Natural Language to SQL from Scratch with Tensorflow – Towards Data Science
Sign up
Sign in
Sign up
Sign in
Member-only story
Eileen Pangu
Follow
Towards Data Science
--
10
Share
Introduction
In this blog post, we’re going to look at an interesting task: translating natural language to SQL. The academic term for that is natural language interface for database (NLIDB). Even though NLIDB is still an area of active research, building a model for one simple table is actually pretty straightforward. We’ll do that for an employee table with 3 columns: name, gender, and salary (as shown in Figure-1). By the end of this blog post, you will learn how to go from a string of natural language input such as
show me the names of the employees whose income is higher than 50000
, to the SQL query output select e.name from employee e where e.salary > 50000
.Overview
In this section, we explain the core idea on a high level.
The core of this task is a machine translation problem. While it may be tempting to just throw in a sequence-to-sequence machine translation model to go directly from the input natural language to the output SQL query, it does not perform well in practice. The main reason is that the model may encounter out of vocabulary (OOV) tokens for column values. Though the model can learn to tolerate other minor unknown words to some extent, OOV for column values is fatal. Imagine we have a different salary value in the example above from the introduction that the training dataset doesn’t cover — 60000, 70000, 80000, you name it — there will always be a salary number that’s out of vocabulary. The same goes for the name column. OOV tokens will be mapped to a
[UNK]
symbol and fed to the translation model. So there is no way for the model to reconstruct the exact actual column values in the SQL output.The typical way to handle this is to run the raw input through a process called schema linking, which identifies and caches the column values, and substitutes them with placeholders that the model has seen during training. For instance, the input example from the introduction will become
show me the names of the employees whose income is higher than
…--
--
10
Towards Data Science
Manager and Tech Lead @ FANG. Enthusiastic tech generalist. Enjoy distilling wisdom from experiences. Believe in that learning is a lifelong journey.
Help
Status
About
Careers
Blog
Privacy
Terms
Text to speech
Teams
source