We are Airsequel, a platform that hosts SQLite databases and automatically generates a GraphQL API, a spreadsheet UI, an SQL workbench, and a dashboard builder for each database.

A few days ago I was writing some examples for our documentation section “Useful SQL Snippets”. While typing the text, Notion AI (powered by ChatGPT) auto-completed the result tables and it was surprisingly good at printing out the correct result. This made my wonder: How good are LLMs at executing SQL code?

Let’s put it to the test with ChatGPT 4. I’m using the OpenAI Playground with the model “gpt-4-1106-preview” and a maximum number of 4096 tokens. I’m leaving the system prompt empty to work with neutrally primed version of ChatGPT.

First, a simple select query:

Given the following table `users`:

id | name | count | email
---|---|---|---
1 | Alice | 25 | alice@example.com
2 | Bob | 32 | bob@example.com
3 | Carol | 41 | carol@example.com
4 | David | 28 | david@example.com
5 | Eve | 37 | eve@example.com

Execute following SQL query (SQLite flavor):
```sql
SELECT name FROM users;
```
Answer with the resulting table and no explanation.

ℹ️ I’m executing each prompt in a new chat to keep the context short. I’m not citing the setup part of the prompt again, to keep this blog post more readable.

Response:

name
---
Alice
Bob
Carol
David
Eve

Perfect! Easy queries do not seem to be a problem. Let’s make it a little harder:

SELECT length(name) FROM users

Response:

length(name)
---
5
3
5
5
3

Also correct! What about renaming columns? Or constant columns?

SELECT 
  name AS person_of_interest,
  'OK' AS status 
FROM users

Response:

person_of_interest | status
---|---
Alice | OK
Bob | OK
Carol | OK
David | OK
Eve | OK

Still correct!

Maybe ordering and limiting the results?

SELECT *
FROM users
ORDER BY count
LIMIT 2

Response:

id | name | count | email
---|---|---|---
1 | Alice | 25 | alice@example.com
4 | David | 28 | david@example.com

Exactly right! I’m starting to get impressed! What if we combine several of SQL’s features?

SELECT
  name AS person_of_interest,
  length(name) AS name_length,
  'PENDING' AS status
FROM users
ORDER BY name_length DESC
LIMIT 3

Response:

person_of_interest | name_length | status
-------------------|-------------|--------
Carol              | 5           | PENDING
Alice              | 5           | PENDING
David              | 5           | PENDING

Wow, correct again! However, the order of the rows is undefined for a query like this. Let’s add another column to ORDER BY to get a definitive result:

SELECT
  name AS person_of_interest,
  length(name) AS name_length,
  'PENDING' AS status
FROM users
ORDER BY name_length DESC, name ASC
LIMIT 3

And we’ve got our first error:

person_of_interest | name_length | status
---|---|---
Carol | 5 | PENDING
Alice | 5 | PENDING
David | 5 | PENDING

It did not sort by the name of the person. Let’s check if this is a general problem, or just an omission due to the increasing complexity of the query:

Given the following table `users`:

id | name | color
---|---|---|---
2 | Bob | green
3 | Carol | blue
1 | Alice | red
5 | Eve | green
4 | David | blue

Execute following SQL query (SQLite flavor):
```sql
SELECT * FROM users
ORDER BY color ASC, name ASC
```
Answer with the resulting table and no explanation.

Response:

id | name | color
---|---|---
3 | Carol | blue
4 | David | blue
5 | Eve | green
2 | Bob | green
1 | Alice | red

Wrong. Eve and Bob are swapped. What if I change it to name DESC?

Response:

id | name | color
---|---|---|
3 | Carol | blue
4 | David | blue
5 | Eve | green
2 | Bob | green
1 | Alice | red

Also wrong. David and Carol are swapped, but it shall be noted that the first column was always sorted correctly!

Overall, all the executions seemed to be quite stable, as I executed all examples several times and they mostly yielded the same results. The order of rows was e.g. something that changed between executions.

The tested queries were of course just a tiny fraction of all useful queries. It would be interesting to expand the testing based on following questions:

  • How does the amount of rows or columns influence the results?
  • How does the complexity of the data change it?
    Long strings vs. short strings, large numbers vs. small numbers.
  • How important is the formatting of the code?
  • Can we improve the results with a better system prompt or priming it with examples?

If you just got nerd sniped and happen to run more extensive tests, I’d love to see your results! 😁

Apart from the intrinsics of executing SQL queries, I’m fascinated about the general implications of this. For simple languages / code it might actually be a viable option to implement the interpreter or compiler with an ML Model. After all, ML models are interpreters for human language already.

The reliability could probably be increased by introducing an intermediate step, where the prompt is compiled (JIT compiled?) into an intermediate language like Python and then executed. This is even already available with ChatGPT’s include code interpreter .

Based on this, I have som predictions:

  • Ever increasing amounts of code / instructions / algorithms will be executed by ML models in the future
  • They could be helpful in ways we haven’t even considered yet:
    • Automatically give feedback during testing when they have ideas on how to improve the code.
      E.g. I accidentally used a wrong SQL column name during testing, but instead of erroring out like a normal SQL interpreter would, it just informed me that it assumes I mean another column and executed it with that!
    • Intelligently change execution characteristics depending on deployment environment, hardware resources, or payload.
  • Soon it will be common to build interpreters and compilers for high level DSLs with fine tuned ML models. I know that this is quite a bold statement, but if devices have increasingly sophisticated hardware for ML model execution, it might actually be a quite performant option!

One thing’s for sure:
The programming languages community is in for some exciting years ahead!