Text-to-SQL Done Properly Is a Retrieval Problem, Not a Prompt

Picture this. You ask your new data tool a plain question. How many active customers did the Durban branch invoice last month? It thinks for a second and hands you a number. The number looks reasonable, so it goes into the board pack. Nobody checks it, because checking it would mean writing the query by hand, which is the whole reason you bought the tool. The problem is the number is wrong, and nothing on the screen tells you so.
This is the failure mode people miss. The danger with text-to-SQL is not that the model writes broken SQL that throws an error. A broken query is a gift. It fails loudly, you see red, you move on. The real danger is confident, runnable SQL that means something other than what you asked. It runs clean, returns a tidy figure, and quietly answers a different question.
Where it actually goes wrong
You see this most clearly in messy enterprise data, and there is nothing messier than a multi-branch ERP export. We did a project for a national distributor where the same idea wore three different names. One table called a location a branch, another called it a depot, a third called it a site. Active customer had at least three definitions depending on who built the report. One date column was order date, the next was dispatch date, and they disagreed by days. A person who has worked there for a decade carries all of that in their head. A model handed the raw schema does not. It guesses.
The research lines up with what you see in the field. When people sort large samples of wrong text-to-SQL answers by cause, most failures are schema-level. The model picks the wrong column, or misreads what a field actually means. Syntax is rarely the issue. The model usually writes valid SQL. It just writes valid SQL about the wrong thing.
It gets worse as the database grows. On benchmarks built from large real schemas, with hundreds of columns and dozens of tables, accuracy falls off hard, and even the strongest systems miss a meaningful share of queries on a curated test set. Your ERP is not a curated test set. It is older, dirtier and full of columns nobody has explained since 2014.
Getting the right answer is a retrieval problem first
Here is the shift. Stop dumping the whole schema into a clever prompt and hoping. The first job is retrieval. Narrow thousands of columns down to the handful that matter for this question, before the model writes a single line of SQL. This is schema linking, and it is the core lever. Get the right columns in front of the model and accuracy climbs. Feed it everything and it drowns.
Pruning is a trade-off, though. Cut too hard and you drop one column the query needs, which breaks it outright. So the better systems retrieve in both directions. Tables first, then their columns, then columns first, then the tables they sit in. Methods like RSL-SQL use this two-way linking to strip most of the schema away while keeping nearly all of what a given query needs. The point is not the exact ratio. The point is that retrieval, not prompting, is doing the work.
At real scale this becomes a search problem. A large slice of enterprise questions need many tables joined together, a case that barely shows up in academic benchmarks. Big catalogues run to thousands of tables, well past what fits in any prompt. The fix is the same one behind good document search. Embed every table, column and relationship as text, store the vectors, and pull back only the relevant slice for each question. Retrieve, then generate.
Define the business words once
Retrieval gets you to the right tables. It does not settle what revenue means, or which of those three definitions of active customer you intended. For that you need a semantic layer. One place where the business definitions live, in version-controlled config, so revenue and active branch resolve to the same thing every time. The model's job shrinks to picking the right metric and dimension. A deterministic engine writes the SQL. If the model picks the right metric, the query is correct by construction, because the joins and the maths are already defined.
Put a semantic layer in front of the same models and the gap closes sharply. In-scope questions stop being a coin toss and start being reliable, because the model is choosing from defined metrics instead of inventing joins. But accuracy is not the best part.
The real win of a semantic layer is honesty. Asked something outside its scope, it says it cannot answer, instead of cheerfully handing you a wrong number.
Guardrails made of code, not prose
Two more things have to be true before any of this touches a live business. First, evaluate against a fixed set of your own real questions with known answers, not a public leaderboard. Leaderboard scoring is noisier than it looks, and a benchmark average tells you nothing about the eleven questions your finance team asks every month. Those eleven beat someone else's average every time.
Second, the safety has to be deterministic. A read-write database connection wired to a non-deterministic model is, in effect, full admin access. A system prompt that says SELECT only is a guardrail made of tape, and a prompt injection walks straight through it. The pattern that holds up is plain engineering:
- Point at a read-only replica using a dedicated read-only database role, so a question can never mutate live data.
- Parse and allowlist the SQL before it runs. Single statement, SELECT only, no multi-statement tricks, with a forced LIMIT and a timeout.
- Enforce row-level security at the database, so each user sees only their permitted rows. Under POPIA that is not optional.
- Keep an audit log of the question, the normalised query, the policy outcome, runtime and row counts.
The honest close
Reserve the word agentic for software that genuinely plans its retrieval and checks its own output. A thin prompt sitting on a raw schema is not an agent, however it is marketed. That is agent washing, and it is exactly the setup that produces confident wrong numbers.
A system that says I cannot answer that yet is worth more than one that always answers and is sometimes wrong. Which one you end up with is an engineering decision about retrieval, definitions, evaluation and guardrails. It is not a question of which model you bought. That is the part most demos skip, and it is the part we build first.