How does AI generate SQL from natural language?
Why naive NL-to-SQL fails.
The simplest NL-to-SQL implementation is one prompt: send the user’s question and the database schema to an LLM, get SQL back, run it. This works on toy schemas in demos. On real enterprise data, it fails predictably in three ways:
- Schema blindness.A real warehouse has hundreds of tables, weird column names from legacy systems, denormalizations, foreign keys that aren’t declared, and business definitions that aren’t in the data itself. An LLM handed a raw schema dump can’t reliably pick the right join, the right filter, or the right aggregation. It guesses, often plausibly, often wrongly.
- Hallucination.When the LLM is uncertain, it doesn’t pause and ask — it generates. The generated SQL might reference columns that don’t exist, use joins that aren’t valid, or compute aggregations on the wrong grain. Some of these throw errors (visible failure). The dangerous ones run successfully but return wrong results (silent failure).
- No verification.Without a verification layer, the natural-language answer the LLM produces is decoupled from the actual query result. The narrative might say “sales were $4.2M” while the query returned $3.8M — with no warning to the user.
Each of these failure modes is well-documented in the research literature. Each is unacceptable for enterprise decisions. Production NL-to-SQL systems are designed to make each impossible by architecture.
The schema-aware approach. Semantic Knowledge Graphs.
The first architectural answer is to give the LLM more than the raw schema. A Semantic Knowledge Graph captures the platform’s full mental model of the warehouse:
- Every table, with descriptions of what it represents in the business
- Every column, with data types, units, valid value ranges, and definitions
- Foreign-key relationships, both declared and inferred
- Joins — including non-obvious ones that require intermediate tables
- Business definitions: what does “active customer” mean for this company? What’s the GAAP-vs-gross convention for revenue?
- Common query patterns: how do analysts typically slice this data?
- Denormalizations: which tables aggregate which raw sources?
When the LLM generates SQL, it’s not guessing from raw schema — it’s grounded in a structured knowledge representation that captures how the warehouse actually works. The accuracy delta is dramatic: published benchmarks show schema-aware NL-to-SQL outperforming raw-schema approaches by 2-3x on real enterprise warehouses.
Building the Semantic Knowledge Graph is upfront work — typically done at deployment by a domain-aware team. In Diwo, the graph is built from automatic schema introspection plus a curation step where analysts annotate business definitions. The graph improves continuously as the platform observes which queries succeed and which fail.
The verification loop. Anti-hallucination by architecture.
Schema awareness reduces hallucination on the SQL generation step. It doesn’t eliminate hallucination in the natural-language answer. The LLM might generate correct SQL, get back $3.8M, and then summarize the result as “sales were $4.2M” in the narrative. This is rare but catastrophic when it happens.
The architectural answer is a verification loop. After the SQL executes and returns numerical results, an Anti-Hallucination Agent compares every number in the natural-language narrative against the actual query result. Mismatches are blocked or corrected before the answer reaches the user.
Three forms of verification typically run in parallel:
- Numerical match. Every number in the narrative must trace to a value in the query result.
- Semantic match.Claims about direction (“sales increased”) must match the actual direction in the data.
- Aggregation grain match.A claim about a monthly figure can’t be derived from an annual aggregation.
The combined effect: by the time an answer reaches the user, every numerical claim has been verified against the actual data the SQL returned. The number you see is the number that came out of your warehouse — not the number the LLM guessed.
The multi-agent production architecture.
Production NL-to-SQL is multi-agent by necessity. Each agent has a narrow scope, verifiable outputs, and is testable in isolation. Diwo Catalyst’s pipeline:
- Diwo Supervisor Agent— receives the natural-language question, classifies the intent (retrieval / aggregation / comparison / what-if), routes to the correct downstream specialists.
- SQL Generator Agent— uses the Semantic Knowledge Graph to generate schema-aware, syntactically correct SQL. Multi-shot retry with error feedback if the warehouse returns an error.
- Query Execution Layer— runs the SQL against the live warehouse with row-level security applied. Returns results plus the SQL for audit.
- Visual Agent— selects an appropriate chart type based on data shape (time series, categorical breakdown, distribution, scatter).
- Insight Agent— generates the narrative summary of the result. Pulls candidate interpretations from the data.
- Anti-Hallucination Agent— verifies every numerical claim in the Insight Agent’s narrative against the query result.
- Recommendation Agent(when the question is decision-shaped) — produces a ranked, dollar-quantified next-best-action with three AI-validated alternatives.
Underneath the agents, the LLM is a swappable component. In Diwo, customers configure their preferred LLM provider per tenant: OpenAI, Anthropic, Google, Groq, or a private deployment. The architecture, not the LLM choice, is what makes the system enterprise-grade.
What this means for buyers and builders.
If you’re evaluating an NL-to-SQL platform, three diagnostic questions separate production-grade systems from demoware:
- Does the platform maintain a Semantic Knowledge Graph or equivalent business-definition layer over the warehouse? Or does it just send the schema to the LLM?
- Is there an Anti-Hallucination layer that verifies every numerical claim in the answer against the actual query result?Or does the LLM’s narrative go straight to the user?
- Are the SQL queries visible and auditable for every answer?Or does the platform hide the SQL behind “trust us” UX?
A platform that fails any of the three is a science experiment, not an enterprise system. Diwo Catalyst was built with all three as foundational requirements; that architectural commitment is what separates conversational analytics that produces governed decisions from a ChatGPT-shaped wrapper that produces guesses.
The questions readers ask.
What is NL-to-SQL?
NL-to-SQL (Natural Language to SQL, sometimes Text-to-SQL) is the practice of translating a plain-English question into an executable SQL query against a database. The user types 'show me last quarter's top 10 customers by revenue' and the system generates a query that answers it. NL-to-SQL is one of the foundational capabilities of conversational analytics — without it, every conversational interface is just a wrapper that asks an LLM to invent numbers.
Why do generic LLMs fail at SQL on enterprise data?
Three structural reasons. (1) Schema blindness — the LLM doesn't know your 800 tables, your column meanings, or which join is correct. It guesses, often wrongly. (2) Hallucination — when uncertain, LLMs generate plausible-looking SQL that runs but returns wrong results, or doesn't run at all. (3) No verification — without a layer that checks the generated SQL against the schema and the actual returned values, the user has no way to know whether the answer is correct. For enterprise decisions, all three are unacceptable.
What is a Semantic Knowledge Graph?
A Semantic Knowledge Graph is the platform's mental model of your data warehouse. It captures every table, every column, the data types, the foreign-key relationships, the joins (including non-obvious ones), the business definitions ('active customer means logged in within 90 days', 'revenue means GAAP not gross'), and the metadata describing how concepts map to physical schema. NL-to-SQL grounded in a Semantic Knowledge Graph is dramatically more accurate than NL-to-SQL grounded in raw schema only.
What's the multi-agent verification loop?
A multi-agent NL-to-SQL pipeline decomposes the work. The SQL Generator Agent writes the query. The Anti-Hallucination Agent verifies that every numerical claim in the natural-language answer matches the actual query result before showing it to the user. The Visual Agent picks an appropriate chart type. The Insight Agent narrates the finding. Each agent is testable in isolation; each has a verifiable output. The architecture means a single LLM hallucination can't cascade into a wrong answer that reaches the user.
Can I just use ChatGPT to write SQL?
ChatGPT can write SQL on demand for small toy schemas you paste into the prompt. It cannot persistently connect to your warehouse, doesn't know your schema across sessions, hallucinates joins on real production warehouses with hundreds of tables, and has no verification layer that checks whether the generated SQL produces correct results. For ad-hoc personal queries on small CSVs, ChatGPT is fine. For governed analytics on enterprise data, you need a purpose-built NL-to-SQL platform with schema awareness and multi-agent verification.
What does Diwo's NL-to-SQL pipeline look like?
Diwo Catalyst's NL-to-SQL is multi-agent: the Diwo Supervisor Agent receives the natural-language question, routes it to the SQL Generator Agent (which uses the Semantic Knowledge Graph to write a schema-aware query), executes the query against your warehouse with row-level security applied, returns the result to the Anti-Hallucination Agent (which verifies every numeric claim), and ships the verified result to the user with the SQL visible for audit. Underneath the agents, the LLM is configurable per tenant — OpenAI, Anthropic, Google, Groq, or a private deployment.
Related reading from Diwo.
Stop reading. Start trying.
Free 15-day Catalyst trial. White-glove onboarding. No credit card. Connect your warehouse — Snowflake, Databricks, BigQuery, Redshift, Postgres, MySQL — or upload a CSV.
