Data Validation Report
Every query result on this site is validated against published statistics from the CDC, NCHS, and CMS. This report shows our automated test suite results.
Methodology
Each test case compares a result from our data against a published value from an official CDC or NCHS source. We run two independent layers of validation for every test:
A hand-written SQL query is executed directly against the DuckDB database on Railway. This tests whether the data itself reproduces published statistics, independent of the AI layer. If Layer 1 fails, the data or our understanding of the codebook is wrong.
A natural language question is sent through the full production pipeline: the question goes to our API, Claude generates SQL, Railway executes it, and the result is checked. This tests the end-to-end system that users interact with. If Layer 2 fails but Layer 1 passes, the AI is misinterpreting the question or generating incorrect SQL.
BRFSS Results
11 testsBehavioral Risk Factor Surveillance System — self-reported survey data, 400K+ respondents/year. Values are weighted prevalence percentages using CDC's _LLCPWT survey weights.
| Statistic | Year | Published | Gold SQL | Dev | NL Query | Dev | Source |
|---|---|---|---|---|---|---|---|
| Adult obesity (national) | 2017 | 30.1% | 30.1% | 0.0 | 30.1% | 0.0 | CDC Obesity Maps |
| Adult obesity (national) | 2018 | 30.9% | 30.9% | 0.0 | 30.9% | 0.0 | CDC Obesity Maps |
| Adult obesity (West Virginia) | 2018 | 39.5% | 39.5% | 0.0 | 39.5% | 0.0 | CDC State Data |
| Adult obesity (Colorado) | 2018 | 22.9% | 22.9% | 0.0 | 22.9% | 0.0 | CDC State Data |
| Current smoking | 2018 | 15.5% | 15.5% | 0.0 | 15.5% | 0.0 | CDC Tobacco Data |
| Adult obesity (national) | 2020 | 31.9% | 31.9% | 0.0 | 31.9% | 0.0 | CDC BRFSS Overweight and Obesity Dataset |
| Diagnosed diabetes | 2018 | 10.9% | 11.4% | +0.5 | 11.8% | +0.9 | CDC Chronic Disease Indicators — Diabetes |
| Current asthma | 2018 | 9.2% | 9.2% | 0.0 | 9.2% | 0.0 | CDC Asthma |
| Physical inactivity | 2018 | 24.5% | 24.5% | 0.0 | 24.5% | 0.0 | CDC PCD |
| Adult obesity (national) | 2023 | 34.3% | 32.8% | -1.5 | 32.8% | -1.5 | CDC Newsroom |
| Lifetime depression diagnosis (national) | 2020 | 18.5% | 18.8% | +0.3 | 18.8% | +0.3 | CDC MMWR 72(24), June 2023 |
NHANES Results
8 testsNational Health and Nutrition Examination Survey (2021–2023 cycle) — clinical exams + lab measurements. Values are weighted prevalence percentages using WTMEC2YR exam weights.
| Statistic | Year | Published | Gold SQL | Dev | NL Query | Dev | Source |
|---|---|---|---|---|---|---|---|
| Obesity overall (BMI≥30) | 2021–23 | 40.3% | 40.3% | 0.0 | 39.8% | -0.5 | NCHS Brief #508 |
| Obesity, men (BMI≥30) | 2021–23 | 39.2% | 39.2% | 0.0 | 38.7% | -0.5 | NCHS Brief #508 |
| Obesity, women (BMI≥30) | 2021–23 | 41.3% | 41.3% | 0.0 | 40.8% | -0.5 | NCHS Brief #508 |
| Total diabetes (incl. undiagnosed) | 2021–23 | 15.8% | 13.8% | -2.0 | 13.8% | -2.0 | NCHS Brief #516 |
| High cholesterol (≥240 mg/dL) | 2021–23 | 11.3% | 11.4% | +0.1 | 11.1% | -0.2 | NCHS Brief #515 |
| Hypertension (measured + Dx) | 2021–23 | 47.7% | 50.0% | +2.3 | 50.0% | +2.3 | NCHS Brief #511 |
| Severe obesity (BMI≥40) | 2021–23 | 9.4% | 9.4% | 0.0 | 9.3% | -0.1 | NCHS Brief #508 |
| Depression (PHQ-9≥10) | 2021–23 | 13.1% | 12.6% | -0.5 | 12.6% | -0.5 | NCHS Brief #527 |
Medicare Inpatient (Part A) Results
4 testsMedicare Inpatient Prospective Payment System (IPPS) — hospital discharges by DRG, ~2M rows across 11 years (2013–2023). Values are counts from the CMS Provider Summary PUF, which only includes hospitals with ≥11 discharges per DRG.
| Statistic | Year | Published | Gold SQL | Dev | NL Query | Dev | Source |
|---|---|---|---|---|---|---|---|
| IPPS hospitals | 2023 | 3,100 | 2,941 | -5.1 | 2,941 | -5.1 | CMS IPPS PUF |
| Distinct DRG codes | 2023 | 600 | 534 | -11.0 | 534 | -11.0 | CMS FY 2023 IPPS Rule |
| Top DRG: Septicemia (871) | 2023 | 561,177 | 561,177 | 0.0 | 561,177 | 0.0 | CMS IPPS PUF |
| #2 DRG: Heart Failure (291) | 2023 | 319,367 | 319,367 | 0.0 | 319,367 | 0.0 | CMS IPPS PUF |
Medicare Part D Results
5 testsMedicare Part D Prescribers by Provider and Drug — 276M rows across 11 years (2013–2023). Published values are aggregate totals from the CMS Public Use File. Prescriber-drug combinations with fewer than 11 claims are suppressed by CMS before release.
| Statistic | Year | Published | Gold SQL | Dev | NL Query | Dev | Source |
|---|---|---|---|---|---|---|---|
| Unique prescribers | 2023 | 1,104,162 | 1,104,162 | 0.0 | 1,104,162 | 0.0 | CMS Part D PUF |
| Total claims | 2023 | 1,393,568,104 | 1,393,568,104 | 0.0 | 1,393,568,104 | 0.0 | CMS Part D PUF |
| Total drug cost | 2023 | $212.7B | $212.7B | 0.0 | $212.7B | 0.0 | CMS Part D PUF |
| Unique prescribers | 2019 | 985,533 | 985,533 | 0.0 | 985,533 | 0.0 | CMS Part D PUF |
| Total drug cost | 2019 | $137.0B | $137.0B | 0.0 | $137.0B | 0.0 | CMS Part D PUF |
Notes
Tolerance thresholds
Each test has a pre-defined tolerance (typically 1–2 percentage points for BRFSS, 1.5–5 for NHANES). These account for differences in survey weight versions, age cutoffs, and rounding. A deviation within tolerance is a pass.
BRFSS vs NHANES obesity gap
BRFSS reports ~31–33% obesity; NHANES reports ~40%. This is not an error. BRFSS uses self-reported height/weight (people underreport weight), while NHANES uses clinical measurements. The gap is well-documented in epidemiological literature.
CMS Public Use File suppression
Medicare PUF data suppresses all provider-level rows with fewer than 11 claims, beneficiaries, or discharges. This means aggregate totals from the PUF are systematically lower than universe totals. For Medicare Inpatient, hospital and DRG counts are ~5–15% below CMS-reported totals. For Part D, published values are computed directly from the PUF, so Gold SQL matches exactly.
What each layer catches
Layer 1 failures indicate data issues: wrong codebook interpretation, missing survey weights, incorrect variable coding. Layer 2 failures (with Layer 1 passing) indicate AI issues: the NL-to-SQL model is generating incorrect queries. Both layers passing means the data is correct and the AI can reproduce results from plain English questions.