You Can’t Scale What You Can’t See

Last week I went to the American Education Research Association’s 2026 conference. This post reviews two of my presentations: (1) five-state collaboration on tutoring data systems, (2) a forensic audit of federal ESSER data. They were written with different co-authors, they sat in different AERA divisions, and they used different methods.

But across the different divisions, organizations, and panels, the papers had one guiding theme. Namely, that scaling high-dosage tutoring is a measurement problem as much as it is an instructional one. Evidence on whether tutoring works is settled. It works. The open question is whether we can see what’s happening clearly enough to steadily improve delivery at scale.

Here’s what each paper contributed to that question.

The playbook for state-wide tutoring insight

The first paper was a two-year collaboration between Accelerate, Harvard’s Strategic Data Project, and the state education agencies of Arkansas, Colorado, Delaware, Louisiana, and Ohio. It asked the question “what does it take for a state to scale high-dosage tutoring without losing sight of it?”. Across these five states (reaching ~45,000 students in 300+ schools across 150+ districts), we found four remarkably consistent lessons:

Plan for interoperability from day one. Data needs to interoperably flow across authorized parties. Colorado collected tutoring data from many disparate districts, by specifying what data they needed to report as part of grant-making they were able to cleanly aggregate the data and understand implementation across their implementation sites.
Keep the collection burden low and the feedback loop fast. Arkansas nearly lost its first year of data to misformatted student IDs. They simplified their template, trained districts directly, and jumped to 98% reporting compliance.
Tutoring data is most useful when it connects to broader student data systems. Ohio built reporting requirements into vendor contracts before launch, and was able to build a tutoring variable directly into their state data system, giving state-wide insight into which students were designated for intervention.
Transparency drives improvement. When states can see which providers are delivering dosage and results, districts and vendors respond differently. Across all states, providers and districts that clearly understood implementation requirements and were able to check their progress against those requirements in near-real-time were able to pivot practice.

The headline is that funding tutoring is not enough if states don’t also build systems to monitor and improve it. With a simple dashboard, a state can move from “we spent X dollars” to “10,000 students received an average of 16 weeks, 30 sessions, and 20 hours of tutoring, and their gains outpaced non-tutored peers by about a quarter of a standard deviation.”

The full multi-state data toolkit *Measure, Monitor, Improve* is available for any state or district that wants to adapt it. The landing page links to each of the five components: the narrative overview, the data collection protocol, the data dictionary and collection template, the editable dashboard, and state case studies from Arkansas, Delaware, Colorado, Louisiana, and Ohio.

What the federal record can and can’t tell us

The second paper asked the same visibility question at the national level: can we see, from the outside, where pandemic-era tutoring dollars actually went and what they bought? If you wanted to study pandemic-era tutoring spending nationally, you’d reach for the federal ESSER public data release. I ran that file through two analyses. 1. Do the leading digits of the reported dollars look like naturally occurring data (Benford Law) 2. Are the holes in the data random (structured missingness)?

In plain terms, the reported numbers don’t look like naturally occurring spending. The fingerprint is consistent with rounded estimates, repeated placeholder values, and post-hoc reconstruction. It’s unlikely to be outright fabrication, but not precise accounting either. A specific district’s specific line item may still be fine but using these data for aggregate comparisons across states or for fine-grained category analyses may not produce trustworthy results. This is a problem because aggregate comparisons or fine-grained category analyses is exactly the uses most studies put them to.

The missingness story matters even more for equity-oriented work. Large-town and large-city districts, exactly the settings where recovery and equity research concentrate, have the highest rates of missing ESSER fields.

Year 5 ESSER reporting was suspended in April 2025 and is, per the US Department of Education, “not expected to be reinstated”. So this file is essentially the final federal ESSER public record. The practical implication for applied researchers is that ESSER expenditure columns need to be treated as measured with structured error by state and category, not as ground truth.

To be clear, treating the whole file as off-limits is the wrong response. To make the triage practical, I built a companion web app that lets researchers search any ESSER variable and see, at a glance, its Benford conformity, its missingness pattern, and a suggested quality bucket (safer to use, use with sensitivity checks, needs category-specific caveats). If you’re planning to use ESSER spending data in a paper, start there before you pick your outcome columns.

The low quality of data represents a major missed opportunity for understanding how huge sums of money were spent during the pandemic recovery. Failing to standardize metrics in a transparent, well-planned, well-designed way before moving billions of dollars hamstrung our ability to learn what was happening in real-time and limits our ability to learn about what worked, and didn’t, as students (mostly didn’t) recover from the pandemic. That’s not just a research failure, or a data flow failure. That’s a failure of governance.

Use the webapp here.

The through-line

If there is one lesson running through both papers, it is that educational improvement depends on visibility. We cannot scale what we cannot see, and we cannot improve what we do not measure well. Whether the unit is a single essay, a statewide tutoring initiative, or a federal spending stream, better outcomes depend on better information about what is actually happening on the ground. The next phase of improving outcomes and closing achievement gaps depends less on proving what works and more on building the systems that let leaders see where it is happening, where it is not, and what to do next.

Jason Godfrey is the Managing Director of Data & Information Systems at Accelerate.

[email protected]

The playbook for state-wide tutoring insight

What the federal record can and can’t tell us

The through-line

Stay Connected!

Learn What’s Working for Students