Is Dark Energy Evolving? A Statistician’s Guide to Not Fooling Yourself

by Will Handley

In cosmology, combining disparate datasets is a delicate business. If two observations yield conflicting constraints on, say, the expansion rate of the universe, you cannot simply average them. You must identify the source of the discrepancy. We construct our picture of the universe from fundamentally different types of data — the cosmic microwave background, type Ia supernovae, and baryon acoustic oscillations — and before we trust a joint analysis, we are required to check that these datasets are telling a consistent statistical story.

The Jeffreys-Lindley Paradox

In 2024, the DESI collaboration released measurements suggesting that dark energy might evolve over time rather than acting as a static cosmological constant. If true, this challenges the standard LCDM paradigm. One frequentist metric quoted the result at approximately 4.2 sigma — close enough to discovery language that the community understandably paid attention.

To test this, we ran a large grid of dataset and model combinations on the DiRAC supercomputer. We computed not just parameter constraints, but the Bayesian evidence: a score for the *entire model*, not just its best-fitting parameters. Bayesian evidence automatically penalises unnecessary complexity, making it the appropriate tool for model comparison.

During this analysis, a bug was identified in the DES supernova code, and the frequentist significance dropped to 3.2 sigma. The Bayesian evidence, however, provided a starker conclusion: the signal effectively vanished. Evolving dark energy and LCDM explain the data equally well once you account for the extra parameter freedom. The 4.2 sigma arose from a frequentist test answering a fundamentally different question to “which model should I believe?” — a well-known statistical mismatch called the Jeffreys-Lindley paradox. Our tension analysis confirmed that the original signal was driven by a systematic mismatch *between* datasets, not by genuine dark energy evolution. After the bug fix, the mismatch resolved.

Code Debt and the AI Transition

Historically, computing Bayesian evidence has been about an order of magnitude more expensive than standard parameter fitting, which is why it is not performed routinely. The elephant in the room is code debt. There is an enormous amount of embodied intelligence in codes like CAMB, CLASS, and CosmoSIS, but PhD students routinely spend the bulk of their time debugging legacy Fortran rather than inserting new physics.

Large language models are starting to change this — not by doing the science, but by chewing through the tedious parts: code translation, refactoring, boilerplate, debugging. It is not fully push-button yet (there is still skill involved), but the productivity gains are already large. For a talk last year, I realised a key sampler was missing from BlackJAX and vibe-coded it in about fifteen minutes. Sam Leeney built JaxBandFlux — a JAX supernova photometry tool the community is already using — in hours rather than days. Natalie Hogg used off-the-shelf tools to find bottlenecks and shrink parts of a workflow from roughly 24 hours to 1–2 minutes.

Two separations matter here. First: GPUs are not AI. When someone claims their AI method is faster, the first question should be whether it is the method that is fast, or whether they just ran it on a GPU. Second: LLMs are not science. The goal is not to automate scientific judgement but to make the non-science parts of the job cheap enough that we can spend our time on physics and steering the analysis. I and many others mistook mental athleticism — typing fast, remembering syntax, doing long calculations — for scientific strength. With these tools, scientific principles and creative drive become the differentiators, and it is barely correlated with who was fastest at algebra or most fluent in C++.

This is not without risk. The indiscriminate adoption of AI does threaten the foundations of academia, and we need to engage with this seriously rather than with complacency.

From Days to Minutes

Once the same classical inference machinery — nested sampling, evidence calculations, consistency checks — runs on GPUs via JAX, the economics flips. We see thousandsfold speedups: analyses that required days on HPC now land in minutes. These are not AI methods; they are the same trusted algorithms, finally using hardware properly.

If that becomes routine, then broad model comparison and dataset consistency checks stop being special one-off projects and become part of the default workflow.

What Comes Next

All of our analysis chains are publicly available through the Python package [anesthetic](https://github.com/williamjameshandley/anesthetic). As DESI and other surveys release further data, we will continue running these checks. Evolving dark energy would be an extraordinary discovery, but the current Bayesian accounting says the data do not demand it. If future releases keep pushing in the same direction *and* the datasets remain internally consistent, that would constitute compelling evidence to look beyond the standard model.

If producing code and text were virtually free, what would you do differently as a scientist? That is the world we now inhabit.