KS
← Back to research
PreprintSelf-archived preprint·May 14, 2026 — rev. Jun 2, 2026

The Validation Crisis in AGI Capability Forecasting

Kacper Saks

Abstract

Forecasts of when artificial general intelligence will arrive increasingly shape capital allocation, regulation, and where a generation of talent is placed — yet the confidence attached to them exceeds what the methods producing them can support. This paper argues the gap is structural: the predictable consequence of fitting a model to a measured window and projecting it forward without the validation discipline other quantitative fields require. We import that discipline from quantitative finance — the deflated Sharpe ratio, the probability of backtest overfitting, and a walk-forward retrodiction protocol — and introduce the Deflated Capability Forecast (DCF), a method that widens a forecast's stated interval by the amount its underlying methodology warrants, returning a distribution with explicit treatment of the tails in place of a point estimate carrying unearned precision. Across the forecasts where the method could be fully computed, deflation factors cluster between 1.3× and 2.0× — the stated intervals are systematically too narrow. We then turn the method on this work itself: a preregistered prediction that one landmark forecast's interval would widen by at least 2.3× produced 1.285×. We report the failure rather than revise the threshold — a discipline of honest validation is supposed to surface exactly this.

Overview

The major published AGI timeline forecasts share a methodological foundation that would not survive scrutiny in a mature quantitative discipline. This is not a claim about whether they are right; it is a claim about how they are made. Each extrapolates from a measured window, fits a curve to it, and reads the confidence around the extension as though the future were a continuation of the sample. Quantitative finance has a name for this and a set of tools built specifically to defend against it. The forecasting community has, for the most part, neither.

The central argument is narrow: the current debate over AGI timelines suffers from the same methodological errors quantitative finance diagnosed and partially solved between 2014 and 2018 — in-sample extrapolation, multiple testing without correction, the absence of walk-forward validation, and selection bias toward success. These are the failures a hedge fund learns to detect before it is allowed to manage external capital, because each reliably produces a track record that looks excellent and means nothing.

The work critiques methodology, not people — it cites the forecasters without a single pejorative adjective, because the work is serious and that seriousness is exactly why its methodological exposure is worth examining.

Method

┌─────────────────────────────────────────────────────────┐
│         THE PROBLEM: in-sample extrapolation            │
│   measure a window → fit a model → project forward      │
│   read the confidence interval as if the future were    │
│   a continuation of the sample (it is not)              │
└────────────────────────┬────────────────────────────────┘
                         ▼
┌─────────────────────────────────────────────────────────┐
│        THREE TOOLS FROM QUANTITATIVE FINANCE            │
│                                                         │
│  DSR  Deflated Sharpe Ratio                             │
│       corrects the reported statistic for best-of-N     │
│       (Bailey & López de Prado 2014)                    │
│                                                         │
│  PFO  Probability of Forecast Overfitting               │
│       adapted from PBO via sequential-test partition    │
│       under the publication-time filtration             │
│                                                         │
│  WFR  Walk-Forward Retrodiction                         │
│       manufactures the held-out sample the forecast     │
│       never reserved — freeze the info set, score the   │
│       elapsed period                                    │
└────────────────────────┬────────────────────────────────┘
                         ▼
┌─────────────────────────────────────────────────────────┐
│        DCF — Deflated Capability Forecast               │
│   Equation 14.1': deflate the point estimate, weight    │
│   by overfitting probability, then widen + shift by a   │
│   certification-friction factor (φ, δ).                 │
│   Normal spine + generalized-Pareto right tail (EVT)    │
│   → a distribution with honest tails, not a point.      │
└─────────────────────────────────────────────────────────┘

The financial tools do not transfer for free. Three adaptations are carried as explicit honesty statements: capability series are non-stationary (the IID variance assumption is replaced with a stationary bootstrap), the search count is undisclosed (the effective trial count is ranged, not asserted), and the performance statistic has no off-the-shelf analog (a five-candidate panel — Brier score, log-loss, calibration error, interval coverage, capability-Sharpe — is selected per forecast). Each adaptation is labeled (speculative) by default rather than waved past.

Results

The DCF was computed for five forecasts. Across the surveyed set, deflation factors cluster between 1.28× and 2.02× — the stated intervals are systematically too narrow.

ForecastReported 95% CIDeflation ratio
Aschenbrenner (OOM, 2027)[2025, 2029]1.285× / 1.539×
Cotra 2020 (anchors, 2052)[2031, 2100]1.531×
Cotra 2022 (anchors, 2040)[2030, 2100]1.732×
Davidson (takeoff duration)[1, 10] yr2.021×
Self-prediction (preregistered)[10, 82.5] %1.320×

The Integrity Capstone

The strongest element of the paper is turned on the author. Before computing anything, the work preregistered a specific prediction: that applying the deflated Sharpe ratio to one landmark forecast would widen its interval by at least 2.3×.

It produced 1.285×. The prediction failed.

The paper reports the failure rather than revising the threshold. The failure lies not in the framework — which computed correctly and deflated each interval exactly as its derivation specifies — but in a prior set by intuition before the deflation it should have been derived from. That is precisely the error this work exists to identify, surfaced now against the author's own number. A discipline of honest validation is supposed to surface exactly this. The preregistration is cryptographically locked (git tag preregistration-v3-locked, SHA-256 sidecar fingerprint) so verification proceeds against the immutable record rather than against prose that could drift from it.

What the Paper Refuses to Claim

The refusals matter as much as the claims. The paper does not argue that AGI will not arrive. It does not argue that the forecasters are dishonest or incompetent. It does not offer its own, better date — to do so would repeat the exact error it identifies. And it does not claim that quantitative finance has solved its own validation problems — only that finance has paid, in real capital over real years, for a discipline capability forecasting has not yet adopted.

Reproducibility

LaTeX (single-column preprint), Python reference implementation (deflated-capability-forecast package), Jupyter notebooks that re-verify every plotted value against the package to a 1e-3 tolerance, git-tagged cryptographic preregistration, trace-to-source discipline (every paragraph cites a primary source), and explicit epistemic labeling throughout: (established) / (speculative) / (speculative — derived).

Cite this work

BibTeX
@misc{saks2026validation,
  title        = {The Validation Crisis in AGI Capability Forecasting},
  author       = {Saks, Kacper},
  year         = {2026},
  howpublished = {Self-archived preprint},
  url          = {https://kacpersaks.dev/research/validation-crisis},
  note         = {21 pp., 186 references}
}
quantitative-financestatisticsai-forecastingvalidationmethodology