PublishedZenodo · archived research artifact·Jun 9, 2026 — rev. Jul 22, 2026

An Executable-Gate Multi-Agent Research Organization: Artifact, Case Study, and a Pre-Registered Gate-Calibration Study

Kacper Saks

Abstract

Multi-agent systems are increasingly proposed as autonomous research organizations, yet their quality-control mechanisms are rarely measured. This work releases a domain-agnostic, 39-role multi-agent research organization in which every role output must pass an executable gate and adversarial sign-off — nothing self-certifies — together with two empirical records of its behavior. First, a limitations-forward case study of the initial end-to-end run (v0.1.0), in which the organization abandoned all five candidate research directions and retracted its own flagship integrity exhibit after it failed its verifier. Second, a pre-registered, blinded, seeded gate-calibration study (v0.2.0) answering the question the first run left open: do the gates discriminate between sound and flawed research, or do they uniformly reject? On a corpus of 20 theses (10 known-flawed with planted defects, 10 known-sound reproductions, SHA-256 sealed labels), the gates detected 15 of 15 planted flaws and cited the correct reason in 13–14 of 15 cases. An ablation isolated the round-1 auto-fail prior as the dominant false-positive source: it reduced pooled validity-gate specificity from 0.96 [0.86, 1.00] to 0.72 [0.58, 0.84] and killed 8 of 10 known-sound theses with no sensitivity benefit. The release includes a Citation Fidelity Protocol with SHA-256-pinned verbatim anchors, a full reproducibility harness (make reproduce / make verify), and seven named limitations stated in full.

Overview

The artifact is a complete research organization expressed as 39 Claude agent roles across 13 divisions — leadership, frontier research, theory and statistics, interpretability, research engineering, evaluation and red team, safety and governance, literature, scientific writing, visualization, program management, publication, and a self-improvement loop — operating over a 9-phase lifecycle with 4 human checkpoints.

The structural rule that defines the system: nothing self-certifies. Every role output passes an executable gate (tests, make verify, citation checks, grep gates) plus adversarial sign-off from a paired critic role holding kill authority.

The Two Empirical Records

Record 1 — the v0.1.0 run. The organization examined five candidate research directions and abandoned all five at the gates. Its flagship integrity exhibit was retracted after failing its own verifier. Zero theses survived. This is reported as a finding, not buried: a single run (n = 1) consistent with either working quality control or uniform rejection.

Record 2 — the calibration study. A pre-registered (git tag calibration-prereg-v1), blinded, seeded study distinguished the two hypotheses. Twenty theses — ten with planted defects across six flaw classes, ten known-sound reproductions — were evaluated under salted filenames with SHA-256 sealed labels, in two arms: M1-ON (round-1 auto-fail prior active) and M1-OFF (ablation).

Headline Results

Metric	M1-ON	M1-OFF (ablation)
Sensitivity (correct-reason)	13/15 = 0.87 [0.60, 0.98]	14/15 = 0.93 [0.68, 1.00]
Pooled validity-gate specificity	36/50 = 0.72 [0.58, 0.84]	48/50 = 0.96 [0.86, 1.00]
Known-sound theses killed	8 of 10	1 of 10
Known-flawed theses killed	10 of 10	10 of 10
Plants missed entirely	0 of 15	0 of 15

Intervals are 95% exact Clopper–Pearson. The round-1 auto-fail prior — "nothing passes review on the first try," a plausible-sounding skeptical default — was the dominant source of false positives: it cost 0.24 of specificity and killed 8 sound theses while adding nothing to sensitivity. The default was flipped to off in v0.2.0; the flip is itself flagged as unvalidated on fresh data (the calibration corpus is spent).

Citation Fidelity Protocol

Three of the five v0.1.0 abandonment citations were found misread on audit (scope overclaim, a smoothing bandwidth read as a block length, a risk metric read as a validation protocol). The v0.2.0 fix is structural: every citation carries a typed claim card — claimed object type, scope conditions, and a verbatim anchor pinned by SHA-256 to its source — verified mechanically and by a dedicated citation-object-adversary role.

Limitations

Seven limitations are stated in full in the release (L1–L7), including: single run + small-n calibration (wide intervals reported in full), seeder and evaluator drawn from the same model family, two flaw classes detected at chance under M1-ON, single-model dependence, the spent corpus behind the M1 default flip, historical citation defects, and the retracted exhibit shipped provenance-only.

Reproducibility

A fresh clone plus make reproduce regenerates all figures and headline numbers (pinned seeds; verification against a frozen reference with Monte-Carlo tolerance). The calibration corpus, all 40 verdicts, sealed-label hashes, pre-registration, and full provenance trail for every abandonment ship in the release. Code is MIT; documentation is CC BY 4.0.

What Came Next (June 2026)

This archived artifact (v0.2.0) is the foundation for an active research program built on top of it. Three developments followed the release:

The organization is now running real research (C1). Rather than studying its own gates, the org selected — through its own 12-candidate, four-gate selection process — a frontier project that turns the calibration idea outward: a pre-registered, cross-model benchmark of whether third-party LLM paper-review systems reject flawed research for the right reasons. C1 uses a matched-pair, single-flaw, externally certified corpus across ≥3 evaluator families (the author's own model family is barred from every confirmatory cell), with attribution scoring and a same-family seeding axis — the construction-validity controls that SoundnessBench (arXiv:2605.30329), the predecessor that established the bare deficit, structurally lacks. The cross-model pilot has since run across three reasoning families (GPT-5, Gemini 3.1 Pro, DeepSeek-R1) and surfaced a wide real attribution spread (correct-reason 0.93 vs 0.47) that killed the original primary hypothesis and re-selected the per-system calibration card (H1″); confirmatory n was ratified at ~300 per arm per system. The program is nonetheless still halted at its pre-registration checkpoint: a pre-lock eligibility gate stopped the freeze over a third-party licensing risk (DeepSeek → Mistral swap), and it now waits on a single API key — $0 of the confirmatory budget spent, nothing frozen.

A public erratum (v0.2.1). When the program established that the specificity numbers above are engine-specific, they were downgraded to exploratory on the public repository before any new freeze — so the public record matches what the program privately knows.

A second adversarial pre-review organization (System B). A separate 7-agent org now reviews each paper against the real, frozen rubric of its target venue before any human submits — deliberately inverting the round-1 prior toward high flaw-recall, with a non-removable human gate and an accept that is necessary but not sufficient (the external venue is terminal). Every verdict logs a prediction of the real outcome, calibrating the pipeline against actual peer review over time.

The system, the C1 program, and System B are tracked in the project writeup.

Cite this work

BibTeX

@software{saks_2026_20645678,
  author    = {Saks, Kacper},
  title     = {An Executable-Gate Multi-Agent Research Organization:
               Artifact, Case Study, and a Pre-Registered
               Gate-Calibration Study},
  month     = jun,
  year      = 2026,
  publisher = {Zenodo},
  version   = {v0.2.0},
  doi       = {10.5281/zenodo.20645678},
  url       = {https://doi.org/10.5281/zenodo.20645678}
}

multi-agentresearch-automationevaluationpre-registrationcalibrationreproducibility