CompletedJun 7, 2026 — updated Jul 22, 2026

WORLDCLASS RESEARCH ORG: AN EXECUTABLE-GATE MULTI-AGENT RESEARCH ORGANIZATION

A 39-role multi-agent research organization where every output must pass executable gates and adversarial review. After a pre-registered calibration study, it now runs its own frontier research program (C1) and a second adversarial pre-review org (System B).

modules 39

multi-agentai-agentsresearch-automationpythonclaudereproducibilitypre-registration

View on GitHub →

Overview

A domain-agnostic, multi-agent research automation system designed to conduct frontier research from ideation through submission — with one structural rule: nothing self-certifies. Every role output must pass an executable (machine-checkable) gate plus adversarial sign-off from a paired critic role.

The project began as a dual artifact — a reusable organizational structure (39 roles across 13 divisions, 9-phase lifecycle) and an empirical record of what happened when it ran. It has since grown into a three-part research pipeline: the writing org (this artifact), a real research program it is now executing (C1), and a separate adversarial pre-review org (System B) that hardens a paper before any human submits it. Released on GitHub and archived on Zenodo with DOI 10.5281/zenodo.20645678.

Architecture

┌─────────────────────────────────────────────────────────┐
│        13 DIVISIONS · 39 ROLES (Claude agents)          │
│                                                         │
│  Leadership · Frontier Research · Theory & Statistics   │
│  Interpretability · Research Engineering · Eval/Red Team│
│  Safety & Governance · Literature · Scientific Writing  │
│  Visualization · Program Mgmt · Publication · Self-Impr │
└────────────────────────┬────────────────────────────────┘
                         ▼
┌─────────────────────────────────────────────────────────┐
│         9-PHASE LIFECYCLE (P0–P8) + 4 CHECKPOINTS       │
│   Ideation → Thesis Lock ⛳ → Literature → Prereg ⛳     │
│   → Execution → Drafting → Red Team ⛳ → Production      │
│   → Submission ⛳   (Self-improvement runs in parallel)  │
└────────────────────────┬────────────────────────────────┘
                         ▼
┌─────────────────────────────────────────────────────────┐
│              THE GATE RULE: nothing self-certifies      │
│                                                         │
│   role output → executable gate (make verify, tests,    │
│   citation checks, grep gates) → adversarial sign-off   │
│   → paired critic with KILL AUTHORITY                   │
└─────────────────────────────────────────────────────────┘

What Happened When It Ran

v0.1.0 (first end-to-end run): the organization examined 5 candidate research directions and abandoned all 5 at the gates. The flagship integrity exhibit was itself retracted after failing its own verifier. Zero surviving theses.

That result left a critical question open: were the gates correctly identifying flaws — or was the system just a uniform rejector that kills everything it touches?

v0.2.0 (the calibration study): a pre-registered, blinded, seeded study answered it. A corpus of 20 theses — 10 known-flawed (with planted defects across 6 flaw classes) and 10 known-sound reproductions — was evaluated under salted filenames with SHA-256 sealed labels.

Calibration Results

Metric	M1-ON (round-1 auto-fail)	M1-OFF (ablation)
Sensitivity (correct-reason)	0.87 [0.60, 0.98]	0.93 [0.68, 1.00]
Validity-gate specificity	0.72 [0.58, 0.84]	0.96 [0.86, 1.00]
Known-sound theses killed	8 of 10	1 of 10
Known-flawed theses killed	10 of 10	10 of 10
Plants missed entirely	0 of 15	0 of 15

The headline finding: the round-1 auto-fail prior ("nothing passes review on the first try") tanked specificity from 0.96 to 0.72 and killed 8 of 10 known-sound theses — with no sensitivity benefit. Institutionalized skepticism, hard-coded as a prior, destroyed the system's ability to recognize good work. The default was flipped, and the limitation that the new default remains unvalidated on fresh data is stated in the release rather than hidden.

From Artifact to Operator: A Real Research Program (C1)

After the calibration study the organization stopped being a demo and started doing the job it was built for. In research-program-2026 it ran its own selection process — generating a 12-candidate slate, freezing it before any judgment (commit 64763e7), then killing 10 candidates at four non-compensatory gates (asset, falsifier, feasibility, novelty). One survived: C1 — "Evaluator Calibration at Scale."

C1 turns the v0.2.0 idea outward. Instead of measuring the org's own gates, it measures the field's: a pre-registered, cross-model benchmark of whether third-party LLM paper-review systems reject flawed research for the right reasons.

Cross-model, cross-family. ≥3 third-party evaluator families (OpenAI, Google, Meta-Llama). The author's own model family is barred from every confirmatory cell — it may seed flaws, never judge them.
Matched-pair, single-flaw corpus. Each flawed document is twinned with a sound version differing only in one planted, externally certified defect, across six flaw classes (mechanical vs epistemic).
Attribution, not just rejection. It separates "rejected" from "rejected for the correct reason," measures false-kills on the sound twins, and tests a same-family seeding effect — the construction-validity axes that SoundnessBench (arXiv:2605.30329, the predecessor that established the bare deficit) structurally lacks.
External human ground truth. Every plant is certified by ≥2 independent non-author raters under a frozen rubric (Cohen's κ ≥ 0.70), recruited with an unconditional co-authorship offer made before any rating begins — so the certification cannot be biased by the result.

The governance is the point. An independent circularity ("F6") review imposed 8 binding conditions; a deviations log, a budget ledger ($0 of a $7,000 cap), and a standing prior-art "collision watch" run continuously. When four independent reviewers flagged "no independent baseline detector" as the predecessor's #1 weakness, a single-judge baseline arm was added before any confirmatory run — and an earlier "multi-gate wall" description was retracted as inaccurate (the only real structural multiplicity is cross-model majority voting).

The Cross-Model Pilot — and What It Found

The API blocker cleared. Real third-party keys were provisioned and a directional power-input pilot ran across three reasoning families — OpenAI GPT-5, Google Gemini 3.1 Pro, and DeepSeek-R1 (all cleared the ≥80% structured-output feasibility filter; OpenReviewer, Qwen, and Kimi-K2 were tested and rejected as infeasible or degenerate). The pilot is explicitly a power input, never a finding — but the spread it surfaced is exactly what the benchmark exists to measure:

Evaluator (pilot, n=15/arm)	Correct-reason	Wrong-reason	False-kill (sound)
Gemini 3.1 Pro	0.93	0.07	0.60
GPT-5	0.47	0.20	0.40
DeepSeek-R1	0.47	0.13	0.13

That 2× attribution spread did real work. It killed the original primary hypothesis — H1 ("systems reject flawed work for the wrong reason") required ≥2 of 3 evaluators above a 0.15 wrong-reason gap, and only one cleared it. The primary was re-selected to H1″: the per-system, per-flaw-class calibration card itself — which is precisely what a spread this wide makes worth publishing. Confirmatory sample size was ratified at ~300 per arm per system (±0.057). The author's own timed adjudication packet clocked ~12 min/item at a perfect 32/32 accuracy (an upper bound), selecting the robust Variant-B design.

The Governance Halting the Author — Again

The pre-registration lock was attempted on 2026-07-01 — and halted itself at the roster-eligibility gate. DeepSeek-R1's license and 2026 operational context (EU data-protection probes, an Australian government restriction, a pending US Entity-List review) meant a confirmatory run built on it could become unexecutable mid-study. Rather than freeze anyway, the third evaluator slot was swapped to Mistral (Apache 2.0, open weights, EU, self-hostable — "execution-proof"), logged as a pre-factual deviation, and driven back through the repair cycles.

As of the latest state the program is still halted at its pre-registration checkpoint — now on a single, precise blocker: one API key the author must provision. $0 of the confirmatory budget has been spent ($5.81 of a $7,000 cap, all on pilots); nothing is frozen, no confirmatory tag applied. That is the whole thesis restated in miniature: report the halt in full rather than ship a study you couldn't stand behind.

System B: Adversarial Pre-Review

A second, separate organization now sits between writing and submission. Where the research org produces work, System B tries to destroy it — a simulated hostile reviewer (7 agents: a rubric-internalizer, five specialist reviewers, and an area chair) that reviews a paper against the real, frozen rubric of its target venue before any human submits.

It deliberately inverts the M1 lesson: not "reject everything," but high recall of flaws with a comfortable accept bar — because the cost asymmetry is reversed (a false accept costs months to a desk-reject; a false reject costs a day of revision). Two rules are non-negotiable: System B can never emit a "submit" — the human gate is non-removable — and because it shares the author's model family, its accept is necessary but not sufficient: the external venue is the only terminal authority. Every verdict logs a prediction of the real outcome, so the pipeline calibrates itself against actual peer review over time. Self-test: 13/13 injected flaws caught, 3/3 sound control papers accepted.

Engineering Highlights

Citation Fidelity Protocol (CFP) — every citation carries a typed claim card with scope conditions and a SHA-256-pinned verbatim anchor; a dedicated citation-object-adversary role verifies each card. Built after 3 of 5 v0.1.0 abandonment citations were found misread.
Public erratum (v0.2.1) — when later work showed the specificity numbers were engine-specific, they were downgraded to exploratory on the public repo, so the public record matches what the program privately knows before any new freeze.
Reproducibility harness — make reproduce regenerates all figures and headline numbers from a fresh clone; make verify checks against a frozen reference with Monte-Carlo tolerance; pinned seeds throughout.
Pre-registration discipline — predictions frozen at git tags before any evaluation ran; deviations logged pre-factually, never retrofitted.
Limitations-forward reporting — named limitations stated in full, including small-n, same-model-family evaluation, and the spent corpus.

Tech Stack

Python, Claude (38+ agent role definitions, ROS v3 operating standard), cross-model evaluation (OpenAI · Google · Meta-Llama), Makefile reproducibility gates, SHA-256 sealed-label blinding, external human certification (Cohen's κ), Clopper–Pearson exact intervals, git-tagged pre-registration, MIT (code) + CC BY 4.0 (docs)