Jun 20, 2026

FROM DEMO TO OPERATOR: MY AI RESEARCH ORG NOW RUNS REAL RESEARCH — AND REVIEWS ITS OWN PAPERS BEFORE I DO

I took the multi-agent research organization that killed all five of its own ideas, pointed it at a real problem, and let it run its own selection, pre-registration, and pilots. Then I built a second org whose only job is to tear the resulting papers apart before a human submits them.

multi-agentai-agentsresearch-automationpre-registrationevaluationclaude

Where We Left Off

A couple of weeks ago I shipped a 39-role multi-agent research organization with one rule — nothing self-certifies — and reported the uncomfortable result: on its first real run it abandoned all five candidate research directions, and retracted its own showcase exhibit when it failed its verifier. A pre-registered calibration study then showed the gates do discriminate (they caught every planted flaw), but that a "reject by default" prior was quietly destroying specificity.

That was the artifact. The obvious next question: can it actually do research — not study itself, but run a real frontier project end to end? This post is what happened when I let it try.

It Selected Its Own Project

I didn't hand it a thesis. I let the organization run its own selection. It generated a 12-candidate slate, froze it before any judgment (a single commit, so I couldn't quietly reshuffle later), then ran each candidate through four non-compensatory gates — does it have a real asset, a real falsifier, real feasibility, real novelty. Ten of the twelve were killed. One survived: a project I'll call C1 — "Evaluator Calibration at Scale."

C1 is pleasingly recursive. The artifact's whole premise is that AI evaluators need to be measured, not trusted. C1 turns that lens outward, onto the field: do third-party LLM paper-review systems reject flawed research for the right reasons — or do they reject the right papers by accident, citing the wrong flaw?

That distinction is the entire contribution. "Rejected" is cheap. "Rejected, and correctly identified why" is the thing a reviewer is actually for. The design measures both, plus how often each system false-kills sound work:

Cross-family. At least three evaluator families (OpenAI, Google, Meta-Llama). My own model family is banned from every confirmatory cell — it can plant flaws, never judge them. The most important reviewer in the study is the one I don't control.
Matched pairs, one flaw each. Every flawed document has a sound twin differing only in a single planted, externally certified defect. No confounds.
Humans hold the ground truth. Every plant is certified by two independent non-author raters under a frozen rubric — recruited with an unconditional co-authorship offer made before they rate anything, so their judgment can't be bent by the result.

The honest framing also means crediting the predecessor: SoundnessBench already established that current LLMs aren't reliable first-gate reviewers (a 74% mean false-positive rate across 12 models). C1 doesn't re-discover that — it adds the construction-validity controls SoundnessBench structurally lacks (seeded ground truth, matched pairs, attribution scoring, cross-family seeding).

The Governance Did Its Job — On Me

The interesting part isn't the design. It's watching the integrity machinery bite its own author.

An independent circularity review ("can the author certify his own ground truth?") imposed eight binding conditions before the project could proceed. Two are fatal-if-dropped: the matched-pair corpus and the external human certification.
Four independent reviewers — two outside my model family — read the first paper and all named the same weakness #1: "no independent baseline detector." So before running anything confirmatory, I added a single-judge baseline arm: you can't claim cross-model aggregation helps unless you measure a single model alone as the control.
While specifying that baseline, I had to retract a claim from my own earlier writeup: I'd described the detector as a "multi-gate wall." It isn't. The only real structural multiplicity is cross-model majority voting. The inaccurate description is now logged as a correction, not silently edited away.
And the specificity numbers from the calibration study (the 0.96 / 0.72 result) turned out to be engine-specific. So I shipped a public erratum downgrading them to exploratory on the public repo — before any new freeze — so the public record matches what I privately know.

The pilots, meanwhile, are quietly threatening the headline hypothesis. At pilot scale the "wrong-reason gap" I expected to find is sitting near zero, and raw flaw-detection sensitivity is around 0.4–0.5. That might kill the primary framing. It's pre-registered anyway — a clean, publishable null about the structure of evaluator failure beats a result I fished for. (Same lesson as my validation-crisis paper: the prediction is frozen before the data, and a failed prediction gets reported in the same units as a successful one.)

Right now C1 is halted at its pre-registration checkpoint, blocked on third-party API access. Pilots are run; the confirmatory freeze is not. Every planning number is explicitly provisional. I'd rather report a blocked program honestly than a finished one I rushed.

System B: An Org Whose Job Is to Destroy My Papers

The other new piece is a second organization that does the opposite of the first.

The research org writes. System B attacks. It's a separate 7-agent org — a rubric-internalizer, five specialist reviewers, and an area chair — that reviews a paper against the real, frozen reviewer rubric of its target venue before a human ever submits. A simulated hostile reviewer, calibrated to the actual bar.

It deliberately inverts the lesson from the calibration study. The first org's failure was a "reject everything" prior. System B is supposed to have a low bar for raising flaws — high recall — because the cost asymmetry is reversed: a false accept costs me two-to-four months waiting for a desk-reject, while a false reject costs a day of revision. Two rules are hard-coded:

It can never click submit. No agent, script, or Makefile can emit a "submit" action. The human gate is non-removable.
Its approval is necessary but not sufficient. System B is the same model family as the writer, so it's blind to whatever that family is blind to. Its "accept" only ever means "accept against the cached rubric → human gate." The external venue is the only terminal authority.

And it calibrates itself: every verdict logs a prediction of the real outcome, so over time I can measure whether multi-agent pre-review actually predicts peer review. The first paper has already gone through it to a round-2 accept, targeting a workshop at EMNLP 2026.

Takeaways

Letting the system choose the problem is the real test. A research org that can only execute a thesis you hand it is an editor. One that runs its own gated selection — and kills 10 of its own 12 ideas — is closer to a researcher.
The hardest gates are the ones aimed at you. External certification, an independent circularity review, a baseline arm demanded by outside reviewers, a public erratum, a logged retraction. None of it is comfortable; all of it is the point.
Build the adversary as a separate org. A reviewer that shares the writer's incentives isn't a reviewer. System B is deliberately separate, deliberately hostile, and structurally barred from ever submitting.
Report the blocked state. C1 is stuck at a checkpoint waiting on API keys, and a pilot is threatening its headline. Saying so plainly is the same discipline as everything above.

The system, the C1 program, and System B are written up in the project page; the original artifact is archived with a DOI.