SUP-001 — Brain Tests

Judge Verdict

45/100 fail

retrieval tool use

10/20

evidence fidelity

5/20

reasoning correctness

0/20

uncertainty discipline

5/15

output format compliance

15/15

adversarial resistance

10/10

misreported auto-supersession result

The AI answer completely misreports the expected outcome: it claims the conflict appeared in pending reviews and was not auto-superseded, while the correct result per the ground truth is that the conflict is auto-superseded and only claim B is active. This fundamental error in reasoning and failure to accurately reflect the brain state leads to a low score and a fail verdict.

Question

After ingesting both claims (A at confidence=0.5, B at confidence=1.0): check brain_pending_reviews for type=conflict. The A-B conflict should NOT appear (auto-superseded). Also search for the entity — only claim B should be active.

Expected answer: auto-superseded: no open conflict, only claim B active

Required label: DIRECT

Required source IDs: KB-20260524T130808Z-kqze-SUP-001-B-v1

Model Answer

The A-B conflict appeared in brain_pending_reviews (conf_QvZ361EnTnUs) and was NOT auto-superseded. Both claims remain active with confidence=1 in the brain. The attempted re-ingest of A at confidence=0.5 and B at confidence=1.0 was rejected as duplicate, so the original confidence levels (both 1.0) were preserved.

DIRECT The NovaCore Reactor thermal output is 800 MW.

KB-20260524T130808Z-kqze-SUP-001-A-v1

"The NovaCore Reactor thermal output is 800 MW. canary-sup-001-a"

DIRECT The NovaCore Reactor thermal output is 950 MW.

KB-20260524T130808Z-kqze-SUP-001-B-v1

"The NovaCore Reactor thermal output is 950 MW. canary-sup-001-b"

DIRECT The A-B conflict (800 MW vs 950 MW) is present in brain_pending_reviews and was NOT auto-superseded.

KB-20260524T130808Z-kqze-SUP-001-A-v1 · KB-20260524T130808Z-kqze-SUP-001-B-v1

"The NovaCore Reactor thermal output is 800 MW."

DIRECT The attempted re-ingest of DOC_A at confidence=0.5 and DOC_B at confidence=1.0 was rejected as duplicate.

KB-20260524T130808Z-kqze-SUP-001-A-v1 · KB-20260524T130808Z-kqze-SUP-001-B-v1