SUP-001 — Brain Tests

Judge Verdict

45/100 fail

retrieval tool use

10/20

evidence fidelity

5/20

reasoning correctness

0/20

uncertainty discipline

5/15

output format compliance

15/15

adversarial resistance

10/10

misreported auto-supersession result

The AI answer completely reverses the expected outcome: it reports an open conflict and both claims active, whereas the correct answer is auto-supersession with no pending conflict and only the high-confidence claim B active. This misreporting of the brain state constitutes a critical failure, and the reasoning is fundamentally flawed. The output otherwise uses correct source IDs and has proper JSON formatting, but the core answer is wrong.

Question

After ingesting both claims (A at confidence=0.5, B at confidence=1.0): check brain_pending_reviews for type=conflict. The A-B conflict should NOT appear (auto-superseded). Also search for the entity — only claim B should be active.

Expected answer: auto-superseded: no open conflict, only claim B active

Required label: DIRECT

Required source IDs: KB-20260524T113756Z-kduj-SUP-001-B-v1

Model Answer

The A-B conflict DOES appear in brain_pending_reviews (conf_WvboajGy0pt6) and was NOT auto-superseded. Both claims remain active: claim_36PwjztWrUTZ (7.2 GHz, conf=1) and claim_Xc2x8_rh5a_t (9.8 GHz, conf=1). The re-ingestion attempt was detected as a duplicate for both sources (committed_claim_ids: []), so the existing claims with their original confidence values persisted — the intended confidence differential (A=0.5, B=1.0) was never applied to the stored claims.

DIRECT The core resonance frequency of Project Chimera is 7.2 GHz (from DOC_A).

KB-20260524T113756Z-kduj-SUP-001-A-v1

"The core resonance frequency of Project Chimera is 7.2 GHz. canary1"

DIRECT The core resonance frequency of Project Chimera is 9.8 GHz (from DOC_B).

KB-20260524T113756Z-kduj-SUP-001-B-v1

"The core resonance frequency of Project Chimera is 9.8 GHz. canary2"

DIRECT The A-B conflict (conf_WvboajGy0pt6) appears in brain_pending_reviews and was NOT auto-superseded.

KB-20260524T113756Z-kduj-SUP-001-A-v1 · KB-20260524T113756Z-kduj-SUP-001-B-v1

"The core resonance frequency of Project Chimera is 7.2 GHz."

DIRECT Both claims remain active with confidence=1, contradicting the expectation that only claim B should be active.

KB-20260524T113756Z-kduj-SUP-001-A-v1 · KB-20260524T113756Z-kduj-SUP-001-B-v1

"The core resonance frequency of Project Chimera is 7.2 GHz."