SUP-002 — Brain Tests

Judge Verdict

90/100 pass

retrieval tool use

20/20

evidence fidelity

20/20

reasoning correctness

20/20

uncertainty discipline

5/15

output format compliance

15/15

adversarial resistance

10/10

The tested AI correctly checked pending_reviews, identified an open (not auto‑superseded) conflict between the two documents, and cited both required sources. The main deficiency is the absence of the UNCERTAIN label, which reduces the uncertainty discipline score but does not constitute a critical failure. Overall the response demonstrates correct retrieval, faithful evidence, and sound reasoning, earning a passing score.

Question

After ingesting both claims at confidence=0.7: check brain_pending_reviews for type=conflict. An open conflict between A and B SHOULD appear (not auto-superseded because confidences are equal).

Expected answer: open conflict exists: both claims appear in pending reviews

Required label: UNCERTAIN

Required source IDs: KB-20260524T113756Z-kduj-SUP-002-A-v1, KB-20260524T113756Z-kduj-SUP-002-B-v1

Model Answer

Yes, an open conflict between DOC_A and DOC_B appears in brain_pending_reviews as conf_cq5Yz3zDevWT. It was NOT auto-superseded because both claims have equal confidence (0.7). The conflict is between claim_Iol-MEvzv13p ("Project Erebus completed its first successful field test on September 28, 2024." from KB-20260524T113756Z-kduj-SUP-002-B-v1) and claim_2DyQ6V9zjBZJ ("Project Erebus completed its first successful field test on March 12, 2024." from KB-20260524T113756Z-kduj-SUP-002-A-v1). Reason: numeric or subject contradiction. This confirms the expected behavior — equal-confidence contradictory claims produce an open conflict for human review rather than auto-superseding.

DIRECT An open conflict exists between DOC_A and DOC_B in brain_pending_reviews after ingesting both at confidence=0.7.

KB-20260524T113756Z-kduj-SUP-002-A-v1 · KB-20260524T113756Z-kduj-SUP-002-B-v1

"Project Erebus completed its first successful field test on March 12, 2024."

INFERRED The conflict was not auto-superseded because both claims have equal confidence of 0.7.

KB-20260524T113756Z-kduj-SUP-002-A-v1 · KB-20260524T113756Z-kduj-SUP-002-B-v1