← 20260524T130808Z-kqze

SDR-002

semantic_drift · 1 reps · avg 40.0/100 · pass rate 0%

Judge Verdict

40/100 fail
retrieval tool use
10/20
evidence fidelity
10/20
reasoning correctness
5/20
uncertainty discipline
0/15
output format compliance
15/15
adversarial resistance
0/10
wrong person DIRECT label (should be INFERRED for synonym reasoning) hallucinated_source_id

The tested AI failed to infer the correct answer ('manager') from the available evidence, instead claiming it cannot be determined. It introduced a hallucinated source ID and did not use the required INFERRED label. These critical failures, combined with poor reasoning and incorrect uncertainty, result in a fail verdict with a score of 40.

Question

Who signed off on the spending plan?

Expected answer: manager
Required label: INFERRED
Required source IDs: KB-20260524T130808Z-kqze-SDR-002-A-v1

Model Answer

The brain-test domain contains no document that explicitly mentions a "spending plan" or who signed off on one. The closest available evidence is a claim that "the manager approved the budget" — but the manager is not named, and this refers to a budget rather than a spending plan specifically.

DIRECT No source in the brain-test domain explicitly identifies who signed off on a spending plan.
KB-20260524T130808Z-kqze-SDR-002-A-v1 · KB-20260524T113756Z-kduj-SDR-002-A-v1
"The manager approved the budget."
DIRECT The manager approved the budget.
KB-20260524T130808Z-kqze-SDR-002-A-v1
"The manager approved the budget."
UNCERTAIN The identity of the person who signed off on the spending plan cannot be determined from the available evidence.
KB-20260524T130808Z-kqze-SDR-002-A-v1