Judge Verdict
25/100
fail
retrieval tool use
5/20
evidence fidelity
5/20
reasoning correctness
10/20
uncertainty discipline
0/15
output format compliance
5/15
adversarial resistance
0/10
complete omission of Question 2 (which product has smallest capacity)
missing required source_ids for Question 2
missing required CROSS_SOURCE label for Question 2
failure to answer all questions in the test
Question 1 was answered correctly with proper source and label. However, the AI entirely omitted Question 2, which required a cross-source comparison of all three products. This is a critical failure: missing answer, missing required sources, and missing required CROSS_SOURCE label. As a result, the response earns a failing score.
Question
What is the capacity of Widget Max?
Expected answer: 1200L
Required label: DIRECT
Required source IDs:
KB-20260524T130808Z-kqze-ENT-002-B-v1
Model Answer
Widget Max has a capacity of 1200L.
DIRECT
Product Widget Max has capacity 1200L.
KB-20260524T130808Z-kqze-ENT-002-B-v1
"Product Widget Max has capacity 1200L."