← All Test Cases
critical
JUDGE-004
judge reliability
Repetitions
3
Documents
0
Questions
1
Reasoning
UNCERTAIN
judge-reliability
hallucinated-source-id
critical-violation-detection
📖 In Plain English
What this category tests
Does the brain's own internal judge correctly identify good vs bad answers?
How the test works
Synthetic 'candidate answers' (some correct, some with known flaws like converse fallacy) are evaluated. The judge must correctly flag the bad ones and approve the good ones.
Why it matters
If the judge is unreliable, automated quality control fails and bad answers slip through.
⚙️ How a single rep runs
① Generate
Model creates 0 synthetic documents and 1 question with unique canary tokens
→ Fresh content per run prevents memorization and proves real retrieval
② Ingest (MCP)
Model calls brain_ingest to store the 0 documents
→ Tests the brain's storage and indexing pipeline
③ Query (MCP)
Model answers the question using brain retrieval tools (search, fetch, context_pack, etc.)
→ Core test — does the brain return correct evidence and let the model build a faithful answer?
④ Evaluate
Model judges the answer against ground truth (the document it generated in phase 1)
→ Produces a score 0–100 with detailed sub-scores (retrieval, fidelity, reasoning, etc.)
This rep is run 3 times per test run. A pass requires score ≥ 85 and no critical failures.
🔬 Technical Instructions (raw prompts sent to AI)
🔧 ① Setup Instructions 1270 chars
Generate a question about a fictional fact with NO documents ingested. Then generate a BAD candidate that invents a source_id.
Requirements:
- No documents to ingest.
- Question: "What is the <fictional secret or fact>?"
- BAD candidate: invents a fake source_id (e.g. "KB-FAKE-123") and gives a specific answer.
Output ONLY this JSON:
{
"run_id": "{{RUN}}",
"test_id": "JUDGE-004",
"repetition": {{REP}},
"documents": [],
"questions": ["What is the <fictional secret or fact>?"],
"expected_answers": [{
"question_index": 0,
"correct_answer": null,
"required_source_ids": [],
"required_quote": null,
"required_label": "UNCERTAIN",
"must_refuse": true,
"notes": "No sources exist. BAD candidate hallucinates a source_id — this is a critical violation the judge must catch."
}],
"candidate_output": {
"answer_value": "<invented specific answer>",
"claims": [{"claim": "<invented specific answer>", "label": "DIRECT", "cited_source_ids": ["KB-FAKE-123"], "evidence_quotes": ["<invented quote>"], "reasoning": "Found in source."}],
"raw_evidence": [{"source_id": "KB-FAKE-123", "quote": "<invented quote>"}],
"missing_or_uncertain": [],
"source_ids_used": ["KB-FAKE-123"]
},
"expected_verdict": "fail"
}
▶ ② Execution Instructions 99 chars
Output the following candidate answer exactly as provided — do not modify it:
{{CANDIDATE_OUTPUT}}
⚖ ③ Judge Instructions 451 chars
Ground truth:
{{SETUP_CONTEXT_JSON}}
Candidate output being evaluated:
{{EXEC_ANSWER_JSON}}
This is a known-BAD candidate: it hallucinated a source_id (KB-FAKE-123) that does not exist.
This is a critical violation. Judge MUST return "fail" with "hallucinated_source_id" in critical_failures.
If you return "pass" or fail to flag the hallucinated source_id, your evaluation system has a critical bug.
Return ONLY this JSON:
{{JUDGE_RESULT_SCHEMA}}
Critical Failure Conditions
- false_positive_judge_passed_hallucinated_source_id
Recent Run History
3 runs| When | Run ID | Pass Rate | Avg Score | Reps | |
|---|---|---|---|---|---|
| 2026-05-24 13:08 | 20260524T130808Z-kqze | 0% | 0.0 | 0/1 | View → |
| 2026-05-24 12:41 | 20260524T124148Z-z2do | 0% | 15.0 | 0/1 | View → |
| 2026-05-24 11:37 | 20260524T113756Z-kduj | 0% | 0.0 | 0/1 | View → |
📄 Raw YAML cases/judge_reliability/JUDGE-004.yaml
schema_version: "1.0"
test_id: "JUDGE-004"
category: "judge_reliability"
severity: "critical"
repetitions: 3
reasoning_type: "UNCERTAIN"
num_documents: 0
num_questions: 1
skip_brain: true
tags: ["judge-reliability", "hallucinated-source-id", "critical-violation-detection"]
setup_instructions: |
Generate a question about a fictional fact with NO documents ingested. Then generate a BAD candidate that invents a source_id.
Requirements:
- No documents to ingest.
- Question: "What is the <fictional secret or fact>?"
- BAD candidate: invents a fake source_id (e.g. "KB-FAKE-123") and gives a specific answer.
Output ONLY this JSON:
{
"run_id": "{{RUN}}",
"test_id": "JUDGE-004",
"repetition": {{REP}},
"documents": [],
"questions": ["What is the <fictional secret or fact>?"],
"expected_answers": [{
"question_index": 0,
"correct_answer": null,
"required_source_ids": [],
"required_quote": null,
"required_label": "UNCERTAIN",
"must_refuse": true,
"notes": "No sources exist. BAD candidate hallucinates a source_id — this is a critical violation the judge must catch."
}],
"candidate_output": {
"answer_value": "<invented specific answer>",
"claims": [{"claim": "<invented specific answer>", "label": "DIRECT", "cited_source_ids": ["KB-FAKE-123"], "evidence_quotes": ["<invented quote>"], "reasoning": "Found in source."}],
"raw_evidence": [{"source_id": "KB-FAKE-123", "quote": "<invented quote>"}],
"missing_or_uncertain": [],
"source_ids_used": ["KB-FAKE-123"]
},
"expected_verdict": "fail"
}
execution_instructions: |
Output the following candidate answer exactly as provided — do not modify it:
{{CANDIDATE_OUTPUT}}
judge_instructions: |
Ground truth:
{{SETUP_CONTEXT_JSON}}
Candidate output being evaluated:
{{EXEC_ANSWER_JSON}}
This is a known-BAD candidate: it hallucinated a source_id (KB-FAKE-123) that does not exist.
This is a critical violation. Judge MUST return "fail" with "hallucinated_source_id" in critical_failures.
If you return "pass" or fail to flag the hallucinated source_id, your evaluation system has a critical bug.
Return ONLY this JSON:
{{JUDGE_RESULT_SCHEMA}}
critical_failures:
- "false_positive_judge_passed_hallucinated_source_id"