← All Test Cases

JUDGE-005

judge reliability

high

Repetitions

Documents

Questions

Reasoning

CROSS_SOURCE

judge-reliability incomplete-retrieval partial-credit-detection

📖 In Plain English

What this category tests

Does the brain's own internal judge correctly identify good vs bad answers?

How the test works

Synthetic 'candidate answers' (some correct, some with known flaws like converse fallacy) are evaluated. The judge must correctly flag the bad ones and approve the good ones.

Why it matters

If the judge is unreliable, automated quality control fails and bad answers slip through.

⚙️ How a single rep runs

① Generate

Model creates 2 synthetic documents and 1 question with unique canary tokens

→ Fresh content per run prevents memorization and proves real retrieval

② Ingest (MCP)

Model calls brain_ingest to store the 2 documents

→ Tests the brain's storage and indexing pipeline

③ Query (MCP)

Model answers the question using brain retrieval tools (search, fetch, context_pack, etc.)

→ Core test — does the brain return correct evidence and let the model build a faithful answer?

④ Evaluate

Model judges the answer against ground truth (the document it generated in phase 1)

→ Produces a score 0–100 with detailed sub-scores (retrieval, fidelity, reasoning, etc.)

This rep is run 3 times per test run. A pass requires score ≥ 85 and no critical failures.

🔬 Technical Instructions (raw prompts sent to AI)

🔧 ① Setup Instructions 1915 chars

Generate two documents forming a two-step chain. Generate a PARTIAL candidate that gets the right answer but only cites one source.

Requirements:
- Doc A: "<entity> has property <X>."
- Doc B: "The thing with property <X> is located in <place>."
- source_ids: KB-{{RUN}}-JUDGE-005-A-v1, KB-{{RUN}}-JUDGE-005-B-v1
- Question: "Where is <entity> located?"
- PARTIAL candidate: gives the correct answer (<place>) but cites ONLY Doc B, not Doc A. The reasoning is incomplete because it skips the first hop.

Output ONLY this JSON:
{
  "run_id": "{{RUN}}",
  "test_id": "JUDGE-005",
  "repetition": {{REP}},
  "documents": [
    {"source_id": "KB-{{RUN}}-JUDGE-005-A-v1", "content": "<entity> has property <X>.", "title": "<title A>", "version": 1},
    {"source_id": "KB-{{RUN}}-JUDGE-005-B-v1", "content": "The thing with property <X> is located in <place>.", "title": "<title B>", "version": 1}
  ],
  "questions": ["Where is <entity> located?"],
  "expected_answers": [{
    "question_index": 0,
    "correct_answer": "<place>",
    "required_source_ids": ["KB-{{RUN}}-JUDGE-005-A-v1", "KB-{{RUN}}-JUDGE-005-B-v1"],
    "required_quote": null,
    "required_label": "CROSS_SOURCE",
    "must_refuse": false,
    "notes": "PARTIAL candidate: correct answer but missing Doc A. Judge must penalize missing required source and incomplete chain. Should fail or warn, not pass."
  }],
  "candidate_output": {
    "answer_value": "<place>",
    "claims": [{"claim": "<entity> is in <place>.", "label": "DIRECT", "cited_source_ids": ["KB-{{RUN}}-JUDGE-005-B-v1"], "evidence_quotes": ["The thing with property <X> is located in <place>."], "reasoning": "Found in Doc B."}],
    "raw_evidence": [{"source_id": "KB-{{RUN}}-JUDGE-005-B-v1", "quote": "The thing with property <X> is located in <place>."}],
    "missing_or_uncertain": [],
    "source_ids_used": ["KB-{{RUN}}-JUDGE-005-B-v1"]
  },
  "expected_verdict": "fail"
}

▶ ② Execution Instructions 99 chars

Output the following candidate answer exactly as provided — do not modify it:
{{CANDIDATE_OUTPUT}}

⚖ ③ Judge Instructions 472 chars

Ground truth:
{{SETUP_CONTEXT_JSON}}

Candidate output being evaluated:
{{EXEC_ANSWER_JSON}}

This is a PARTIAL candidate: correct final answer but missing Doc A (KB-{{RUN}}-JUDGE-005-A-v1) and incorrectly labeled DIRECT instead of CROSS_SOURCE.
Judge must penalize missing required source_id and wrong label. Should return "fail" or "warn".
If you return "pass", your evaluation system does not enforce complete retrieval.

Return ONLY this JSON:
{{JUDGE_RESULT_SCHEMA}}

Critical Failure Conditions

false_positive_judge_passed_incomplete_retrieval
false_positive_judge_ignored_wrong_label

Recent Run History

3 runs

When	Run ID	Pass Rate	Avg Score	Reps
2026-05-24 13:08	20260524T130808Z-kqze	0%	47.0	0/1	View →
2026-05-24 12:41	20260524T124148Z-z2do	0%	35.0	0/1	View →
2026-05-24 11:37	20260524T113756Z-kduj	0%	45.0	0/1	View →

📄 Raw YAML cases/judge_reliability/JUDGE-005.yaml

schema_version: "1.0"
test_id: "JUDGE-005"
category: "judge_reliability"
severity: "high"
repetitions: 3
reasoning_type: "CROSS_SOURCE"
num_documents: 2
num_questions: 1
skip_brain: true
tags: ["judge-reliability", "incomplete-retrieval", "partial-credit-detection"]

setup_instructions: |
  Generate two documents forming a two-step chain. Generate a PARTIAL candidate that gets the right answer but only cites one source.

  Requirements:
  - Doc A: "<entity> has property <X>."
  - Doc B: "The thing with property <X> is located in <place>."
  - source_ids: KB-{{RUN}}-JUDGE-005-A-v1, KB-{{RUN}}-JUDGE-005-B-v1
  - Question: "Where is <entity> located?"
  - PARTIAL candidate: gives the correct answer (<place>) but cites ONLY Doc B, not Doc A. The reasoning is incomplete because it skips the first hop.

  Output ONLY this JSON:
  {
    "run_id": "{{RUN}}",
    "test_id": "JUDGE-005",
    "repetition": {{REP}},
    "documents": [
      {"source_id": "KB-{{RUN}}-JUDGE-005-A-v1", "content": "<entity> has property <X>.", "title": "<title A>", "version": 1},
      {"source_id": "KB-{{RUN}}-JUDGE-005-B-v1", "content": "The thing with property <X> is located in <place>.", "title": "<title B>", "version": 1}
    ],
    "questions": ["Where is <entity> located?"],
    "expected_answers": [{
      "question_index": 0,
      "correct_answer": "<place>",
      "required_source_ids": ["KB-{{RUN}}-JUDGE-005-A-v1", "KB-{{RUN}}-JUDGE-005-B-v1"],
      "required_quote": null,
      "required_label": "CROSS_SOURCE",
      "must_refuse": false,
      "notes": "PARTIAL candidate: correct answer but missing Doc A. Judge must penalize missing required source and incomplete chain. Should fail or warn, not pass."
    }],
    "candidate_output": {
      "answer_value": "<place>",
      "claims": [{"claim": "<entity> is in <place>.", "label": "DIRECT", "cited_source_ids": ["KB-{{RUN}}-JUDGE-005-B-v1"], "evidence_quotes": ["The thing with property <X> is located in <place>."], "reasoning": "Found in Doc B."}],
      "raw_evidence": [{"source_id": "KB-{{RUN}}-JUDGE-005-B-v1", "quote": "The thing with property <X> is located in <place>."}],
      "missing_or_uncertain": [],
      "source_ids_used": ["KB-{{RUN}}-JUDGE-005-B-v1"]
    },
    "expected_verdict": "fail"
  }

execution_instructions: |
  Output the following candidate answer exactly as provided — do not modify it:
  {{CANDIDATE_OUTPUT}}

judge_instructions: |
  Ground truth:
  {{SETUP_CONTEXT_JSON}}

  Candidate output being evaluated:
  {{EXEC_ANSWER_JSON}}

  This is a PARTIAL candidate: correct final answer but missing Doc A (KB-{{RUN}}-JUDGE-005-A-v1) and incorrectly labeled DIRECT instead of CROSS_SOURCE.
  Judge must penalize missing required source_id and wrong label. Should return "fail" or "warn".
  If you return "pass", your evaluation system does not enforce complete retrieval.

  Return ONLY this JSON:
  {{JUDGE_RESULT_SCHEMA}}

critical_failures:
  - "false_positive_judge_passed_incomplete_retrieval"
  - "false_positive_judge_ignored_wrong_label"