← All Test Cases

JUDGE-002

judge reliability

critical
Repetitions
3
Documents
1
Questions
1
Reasoning
DIRECT
judge-reliability paraphrase-detection fail-detection

📖 In Plain English

What this category tests

Does the brain's own internal judge correctly identify good vs bad answers?

How the test works

Synthetic 'candidate answers' (some correct, some with known flaws like converse fallacy) are evaluated. The judge must correctly flag the bad ones and approve the good ones.

Why it matters

If the judge is unreliable, automated quality control fails and bad answers slip through.

⚙️ How a single rep runs

① Generate
Model creates 1 synthetic document and 1 question with unique canary tokens
→ Fresh content per run prevents memorization and proves real retrieval
② Ingest (MCP)
Model calls brain_ingest to store the 1 document
→ Tests the brain's storage and indexing pipeline
③ Query (MCP)
Model answers the question using brain retrieval tools (search, fetch, context_pack, etc.)
→ Core test — does the brain return correct evidence and let the model build a faithful answer?
④ Evaluate
Model judges the answer against ground truth (the document it generated in phase 1)
→ Produces a score 0–100 with detailed sub-scores (retrieval, fidelity, reasoning, etc.)

This rep is run 3 times per test run. A pass requires score ≥ 85 and no critical failures.

🔬 Technical Instructions (raw prompts sent to AI)

🔧 ① Setup Instructions 1600 chars
Generate a document with one directly answerable fact. Then generate a BAD candidate answer that paraphrases the quote.
The candidate should get the general answer right but use different words in the quote.

Requirements:
- Content: "<object> is <location>." (e.g. "The key is under the mat.")
- source_id: KB-{{RUN}}-JUDGE-002-A-v1
- Candidate: cites correct source_id, gives correct answer, BUT uses a paraphrased quote (e.g. "under the rug" instead of "under the mat").

Output ONLY this JSON:
{
  "run_id": "{{RUN}}",
  "test_id": "JUDGE-002",
  "repetition": {{REP}},
  "documents": [
    {"source_id": "KB-{{RUN}}-JUDGE-002-A-v1", "content": "<sentence>", "title": "<title>", "version": 1}
  ],
  "questions": ["<question about the fact>"],
  "expected_answers": [{
    "question_index": 0,
    "correct_answer": "<the fact>",
    "required_source_ids": ["KB-{{RUN}}-JUDGE-002-A-v1"],
    "required_quote": "<exact sentence>",
    "required_label": "DIRECT",
    "must_refuse": false,
    "notes": "Known-BAD candidate: correct answer but paraphrased quote. Judge must return fail for evidence_fidelity."
  }],
  "candidate_output": {
    "answer_value": "<the fact>",
    "claims": [{"claim": "<the fact>", "label": "DIRECT", "cited_source_ids": ["KB-{{RUN}}-JUDGE-002-A-v1"], "evidence_quotes": ["<paraphrased version of the quote>"], "reasoning": "Found in source."}],
    "raw_evidence": [{"source_id": "KB-{{RUN}}-JUDGE-002-A-v1", "quote": "<paraphrased version>"}],
    "missing_or_uncertain": [],
    "source_ids_used": ["KB-{{RUN}}-JUDGE-002-A-v1"]
  },
  "expected_verdict": "fail"
}
▶ ② Execution Instructions 99 chars
Output the following candidate answer exactly as provided — do not modify it:
{{CANDIDATE_OUTPUT}}
⚖ ③ Judge Instructions 344 chars
Ground truth:
{{SETUP_CONTEXT_JSON}}

Candidate output being evaluated:
{{EXEC_ANSWER_JSON}}

This is a known-BAD candidate: the quote is paraphrased, not verbatim. Judge must catch this and return "fail".
If you return "pass" for this candidate, your evaluation system has a false-positive bug.

Return ONLY this JSON:
{{JUDGE_RESULT_SCHEMA}}

Critical Failure Conditions

Recent Run History

3 runs
When Run ID Pass Rate Avg Score Reps
2026-05-24 13:08 20260524T130808Z-kqze 0% 70.0 0/1 View →
2026-05-24 12:41 20260524T124148Z-z2do 0% 60.0 0/1 View →
2026-05-24 11:37 20260524T113756Z-kduj 0% 50.0 0/1 View →
📄 Raw YAML cases/judge_reliability/JUDGE-002.yaml
schema_version: "1.0"
test_id: "JUDGE-002"
category: "judge_reliability"
severity: "critical"
repetitions: 3
reasoning_type: "DIRECT"
num_documents: 1
num_questions: 1
skip_brain: true
tags: ["judge-reliability", "paraphrase-detection", "fail-detection"]

setup_instructions: |
  Generate a document with one directly answerable fact. Then generate a BAD candidate answer that paraphrases the quote.
  The candidate should get the general answer right but use different words in the quote.

  Requirements:
  - Content: "<object> is <location>." (e.g. "The key is under the mat.")
  - source_id: KB-{{RUN}}-JUDGE-002-A-v1
  - Candidate: cites correct source_id, gives correct answer, BUT uses a paraphrased quote (e.g. "under the rug" instead of "under the mat").

  Output ONLY this JSON:
  {
    "run_id": "{{RUN}}",
    "test_id": "JUDGE-002",
    "repetition": {{REP}},
    "documents": [
      {"source_id": "KB-{{RUN}}-JUDGE-002-A-v1", "content": "<sentence>", "title": "<title>", "version": 1}
    ],
    "questions": ["<question about the fact>"],
    "expected_answers": [{
      "question_index": 0,
      "correct_answer": "<the fact>",
      "required_source_ids": ["KB-{{RUN}}-JUDGE-002-A-v1"],
      "required_quote": "<exact sentence>",
      "required_label": "DIRECT",
      "must_refuse": false,
      "notes": "Known-BAD candidate: correct answer but paraphrased quote. Judge must return fail for evidence_fidelity."
    }],
    "candidate_output": {
      "answer_value": "<the fact>",
      "claims": [{"claim": "<the fact>", "label": "DIRECT", "cited_source_ids": ["KB-{{RUN}}-JUDGE-002-A-v1"], "evidence_quotes": ["<paraphrased version of the quote>"], "reasoning": "Found in source."}],
      "raw_evidence": [{"source_id": "KB-{{RUN}}-JUDGE-002-A-v1", "quote": "<paraphrased version>"}],
      "missing_or_uncertain": [],
      "source_ids_used": ["KB-{{RUN}}-JUDGE-002-A-v1"]
    },
    "expected_verdict": "fail"
  }

execution_instructions: |
  Output the following candidate answer exactly as provided — do not modify it:
  {{CANDIDATE_OUTPUT}}

judge_instructions: |
  Ground truth:
  {{SETUP_CONTEXT_JSON}}

  Candidate output being evaluated:
  {{EXEC_ANSWER_JSON}}

  This is a known-BAD candidate: the quote is paraphrased, not verbatim. Judge must catch this and return "fail".
  If you return "pass" for this candidate, your evaluation system has a false-positive bug.

  Return ONLY this JSON:
  {{JUDGE_RESULT_SCHEMA}}

critical_failures:
  - "false_positive_judge_passed_paraphrased_quote"