← All Test Cases

JUDGE-001

judge reliability

high

Repetitions

Documents

Questions

Reasoning

DIRECT

judge-reliability known-good pass-detection

📖 In Plain English

What this category tests

Does the brain's own internal judge correctly identify good vs bad answers?

How the test works

Synthetic 'candidate answers' (some correct, some with known flaws like converse fallacy) are evaluated. The judge must correctly flag the bad ones and approve the good ones.

Why it matters

If the judge is unreliable, automated quality control fails and bad answers slip through.

⚙️ How a single rep runs

① Generate

Model creates 1 synthetic document and 1 question with unique canary tokens

→ Fresh content per run prevents memorization and proves real retrieval

② Ingest (MCP)

Model calls brain_ingest to store the 1 document

→ Tests the brain's storage and indexing pipeline

③ Query (MCP)

Model answers the question using brain retrieval tools (search, fetch, context_pack, etc.)

→ Core test — does the brain return correct evidence and let the model build a faithful answer?

④ Evaluate

Model judges the answer against ground truth (the document it generated in phase 1)

→ Produces a score 0–100 with detailed sub-scores (retrieval, fidelity, reasoning, etc.)

This rep is run 3 times per test run. A pass requires score ≥ 85 and no critical failures.

🔬 Technical Instructions (raw prompts sent to AI)

🔧 ① Setup Instructions 1598 chars

Generate a document with one directly answerable fact. Then generate a CORRECT candidate answer.
The candidate answer should cite the exact source_id, exact quote, and correct DIRECT label.

Requirements:
- Content: "<object> is <location/property>." (one simple direct fact, e.g. "The key is under the mat.")
- source_id: KB-{{RUN}}-JUDGE-001-A-v1
- Generate a candidate_output (a correct answer) that: cites KB-{{RUN}}-JUDGE-001-A-v1, quotes the exact sentence verbatim, labels the claim DIRECT, gives the correct answer.

Output ONLY this JSON:
{
  "run_id": "{{RUN}}",
  "test_id": "JUDGE-001",
  "repetition": {{REP}},
  "documents": [
    {"source_id": "KB-{{RUN}}-JUDGE-001-A-v1", "content": "<sentence>", "title": "<title>", "version": 1}
  ],
  "questions": ["<question about the fact>"],
  "expected_answers": [{
    "question_index": 0,
    "correct_answer": "<the fact>",
    "required_source_ids": ["KB-{{RUN}}-JUDGE-001-A-v1"],
    "required_quote": "<exact sentence>",
    "required_label": "DIRECT",
    "must_refuse": false,
    "notes": "This is a known-good candidate. Judge should return pass with high score."
  }],
  "candidate_output": {
    "answer_value": "<the fact>",
    "claims": [{"claim": "<the fact>", "label": "DIRECT", "cited_source_ids": ["KB-{{RUN}}-JUDGE-001-A-v1"], "evidence_quotes": ["<exact sentence>"], "reasoning": "Explicitly stated."}],
    "raw_evidence": [{"source_id": "KB-{{RUN}}-JUDGE-001-A-v1", "quote": "<exact sentence>"}],
    "missing_or_uncertain": [],
    "source_ids_used": ["KB-{{RUN}}-JUDGE-001-A-v1"]
  },
  "expected_verdict": "pass"
}

▶ ② Execution Instructions 99 chars

Output the following candidate answer exactly as provided — do not modify it:
{{CANDIDATE_OUTPUT}}

⚖ ③ Judge Instructions 400 chars

Ground truth:
{{SETUP_CONTEXT_JSON}}

Candidate output being evaluated:
{{EXEC_ANSWER_JSON}}

This is a known-GOOD candidate. It should score >= 85 and receive verdict "pass".
Check: correct answer, exact source_id, exact quote, DIRECT label, valid JSON format.
If you return "fail" for this candidate, your evaluation system has a false-negative bug.

Return ONLY this JSON:
{{JUDGE_RESULT_SCHEMA}}

Critical Failure Conditions

false_negative_judge_failed_known_good_candidate

Recent Run History

3 runs

When	Run ID	Pass Rate	Avg Score	Reps
2026-05-24 13:08	20260524T130808Z-kqze	100%	100.0	1/1	View →
2026-05-24 12:41	20260524T124148Z-z2do	100%	100.0	1/1	View →
2026-05-24 11:37	20260524T113756Z-kduj	100%	100.0	1/1	View →

📄 Raw YAML cases/judge_reliability/JUDGE-001.yaml

schema_version: "1.0"
test_id: "JUDGE-001"
category: "judge_reliability"
severity: "high"
repetitions: 3
reasoning_type: "DIRECT"
num_documents: 1
num_questions: 1
skip_brain: true
tags: ["judge-reliability", "known-good", "pass-detection"]

setup_instructions: |
  Generate a document with one directly answerable fact. Then generate a CORRECT candidate answer.
  The candidate answer should cite the exact source_id, exact quote, and correct DIRECT label.

  Requirements:
  - Content: "<object> is <location/property>." (one simple direct fact, e.g. "The key is under the mat.")
  - source_id: KB-{{RUN}}-JUDGE-001-A-v1
  - Generate a candidate_output (a correct answer) that: cites KB-{{RUN}}-JUDGE-001-A-v1, quotes the exact sentence verbatim, labels the claim DIRECT, gives the correct answer.

  Output ONLY this JSON:
  {
    "run_id": "{{RUN}}",
    "test_id": "JUDGE-001",
    "repetition": {{REP}},
    "documents": [
      {"source_id": "KB-{{RUN}}-JUDGE-001-A-v1", "content": "<sentence>", "title": "<title>", "version": 1}
    ],
    "questions": ["<question about the fact>"],
    "expected_answers": [{
      "question_index": 0,
      "correct_answer": "<the fact>",
      "required_source_ids": ["KB-{{RUN}}-JUDGE-001-A-v1"],
      "required_quote": "<exact sentence>",
      "required_label": "DIRECT",
      "must_refuse": false,
      "notes": "This is a known-good candidate. Judge should return pass with high score."
    }],
    "candidate_output": {
      "answer_value": "<the fact>",
      "claims": [{"claim": "<the fact>", "label": "DIRECT", "cited_source_ids": ["KB-{{RUN}}-JUDGE-001-A-v1"], "evidence_quotes": ["<exact sentence>"], "reasoning": "Explicitly stated."}],
      "raw_evidence": [{"source_id": "KB-{{RUN}}-JUDGE-001-A-v1", "quote": "<exact sentence>"}],
      "missing_or_uncertain": [],
      "source_ids_used": ["KB-{{RUN}}-JUDGE-001-A-v1"]
    },
    "expected_verdict": "pass"
  }

execution_instructions: |
  Output the following candidate answer exactly as provided — do not modify it:
  {{CANDIDATE_OUTPUT}}

judge_instructions: |
  Ground truth:
  {{SETUP_CONTEXT_JSON}}

  Candidate output being evaluated:
  {{EXEC_ANSWER_JSON}}

  This is a known-GOOD candidate. It should score >= 85 and receive verdict "pass".
  Check: correct answer, exact source_id, exact quote, DIRECT label, valid JSON format.
  If you return "fail" for this candidate, your evaluation system has a false-negative bug.

  Return ONLY this JSON:
  {{JUDGE_RESULT_SCHEMA}}

critical_failures:
  - "false_negative_judge_failed_known_good_candidate"