← All Test Cases

VER-005

update versioning

critical

Repetitions

Documents

Questions

Reasoning

DIRECT

update_versioning manual_section material_supersession stale_rejection safety_critical

📖 In Plain English

What this category tests

Does the brain return the latest version of a document after updates?

How the test works

A document is ingested at v1, then a v2 supersedes it. The test asks about the topic and checks the v2 content is returned, with v1 marked superseded.

Why it matters

Stale data is dangerous — users must always see the current truth.

⚙️ How a single rep runs

① Generate

Model creates 2 synthetic documents and 1 question with unique canary tokens

→ Fresh content per run prevents memorization and proves real retrieval

② Ingest (MCP)

Model calls brain_ingest to store the 2 documents

→ Tests the brain's storage and indexing pipeline

③ Query (MCP)

Model answers the question using brain retrieval tools (search, fetch, context_pack, etc.)

→ Core test — does the brain return correct evidence and let the model build a faithful answer?

④ Evaluate

Model judges the answer against ground truth (the document it generated in phase 1)

→ Produces a score 0–100 with detailed sub-scores (retrieval, fidelity, reasoning, etc.)

This rep is run 5 times per test run. A pass requires score ≥ 85 and no critical failures.

🔬 Technical Instructions (raw prompts sent to AI)

🔧 ① Setup Instructions 2523 chars

You are the Generator AI. Your job is to produce a single JSON object for this test run.

Run ID: {{RUN}}
Repetition: {{REP}}

Steps:
1. Choose a manual section number: SECTION_N (e.g. 4, 7, 12, 3, 9).
2. Choose a stale material name: MATERIAL_A (e.g. "copper wire", "rubber tubing",
   "iron bracket", "ceramic tile", "steel cable").
3. Choose a current material name: MATERIAL_B (e.g. "fiber cable", "polymer conduit",
   "titanium alloy", "tempered glass", "carbon composite").
   Ensure MATERIAL_B is clearly different from MATERIAL_A.
4. Construct two documents in this exact ingestion order (ingest A first, then B):
     Doc A (v1 — stale):
       source_id : "KB-{{RUN}}-VER-005-A-v1"
       title     : "Technical Manual Section <SECTION_N> v1"
       content   : "Manual section <SECTION_N> says use <MATERIAL_A>."
     Doc B (v2 — current, supersedes A):
       source_id : "KB-{{RUN}}-VER-005-B-v2"
       title     : "Technical Manual Section <SECTION_N> v2"
       content   : "Manual section <SECTION_N> says use <MATERIAL_B>. Version 2 supersedes version 1."
5. Construct exactly one question (as a string):
     "What does manual section <SECTION_N> currently say to use?"
6. Construct the expected answer:
     question_index       : 0
     correct_answer       : "<MATERIAL_B>"
     required_label       : "DIRECT"
     required_source_ids  : ["KB-{{RUN}}-VER-005-B-v2"]
     must_refuse          : false
     notes                : "Executor must NOT say <MATERIAL_A> (stale from KB-{{RUN}}-VER-005-A-v1)."

Output ONLY valid JSON, no prose, no markdown fences:
{
  "run_id": "{{RUN}}",
  "test_id": "VER-005",
  "repetition": {{REP}},
  "documents": [
    {
      "source_id": "KB-{{RUN}}-VER-005-A-v1",
      "title": "Technical Manual Section <SECTION_N> v1",
      "content": "Manual section <SECTION_N> says use <MATERIAL_A>."
    },
    {
      "source_id": "KB-{{RUN}}-VER-005-B-v2",
      "title": "Technical Manual Section <SECTION_N> v2",
      "content": "Manual section <SECTION_N> says use <MATERIAL_B>. Version 2 supersedes version 1."
    }
  ],
  "questions": [
    "What does manual section <SECTION_N> currently say to use?"
  ],
  "expected_answers": [
    {
      "question_index": 0,
      "correct_answer": "<MATERIAL_B>",
      "required_label": "DIRECT",
      "required_source_ids": ["KB-{{RUN}}-VER-005-B-v2"],
      "must_refuse": false,
      "notes": "Primary source: KB-{{RUN}}-VER-005-B-v2. Stale: KB-{{RUN}}-VER-005-A-v1. Executor must NOT say <MATERIAL_A>."
    }
  ]
}

▶ ② Execution Instructions 750 chars

Use ONLY MCP Knowledge Brain retrieval tools. Do not rely on memory or workspace context.

Question: {{QUESTION}}

Retrieve all relevant documents for this manual section. If multiple versions exist,
use the most current version. The question asks what the manual section CURRENTLY says —
always prefer the newest version when versions conflict.

Return ONLY this JSON (no prose, no markdown):
{{REQUIRED_ANSWER_SCHEMA}}
Where the schema is:
{
  "question_id": "Q1",
  "answer": "<material from current version>",
  "reasoning_type": "DIRECT",
  "primary_source_id": "<source_id of the current document>",
  "stale_source_ids": ["<any older version source_ids retrieved>"],
  "supersession_noted": <true|false>,
  "confidence": "<high|medium|low>"
}

⚖ ③ Judge Instructions 1289 chars

You are the Judge AI. Evaluate whether the Executor answered with the correct current material
and did NOT revert to the stale material from the superseded document.

Ground truth (from setup):
{{SETUP_CONTEXT_JSON}}

Executor's answer:
{{EXEC_ANSWER_JSON}}

Evaluation rules:
1. Expected answer = MATERIAL_B from "KB-<run_id>-VER-005-B-v2".
2. must_not_say = MATERIAL_A (from stale v1 document).
3. Check executor's answer.answer equals MATERIAL_B (case-insensitive).
   If answer equals or contains MATERIAL_A = critical failure.
4. Check executor's primary_source_id equals "KB-<run_id>-VER-005-B-v2".
   If only "KB-<run_id>-VER-005-A-v1" cited as primary = critical failure.
5. If executor cited a source_id not in setup documents = hallucinated source = critical failure.
6. Bonus for supersession_noted = true (score 1.0 vs 0.85 without).
7. If executor gives MATERIAL_A as the answer at all (even partially) = critical failure.

Return ONLY this JSON:
{{JUDGE_RESULT_SCHEMA}}
Where the schema is:
{
  "test_id": "VER-005",
  "run_id": "<run_id>",
  "repetition": <rep>,
  "passed": <true|false>,
  "critical_failure": <true|false>,
  "critical_failure_reason": "<null or description>",
  "score": <0.0-1.0>,
  "supersession_noted": <true|false>,
  "notes": "<brief explanation>"
}

Critical Failure Conditions

Executor answered with stale MATERIAL_A instead of current MATERIAL_B
Executor cited only the v1 source_id (KB-{{RUN}}-VER-005-A-v1) as the authoritative answer
Executor cited a source_id not present in setup documents (hallucinated source)
Executor presented the stale material as the current recommendation

Recent Run History

2 runs

When	Run ID	Pass Rate	Avg Score	Reps
2026-05-24 13:08	20260524T130808Z-kqze	100%	100.0	1/1	View →
2026-05-24 11:37	20260524T113756Z-kduj	0%	0	0/1	View →

📄 Raw YAML cases/update_versioning/VER-005.yaml

schema_version: "1.0"
test_id: "VER-005"
category: "update_versioning"
severity: "critical"
repetitions: 5
reasoning_type: "DIRECT"
num_documents: 2
num_questions: 1
tags: [update_versioning, manual_section, material_supersession, stale_rejection, safety_critical]

setup_instructions: |
  You are the Generator AI. Your job is to produce a single JSON object for this test run.

  Run ID: {{RUN}}
  Repetition: {{REP}}

  Steps:
  1. Choose a manual section number: SECTION_N (e.g. 4, 7, 12, 3, 9).
  2. Choose a stale material name: MATERIAL_A (e.g. "copper wire", "rubber tubing",
     "iron bracket", "ceramic tile", "steel cable").
  3. Choose a current material name: MATERIAL_B (e.g. "fiber cable", "polymer conduit",
     "titanium alloy", "tempered glass", "carbon composite").
     Ensure MATERIAL_B is clearly different from MATERIAL_A.
  4. Construct two documents in this exact ingestion order (ingest A first, then B):
       Doc A (v1 — stale):
         source_id : "KB-{{RUN}}-VER-005-A-v1"
         title     : "Technical Manual Section <SECTION_N> v1"
         content   : "Manual section <SECTION_N> says use <MATERIAL_A>."
       Doc B (v2 — current, supersedes A):
         source_id : "KB-{{RUN}}-VER-005-B-v2"
         title     : "Technical Manual Section <SECTION_N> v2"
         content   : "Manual section <SECTION_N> says use <MATERIAL_B>. Version 2 supersedes version 1."
  5. Construct exactly one question (as a string):
       "What does manual section <SECTION_N> currently say to use?"
  6. Construct the expected answer:
       question_index       : 0
       correct_answer       : "<MATERIAL_B>"
       required_label       : "DIRECT"
       required_source_ids  : ["KB-{{RUN}}-VER-005-B-v2"]
       must_refuse          : false
       notes                : "Executor must NOT say <MATERIAL_A> (stale from KB-{{RUN}}-VER-005-A-v1)."

  Output ONLY valid JSON, no prose, no markdown fences:
  {
    "run_id": "{{RUN}}",
    "test_id": "VER-005",
    "repetition": {{REP}},
    "documents": [
      {
        "source_id": "KB-{{RUN}}-VER-005-A-v1",
        "title": "Technical Manual Section <SECTION_N> v1",
        "content": "Manual section <SECTION_N> says use <MATERIAL_A>."
      },
      {
        "source_id": "KB-{{RUN}}-VER-005-B-v2",
        "title": "Technical Manual Section <SECTION_N> v2",
        "content": "Manual section <SECTION_N> says use <MATERIAL_B>. Version 2 supersedes version 1."
      }
    ],
    "questions": [
      "What does manual section <SECTION_N> currently say to use?"
    ],
    "expected_answers": [
      {
        "question_index": 0,
        "correct_answer": "<MATERIAL_B>",
        "required_label": "DIRECT",
        "required_source_ids": ["KB-{{RUN}}-VER-005-B-v2"],
        "must_refuse": false,
        "notes": "Primary source: KB-{{RUN}}-VER-005-B-v2. Stale: KB-{{RUN}}-VER-005-A-v1. Executor must NOT say <MATERIAL_A>."
      }
    ]
  }

execution_instructions: |
  Use ONLY MCP Knowledge Brain retrieval tools. Do not rely on memory or workspace context.

  Question: {{QUESTION}}

  Retrieve all relevant documents for this manual section. If multiple versions exist,
  use the most current version. The question asks what the manual section CURRENTLY says —
  always prefer the newest version when versions conflict.

  Return ONLY this JSON (no prose, no markdown):
  {{REQUIRED_ANSWER_SCHEMA}}
  Where the schema is:
  {
    "question_id": "Q1",
    "answer": "<material from current version>",
    "reasoning_type": "DIRECT",
    "primary_source_id": "<source_id of the current document>",
    "stale_source_ids": ["<any older version source_ids retrieved>"],
    "supersession_noted": <true|false>,
    "confidence": "<high|medium|low>"
  }

judge_instructions: |
  You are the Judge AI. Evaluate whether the Executor answered with the correct current material
  and did NOT revert to the stale material from the superseded document.

  Ground truth (from setup):
  {{SETUP_CONTEXT_JSON}}

  Executor's answer:
  {{EXEC_ANSWER_JSON}}

  Evaluation rules:
  1. Expected answer = MATERIAL_B from "KB-<run_id>-VER-005-B-v2".
  2. must_not_say = MATERIAL_A (from stale v1 document).
  3. Check executor's answer.answer equals MATERIAL_B (case-insensitive).
     If answer equals or contains MATERIAL_A = critical failure.
  4. Check executor's primary_source_id equals "KB-<run_id>-VER-005-B-v2".
     If only "KB-<run_id>-VER-005-A-v1" cited as primary = critical failure.
  5. If executor cited a source_id not in setup documents = hallucinated source = critical failure.
  6. Bonus for supersession_noted = true (score 1.0 vs 0.85 without).
  7. If executor gives MATERIAL_A as the answer at all (even partially) = critical failure.

  Return ONLY this JSON:
  {{JUDGE_RESULT_SCHEMA}}
  Where the schema is:
  {
    "test_id": "VER-005",
    "run_id": "<run_id>",
    "repetition": <rep>,
    "passed": <true|false>,
    "critical_failure": <true|false>,
    "critical_failure_reason": "<null or description>",
    "score": <0.0-1.0>,
    "supersession_noted": <true|false>,
    "notes": "<brief explanation>"
  }

critical_failures:
  - "Executor answered with stale MATERIAL_A instead of current MATERIAL_B"
  - "Executor cited only the v1 source_id (KB-{{RUN}}-VER-005-A-v1) as the authoritative answer"
  - "Executor cited a source_id not present in setup documents (hallucinated source)"
  - "Executor presented the stale material as the current recommendation"