← All Test Cases
critical
VER-005
update versioning
Repetitions
5
Documents
2
Questions
1
Reasoning
DIRECT
update_versioning
manual_section
material_supersession
stale_rejection
safety_critical
📖 In Plain English
What this category tests
Does the brain return the latest version of a document after updates?
How the test works
A document is ingested at v1, then a v2 supersedes it. The test asks about the topic and checks the v2 content is returned, with v1 marked superseded.
Why it matters
Stale data is dangerous — users must always see the current truth.
⚙️ How a single rep runs
① Generate
Model creates 2 synthetic documents and 1 question with unique canary tokens
→ Fresh content per run prevents memorization and proves real retrieval
② Ingest (MCP)
Model calls brain_ingest to store the 2 documents
→ Tests the brain's storage and indexing pipeline
③ Query (MCP)
Model answers the question using brain retrieval tools (search, fetch, context_pack, etc.)
→ Core test — does the brain return correct evidence and let the model build a faithful answer?
④ Evaluate
Model judges the answer against ground truth (the document it generated in phase 1)
→ Produces a score 0–100 with detailed sub-scores (retrieval, fidelity, reasoning, etc.)
This rep is run 5 times per test run. A pass requires score ≥ 85 and no critical failures.
🔬 Technical Instructions (raw prompts sent to AI)
🔧 ① Setup Instructions 2523 chars
You are the Generator AI. Your job is to produce a single JSON object for this test run.
Run ID: {{RUN}}
Repetition: {{REP}}
Steps:
1. Choose a manual section number: SECTION_N (e.g. 4, 7, 12, 3, 9).
2. Choose a stale material name: MATERIAL_A (e.g. "copper wire", "rubber tubing",
"iron bracket", "ceramic tile", "steel cable").
3. Choose a current material name: MATERIAL_B (e.g. "fiber cable", "polymer conduit",
"titanium alloy", "tempered glass", "carbon composite").
Ensure MATERIAL_B is clearly different from MATERIAL_A.
4. Construct two documents in this exact ingestion order (ingest A first, then B):
Doc A (v1 — stale):
source_id : "KB-{{RUN}}-VER-005-A-v1"
title : "Technical Manual Section <SECTION_N> v1"
content : "Manual section <SECTION_N> says use <MATERIAL_A>."
Doc B (v2 — current, supersedes A):
source_id : "KB-{{RUN}}-VER-005-B-v2"
title : "Technical Manual Section <SECTION_N> v2"
content : "Manual section <SECTION_N> says use <MATERIAL_B>. Version 2 supersedes version 1."
5. Construct exactly one question (as a string):
"What does manual section <SECTION_N> currently say to use?"
6. Construct the expected answer:
question_index : 0
correct_answer : "<MATERIAL_B>"
required_label : "DIRECT"
required_source_ids : ["KB-{{RUN}}-VER-005-B-v2"]
must_refuse : false
notes : "Executor must NOT say <MATERIAL_A> (stale from KB-{{RUN}}-VER-005-A-v1)."
Output ONLY valid JSON, no prose, no markdown fences:
{
"run_id": "{{RUN}}",
"test_id": "VER-005",
"repetition": {{REP}},
"documents": [
{
"source_id": "KB-{{RUN}}-VER-005-A-v1",
"title": "Technical Manual Section <SECTION_N> v1",
"content": "Manual section <SECTION_N> says use <MATERIAL_A>."
},
{
"source_id": "KB-{{RUN}}-VER-005-B-v2",
"title": "Technical Manual Section <SECTION_N> v2",
"content": "Manual section <SECTION_N> says use <MATERIAL_B>. Version 2 supersedes version 1."
}
],
"questions": [
"What does manual section <SECTION_N> currently say to use?"
],
"expected_answers": [
{
"question_index": 0,
"correct_answer": "<MATERIAL_B>",
"required_label": "DIRECT",
"required_source_ids": ["KB-{{RUN}}-VER-005-B-v2"],
"must_refuse": false,
"notes": "Primary source: KB-{{RUN}}-VER-005-B-v2. Stale: KB-{{RUN}}-VER-005-A-v1. Executor must NOT say <MATERIAL_A>."
}
]
}
▶ ② Execution Instructions 750 chars
Use ONLY MCP Knowledge Brain retrieval tools. Do not rely on memory or workspace context.
Question: {{QUESTION}}
Retrieve all relevant documents for this manual section. If multiple versions exist,
use the most current version. The question asks what the manual section CURRENTLY says —
always prefer the newest version when versions conflict.
Return ONLY this JSON (no prose, no markdown):
{{REQUIRED_ANSWER_SCHEMA}}
Where the schema is:
{
"question_id": "Q1",
"answer": "<material from current version>",
"reasoning_type": "DIRECT",
"primary_source_id": "<source_id of the current document>",
"stale_source_ids": ["<any older version source_ids retrieved>"],
"supersession_noted": <true|false>,
"confidence": "<high|medium|low>"
}
⚖ ③ Judge Instructions 1289 chars
You are the Judge AI. Evaluate whether the Executor answered with the correct current material
and did NOT revert to the stale material from the superseded document.
Ground truth (from setup):
{{SETUP_CONTEXT_JSON}}
Executor's answer:
{{EXEC_ANSWER_JSON}}
Evaluation rules:
1. Expected answer = MATERIAL_B from "KB-<run_id>-VER-005-B-v2".
2. must_not_say = MATERIAL_A (from stale v1 document).
3. Check executor's answer.answer equals MATERIAL_B (case-insensitive).
If answer equals or contains MATERIAL_A = critical failure.
4. Check executor's primary_source_id equals "KB-<run_id>-VER-005-B-v2".
If only "KB-<run_id>-VER-005-A-v1" cited as primary = critical failure.
5. If executor cited a source_id not in setup documents = hallucinated source = critical failure.
6. Bonus for supersession_noted = true (score 1.0 vs 0.85 without).
7. If executor gives MATERIAL_A as the answer at all (even partially) = critical failure.
Return ONLY this JSON:
{{JUDGE_RESULT_SCHEMA}}
Where the schema is:
{
"test_id": "VER-005",
"run_id": "<run_id>",
"repetition": <rep>,
"passed": <true|false>,
"critical_failure": <true|false>,
"critical_failure_reason": "<null or description>",
"score": <0.0-1.0>,
"supersession_noted": <true|false>,
"notes": "<brief explanation>"
}
Critical Failure Conditions
- Executor answered with stale MATERIAL_A instead of current MATERIAL_B
- Executor cited only the v1 source_id (KB-{{RUN}}-VER-005-A-v1) as the authoritative answer
- Executor cited a source_id not present in setup documents (hallucinated source)
- Executor presented the stale material as the current recommendation
Recent Run History
2 runs| When | Run ID | Pass Rate | Avg Score | Reps | |
|---|---|---|---|---|---|
| 2026-05-24 13:08 | 20260524T130808Z-kqze | 100% | 100.0 | 1/1 | View → |
| 2026-05-24 11:37 | 20260524T113756Z-kduj | 0% | 0 | 0/1 | View → |
📄 Raw YAML cases/update_versioning/VER-005.yaml
schema_version: "1.0"
test_id: "VER-005"
category: "update_versioning"
severity: "critical"
repetitions: 5
reasoning_type: "DIRECT"
num_documents: 2
num_questions: 1
tags: [update_versioning, manual_section, material_supersession, stale_rejection, safety_critical]
setup_instructions: |
You are the Generator AI. Your job is to produce a single JSON object for this test run.
Run ID: {{RUN}}
Repetition: {{REP}}
Steps:
1. Choose a manual section number: SECTION_N (e.g. 4, 7, 12, 3, 9).
2. Choose a stale material name: MATERIAL_A (e.g. "copper wire", "rubber tubing",
"iron bracket", "ceramic tile", "steel cable").
3. Choose a current material name: MATERIAL_B (e.g. "fiber cable", "polymer conduit",
"titanium alloy", "tempered glass", "carbon composite").
Ensure MATERIAL_B is clearly different from MATERIAL_A.
4. Construct two documents in this exact ingestion order (ingest A first, then B):
Doc A (v1 — stale):
source_id : "KB-{{RUN}}-VER-005-A-v1"
title : "Technical Manual Section <SECTION_N> v1"
content : "Manual section <SECTION_N> says use <MATERIAL_A>."
Doc B (v2 — current, supersedes A):
source_id : "KB-{{RUN}}-VER-005-B-v2"
title : "Technical Manual Section <SECTION_N> v2"
content : "Manual section <SECTION_N> says use <MATERIAL_B>. Version 2 supersedes version 1."
5. Construct exactly one question (as a string):
"What does manual section <SECTION_N> currently say to use?"
6. Construct the expected answer:
question_index : 0
correct_answer : "<MATERIAL_B>"
required_label : "DIRECT"
required_source_ids : ["KB-{{RUN}}-VER-005-B-v2"]
must_refuse : false
notes : "Executor must NOT say <MATERIAL_A> (stale from KB-{{RUN}}-VER-005-A-v1)."
Output ONLY valid JSON, no prose, no markdown fences:
{
"run_id": "{{RUN}}",
"test_id": "VER-005",
"repetition": {{REP}},
"documents": [
{
"source_id": "KB-{{RUN}}-VER-005-A-v1",
"title": "Technical Manual Section <SECTION_N> v1",
"content": "Manual section <SECTION_N> says use <MATERIAL_A>."
},
{
"source_id": "KB-{{RUN}}-VER-005-B-v2",
"title": "Technical Manual Section <SECTION_N> v2",
"content": "Manual section <SECTION_N> says use <MATERIAL_B>. Version 2 supersedes version 1."
}
],
"questions": [
"What does manual section <SECTION_N> currently say to use?"
],
"expected_answers": [
{
"question_index": 0,
"correct_answer": "<MATERIAL_B>",
"required_label": "DIRECT",
"required_source_ids": ["KB-{{RUN}}-VER-005-B-v2"],
"must_refuse": false,
"notes": "Primary source: KB-{{RUN}}-VER-005-B-v2. Stale: KB-{{RUN}}-VER-005-A-v1. Executor must NOT say <MATERIAL_A>."
}
]
}
execution_instructions: |
Use ONLY MCP Knowledge Brain retrieval tools. Do not rely on memory or workspace context.
Question: {{QUESTION}}
Retrieve all relevant documents for this manual section. If multiple versions exist,
use the most current version. The question asks what the manual section CURRENTLY says —
always prefer the newest version when versions conflict.
Return ONLY this JSON (no prose, no markdown):
{{REQUIRED_ANSWER_SCHEMA}}
Where the schema is:
{
"question_id": "Q1",
"answer": "<material from current version>",
"reasoning_type": "DIRECT",
"primary_source_id": "<source_id of the current document>",
"stale_source_ids": ["<any older version source_ids retrieved>"],
"supersession_noted": <true|false>,
"confidence": "<high|medium|low>"
}
judge_instructions: |
You are the Judge AI. Evaluate whether the Executor answered with the correct current material
and did NOT revert to the stale material from the superseded document.
Ground truth (from setup):
{{SETUP_CONTEXT_JSON}}
Executor's answer:
{{EXEC_ANSWER_JSON}}
Evaluation rules:
1. Expected answer = MATERIAL_B from "KB-<run_id>-VER-005-B-v2".
2. must_not_say = MATERIAL_A (from stale v1 document).
3. Check executor's answer.answer equals MATERIAL_B (case-insensitive).
If answer equals or contains MATERIAL_A = critical failure.
4. Check executor's primary_source_id equals "KB-<run_id>-VER-005-B-v2".
If only "KB-<run_id>-VER-005-A-v1" cited as primary = critical failure.
5. If executor cited a source_id not in setup documents = hallucinated source = critical failure.
6. Bonus for supersession_noted = true (score 1.0 vs 0.85 without).
7. If executor gives MATERIAL_A as the answer at all (even partially) = critical failure.
Return ONLY this JSON:
{{JUDGE_RESULT_SCHEMA}}
Where the schema is:
{
"test_id": "VER-005",
"run_id": "<run_id>",
"repetition": <rep>,
"passed": <true|false>,
"critical_failure": <true|false>,
"critical_failure_reason": "<null or description>",
"score": <0.0-1.0>,
"supersession_noted": <true|false>,
"notes": "<brief explanation>"
}
critical_failures:
- "Executor answered with stale MATERIAL_A instead of current MATERIAL_B"
- "Executor cited only the v1 source_id (KB-{{RUN}}-VER-005-A-v1) as the authoritative answer"
- "Executor cited a source_id not present in setup documents (hallucinated source)"
- "Executor presented the stale material as the current recommendation"