← All Test Cases
high
VER-002
update versioning
Repetitions
5
Documents
2
Questions
1
Reasoning
DIRECT
update_versioning
policy
numeric_limit
dual_citation
prior_and_current
📖 In Plain English
What this category tests
Does the brain return the latest version of a document after updates?
How the test works
A document is ingested at v1, then a v2 supersedes it. The test asks about the topic and checks the v2 content is returned, with v1 marked superseded.
Why it matters
Stale data is dangerous — users must always see the current truth.
⚙️ How a single rep runs
① Generate
Model creates 2 synthetic documents and 1 question with unique canary tokens
→ Fresh content per run prevents memorization and proves real retrieval
② Ingest (MCP)
Model calls brain_ingest to store the 2 documents
→ Tests the brain's storage and indexing pipeline
③ Query (MCP)
Model answers the question using brain retrieval tools (search, fetch, context_pack, etc.)
→ Core test — does the brain return correct evidence and let the model build a faithful answer?
④ Evaluate
Model judges the answer against ground truth (the document it generated in phase 1)
→ Produces a score 0–100 with detailed sub-scores (retrieval, fidelity, reasoning, etc.)
This rep is run 5 times per test run. A pass requires score ≥ 85 and no critical failures.
🔬 Technical Instructions (raw prompts sent to AI)
🔧 ① Setup Instructions 2425 chars
You are the Generator AI. Your job is to produce a single JSON object for this test run.
Run ID: {{RUN}}
Repetition: {{REP}}
Steps:
1. Choose a small integer for the prior limit: NUMBER_A (e.g. 8, 10, 15, 20, 25).
2. Choose a larger integer for the current limit: NUMBER_B, where NUMBER_B > NUMBER_A
(e.g. if NUMBER_A=10 then NUMBER_B=12; if NUMBER_A=15 then NUMBER_B=18).
3. Construct two documents in this exact ingestion order:
Doc A (v1 — prior policy):
source_id : "KB-{{RUN}}-VER-002-A-v1"
title : "Policy Limit Memo v1"
content : "Policy limit is <NUMBER_A>."
Doc B (v2 — current policy, supersedes A):
source_id : "KB-{{RUN}}-VER-002-B-v2"
title : "Policy Limit Memo v2"
content : "Policy limit is <NUMBER_B>. This supersedes the prior limit of <NUMBER_A>."
4. Ingest Doc A first, then Doc B.
5. Construct exactly one question (as a string):
"What is the current policy limit, and what was the prior limit? Cite source IDs for each."
6. Construct the expected answer:
question_index : 0
correct_answer : "Current limit: <NUMBER_B> (KB-{{RUN}}-VER-002-B-v2); prior limit: <NUMBER_A> (KB-{{RUN}}-VER-002-A-v1)."
required_label : "DIRECT"
required_source_ids : ["KB-{{RUN}}-VER-002-A-v1", "KB-{{RUN}}-VER-002-B-v2"]
must_refuse : false
Output ONLY valid JSON, no prose, no markdown fences:
{
"run_id": "{{RUN}}",
"test_id": "VER-002",
"repetition": {{REP}},
"documents": [
{
"source_id": "KB-{{RUN}}-VER-002-A-v1",
"title": "Policy Limit Memo v1",
"content": "Policy limit is <NUMBER_A>."
},
{
"source_id": "KB-{{RUN}}-VER-002-B-v2",
"title": "Policy Limit Memo v2",
"content": "Policy limit is <NUMBER_B>. This supersedes the prior limit of <NUMBER_A>."
}
],
"questions": [
"What is the current policy limit, and what was the prior limit? Cite source IDs for each."
],
"expected_answers": [
{
"question_index": 0,
"correct_answer": "Current limit: <NUMBER_B> (KB-{{RUN}}-VER-002-B-v2); prior limit: <NUMBER_A> (KB-{{RUN}}-VER-002-A-v1).",
"required_label": "DIRECT",
"required_source_ids": ["KB-{{RUN}}-VER-002-A-v1", "KB-{{RUN}}-VER-002-B-v2"],
"must_refuse": false,
"notes": "Executor must cite both versions: current=NUMBER_B from B-v2, prior=NUMBER_A from A-v1."
}
]
}
▶ ② Execution Instructions 604 chars
Use ONLY MCP Knowledge Brain retrieval tools. Do not rely on memory or workspace context.
Question: {{QUESTION}}
Retrieve all relevant policy limit documents. Identify the current and prior values,
and cite the source_id for each.
Return ONLY this JSON (no prose, no markdown):
{{REQUIRED_ANSWER_SCHEMA}}
Where the schema is:
{
"question_id": "Q1",
"current_limit": <number>,
"current_source_id": "<source_id of current policy document>",
"prior_limit": <number>,
"prior_source_id": "<source_id of prior policy document>",
"reasoning_type": "DIRECT",
"confidence": "<high|medium|low>"
}
⚖ ③ Judge Instructions 1449 chars
You are the Judge AI. Evaluate whether the Executor correctly identified both the current
and prior policy limits with correct source citations.
Ground truth (from setup):
{{SETUP_CONTEXT_JSON}}
Executor's answer:
{{EXEC_ANSWER_JSON}}
Evaluation rules:
1. Extract expected values from setup: expected_answers[0]
- current_limit = NUMBER_B, current_source_id = "KB-<run_id>-VER-002-B-v2"
- prior_limit = NUMBER_A, prior_source_id = "KB-<run_id>-VER-002-A-v1"
2. Check executor's current_limit equals NUMBER_B. If wrong = critical failure.
3. Check executor's prior_limit equals NUMBER_A. If wrong = critical failure.
4. Check executor's current_source_id equals "KB-<run_id>-VER-002-B-v2".
If only one version cited = critical failure.
5. Check executor's prior_source_id equals "KB-<run_id>-VER-002-A-v1".
6. If executor swapped current and prior = critical failure.
7. If any cited source_id is not in setup documents = hallucinated source = critical failure.
8. Full score 1.0 requires all four values correct. Deduct 0.25 per wrong field.
Return ONLY this JSON:
{{JUDGE_RESULT_SCHEMA}}
Where the schema is:
{
"test_id": "VER-002",
"run_id": "<run_id>",
"repetition": <rep>,
"passed": <true|false>,
"critical_failure": <true|false>,
"critical_failure_reason": "<null or description>",
"score": <0.0-1.0>,
"current_limit_correct": <true|false>,
"prior_limit_correct": <true|false>,
"notes": "<brief explanation>"
}
Critical Failure Conditions
- Executor only cited one version of the policy (missing prior or current)
- Executor swapped current and prior limits (returned NUMBER_A as current)
- Executor cited a source_id not present in setup documents (hallucinated source)
- Executor returned incorrect numeric values for current or prior limit
Recent Run History
2 runs| When | Run ID | Pass Rate | Avg Score | Reps | |
|---|---|---|---|---|---|
| 2026-05-24 13:08 | 20260524T130808Z-kqze | 100% | 100.0 | 1/1 | View → |
| 2026-05-24 11:37 | 20260524T113756Z-kduj | 0% | 0 | 0/1 | View → |
📄 Raw YAML cases/update_versioning/VER-002.yaml
schema_version: "1.0"
test_id: "VER-002"
category: "update_versioning"
severity: "high"
repetitions: 5
reasoning_type: "DIRECT"
num_documents: 2
num_questions: 1
tags: [update_versioning, policy, numeric_limit, dual_citation, prior_and_current]
setup_instructions: |
You are the Generator AI. Your job is to produce a single JSON object for this test run.
Run ID: {{RUN}}
Repetition: {{REP}}
Steps:
1. Choose a small integer for the prior limit: NUMBER_A (e.g. 8, 10, 15, 20, 25).
2. Choose a larger integer for the current limit: NUMBER_B, where NUMBER_B > NUMBER_A
(e.g. if NUMBER_A=10 then NUMBER_B=12; if NUMBER_A=15 then NUMBER_B=18).
3. Construct two documents in this exact ingestion order:
Doc A (v1 — prior policy):
source_id : "KB-{{RUN}}-VER-002-A-v1"
title : "Policy Limit Memo v1"
content : "Policy limit is <NUMBER_A>."
Doc B (v2 — current policy, supersedes A):
source_id : "KB-{{RUN}}-VER-002-B-v2"
title : "Policy Limit Memo v2"
content : "Policy limit is <NUMBER_B>. This supersedes the prior limit of <NUMBER_A>."
4. Ingest Doc A first, then Doc B.
5. Construct exactly one question (as a string):
"What is the current policy limit, and what was the prior limit? Cite source IDs for each."
6. Construct the expected answer:
question_index : 0
correct_answer : "Current limit: <NUMBER_B> (KB-{{RUN}}-VER-002-B-v2); prior limit: <NUMBER_A> (KB-{{RUN}}-VER-002-A-v1)."
required_label : "DIRECT"
required_source_ids : ["KB-{{RUN}}-VER-002-A-v1", "KB-{{RUN}}-VER-002-B-v2"]
must_refuse : false
Output ONLY valid JSON, no prose, no markdown fences:
{
"run_id": "{{RUN}}",
"test_id": "VER-002",
"repetition": {{REP}},
"documents": [
{
"source_id": "KB-{{RUN}}-VER-002-A-v1",
"title": "Policy Limit Memo v1",
"content": "Policy limit is <NUMBER_A>."
},
{
"source_id": "KB-{{RUN}}-VER-002-B-v2",
"title": "Policy Limit Memo v2",
"content": "Policy limit is <NUMBER_B>. This supersedes the prior limit of <NUMBER_A>."
}
],
"questions": [
"What is the current policy limit, and what was the prior limit? Cite source IDs for each."
],
"expected_answers": [
{
"question_index": 0,
"correct_answer": "Current limit: <NUMBER_B> (KB-{{RUN}}-VER-002-B-v2); prior limit: <NUMBER_A> (KB-{{RUN}}-VER-002-A-v1).",
"required_label": "DIRECT",
"required_source_ids": ["KB-{{RUN}}-VER-002-A-v1", "KB-{{RUN}}-VER-002-B-v2"],
"must_refuse": false,
"notes": "Executor must cite both versions: current=NUMBER_B from B-v2, prior=NUMBER_A from A-v1."
}
]
}
execution_instructions: |
Use ONLY MCP Knowledge Brain retrieval tools. Do not rely on memory or workspace context.
Question: {{QUESTION}}
Retrieve all relevant policy limit documents. Identify the current and prior values,
and cite the source_id for each.
Return ONLY this JSON (no prose, no markdown):
{{REQUIRED_ANSWER_SCHEMA}}
Where the schema is:
{
"question_id": "Q1",
"current_limit": <number>,
"current_source_id": "<source_id of current policy document>",
"prior_limit": <number>,
"prior_source_id": "<source_id of prior policy document>",
"reasoning_type": "DIRECT",
"confidence": "<high|medium|low>"
}
judge_instructions: |
You are the Judge AI. Evaluate whether the Executor correctly identified both the current
and prior policy limits with correct source citations.
Ground truth (from setup):
{{SETUP_CONTEXT_JSON}}
Executor's answer:
{{EXEC_ANSWER_JSON}}
Evaluation rules:
1. Extract expected values from setup: expected_answers[0]
- current_limit = NUMBER_B, current_source_id = "KB-<run_id>-VER-002-B-v2"
- prior_limit = NUMBER_A, prior_source_id = "KB-<run_id>-VER-002-A-v1"
2. Check executor's current_limit equals NUMBER_B. If wrong = critical failure.
3. Check executor's prior_limit equals NUMBER_A. If wrong = critical failure.
4. Check executor's current_source_id equals "KB-<run_id>-VER-002-B-v2".
If only one version cited = critical failure.
5. Check executor's prior_source_id equals "KB-<run_id>-VER-002-A-v1".
6. If executor swapped current and prior = critical failure.
7. If any cited source_id is not in setup documents = hallucinated source = critical failure.
8. Full score 1.0 requires all four values correct. Deduct 0.25 per wrong field.
Return ONLY this JSON:
{{JUDGE_RESULT_SCHEMA}}
Where the schema is:
{
"test_id": "VER-002",
"run_id": "<run_id>",
"repetition": <rep>,
"passed": <true|false>,
"critical_failure": <true|false>,
"critical_failure_reason": "<null or description>",
"score": <0.0-1.0>,
"current_limit_correct": <true|false>,
"prior_limit_correct": <true|false>,
"notes": "<brief explanation>"
}
critical_failures:
- "Executor only cited one version of the policy (missing prior or current)"
- "Executor swapped current and prior limits (returned NUMBER_A as current)"
- "Executor cited a source_id not present in setup documents (hallucinated source)"
- "Executor returned incorrect numeric values for current or prior limit"