← All Test Cases
medium
VER-004
update versioning
Repetitions
5
Documents
1
Questions
1
Reasoning
UNCERTAIN
update_versioning
archival
deletion_semantics
uncertainty
temporal_validity
📖 In Plain English
What this category tests
Does the brain return the latest version of a document after updates?
How the test works
A document is ingested at v1, then a v2 supersedes it. The test asks about the topic and checks the v2 content is returned, with v1 marked superseded.
Why it matters
Stale data is dangerous — users must always see the current truth.
⚙️ How a single rep runs
① Generate
Model creates 1 synthetic document and 1 question with unique canary tokens
→ Fresh content per run prevents memorization and proves real retrieval
② Ingest (MCP)
Model calls brain_ingest to store the 1 document
→ Tests the brain's storage and indexing pipeline
③ Query (MCP)
Model answers the question using brain retrieval tools (search, fetch, context_pack, etc.)
→ Core test — does the brain return correct evidence and let the model build a faithful answer?
④ Evaluate
Model judges the answer against ground truth (the document it generated in phase 1)
→ Produces a score 0–100 with detailed sub-scores (retrieval, fidelity, reasoning, etc.)
This rep is run 5 times per test run. A pass requires score ≥ 85 and no critical failures.
🔬 Technical Instructions (raw prompts sent to AI)
🔧 ① Setup Instructions 2333 chars
You are the Generator AI. Your job is to produce a single JSON object for this test run.
Run ID: {{RUN}}
Repetition: {{REP}}
Steps:
1. Choose a short numeric code (4-6 digits, e.g. "7392", "48201", "9051", "62837").
Call this CODE_A.
2. Construct exactly one document:
source_id : "KB-{{RUN}}-VER-004-A-v1"
title : "Temporary Code Record"
content : "Temporary code is <CODE_A>."
3. IMPORTANT setup context: After ingesting this document into the brain, treat it as
archived or deleted (if the brain supports deletion semantics). Record this intent in
setup_context so the judge knows the document should be considered inactive.
4. Construct exactly one question (as a string):
"What is the temporary code?"
5. Construct the expected answer with uncertainty:
question_index : 0
correct_answer : "UNCERTAIN — no active source; document was archived (or <CODE_A> noted as from archived document)"
required_label : "UNCERTAIN"
required_source_ids : ["KB-{{RUN}}-VER-004-A-v1"]
must_refuse : false
notes : "Brain may or may not support deletion. Executor must not
confidently state the code as current without noting archival."
Output ONLY valid JSON, no prose, no markdown fences:
{
"run_id": "{{RUN}}",
"test_id": "VER-004",
"repetition": {{REP}},
"documents": [
{
"source_id": "KB-{{RUN}}-VER-004-A-v1",
"title": "Temporary Code Record",
"content": "Temporary code is <CODE_A>.",
"archival_intent": true
}
],
"setup_context": {
"archival_intent": true,
"archival_note": "This document should be treated as archived/deleted if the brain supports deletion semantics."
},
"questions": [
"What is the temporary code?"
],
"expected_answers": [
{
"question_index": 0,
"correct_answer": "No active source — the document has been archived (or <CODE_A> with archival caveat).",
"required_label": "UNCERTAIN",
"required_source_ids": ["KB-{{RUN}}-VER-004-A-v1"],
"must_refuse": false,
"notes": "If brain supports deletion: answer should be 'no active source'. If not: answer may be <CODE_A> but must note archival_intent. Either way, archival_status_noted must be true and confidence low."
}
]
}
▶ ② Execution Instructions 849 chars
Use ONLY MCP Knowledge Brain retrieval tools. Do not rely on memory or workspace context.
Question: {{QUESTION}}
Retrieve relevant documents. Be aware that the source document for this question may have been
archived or deleted. If you find a document, check whether it is marked as archived or inactive.
Do not confidently assert the code as currently valid without noting its archival status.
Return ONLY this JSON (no prose, no markdown):
{{REQUIRED_ANSWER_SCHEMA}}
Where the schema is:
{
"question_id": "Q1",
"answer": "<the code if found, or 'NOT_FOUND' if deletion is supported and doc was removed>",
"reasoning_type": "UNCERTAIN",
"source_ids": ["<source_id if retrieved, else empty array>"],
"archival_status_noted": <true|false>,
"archival_note": "<description of archival/deletion status, or null>",
"confidence": "low"
}
⚖ ③ Judge Instructions 1415 chars
You are the Judge AI. Evaluate whether the Executor handled the archival uncertainty correctly.
Ground truth (from setup):
{{SETUP_CONTEXT_JSON}}
Executor's answer:
{{EXEC_ANSWER_JSON}}
Evaluation rules:
1. The document was marked with archival_intent = true in setup.
2. Critical failure condition: executor confidently states the code as currently active/valid
WITHOUT noting any archival or uncertainty caveat.
3. Acceptable outcomes (both count as pass):
a. Executor says "NOT_FOUND" or "no active source" (brain supports deletion).
b. Executor returns CODE_A but clearly notes it is from an archived/uncertain source.
4. In either acceptable case, archival_status_noted must be true.
5. If executor cited a source_id not in setup documents = hallucinated source = critical failure.
6. Confidence must be "low" — if executor reported "high" confidence without qualification = deduct.
7. Score 1.0 for correctly noting uncertainty. Score 0.5 if code returned without archival note.
Score 0.0 if confidently stated as current with no caveat.
Return ONLY this JSON:
{{JUDGE_RESULT_SCHEMA}}
Where the schema is:
{
"test_id": "VER-004",
"run_id": "<run_id>",
"repetition": <rep>,
"passed": <true|false>,
"critical_failure": <true|false>,
"critical_failure_reason": "<null or description>",
"score": <0.0-1.0>,
"archival_status_noted": <true|false>,
"notes": "<brief explanation>"
}
Critical Failure Conditions
- Executor confidently stated the code as current and active without any archival or uncertainty caveat
- Executor cited a source_id not present in setup documents (hallucinated source)
- Executor reported high confidence on an archived/uncertain document without qualification
Recent Run History
2 runs| When | Run ID | Pass Rate | Avg Score | Reps | |
|---|---|---|---|---|---|
| 2026-05-24 13:08 | 20260524T130808Z-kqze | 0% | 0 | 0/1 | View → |
| 2026-05-24 11:37 | 20260524T113756Z-kduj | 0% | 0 | 0/1 | View → |
📄 Raw YAML cases/update_versioning/VER-004.yaml
schema_version: "1.0"
test_id: "VER-004"
category: "update_versioning"
severity: "medium"
repetitions: 5
reasoning_type: "UNCERTAIN"
num_documents: 1
num_questions: 1
tags: [update_versioning, archival, deletion_semantics, uncertainty, temporal_validity]
setup_instructions: |
You are the Generator AI. Your job is to produce a single JSON object for this test run.
Run ID: {{RUN}}
Repetition: {{REP}}
Steps:
1. Choose a short numeric code (4-6 digits, e.g. "7392", "48201", "9051", "62837").
Call this CODE_A.
2. Construct exactly one document:
source_id : "KB-{{RUN}}-VER-004-A-v1"
title : "Temporary Code Record"
content : "Temporary code is <CODE_A>."
3. IMPORTANT setup context: After ingesting this document into the brain, treat it as
archived or deleted (if the brain supports deletion semantics). Record this intent in
setup_context so the judge knows the document should be considered inactive.
4. Construct exactly one question (as a string):
"What is the temporary code?"
5. Construct the expected answer with uncertainty:
question_index : 0
correct_answer : "UNCERTAIN — no active source; document was archived (or <CODE_A> noted as from archived document)"
required_label : "UNCERTAIN"
required_source_ids : ["KB-{{RUN}}-VER-004-A-v1"]
must_refuse : false
notes : "Brain may or may not support deletion. Executor must not
confidently state the code as current without noting archival."
Output ONLY valid JSON, no prose, no markdown fences:
{
"run_id": "{{RUN}}",
"test_id": "VER-004",
"repetition": {{REP}},
"documents": [
{
"source_id": "KB-{{RUN}}-VER-004-A-v1",
"title": "Temporary Code Record",
"content": "Temporary code is <CODE_A>.",
"archival_intent": true
}
],
"setup_context": {
"archival_intent": true,
"archival_note": "This document should be treated as archived/deleted if the brain supports deletion semantics."
},
"questions": [
"What is the temporary code?"
],
"expected_answers": [
{
"question_index": 0,
"correct_answer": "No active source — the document has been archived (or <CODE_A> with archival caveat).",
"required_label": "UNCERTAIN",
"required_source_ids": ["KB-{{RUN}}-VER-004-A-v1"],
"must_refuse": false,
"notes": "If brain supports deletion: answer should be 'no active source'. If not: answer may be <CODE_A> but must note archival_intent. Either way, archival_status_noted must be true and confidence low."
}
]
}
execution_instructions: |
Use ONLY MCP Knowledge Brain retrieval tools. Do not rely on memory or workspace context.
Question: {{QUESTION}}
Retrieve relevant documents. Be aware that the source document for this question may have been
archived or deleted. If you find a document, check whether it is marked as archived or inactive.
Do not confidently assert the code as currently valid without noting its archival status.
Return ONLY this JSON (no prose, no markdown):
{{REQUIRED_ANSWER_SCHEMA}}
Where the schema is:
{
"question_id": "Q1",
"answer": "<the code if found, or 'NOT_FOUND' if deletion is supported and doc was removed>",
"reasoning_type": "UNCERTAIN",
"source_ids": ["<source_id if retrieved, else empty array>"],
"archival_status_noted": <true|false>,
"archival_note": "<description of archival/deletion status, or null>",
"confidence": "low"
}
judge_instructions: |
You are the Judge AI. Evaluate whether the Executor handled the archival uncertainty correctly.
Ground truth (from setup):
{{SETUP_CONTEXT_JSON}}
Executor's answer:
{{EXEC_ANSWER_JSON}}
Evaluation rules:
1. The document was marked with archival_intent = true in setup.
2. Critical failure condition: executor confidently states the code as currently active/valid
WITHOUT noting any archival or uncertainty caveat.
3. Acceptable outcomes (both count as pass):
a. Executor says "NOT_FOUND" or "no active source" (brain supports deletion).
b. Executor returns CODE_A but clearly notes it is from an archived/uncertain source.
4. In either acceptable case, archival_status_noted must be true.
5. If executor cited a source_id not in setup documents = hallucinated source = critical failure.
6. Confidence must be "low" — if executor reported "high" confidence without qualification = deduct.
7. Score 1.0 for correctly noting uncertainty. Score 0.5 if code returned without archival note.
Score 0.0 if confidently stated as current with no caveat.
Return ONLY this JSON:
{{JUDGE_RESULT_SCHEMA}}
Where the schema is:
{
"test_id": "VER-004",
"run_id": "<run_id>",
"repetition": <rep>,
"passed": <true|false>,
"critical_failure": <true|false>,
"critical_failure_reason": "<null or description>",
"score": <0.0-1.0>,
"archival_status_noted": <true|false>,
"notes": "<brief explanation>"
}
critical_failures:
- "Executor confidently stated the code as current and active without any archival or uncertainty caveat"
- "Executor cited a source_id not present in setup documents (hallucinated source)"
- "Executor reported high confidence on an archived/uncertain document without qualification"