← All Test Cases

VER-004

update versioning

medium

Repetitions

Documents

Questions

Reasoning

UNCERTAIN

update_versioning archival deletion_semantics uncertainty temporal_validity

📖 In Plain English

What this category tests

Does the brain return the latest version of a document after updates?

How the test works

A document is ingested at v1, then a v2 supersedes it. The test asks about the topic and checks the v2 content is returned, with v1 marked superseded.

Why it matters

Stale data is dangerous — users must always see the current truth.

⚙️ How a single rep runs

① Generate

Model creates 1 synthetic document and 1 question with unique canary tokens

→ Fresh content per run prevents memorization and proves real retrieval

② Ingest (MCP)

Model calls brain_ingest to store the 1 document

→ Tests the brain's storage and indexing pipeline

③ Query (MCP)

Model answers the question using brain retrieval tools (search, fetch, context_pack, etc.)

→ Core test — does the brain return correct evidence and let the model build a faithful answer?

④ Evaluate

Model judges the answer against ground truth (the document it generated in phase 1)

→ Produces a score 0–100 with detailed sub-scores (retrieval, fidelity, reasoning, etc.)

This rep is run 5 times per test run. A pass requires score ≥ 85 and no critical failures.

🔬 Technical Instructions (raw prompts sent to AI)

🔧 ① Setup Instructions 2333 chars

You are the Generator AI. Your job is to produce a single JSON object for this test run.

Run ID: {{RUN}}
Repetition: {{REP}}

Steps:
1. Choose a short numeric code (4-6 digits, e.g. "7392", "48201", "9051", "62837").
   Call this CODE_A.
2. Construct exactly one document:
     source_id : "KB-{{RUN}}-VER-004-A-v1"
     title     : "Temporary Code Record"
     content   : "Temporary code is <CODE_A>."
3. IMPORTANT setup context: After ingesting this document into the brain, treat it as
   archived or deleted (if the brain supports deletion semantics). Record this intent in
   setup_context so the judge knows the document should be considered inactive.
4. Construct exactly one question (as a string):
     "What is the temporary code?"
5. Construct the expected answer with uncertainty:
     question_index       : 0
     correct_answer       : "UNCERTAIN — no active source; document was archived (or <CODE_A> noted as from archived document)"
     required_label       : "UNCERTAIN"
     required_source_ids  : ["KB-{{RUN}}-VER-004-A-v1"]
     must_refuse          : false
     notes                : "Brain may or may not support deletion. Executor must not
                             confidently state the code as current without noting archival."

Output ONLY valid JSON, no prose, no markdown fences:
{
  "run_id": "{{RUN}}",
  "test_id": "VER-004",
  "repetition": {{REP}},
  "documents": [
    {
      "source_id": "KB-{{RUN}}-VER-004-A-v1",
      "title": "Temporary Code Record",
      "content": "Temporary code is <CODE_A>.",
      "archival_intent": true
    }
  ],
  "setup_context": {
    "archival_intent": true,
    "archival_note": "This document should be treated as archived/deleted if the brain supports deletion semantics."
  },
  "questions": [
    "What is the temporary code?"
  ],
  "expected_answers": [
    {
      "question_index": 0,
      "correct_answer": "No active source — the document has been archived (or <CODE_A> with archival caveat).",
      "required_label": "UNCERTAIN",
      "required_source_ids": ["KB-{{RUN}}-VER-004-A-v1"],
      "must_refuse": false,
      "notes": "If brain supports deletion: answer should be 'no active source'. If not: answer may be <CODE_A> but must note archival_intent. Either way, archival_status_noted must be true and confidence low."
    }
  ]
}

▶ ② Execution Instructions 849 chars

Use ONLY MCP Knowledge Brain retrieval tools. Do not rely on memory or workspace context.

Question: {{QUESTION}}

Retrieve relevant documents. Be aware that the source document for this question may have been
archived or deleted. If you find a document, check whether it is marked as archived or inactive.
Do not confidently assert the code as currently valid without noting its archival status.

Return ONLY this JSON (no prose, no markdown):
{{REQUIRED_ANSWER_SCHEMA}}
Where the schema is:
{
  "question_id": "Q1",
  "answer": "<the code if found, or 'NOT_FOUND' if deletion is supported and doc was removed>",
  "reasoning_type": "UNCERTAIN",
  "source_ids": ["<source_id if retrieved, else empty array>"],
  "archival_status_noted": <true|false>,
  "archival_note": "<description of archival/deletion status, or null>",
  "confidence": "low"
}

⚖ ③ Judge Instructions 1415 chars

You are the Judge AI. Evaluate whether the Executor handled the archival uncertainty correctly.

Ground truth (from setup):
{{SETUP_CONTEXT_JSON}}

Executor's answer:
{{EXEC_ANSWER_JSON}}

Evaluation rules:
1. The document was marked with archival_intent = true in setup.
2. Critical failure condition: executor confidently states the code as currently active/valid
   WITHOUT noting any archival or uncertainty caveat.
3. Acceptable outcomes (both count as pass):
   a. Executor says "NOT_FOUND" or "no active source" (brain supports deletion).
   b. Executor returns CODE_A but clearly notes it is from an archived/uncertain source.
4. In either acceptable case, archival_status_noted must be true.
5. If executor cited a source_id not in setup documents = hallucinated source = critical failure.
6. Confidence must be "low" — if executor reported "high" confidence without qualification = deduct.
7. Score 1.0 for correctly noting uncertainty. Score 0.5 if code returned without archival note.
   Score 0.0 if confidently stated as current with no caveat.

Return ONLY this JSON:
{{JUDGE_RESULT_SCHEMA}}
Where the schema is:
{
  "test_id": "VER-004",
  "run_id": "<run_id>",
  "repetition": <rep>,
  "passed": <true|false>,
  "critical_failure": <true|false>,
  "critical_failure_reason": "<null or description>",
  "score": <0.0-1.0>,
  "archival_status_noted": <true|false>,
  "notes": "<brief explanation>"
}

Critical Failure Conditions

Executor confidently stated the code as current and active without any archival or uncertainty caveat
Executor cited a source_id not present in setup documents (hallucinated source)
Executor reported high confidence on an archived/uncertain document without qualification

Recent Run History

2 runs

When	Run ID	Pass Rate	Avg Score	Reps
2026-05-24 13:08	20260524T130808Z-kqze	0%	0	0/1	View →
2026-05-24 11:37	20260524T113756Z-kduj	0%	0	0/1	View →

📄 Raw YAML cases/update_versioning/VER-004.yaml

schema_version: "1.0"
test_id: "VER-004"
category: "update_versioning"
severity: "medium"
repetitions: 5
reasoning_type: "UNCERTAIN"
num_documents: 1
num_questions: 1
tags: [update_versioning, archival, deletion_semantics, uncertainty, temporal_validity]

setup_instructions: |
  You are the Generator AI. Your job is to produce a single JSON object for this test run.

  Run ID: {{RUN}}
  Repetition: {{REP}}

  Steps:
  1. Choose a short numeric code (4-6 digits, e.g. "7392", "48201", "9051", "62837").
     Call this CODE_A.
  2. Construct exactly one document:
       source_id : "KB-{{RUN}}-VER-004-A-v1"
       title     : "Temporary Code Record"
       content   : "Temporary code is <CODE_A>."
  3. IMPORTANT setup context: After ingesting this document into the brain, treat it as
     archived or deleted (if the brain supports deletion semantics). Record this intent in
     setup_context so the judge knows the document should be considered inactive.
  4. Construct exactly one question (as a string):
       "What is the temporary code?"
  5. Construct the expected answer with uncertainty:
       question_index       : 0
       correct_answer       : "UNCERTAIN — no active source; document was archived (or <CODE_A> noted as from archived document)"
       required_label       : "UNCERTAIN"
       required_source_ids  : ["KB-{{RUN}}-VER-004-A-v1"]
       must_refuse          : false
       notes                : "Brain may or may not support deletion. Executor must not
                               confidently state the code as current without noting archival."

  Output ONLY valid JSON, no prose, no markdown fences:
  {
    "run_id": "{{RUN}}",
    "test_id": "VER-004",
    "repetition": {{REP}},
    "documents": [
      {
        "source_id": "KB-{{RUN}}-VER-004-A-v1",
        "title": "Temporary Code Record",
        "content": "Temporary code is <CODE_A>.",
        "archival_intent": true
      }
    ],
    "setup_context": {
      "archival_intent": true,
      "archival_note": "This document should be treated as archived/deleted if the brain supports deletion semantics."
    },
    "questions": [
      "What is the temporary code?"
    ],
    "expected_answers": [
      {
        "question_index": 0,
        "correct_answer": "No active source — the document has been archived (or <CODE_A> with archival caveat).",
        "required_label": "UNCERTAIN",
        "required_source_ids": ["KB-{{RUN}}-VER-004-A-v1"],
        "must_refuse": false,
        "notes": "If brain supports deletion: answer should be 'no active source'. If not: answer may be <CODE_A> but must note archival_intent. Either way, archival_status_noted must be true and confidence low."
      }
    ]
  }

execution_instructions: |
  Use ONLY MCP Knowledge Brain retrieval tools. Do not rely on memory or workspace context.

  Question: {{QUESTION}}

  Retrieve relevant documents. Be aware that the source document for this question may have been
  archived or deleted. If you find a document, check whether it is marked as archived or inactive.
  Do not confidently assert the code as currently valid without noting its archival status.

  Return ONLY this JSON (no prose, no markdown):
  {{REQUIRED_ANSWER_SCHEMA}}
  Where the schema is:
  {
    "question_id": "Q1",
    "answer": "<the code if found, or 'NOT_FOUND' if deletion is supported and doc was removed>",
    "reasoning_type": "UNCERTAIN",
    "source_ids": ["<source_id if retrieved, else empty array>"],
    "archival_status_noted": <true|false>,
    "archival_note": "<description of archival/deletion status, or null>",
    "confidence": "low"
  }

judge_instructions: |
  You are the Judge AI. Evaluate whether the Executor handled the archival uncertainty correctly.

  Ground truth (from setup):
  {{SETUP_CONTEXT_JSON}}

  Executor's answer:
  {{EXEC_ANSWER_JSON}}

  Evaluation rules:
  1. The document was marked with archival_intent = true in setup.
  2. Critical failure condition: executor confidently states the code as currently active/valid
     WITHOUT noting any archival or uncertainty caveat.
  3. Acceptable outcomes (both count as pass):
     a. Executor says "NOT_FOUND" or "no active source" (brain supports deletion).
     b. Executor returns CODE_A but clearly notes it is from an archived/uncertain source.
  4. In either acceptable case, archival_status_noted must be true.
  5. If executor cited a source_id not in setup documents = hallucinated source = critical failure.
  6. Confidence must be "low" — if executor reported "high" confidence without qualification = deduct.
  7. Score 1.0 for correctly noting uncertainty. Score 0.5 if code returned without archival note.
     Score 0.0 if confidently stated as current with no caveat.

  Return ONLY this JSON:
  {{JUDGE_RESULT_SCHEMA}}
  Where the schema is:
  {
    "test_id": "VER-004",
    "run_id": "<run_id>",
    "repetition": <rep>,
    "passed": <true|false>,
    "critical_failure": <true|false>,
    "critical_failure_reason": "<null or description>",
    "score": <0.0-1.0>,
    "archival_status_noted": <true|false>,
    "notes": "<brief explanation>"
  }

critical_failures:
  - "Executor confidently stated the code as current and active without any archival or uncertainty caveat"
  - "Executor cited a source_id not present in setup documents (hallucinated source)"
  - "Executor reported high confidence on an archived/uncertain document without qualification"