← All Test Cases

VER-001

update versioning

critical

Repetitions

Documents

Questions

Reasoning

DIRECT

update_versioning supersession beacon stale_rejection two_doc

📖 In Plain English

What this category tests

Does the brain return the latest version of a document after updates?

How the test works

A document is ingested at v1, then a v2 supersedes it. The test asks about the topic and checks the v2 content is returned, with v1 marked superseded.

Why it matters

Stale data is dangerous — users must always see the current truth.

⚙️ How a single rep runs

① Generate

Model creates 2 synthetic documents and 1 question with unique canary tokens

→ Fresh content per run prevents memorization and proves real retrieval

② Ingest (MCP)

Model calls brain_ingest to store the 2 documents

→ Tests the brain's storage and indexing pipeline

③ Query (MCP)

Model answers the question using brain retrieval tools (search, fetch, context_pack, etc.)

→ Core test — does the brain return correct evidence and let the model build a faithful answer?

④ Evaluate

Model judges the answer against ground truth (the document it generated in phase 1)

→ Produces a score 0–100 with detailed sub-scores (retrieval, fidelity, reasoning, etc.)

This rep is run 5 times per test run. A pass requires score ≥ 85 and no critical failures.

🔬 Technical Instructions (raw prompts sent to AI)

🔧 ① Setup Instructions 2279 chars

You are the Generator AI. Your job is to produce a single JSON object for this test run.

Run ID: {{RUN}}
Repetition: {{REP}}

Steps:
1. Invent a fictional beacon name (e.g. "Beacon Halvorn", "Beacon Quelspire", "Beacon Driftmast",
   "Beacon Ironclaw", "Beacon Thalvex"). Call this BEACON_NAME.
2. Construct two documents in this exact ingestion order (ingest A first, then B):
     Doc A (v1 — stale):
       source_id : "KB-{{RUN}}-VER-001-A-v1"
       title     : "Beacon Status Report v1"
       content   : "Status of <BEACON_NAME> is RED."
     Doc B (v2 — current, supersedes A):
       source_id : "KB-{{RUN}}-VER-001-B-v2"
       title     : "Beacon Status Report v2"
       content   : "Status of <BEACON_NAME> is GREEN. This report supersedes the previous RED status."
3. IMPORTANT: Ingest Doc A into the brain first, then Doc B. Doc B is the authoritative version.
4. Construct exactly one question (as a string):
     "What is the current status of <BEACON_NAME>?"
5. Construct the expected answer:
     question_index       : 0
     correct_answer       : "GREEN"
     required_label       : "DIRECT"
     required_source_ids  : ["KB-{{RUN}}-VER-001-B-v2"]
     must_refuse          : false
     notes                : "Current source is KB-{{RUN}}-VER-001-B-v2; KB-{{RUN}}-VER-001-A-v1 is superseded. If both docs are retrieved, executor should note v1 is superseded."

Output ONLY valid JSON, no prose, no markdown fences:
{
  "run_id": "{{RUN}}",
  "test_id": "VER-001",
  "repetition": {{REP}},
  "documents": [
    {
      "source_id": "KB-{{RUN}}-VER-001-A-v1",
      "title": "Beacon Status Report v1",
      "content": "Status of <BEACON_NAME> is RED."
    },
    {
      "source_id": "KB-{{RUN}}-VER-001-B-v2",
      "title": "Beacon Status Report v2",
      "content": "Status of <BEACON_NAME> is GREEN. This report supersedes the previous RED status."
    }
  ],
  "questions": [
    "What is the current status of <BEACON_NAME>?"
  ],
  "expected_answers": [
    {
      "question_index": 0,
      "correct_answer": "GREEN",
      "required_label": "DIRECT",
      "required_source_ids": ["KB-{{RUN}}-VER-001-B-v2"],
      "must_refuse": false,
      "notes": "Primary source: KB-{{RUN}}-VER-001-B-v2. Superseded: KB-{{RUN}}-VER-001-A-v1."
    }
  ]
}

▶ ② Execution Instructions 711 chars

Use ONLY MCP Knowledge Brain retrieval tools. Do not rely on memory or workspace context.

Question: {{QUESTION}}

Retrieve all relevant documents about the beacon status. If multiple versions exist,
identify which is most current and note if any older version is superseded.

Return ONLY this JSON (no prose, no markdown):
{{REQUIRED_ANSWER_SCHEMA}}
Where the schema is:
{
  "question_id": "Q1",
  "answer": "<current status: GREEN or RED>",
  "reasoning_type": "DIRECT",
  "primary_source_id": "<source_id of the current/authoritative document>",
  "additional_source_ids": ["<any other source_ids retrieved, e.g. older versions>"],
  "supersession_noted": <true|false>,
  "confidence": "<high|medium|low>"
}

⚖ ③ Judge Instructions 1131 chars

You are the Judge AI. Evaluate whether the Executor answered with the correct current status.

Ground truth (from setup):
{{SETUP_CONTEXT_JSON}}

Executor's answer:
{{EXEC_ANSWER_JSON}}

Evaluation rules:
1. Expected answer is "GREEN" from source "KB-<run_id>-VER-001-B-v2".
2. Check executor's answer.answer is "GREEN" (case-insensitive).
   If "RED" = critical failure (stale answer).
3. Check executor's primary_source_id equals "KB-<run_id>-VER-001-B-v2".
   If only "KB-<run_id>-VER-001-A-v1" is cited = critical failure.
4. If executor cited a source_id not present in setup documents = hallucinated source = critical failure.
5. Bonus: supersession_noted = true improves score (full 1.0 vs 0.85 without it).
6. If executor answered RED even partially = critical failure.

Return ONLY this JSON:
{{JUDGE_RESULT_SCHEMA}}
Where the schema is:
{
  "test_id": "VER-001",
  "run_id": "<run_id>",
  "repetition": <rep>,
  "passed": <true|false>,
  "critical_failure": <true|false>,
  "critical_failure_reason": "<null or description>",
  "score": <0.0-1.0>,
  "supersession_noted": <true|false>,
  "notes": "<brief explanation>"
}

Critical Failure Conditions

Executor answered RED (stale answer from v1 document)
Executor cited only the v1 source_id (KB-{{RUN}}-VER-001-A-v1) as the current answer
Executor cited a source_id not present in setup documents (hallucinated source)

Recent Run History

2 runs

When	Run ID	Pass Rate	Avg Score	Reps
2026-05-24 13:08	20260524T130808Z-kqze	100%	100.0	1/1	View →
2026-05-24 11:37	20260524T113756Z-kduj	0%	0	0/1	View →

📄 Raw YAML cases/update_versioning/VER-001.yaml

schema_version: "1.0"
test_id: "VER-001"
category: "update_versioning"
severity: "critical"
repetitions: 5
reasoning_type: "DIRECT"
num_documents: 2
num_questions: 1
tags: [update_versioning, supersession, beacon, stale_rejection, two_doc]

setup_instructions: |
  You are the Generator AI. Your job is to produce a single JSON object for this test run.

  Run ID: {{RUN}}
  Repetition: {{REP}}

  Steps:
  1. Invent a fictional beacon name (e.g. "Beacon Halvorn", "Beacon Quelspire", "Beacon Driftmast",
     "Beacon Ironclaw", "Beacon Thalvex"). Call this BEACON_NAME.
  2. Construct two documents in this exact ingestion order (ingest A first, then B):
       Doc A (v1 — stale):
         source_id : "KB-{{RUN}}-VER-001-A-v1"
         title     : "Beacon Status Report v1"
         content   : "Status of <BEACON_NAME> is RED."
       Doc B (v2 — current, supersedes A):
         source_id : "KB-{{RUN}}-VER-001-B-v2"
         title     : "Beacon Status Report v2"
         content   : "Status of <BEACON_NAME> is GREEN. This report supersedes the previous RED status."
  3. IMPORTANT: Ingest Doc A into the brain first, then Doc B. Doc B is the authoritative version.
  4. Construct exactly one question (as a string):
       "What is the current status of <BEACON_NAME>?"
  5. Construct the expected answer:
       question_index       : 0
       correct_answer       : "GREEN"
       required_label       : "DIRECT"
       required_source_ids  : ["KB-{{RUN}}-VER-001-B-v2"]
       must_refuse          : false
       notes                : "Current source is KB-{{RUN}}-VER-001-B-v2; KB-{{RUN}}-VER-001-A-v1 is superseded. If both docs are retrieved, executor should note v1 is superseded."

  Output ONLY valid JSON, no prose, no markdown fences:
  {
    "run_id": "{{RUN}}",
    "test_id": "VER-001",
    "repetition": {{REP}},
    "documents": [
      {
        "source_id": "KB-{{RUN}}-VER-001-A-v1",
        "title": "Beacon Status Report v1",
        "content": "Status of <BEACON_NAME> is RED."
      },
      {
        "source_id": "KB-{{RUN}}-VER-001-B-v2",
        "title": "Beacon Status Report v2",
        "content": "Status of <BEACON_NAME> is GREEN. This report supersedes the previous RED status."
      }
    ],
    "questions": [
      "What is the current status of <BEACON_NAME>?"
    ],
    "expected_answers": [
      {
        "question_index": 0,
        "correct_answer": "GREEN",
        "required_label": "DIRECT",
        "required_source_ids": ["KB-{{RUN}}-VER-001-B-v2"],
        "must_refuse": false,
        "notes": "Primary source: KB-{{RUN}}-VER-001-B-v2. Superseded: KB-{{RUN}}-VER-001-A-v1."
      }
    ]
  }

execution_instructions: |
  Use ONLY MCP Knowledge Brain retrieval tools. Do not rely on memory or workspace context.

  Question: {{QUESTION}}

  Retrieve all relevant documents about the beacon status. If multiple versions exist,
  identify which is most current and note if any older version is superseded.

  Return ONLY this JSON (no prose, no markdown):
  {{REQUIRED_ANSWER_SCHEMA}}
  Where the schema is:
  {
    "question_id": "Q1",
    "answer": "<current status: GREEN or RED>",
    "reasoning_type": "DIRECT",
    "primary_source_id": "<source_id of the current/authoritative document>",
    "additional_source_ids": ["<any other source_ids retrieved, e.g. older versions>"],
    "supersession_noted": <true|false>,
    "confidence": "<high|medium|low>"
  }

judge_instructions: |
  You are the Judge AI. Evaluate whether the Executor answered with the correct current status.

  Ground truth (from setup):
  {{SETUP_CONTEXT_JSON}}

  Executor's answer:
  {{EXEC_ANSWER_JSON}}

  Evaluation rules:
  1. Expected answer is "GREEN" from source "KB-<run_id>-VER-001-B-v2".
  2. Check executor's answer.answer is "GREEN" (case-insensitive).
     If "RED" = critical failure (stale answer).
  3. Check executor's primary_source_id equals "KB-<run_id>-VER-001-B-v2".
     If only "KB-<run_id>-VER-001-A-v1" is cited = critical failure.
  4. If executor cited a source_id not present in setup documents = hallucinated source = critical failure.
  5. Bonus: supersession_noted = true improves score (full 1.0 vs 0.85 without it).
  6. If executor answered RED even partially = critical failure.

  Return ONLY this JSON:
  {{JUDGE_RESULT_SCHEMA}}
  Where the schema is:
  {
    "test_id": "VER-001",
    "run_id": "<run_id>",
    "repetition": <rep>,
    "passed": <true|false>,
    "critical_failure": <true|false>,
    "critical_failure_reason": "<null or description>",
    "score": <0.0-1.0>,
    "supersession_noted": <true|false>,
    "notes": "<brief explanation>"
  }

critical_failures:
  - "Executor answered RED (stale answer from v1 document)"
  - "Executor cited only the v1 source_id (KB-{{RUN}}-VER-001-A-v1) as the current answer"
  - "Executor cited a source_id not present in setup documents (hallucinated source)"