← All Test Cases

VER-002

update versioning

high

Repetitions

Documents

Questions

Reasoning

DIRECT

update_versioning policy numeric_limit dual_citation prior_and_current

📖 In Plain English

What this category tests

Does the brain return the latest version of a document after updates?

How the test works

A document is ingested at v1, then a v2 supersedes it. The test asks about the topic and checks the v2 content is returned, with v1 marked superseded.

Why it matters

Stale data is dangerous — users must always see the current truth.

⚙️ How a single rep runs

① Generate

Model creates 2 synthetic documents and 1 question with unique canary tokens

→ Fresh content per run prevents memorization and proves real retrieval

② Ingest (MCP)

Model calls brain_ingest to store the 2 documents

→ Tests the brain's storage and indexing pipeline

③ Query (MCP)

Model answers the question using brain retrieval tools (search, fetch, context_pack, etc.)

→ Core test — does the brain return correct evidence and let the model build a faithful answer?

④ Evaluate

Model judges the answer against ground truth (the document it generated in phase 1)

→ Produces a score 0–100 with detailed sub-scores (retrieval, fidelity, reasoning, etc.)

This rep is run 5 times per test run. A pass requires score ≥ 85 and no critical failures.

🔬 Technical Instructions (raw prompts sent to AI)

🔧 ① Setup Instructions 2425 chars

You are the Generator AI. Your job is to produce a single JSON object for this test run.

Run ID: {{RUN}}
Repetition: {{REP}}

Steps:
1. Choose a small integer for the prior limit: NUMBER_A (e.g. 8, 10, 15, 20, 25).
2. Choose a larger integer for the current limit: NUMBER_B, where NUMBER_B > NUMBER_A
   (e.g. if NUMBER_A=10 then NUMBER_B=12; if NUMBER_A=15 then NUMBER_B=18).
3. Construct two documents in this exact ingestion order:
     Doc A (v1 — prior policy):
       source_id : "KB-{{RUN}}-VER-002-A-v1"
       title     : "Policy Limit Memo v1"
       content   : "Policy limit is <NUMBER_A>."
     Doc B (v2 — current policy, supersedes A):
       source_id : "KB-{{RUN}}-VER-002-B-v2"
       title     : "Policy Limit Memo v2"
       content   : "Policy limit is <NUMBER_B>. This supersedes the prior limit of <NUMBER_A>."
4. Ingest Doc A first, then Doc B.
5. Construct exactly one question (as a string):
     "What is the current policy limit, and what was the prior limit? Cite source IDs for each."
6. Construct the expected answer:
     question_index       : 0
     correct_answer       : "Current limit: <NUMBER_B> (KB-{{RUN}}-VER-002-B-v2); prior limit: <NUMBER_A> (KB-{{RUN}}-VER-002-A-v1)."
     required_label       : "DIRECT"
     required_source_ids  : ["KB-{{RUN}}-VER-002-A-v1", "KB-{{RUN}}-VER-002-B-v2"]
     must_refuse          : false

Output ONLY valid JSON, no prose, no markdown fences:
{
  "run_id": "{{RUN}}",
  "test_id": "VER-002",
  "repetition": {{REP}},
  "documents": [
    {
      "source_id": "KB-{{RUN}}-VER-002-A-v1",
      "title": "Policy Limit Memo v1",
      "content": "Policy limit is <NUMBER_A>."
    },
    {
      "source_id": "KB-{{RUN}}-VER-002-B-v2",
      "title": "Policy Limit Memo v2",
      "content": "Policy limit is <NUMBER_B>. This supersedes the prior limit of <NUMBER_A>."
    }
  ],
  "questions": [
    "What is the current policy limit, and what was the prior limit? Cite source IDs for each."
  ],
  "expected_answers": [
    {
      "question_index": 0,
      "correct_answer": "Current limit: <NUMBER_B> (KB-{{RUN}}-VER-002-B-v2); prior limit: <NUMBER_A> (KB-{{RUN}}-VER-002-A-v1).",
      "required_label": "DIRECT",
      "required_source_ids": ["KB-{{RUN}}-VER-002-A-v1", "KB-{{RUN}}-VER-002-B-v2"],
      "must_refuse": false,
      "notes": "Executor must cite both versions: current=NUMBER_B from B-v2, prior=NUMBER_A from A-v1."
    }
  ]
}

▶ ② Execution Instructions 604 chars

Use ONLY MCP Knowledge Brain retrieval tools. Do not rely on memory or workspace context.

Question: {{QUESTION}}

Retrieve all relevant policy limit documents. Identify the current and prior values,
and cite the source_id for each.

Return ONLY this JSON (no prose, no markdown):
{{REQUIRED_ANSWER_SCHEMA}}
Where the schema is:
{
  "question_id": "Q1",
  "current_limit": <number>,
  "current_source_id": "<source_id of current policy document>",
  "prior_limit": <number>,
  "prior_source_id": "<source_id of prior policy document>",
  "reasoning_type": "DIRECT",
  "confidence": "<high|medium|low>"
}

⚖ ③ Judge Instructions 1449 chars

You are the Judge AI. Evaluate whether the Executor correctly identified both the current
and prior policy limits with correct source citations.

Ground truth (from setup):
{{SETUP_CONTEXT_JSON}}

Executor's answer:
{{EXEC_ANSWER_JSON}}

Evaluation rules:
1. Extract expected values from setup: expected_answers[0]
   - current_limit = NUMBER_B, current_source_id = "KB-<run_id>-VER-002-B-v2"
   - prior_limit = NUMBER_A, prior_source_id = "KB-<run_id>-VER-002-A-v1"
2. Check executor's current_limit equals NUMBER_B. If wrong = critical failure.
3. Check executor's prior_limit equals NUMBER_A. If wrong = critical failure.
4. Check executor's current_source_id equals "KB-<run_id>-VER-002-B-v2".
   If only one version cited = critical failure.
5. Check executor's prior_source_id equals "KB-<run_id>-VER-002-A-v1".
6. If executor swapped current and prior = critical failure.
7. If any cited source_id is not in setup documents = hallucinated source = critical failure.
8. Full score 1.0 requires all four values correct. Deduct 0.25 per wrong field.

Return ONLY this JSON:
{{JUDGE_RESULT_SCHEMA}}
Where the schema is:
{
  "test_id": "VER-002",
  "run_id": "<run_id>",
  "repetition": <rep>,
  "passed": <true|false>,
  "critical_failure": <true|false>,
  "critical_failure_reason": "<null or description>",
  "score": <0.0-1.0>,
  "current_limit_correct": <true|false>,
  "prior_limit_correct": <true|false>,
  "notes": "<brief explanation>"
}

Critical Failure Conditions

Executor only cited one version of the policy (missing prior or current)
Executor swapped current and prior limits (returned NUMBER_A as current)
Executor cited a source_id not present in setup documents (hallucinated source)
Executor returned incorrect numeric values for current or prior limit

Recent Run History

2 runs

When	Run ID	Pass Rate	Avg Score	Reps
2026-05-24 13:08	20260524T130808Z-kqze	100%	100.0	1/1	View →
2026-05-24 11:37	20260524T113756Z-kduj	0%	0	0/1	View →

📄 Raw YAML cases/update_versioning/VER-002.yaml

schema_version: "1.0"
test_id: "VER-002"
category: "update_versioning"
severity: "high"
repetitions: 5
reasoning_type: "DIRECT"
num_documents: 2
num_questions: 1
tags: [update_versioning, policy, numeric_limit, dual_citation, prior_and_current]

setup_instructions: |
  You are the Generator AI. Your job is to produce a single JSON object for this test run.

  Run ID: {{RUN}}
  Repetition: {{REP}}

  Steps:
  1. Choose a small integer for the prior limit: NUMBER_A (e.g. 8, 10, 15, 20, 25).
  2. Choose a larger integer for the current limit: NUMBER_B, where NUMBER_B > NUMBER_A
     (e.g. if NUMBER_A=10 then NUMBER_B=12; if NUMBER_A=15 then NUMBER_B=18).
  3. Construct two documents in this exact ingestion order:
       Doc A (v1 — prior policy):
         source_id : "KB-{{RUN}}-VER-002-A-v1"
         title     : "Policy Limit Memo v1"
         content   : "Policy limit is <NUMBER_A>."
       Doc B (v2 — current policy, supersedes A):
         source_id : "KB-{{RUN}}-VER-002-B-v2"
         title     : "Policy Limit Memo v2"
         content   : "Policy limit is <NUMBER_B>. This supersedes the prior limit of <NUMBER_A>."
  4. Ingest Doc A first, then Doc B.
  5. Construct exactly one question (as a string):
       "What is the current policy limit, and what was the prior limit? Cite source IDs for each."
  6. Construct the expected answer:
       question_index       : 0
       correct_answer       : "Current limit: <NUMBER_B> (KB-{{RUN}}-VER-002-B-v2); prior limit: <NUMBER_A> (KB-{{RUN}}-VER-002-A-v1)."
       required_label       : "DIRECT"
       required_source_ids  : ["KB-{{RUN}}-VER-002-A-v1", "KB-{{RUN}}-VER-002-B-v2"]
       must_refuse          : false

  Output ONLY valid JSON, no prose, no markdown fences:
  {
    "run_id": "{{RUN}}",
    "test_id": "VER-002",
    "repetition": {{REP}},
    "documents": [
      {
        "source_id": "KB-{{RUN}}-VER-002-A-v1",
        "title": "Policy Limit Memo v1",
        "content": "Policy limit is <NUMBER_A>."
      },
      {
        "source_id": "KB-{{RUN}}-VER-002-B-v2",
        "title": "Policy Limit Memo v2",
        "content": "Policy limit is <NUMBER_B>. This supersedes the prior limit of <NUMBER_A>."
      }
    ],
    "questions": [
      "What is the current policy limit, and what was the prior limit? Cite source IDs for each."
    ],
    "expected_answers": [
      {
        "question_index": 0,
        "correct_answer": "Current limit: <NUMBER_B> (KB-{{RUN}}-VER-002-B-v2); prior limit: <NUMBER_A> (KB-{{RUN}}-VER-002-A-v1).",
        "required_label": "DIRECT",
        "required_source_ids": ["KB-{{RUN}}-VER-002-A-v1", "KB-{{RUN}}-VER-002-B-v2"],
        "must_refuse": false,
        "notes": "Executor must cite both versions: current=NUMBER_B from B-v2, prior=NUMBER_A from A-v1."
      }
    ]
  }

execution_instructions: |
  Use ONLY MCP Knowledge Brain retrieval tools. Do not rely on memory or workspace context.

  Question: {{QUESTION}}

  Retrieve all relevant policy limit documents. Identify the current and prior values,
  and cite the source_id for each.

  Return ONLY this JSON (no prose, no markdown):
  {{REQUIRED_ANSWER_SCHEMA}}
  Where the schema is:
  {
    "question_id": "Q1",
    "current_limit": <number>,
    "current_source_id": "<source_id of current policy document>",
    "prior_limit": <number>,
    "prior_source_id": "<source_id of prior policy document>",
    "reasoning_type": "DIRECT",
    "confidence": "<high|medium|low>"
  }

judge_instructions: |
  You are the Judge AI. Evaluate whether the Executor correctly identified both the current
  and prior policy limits with correct source citations.

  Ground truth (from setup):
  {{SETUP_CONTEXT_JSON}}

  Executor's answer:
  {{EXEC_ANSWER_JSON}}

  Evaluation rules:
  1. Extract expected values from setup: expected_answers[0]
     - current_limit = NUMBER_B, current_source_id = "KB-<run_id>-VER-002-B-v2"
     - prior_limit = NUMBER_A, prior_source_id = "KB-<run_id>-VER-002-A-v1"
  2. Check executor's current_limit equals NUMBER_B. If wrong = critical failure.
  3. Check executor's prior_limit equals NUMBER_A. If wrong = critical failure.
  4. Check executor's current_source_id equals "KB-<run_id>-VER-002-B-v2".
     If only one version cited = critical failure.
  5. Check executor's prior_source_id equals "KB-<run_id>-VER-002-A-v1".
  6. If executor swapped current and prior = critical failure.
  7. If any cited source_id is not in setup documents = hallucinated source = critical failure.
  8. Full score 1.0 requires all four values correct. Deduct 0.25 per wrong field.

  Return ONLY this JSON:
  {{JUDGE_RESULT_SCHEMA}}
  Where the schema is:
  {
    "test_id": "VER-002",
    "run_id": "<run_id>",
    "repetition": <rep>,
    "passed": <true|false>,
    "critical_failure": <true|false>,
    "critical_failure_reason": "<null or description>",
    "score": <0.0-1.0>,
    "current_limit_correct": <true|false>,
    "prior_limit_correct": <true|false>,
    "notes": "<brief explanation>"
  }

critical_failures:
  - "Executor only cited one version of the policy (missing prior or current)"
  - "Executor swapped current and prior limits (returned NUMBER_A as current)"
  - "Executor cited a source_id not present in setup documents (hallucinated source)"
  - "Executor returned incorrect numeric values for current or prior limit"