← All Test Cases

ISO-003

context isolation

critical

Repetitions

Documents

Questions

Reasoning

DIRECT

context_isolation codename single_doc memory_discipline retrieval_only

📖 In Plain English

What this category tests

Does each query work with only its own documents, ignoring data from other tests?

How the test works

A document with a unique color/animal is ingested per rep. The test fetches that specific document by label and checks the right value is returned — even when the brain contains documents from other tests.

Why it matters

In production, the brain has thousands of docs. Tests must prove retrieval is precise.

⚙️ How a single rep runs

① Generate

Model creates 1 synthetic document and 1 question with unique canary tokens

→ Fresh content per run prevents memorization and proves real retrieval

② Ingest (MCP)

Model calls brain_ingest to store the 1 document

→ Tests the brain's storage and indexing pipeline

③ Query (MCP)

Model answers the question using brain retrieval tools (search, fetch, context_pack, etc.)

→ Core test — does the brain return correct evidence and let the model build a faithful answer?

④ Evaluate

Model judges the answer against ground truth (the document it generated in phase 1)

→ Produces a score 0–100 with detailed sub-scores (retrieval, fidelity, reasoning, etc.)

This rep is run 5 times per test run. A pass requires score ≥ 85 and no critical failures.

🔬 Technical Instructions (raw prompts sent to AI)

🔧 ① Setup Instructions 1536 chars

You are the Generator AI. Your job is to produce a single JSON object for this test run.

Run ID: {{RUN}}
Repetition: {{REP}}

Steps:
1. Invent a fictional project codename. It must be two words, capitalized, and completely made up
   (e.g. "Project Tambourine", "Operation Glintspire", "Initiative Velvethorn",
    "Project Zumblefar", "Directive Copperlatch").
   Call this CODENAME_A.
2. Construct exactly one document:
     source_id : "KB-{{RUN}}-ISO-003-A-v1"
     title     : "Project Registry"
     content   : "The project codename is <CODENAME_A>."
3. Construct exactly one question (as a string):
     "What is the project codename? Use only brain-retrieved evidence."
4. Construct the expected answer:
     question_index       : 0
     correct_answer       : "<CODENAME_A>"
     required_label       : "DIRECT"
     required_source_ids  : ["KB-{{RUN}}-ISO-003-A-v1"]
     must_refuse          : false

Output ONLY valid JSON, no prose, no markdown fences:
{
  "run_id": "{{RUN}}",
  "test_id": "ISO-003",
  "repetition": {{REP}},
  "documents": [
    {
      "source_id": "KB-{{RUN}}-ISO-003-A-v1",
      "title": "Project Registry",
      "content": "The project codename is <CODENAME_A>."
    }
  ],
  "questions": [
    "What is the project codename? Use only brain-retrieved evidence."
  ],
  "expected_answers": [
    {
      "question_index": 0,
      "correct_answer": "<CODENAME_A>",
      "required_label": "DIRECT",
      "required_source_ids": ["KB-{{RUN}}-ISO-003-A-v1"],
      "must_refuse": false
    }
  ]
}

▶ ② Execution Instructions 762 chars

You have access to MCP Knowledge Brain tools.

MANDATORY: Use brain_fetch to retrieve the document DIRECTLY by its source label.
Do NOT use brain_search — it will return noise documents and distract you.

Step 1: Call brain_fetch with URI: brain://source/by-label/KB-{{RUN}}-ISO-003-A-v1
Step 2: Read the content of the returned document
Step 3: Extract the value asked for in the question (e.g. the flag color, animal, etc.)
Step 4: Return that value as your answer, citing the source_id from the fetched document

If brain_fetch returns no document for that label: answer null, label UNCERTAIN.
Do NOT fall back to brain_search. Do NOT cite any source other than the fetched document.

Question: {{QUESTION}}

Return ONLY this JSON:
{{REQUIRED_ANSWER_SCHEMA}}

⚖ ③ Judge Instructions 1120 chars

You are the Judge AI. Evaluate whether the Executor answered correctly.

Ground truth (from setup):
{{SETUP_CONTEXT_JSON}}

Executor's answer:
{{EXEC_ANSWER_JSON}}

Evaluation rules:
1. Extract expected_codename from setup: expected_answers[0].answer
2. Extract expected_source_id from setup: expected_answers[0].source_ids[0]
3. Check executor's answer.answer matches expected_codename (case-insensitive).
4. Check executor's source_ids[0] matches expected_source_id exactly.
5. If executor's codename does not match expected_codename = critical failure
   (indicates use of memory or prior session knowledge rather than retrieval).
6. If executor's source_id is not in setup documents = hallucinated source = critical failure.
7. If executor's retrieval_confirmed is false or absent, deduct from score.

Return ONLY this JSON:
{{JUDGE_RESULT_SCHEMA}}
Where the schema is:
{
  "test_id": "ISO-003",
  "run_id": "<run_id>",
  "repetition": <rep>,
  "passed": <true|false>,
  "critical_failure": <true|false>,
  "critical_failure_reason": "<null or description>",
  "score": <0.0-1.0>,
  "notes": "<brief explanation>"
}

Critical Failure Conditions

Executor provided a codename not found in the brain-retrieved document (memory contamination)
Executor cited a source_id not present in setup documents (hallucinated source)
Executor explicitly referenced knowledge from outside the brain (prior sessions, external memory)

Recent Run History

3 runs

When	Run ID	Pass Rate	Avg Score	Reps
2026-05-24 13:08	20260524T130808Z-kqze	100%	100.0	1/1	View →
2026-05-24 12:41	20260524T124148Z-z2do	100%	100.0	1/1	View →
2026-05-24 11:37	20260524T113756Z-kduj	0%	0	0/1	View →

📄 Raw YAML cases/context_isolation/ISO-003.yaml

schema_version: "1.0"
test_id: "ISO-003"
category: "context_isolation"
severity: "critical"
repetitions: 5
reasoning_type: "DIRECT"
num_documents: 1
num_questions: 1
tags: [context_isolation, codename, single_doc, memory_discipline, retrieval_only]

setup_instructions: |
  You are the Generator AI. Your job is to produce a single JSON object for this test run.

  Run ID: {{RUN}}
  Repetition: {{REP}}

  Steps:
  1. Invent a fictional project codename. It must be two words, capitalized, and completely made up
     (e.g. "Project Tambourine", "Operation Glintspire", "Initiative Velvethorn",
      "Project Zumblefar", "Directive Copperlatch").
     Call this CODENAME_A.
  2. Construct exactly one document:
       source_id : "KB-{{RUN}}-ISO-003-A-v1"
       title     : "Project Registry"
       content   : "The project codename is <CODENAME_A>."
  3. Construct exactly one question (as a string):
       "What is the project codename? Use only brain-retrieved evidence."
  4. Construct the expected answer:
       question_index       : 0
       correct_answer       : "<CODENAME_A>"
       required_label       : "DIRECT"
       required_source_ids  : ["KB-{{RUN}}-ISO-003-A-v1"]
       must_refuse          : false

  Output ONLY valid JSON, no prose, no markdown fences:
  {
    "run_id": "{{RUN}}",
    "test_id": "ISO-003",
    "repetition": {{REP}},
    "documents": [
      {
        "source_id": "KB-{{RUN}}-ISO-003-A-v1",
        "title": "Project Registry",
        "content": "The project codename is <CODENAME_A>."
      }
    ],
    "questions": [
      "What is the project codename? Use only brain-retrieved evidence."
    ],
    "expected_answers": [
      {
        "question_index": 0,
        "correct_answer": "<CODENAME_A>",
        "required_label": "DIRECT",
        "required_source_ids": ["KB-{{RUN}}-ISO-003-A-v1"],
        "must_refuse": false
      }
    ]
  }

execution_instructions: |
  You have access to MCP Knowledge Brain tools.

  MANDATORY: Use brain_fetch to retrieve the document DIRECTLY by its source label.
  Do NOT use brain_search — it will return noise documents and distract you.

  Step 1: Call brain_fetch with URI: brain://source/by-label/KB-{{RUN}}-ISO-003-A-v1
  Step 2: Read the content of the returned document
  Step 3: Extract the value asked for in the question (e.g. the flag color, animal, etc.)
  Step 4: Return that value as your answer, citing the source_id from the fetched document

  If brain_fetch returns no document for that label: answer null, label UNCERTAIN.
  Do NOT fall back to brain_search. Do NOT cite any source other than the fetched document.

  Question: {{QUESTION}}

  Return ONLY this JSON:
  {{REQUIRED_ANSWER_SCHEMA}}


judge_instructions: |
  You are the Judge AI. Evaluate whether the Executor answered correctly.

  Ground truth (from setup):
  {{SETUP_CONTEXT_JSON}}

  Executor's answer:
  {{EXEC_ANSWER_JSON}}

  Evaluation rules:
  1. Extract expected_codename from setup: expected_answers[0].answer
  2. Extract expected_source_id from setup: expected_answers[0].source_ids[0]
  3. Check executor's answer.answer matches expected_codename (case-insensitive).
  4. Check executor's source_ids[0] matches expected_source_id exactly.
  5. If executor's codename does not match expected_codename = critical failure
     (indicates use of memory or prior session knowledge rather than retrieval).
  6. If executor's source_id is not in setup documents = hallucinated source = critical failure.
  7. If executor's retrieval_confirmed is false or absent, deduct from score.

  Return ONLY this JSON:
  {{JUDGE_RESULT_SCHEMA}}
  Where the schema is:
  {
    "test_id": "ISO-003",
    "run_id": "<run_id>",
    "repetition": <rep>,
    "passed": <true|false>,
    "critical_failure": <true|false>,
    "critical_failure_reason": "<null or description>",
    "score": <0.0-1.0>,
    "notes": "<brief explanation>"
  }

critical_failures:
  - "Executor provided a codename not found in the brain-retrieved document (memory contamination)"
  - "Executor cited a source_id not present in setup documents (hallucinated source)"
  - "Executor explicitly referenced knowledge from outside the brain (prior sessions, external memory)"