← All Test Cases
critical
ISO-003
context isolation
Repetitions
5
Documents
1
Questions
1
Reasoning
DIRECT
context_isolation
codename
single_doc
memory_discipline
retrieval_only
📖 In Plain English
What this category tests
Does each query work with only its own documents, ignoring data from other tests?
How the test works
A document with a unique color/animal is ingested per rep. The test fetches that specific document by label and checks the right value is returned — even when the brain contains documents from other tests.
Why it matters
In production, the brain has thousands of docs. Tests must prove retrieval is precise.
⚙️ How a single rep runs
① Generate
Model creates 1 synthetic document and 1 question with unique canary tokens
→ Fresh content per run prevents memorization and proves real retrieval
② Ingest (MCP)
Model calls brain_ingest to store the 1 document
→ Tests the brain's storage and indexing pipeline
③ Query (MCP)
Model answers the question using brain retrieval tools (search, fetch, context_pack, etc.)
→ Core test — does the brain return correct evidence and let the model build a faithful answer?
④ Evaluate
Model judges the answer against ground truth (the document it generated in phase 1)
→ Produces a score 0–100 with detailed sub-scores (retrieval, fidelity, reasoning, etc.)
This rep is run 5 times per test run. A pass requires score ≥ 85 and no critical failures.
🔬 Technical Instructions (raw prompts sent to AI)
🔧 ① Setup Instructions 1536 chars
You are the Generator AI. Your job is to produce a single JSON object for this test run.
Run ID: {{RUN}}
Repetition: {{REP}}
Steps:
1. Invent a fictional project codename. It must be two words, capitalized, and completely made up
(e.g. "Project Tambourine", "Operation Glintspire", "Initiative Velvethorn",
"Project Zumblefar", "Directive Copperlatch").
Call this CODENAME_A.
2. Construct exactly one document:
source_id : "KB-{{RUN}}-ISO-003-A-v1"
title : "Project Registry"
content : "The project codename is <CODENAME_A>."
3. Construct exactly one question (as a string):
"What is the project codename? Use only brain-retrieved evidence."
4. Construct the expected answer:
question_index : 0
correct_answer : "<CODENAME_A>"
required_label : "DIRECT"
required_source_ids : ["KB-{{RUN}}-ISO-003-A-v1"]
must_refuse : false
Output ONLY valid JSON, no prose, no markdown fences:
{
"run_id": "{{RUN}}",
"test_id": "ISO-003",
"repetition": {{REP}},
"documents": [
{
"source_id": "KB-{{RUN}}-ISO-003-A-v1",
"title": "Project Registry",
"content": "The project codename is <CODENAME_A>."
}
],
"questions": [
"What is the project codename? Use only brain-retrieved evidence."
],
"expected_answers": [
{
"question_index": 0,
"correct_answer": "<CODENAME_A>",
"required_label": "DIRECT",
"required_source_ids": ["KB-{{RUN}}-ISO-003-A-v1"],
"must_refuse": false
}
]
}
▶ ② Execution Instructions 762 chars
You have access to MCP Knowledge Brain tools.
MANDATORY: Use brain_fetch to retrieve the document DIRECTLY by its source label.
Do NOT use brain_search — it will return noise documents and distract you.
Step 1: Call brain_fetch with URI: brain://source/by-label/KB-{{RUN}}-ISO-003-A-v1
Step 2: Read the content of the returned document
Step 3: Extract the value asked for in the question (e.g. the flag color, animal, etc.)
Step 4: Return that value as your answer, citing the source_id from the fetched document
If brain_fetch returns no document for that label: answer null, label UNCERTAIN.
Do NOT fall back to brain_search. Do NOT cite any source other than the fetched document.
Question: {{QUESTION}}
Return ONLY this JSON:
{{REQUIRED_ANSWER_SCHEMA}}
⚖ ③ Judge Instructions 1120 chars
You are the Judge AI. Evaluate whether the Executor answered correctly.
Ground truth (from setup):
{{SETUP_CONTEXT_JSON}}
Executor's answer:
{{EXEC_ANSWER_JSON}}
Evaluation rules:
1. Extract expected_codename from setup: expected_answers[0].answer
2. Extract expected_source_id from setup: expected_answers[0].source_ids[0]
3. Check executor's answer.answer matches expected_codename (case-insensitive).
4. Check executor's source_ids[0] matches expected_source_id exactly.
5. If executor's codename does not match expected_codename = critical failure
(indicates use of memory or prior session knowledge rather than retrieval).
6. If executor's source_id is not in setup documents = hallucinated source = critical failure.
7. If executor's retrieval_confirmed is false or absent, deduct from score.
Return ONLY this JSON:
{{JUDGE_RESULT_SCHEMA}}
Where the schema is:
{
"test_id": "ISO-003",
"run_id": "<run_id>",
"repetition": <rep>,
"passed": <true|false>,
"critical_failure": <true|false>,
"critical_failure_reason": "<null or description>",
"score": <0.0-1.0>,
"notes": "<brief explanation>"
}
Critical Failure Conditions
- Executor provided a codename not found in the brain-retrieved document (memory contamination)
- Executor cited a source_id not present in setup documents (hallucinated source)
- Executor explicitly referenced knowledge from outside the brain (prior sessions, external memory)
Recent Run History
3 runs| When | Run ID | Pass Rate | Avg Score | Reps | |
|---|---|---|---|---|---|
| 2026-05-24 13:08 | 20260524T130808Z-kqze | 100% | 100.0 | 1/1 | View → |
| 2026-05-24 12:41 | 20260524T124148Z-z2do | 100% | 100.0 | 1/1 | View → |
| 2026-05-24 11:37 | 20260524T113756Z-kduj | 0% | 0 | 0/1 | View → |
📄 Raw YAML cases/context_isolation/ISO-003.yaml
schema_version: "1.0"
test_id: "ISO-003"
category: "context_isolation"
severity: "critical"
repetitions: 5
reasoning_type: "DIRECT"
num_documents: 1
num_questions: 1
tags: [context_isolation, codename, single_doc, memory_discipline, retrieval_only]
setup_instructions: |
You are the Generator AI. Your job is to produce a single JSON object for this test run.
Run ID: {{RUN}}
Repetition: {{REP}}
Steps:
1. Invent a fictional project codename. It must be two words, capitalized, and completely made up
(e.g. "Project Tambourine", "Operation Glintspire", "Initiative Velvethorn",
"Project Zumblefar", "Directive Copperlatch").
Call this CODENAME_A.
2. Construct exactly one document:
source_id : "KB-{{RUN}}-ISO-003-A-v1"
title : "Project Registry"
content : "The project codename is <CODENAME_A>."
3. Construct exactly one question (as a string):
"What is the project codename? Use only brain-retrieved evidence."
4. Construct the expected answer:
question_index : 0
correct_answer : "<CODENAME_A>"
required_label : "DIRECT"
required_source_ids : ["KB-{{RUN}}-ISO-003-A-v1"]
must_refuse : false
Output ONLY valid JSON, no prose, no markdown fences:
{
"run_id": "{{RUN}}",
"test_id": "ISO-003",
"repetition": {{REP}},
"documents": [
{
"source_id": "KB-{{RUN}}-ISO-003-A-v1",
"title": "Project Registry",
"content": "The project codename is <CODENAME_A>."
}
],
"questions": [
"What is the project codename? Use only brain-retrieved evidence."
],
"expected_answers": [
{
"question_index": 0,
"correct_answer": "<CODENAME_A>",
"required_label": "DIRECT",
"required_source_ids": ["KB-{{RUN}}-ISO-003-A-v1"],
"must_refuse": false
}
]
}
execution_instructions: |
You have access to MCP Knowledge Brain tools.
MANDATORY: Use brain_fetch to retrieve the document DIRECTLY by its source label.
Do NOT use brain_search — it will return noise documents and distract you.
Step 1: Call brain_fetch with URI: brain://source/by-label/KB-{{RUN}}-ISO-003-A-v1
Step 2: Read the content of the returned document
Step 3: Extract the value asked for in the question (e.g. the flag color, animal, etc.)
Step 4: Return that value as your answer, citing the source_id from the fetched document
If brain_fetch returns no document for that label: answer null, label UNCERTAIN.
Do NOT fall back to brain_search. Do NOT cite any source other than the fetched document.
Question: {{QUESTION}}
Return ONLY this JSON:
{{REQUIRED_ANSWER_SCHEMA}}
judge_instructions: |
You are the Judge AI. Evaluate whether the Executor answered correctly.
Ground truth (from setup):
{{SETUP_CONTEXT_JSON}}
Executor's answer:
{{EXEC_ANSWER_JSON}}
Evaluation rules:
1. Extract expected_codename from setup: expected_answers[0].answer
2. Extract expected_source_id from setup: expected_answers[0].source_ids[0]
3. Check executor's answer.answer matches expected_codename (case-insensitive).
4. Check executor's source_ids[0] matches expected_source_id exactly.
5. If executor's codename does not match expected_codename = critical failure
(indicates use of memory or prior session knowledge rather than retrieval).
6. If executor's source_id is not in setup documents = hallucinated source = critical failure.
7. If executor's retrieval_confirmed is false or absent, deduct from score.
Return ONLY this JSON:
{{JUDGE_RESULT_SCHEMA}}
Where the schema is:
{
"test_id": "ISO-003",
"run_id": "<run_id>",
"repetition": <rep>,
"passed": <true|false>,
"critical_failure": <true|false>,
"critical_failure_reason": "<null or description>",
"score": <0.0-1.0>,
"notes": "<brief explanation>"
}
critical_failures:
- "Executor provided a codename not found in the brain-retrieved document (memory contamination)"
- "Executor cited a source_id not present in setup documents (hallucinated source)"
- "Executor explicitly referenced knowledge from outside the brain (prior sessions, external memory)"