← All Test Cases

XSR-007

cross source

critical
Repetitions
5
Documents
5
Questions
1
Reasoning
CROSS_SOURCE
cross_source four_hop distractor project token log server rack critical

📖 In Plain English

What this category tests

Can the brain answer questions that require combining information from 2+ documents?

How the test works

Documents form chains (e.g. Doc A: 'X9 is the velvet pouch'; Doc B: 'velvet pouch contains a silver locket'). The test asks 'what does X9 contain?' which requires Doc A → Doc B reasoning.

Why it matters

Real questions often span multiple documents. Single-doc retrieval is not enough.

⚙️ How a single rep runs

① Generate
Model creates 5 synthetic documents and 1 question with unique canary tokens
→ Fresh content per run prevents memorization and proves real retrieval
② Ingest (MCP)
Model calls brain_ingest to store the 5 documents
→ Tests the brain's storage and indexing pipeline
③ Query (MCP)
Model answers the question using brain retrieval tools (search, fetch, context_pack, etc.)
→ Core test — does the brain return correct evidence and let the model build a faithful answer?
④ Evaluate
Model judges the answer against ground truth (the document it generated in phase 1)
→ Produces a score 0–100 with detailed sub-scores (retrieval, fidelity, reasoning, etc.)

This rep is run 5 times per test run. A pass requires score ≥ 85 and no critical failures.

🔬 Technical Instructions (raw prompts sent to AI)

🔧 ① Setup Instructions 3678 chars
You are a test-data Generator AI.

Generate one self-consistent test scenario using the following structure:
  - Choose a fictional project name (e.g. "Project Helion").
  - Choose a fictional token code (e.g. "TOK-2291-ZX").
  - Choose a fictional log number (e.g. "LOG-5540").
  - Choose a fictional primary server name (e.g. "SRVR-Koval").
  - Choose a fictional primary rack number (e.g. "Rack-19B").
  - Choose a DIFFERENT fictional server name for the distractor (e.g. "SRVR-Dantec").
  - Choose a DIFFERENT fictional rack number for the distractor (e.g. "Rack-04A").

Produce exactly five documents:
  Doc A  source_id: KB-{{RUN}}-XSR-007-A-v1
         title: Project Token Index
         content: "Project <Name> uses token <token_code>."

  Doc B  source_id: KB-{{RUN}}-XSR-007-B-v1
         title: Token Audit Log
         content: "Token <token_code> appears in log <log_number>."

  Doc C  source_id: KB-{{RUN}}-XSR-007-C-v1
         title: Log Origin Record
         content: "Log <log_number> was generated by Server <primary_server>."

  Doc D  source_id: KB-{{RUN}}-XSR-007-D-v1
         title: Server Location Register
         content: "Server <primary_server> is in Rack <primary_rack>."

  Doc E  source_id: KB-{{RUN}}-XSR-007-E-v1
         title: Server Location Register (supplementary)
         content: "Server <distractor_server> is in Rack <distractor_rack>."
         note: "DISTRACTOR — must NOT be used in the answer chain."

Produce exactly one question:
  "Where is Project <Name>'s server rack?"

Produce the expected answer:
  question_index: 0
  correct_answer: "<primary_rack>"
  required_label: "CROSS_SOURCE"
  required_source_ids: [
    "KB-{{RUN}}-XSR-007-A-v1",
    "KB-{{RUN}}-XSR-007-B-v1",
    "KB-{{RUN}}-XSR-007-C-v1",
    "KB-{{RUN}}-XSR-007-D-v1"
  ]
  must_refuse: false
  notes: |
    forbidden_source_ids: ["KB-{{RUN}}-XSR-007-E-v1"]
    The four-hop chain is: project -> token -> log -> server -> rack (Docs A-D).
    Doc E describes a different server and must not appear in the answer chain.

Replace {{RUN}} with the actual run_id value.

Output ONLY this JSON (no markdown, no commentary):
{
  "run_id": "<string>",
  "test_id": "XSR-007",
  "repetition": <integer 1-5>,
  "documents": [
    {
      "source_id": "KB-<run_id>-XSR-007-A-v1",
      "title": "Project Token Index",
      "content": "Project <Name> uses token <token_code>."
    },
    {
      "source_id": "KB-<run_id>-XSR-007-B-v1",
      "title": "Token Audit Log",
      "content": "Token <token_code> appears in log <log_number>."
    },
    {
      "source_id": "KB-<run_id>-XSR-007-C-v1",
      "title": "Log Origin Record",
      "content": "Log <log_number> was generated by Server <primary_server>."
    },
    {
      "source_id": "KB-<run_id>-XSR-007-D-v1",
      "title": "Server Location Register",
      "content": "Server <primary_server> is in Rack <primary_rack>."
    },
    {
      "source_id": "KB-<run_id>-XSR-007-E-v1",
      "title": "Server Location Register (supplementary)",
      "content": "Server <distractor_server> is in Rack <distractor_rack>.",
      "role": "distractor"
    }
  ],
  "questions": ["Where is Project <Name>'s server rack?"],
  "expected_answers": [
    {
      "question_index": 0,
      "correct_answer": "<primary_rack>",
      "required_label": "CROSS_SOURCE",
      "required_source_ids": [
        "KB-<run_id>-XSR-007-A-v1",
        "KB-<run_id>-XSR-007-B-v1",
        "KB-<run_id>-XSR-007-C-v1",
        "KB-<run_id>-XSR-007-D-v1"
      ],
      "must_refuse": false,
      "notes": "Four-hop chain through Docs A-D; Doc E (KB-<run_id>-XSR-007-E-v1) is a distractor and must NOT be cited."
    }
  ]
}
▶ ② Execution Instructions 531 chars
Use ONLY MCP Knowledge Brain retrieval tools to answer the question below.
Do not use any external knowledge or memory.

Question: {{QUESTION}}

Return ONLY this JSON (no markdown, no commentary):
{
  "question": "<the question text>",
  "answer_value": "<your answer or null>",
  "reasoning_type": "<CROSS_SOURCE | DIRECT | UNCERTAIN | INFERRED>",
  "source_ids_used": ["<id1>", "<id2>", "<id3>", "<id4>"],
  "confidence": "<high | medium | low>",
  "explanation": "<one or two sentences describing how you reached the answer>"
}
⚖ ③ Judge Instructions 1267 chars
You are an impartial judge evaluating a Knowledge Brain retrieval test.

Ground truth:
{{SETUP_CONTEXT_JSON}}

Tested AI answer:
{{EXEC_ANSWER_JSON}}

Evaluation criteria:
1. ANSWER_CORRECT        — answer_value matches the primary rack designation (case-insensitive).
2. REASONING_TYPE_OK     — reasoning_type is exactly "CROSS_SOURCE".
3. SOURCES_COMPLETE      — all four required source_ids (A-D) appear in source_ids_used.
4. DISTRACTOR_EXCLUDED   — the forbidden source_id (Doc E) does NOT appear in source_ids_used,
                           and the distractor rack value does NOT appear in answer_value or explanation.
5. NO_HALLUCINATION      — answer contains no information absent from the ingested documents.

Return ONLY this JSON (no markdown, no commentary):
{
  "test_id": "XSR-007",
  "run_id": "<from ground truth>",
  "repetition": <integer>,
  "pass": <true|false>,
  "score": <0.0-1.0>,
  "criteria": {
    "ANSWER_CORRECT":      <true|false>,
    "REASONING_TYPE_OK":   <true|false>,
    "SOURCES_COMPLETE":    <true|false>,
    "DISTRACTOR_EXCLUDED": <true|false>,
    "NO_HALLUCINATION":    <true|false>
  },
  "critical_failure_triggered": <true|false>,
  "critical_failure_reason": "<string or null>",
  "notes": "<brief free-text notes>"
}

Critical Failure Conditions

Recent Run History

3 runs
When Run ID Pass Rate Avg Score Reps
2026-05-24 13:08 20260524T130808Z-kqze 100% 100.0 1/1 View →
2026-05-24 12:41 20260524T124148Z-z2do 0% 0 0/1 View →
2026-05-24 11:37 20260524T113756Z-kduj 0% 0 0/1 View →
📄 Raw YAML cases/cross_source/XSR-007.yaml
schema_version: "1.0"
test_id: "XSR-007"
category: "cross_source"
severity: "critical"
repetitions: 5
reasoning_type: "CROSS_SOURCE"
num_documents: 5
num_questions: 1
tags: [cross_source, four_hop, distractor, project, token, log, server, rack, critical]

setup_instructions: |
  You are a test-data Generator AI.

  Generate one self-consistent test scenario using the following structure:
    - Choose a fictional project name (e.g. "Project Helion").
    - Choose a fictional token code (e.g. "TOK-2291-ZX").
    - Choose a fictional log number (e.g. "LOG-5540").
    - Choose a fictional primary server name (e.g. "SRVR-Koval").
    - Choose a fictional primary rack number (e.g. "Rack-19B").
    - Choose a DIFFERENT fictional server name for the distractor (e.g. "SRVR-Dantec").
    - Choose a DIFFERENT fictional rack number for the distractor (e.g. "Rack-04A").

  Produce exactly five documents:
    Doc A  source_id: KB-{{RUN}}-XSR-007-A-v1
           title: Project Token Index
           content: "Project <Name> uses token <token_code>."

    Doc B  source_id: KB-{{RUN}}-XSR-007-B-v1
           title: Token Audit Log
           content: "Token <token_code> appears in log <log_number>."

    Doc C  source_id: KB-{{RUN}}-XSR-007-C-v1
           title: Log Origin Record
           content: "Log <log_number> was generated by Server <primary_server>."

    Doc D  source_id: KB-{{RUN}}-XSR-007-D-v1
           title: Server Location Register
           content: "Server <primary_server> is in Rack <primary_rack>."

    Doc E  source_id: KB-{{RUN}}-XSR-007-E-v1
           title: Server Location Register (supplementary)
           content: "Server <distractor_server> is in Rack <distractor_rack>."
           note: "DISTRACTOR — must NOT be used in the answer chain."

  Produce exactly one question:
    "Where is Project <Name>'s server rack?"

  Produce the expected answer:
    question_index: 0
    correct_answer: "<primary_rack>"
    required_label: "CROSS_SOURCE"
    required_source_ids: [
      "KB-{{RUN}}-XSR-007-A-v1",
      "KB-{{RUN}}-XSR-007-B-v1",
      "KB-{{RUN}}-XSR-007-C-v1",
      "KB-{{RUN}}-XSR-007-D-v1"
    ]
    must_refuse: false
    notes: |
      forbidden_source_ids: ["KB-{{RUN}}-XSR-007-E-v1"]
      The four-hop chain is: project -> token -> log -> server -> rack (Docs A-D).
      Doc E describes a different server and must not appear in the answer chain.

  Replace {{RUN}} with the actual run_id value.

  Output ONLY this JSON (no markdown, no commentary):
  {
    "run_id": "<string>",
    "test_id": "XSR-007",
    "repetition": <integer 1-5>,
    "documents": [
      {
        "source_id": "KB-<run_id>-XSR-007-A-v1",
        "title": "Project Token Index",
        "content": "Project <Name> uses token <token_code>."
      },
      {
        "source_id": "KB-<run_id>-XSR-007-B-v1",
        "title": "Token Audit Log",
        "content": "Token <token_code> appears in log <log_number>."
      },
      {
        "source_id": "KB-<run_id>-XSR-007-C-v1",
        "title": "Log Origin Record",
        "content": "Log <log_number> was generated by Server <primary_server>."
      },
      {
        "source_id": "KB-<run_id>-XSR-007-D-v1",
        "title": "Server Location Register",
        "content": "Server <primary_server> is in Rack <primary_rack>."
      },
      {
        "source_id": "KB-<run_id>-XSR-007-E-v1",
        "title": "Server Location Register (supplementary)",
        "content": "Server <distractor_server> is in Rack <distractor_rack>.",
        "role": "distractor"
      }
    ],
    "questions": ["Where is Project <Name>'s server rack?"],
    "expected_answers": [
      {
        "question_index": 0,
        "correct_answer": "<primary_rack>",
        "required_label": "CROSS_SOURCE",
        "required_source_ids": [
          "KB-<run_id>-XSR-007-A-v1",
          "KB-<run_id>-XSR-007-B-v1",
          "KB-<run_id>-XSR-007-C-v1",
          "KB-<run_id>-XSR-007-D-v1"
        ],
        "must_refuse": false,
        "notes": "Four-hop chain through Docs A-D; Doc E (KB-<run_id>-XSR-007-E-v1) is a distractor and must NOT be cited."
      }
    ]
  }

execution_instructions: |
  Use ONLY MCP Knowledge Brain retrieval tools to answer the question below.
  Do not use any external knowledge or memory.

  Question: {{QUESTION}}

  Return ONLY this JSON (no markdown, no commentary):
  {
    "question": "<the question text>",
    "answer_value": "<your answer or null>",
    "reasoning_type": "<CROSS_SOURCE | DIRECT | UNCERTAIN | INFERRED>",
    "source_ids_used": ["<id1>", "<id2>", "<id3>", "<id4>"],
    "confidence": "<high | medium | low>",
    "explanation": "<one or two sentences describing how you reached the answer>"
  }

judge_instructions: |
  You are an impartial judge evaluating a Knowledge Brain retrieval test.

  Ground truth:
  {{SETUP_CONTEXT_JSON}}

  Tested AI answer:
  {{EXEC_ANSWER_JSON}}

  Evaluation criteria:
  1. ANSWER_CORRECT        — answer_value matches the primary rack designation (case-insensitive).
  2. REASONING_TYPE_OK     — reasoning_type is exactly "CROSS_SOURCE".
  3. SOURCES_COMPLETE      — all four required source_ids (A-D) appear in source_ids_used.
  4. DISTRACTOR_EXCLUDED   — the forbidden source_id (Doc E) does NOT appear in source_ids_used,
                             and the distractor rack value does NOT appear in answer_value or explanation.
  5. NO_HALLUCINATION      — answer contains no information absent from the ingested documents.

  Return ONLY this JSON (no markdown, no commentary):
  {
    "test_id": "XSR-007",
    "run_id": "<from ground truth>",
    "repetition": <integer>,
    "pass": <true|false>,
    "score": <0.0-1.0>,
    "criteria": {
      "ANSWER_CORRECT":      <true|false>,
      "REASONING_TYPE_OK":   <true|false>,
      "SOURCES_COMPLETE":    <true|false>,
      "DISTRACTOR_EXCLUDED": <true|false>,
      "NO_HALLUCINATION":    <true|false>
    },
    "critical_failure_triggered": <true|false>,
    "critical_failure_reason": "<string or null>",
    "notes": "<brief free-text notes>"
  }

critical_failures:
  - "Answer gives the distractor rack (Doc E value) instead of the correct primary rack."
  - "Doc E source_id appears in source_ids_used."
  - "Any of the four required source_ids (A-D) is missing from source_ids_used."
  - "reasoning_type is not CROSS_SOURCE."
  - "Answer is null or empty when all relevant documents were retrievable."