← All Test Cases

XSR-006

cross source

critical
Repetitions
5
Documents
2
Questions
1
Reasoning
UNCERTAIN
cross_source uncertain witness vehicle identity color_match_insufficient critical

📖 In Plain English

What this category tests

Can the brain answer questions that require combining information from 2+ documents?

How the test works

Documents form chains (e.g. Doc A: 'X9 is the velvet pouch'; Doc B: 'velvet pouch contains a silver locket'). The test asks 'what does X9 contain?' which requires Doc A → Doc B reasoning.

Why it matters

Real questions often span multiple documents. Single-doc retrieval is not enough.

⚙️ How a single rep runs

① Generate
Model creates 2 synthetic documents and 1 question with unique canary tokens
→ Fresh content per run prevents memorization and proves real retrieval
② Ingest (MCP)
Model calls brain_ingest to store the 2 documents
→ Tests the brain's storage and indexing pipeline
③ Query (MCP)
Model answers the question using brain retrieval tools (search, fetch, context_pack, etc.)
→ Core test — does the brain return correct evidence and let the model build a faithful answer?
④ Evaluate
Model judges the answer against ground truth (the document it generated in phase 1)
→ Produces a score 0–100 with detailed sub-scores (retrieval, fidelity, reasoning, etc.)

This rep is run 5 times per test run. A pass requires score ≥ 85 and no critical failures.

🔬 Technical Instructions (raw prompts sent to AI)

🔧 ① Setup Instructions 2125 chars
You are a test-data Generator AI.

Generate one self-consistent test scenario using the following structure:
  - Choose a distinct color shared by both documents (e.g. "slate-grey").
  - Choose a fictional vehicle type (e.g. "panel van").

Produce exactly two documents:
  Doc A  source_id: KB-{{RUN}}-XSR-006-A-v1
         title: Witness Statement
         content: "The witness saw a <color> <vehicle_type>."

  Doc B  source_id: KB-{{RUN}}-XSR-006-B-v1
         title: Delivery Record
         content: "The delivery vehicle was <color>."

Produce exactly one question:
  "Was the <vehicle_type> seen by the witness the delivery vehicle?"

Produce the expected answer:
  question_index: 0
  correct_answer: "UNCERTAIN"
  required_label: "UNCERTAIN"
  required_source_ids: [
    "KB-{{RUN}}-XSR-006-A-v1",
    "KB-{{RUN}}-XSR-006-B-v1"
  ]
  must_refuse: false
  notes: |
    Both documents describe a <color> vehicle, but color similarity alone is
    insufficient to confirm the two vehicles are identical. Multiple vehicles
    of the same color may exist. The question cannot be answered with certainty
    from these documents.

Replace {{RUN}} with the actual run_id value.

Output ONLY this JSON (no markdown, no commentary):
{
  "run_id": "<string>",
  "test_id": "XSR-006",
  "repetition": <integer 1-5>,
  "documents": [
    {
      "source_id": "KB-<run_id>-XSR-006-A-v1",
      "title": "Witness Statement",
      "content": "The witness saw a <color> <vehicle_type>."
    },
    {
      "source_id": "KB-<run_id>-XSR-006-B-v1",
      "title": "Delivery Record",
      "content": "The delivery vehicle was <color>."
    }
  ],
  "questions": ["Was the <vehicle_type> seen by the witness the delivery vehicle?"],
  "expected_answers": [
    {
      "question_index": 0,
      "correct_answer": "UNCERTAIN",
      "required_label": "UNCERTAIN",
      "required_source_ids": [
        "KB-<run_id>-XSR-006-A-v1",
        "KB-<run_id>-XSR-006-B-v1"
      ],
      "must_refuse": false,
      "notes": "Color match is necessary but not sufficient for identity; cannot confirm they are the same vehicle."
    }
  ]
}
▶ ② Execution Instructions 515 chars
Use ONLY MCP Knowledge Brain retrieval tools to answer the question below.
Do not use any external knowledge or memory.

Question: {{QUESTION}}

Return ONLY this JSON (no markdown, no commentary):
{
  "question": "<the question text>",
  "answer_value": "<your answer or UNCERTAIN>",
  "reasoning_type": "<CROSS_SOURCE | DIRECT | UNCERTAIN | INFERRED>",
  "source_ids_used": ["<id1>", "<id2>"],
  "confidence": "<high | medium | low>",
  "explanation": "<one or two sentences explaining your level of certainty>"
}
⚖ ③ Judge Instructions 1279 chars
You are an impartial judge evaluating a Knowledge Brain retrieval test.

Ground truth:
{{SETUP_CONTEXT_JSON}}

Tested AI answer:
{{EXEC_ANSWER_JSON}}

Evaluation criteria:
1. UNCERTAINTY_RECOGNIZED — answer_value is "UNCERTAIN" or the explanation clearly
                            states the identity cannot be confirmed from these documents.
2. REASONING_TYPE_OK      — reasoning_type is "UNCERTAIN" (not CROSS_SOURCE, DIRECT, or INFERRED).
3. SOURCES_CITED          — at least one (ideally both) source_ids appear in source_ids_used.
4. NO_FALSE_CONFIRMATION  — answer does NOT assert the two vehicles are the same entity.
5. NO_HALLUCINATION       — answer introduces no facts absent from the ingested documents.

Return ONLY this JSON (no markdown, no commentary):
{
  "test_id": "XSR-006",
  "run_id": "<from ground truth>",
  "repetition": <integer>,
  "pass": <true|false>,
  "score": <0.0-1.0>,
  "criteria": {
    "UNCERTAINTY_RECOGNIZED": <true|false>,
    "REASONING_TYPE_OK":      <true|false>,
    "SOURCES_CITED":          <true|false>,
    "NO_FALSE_CONFIRMATION":  <true|false>,
    "NO_HALLUCINATION":       <true|false>
  },
  "critical_failure_triggered": <true|false>,
  "critical_failure_reason": "<string or null>",
  "notes": "<brief free-text notes>"
}

Critical Failure Conditions

Recent Run History

3 runs
When Run ID Pass Rate Avg Score Reps
2026-05-24 13:08 20260524T130808Z-kqze 0% 52.0 0/1 View →
2026-05-24 12:41 20260524T124148Z-z2do 0% 0 0/1 View →
2026-05-24 11:37 20260524T113756Z-kduj 0% 0 0/1 View →
📄 Raw YAML cases/cross_source/XSR-006.yaml
schema_version: "1.0"
test_id: "XSR-006"
category: "cross_source"
severity: "critical"
repetitions: 5
reasoning_type: "UNCERTAIN"
num_documents: 2
num_questions: 1
tags: [cross_source, uncertain, witness, vehicle, identity, color_match_insufficient, critical]

setup_instructions: |
  You are a test-data Generator AI.

  Generate one self-consistent test scenario using the following structure:
    - Choose a distinct color shared by both documents (e.g. "slate-grey").
    - Choose a fictional vehicle type (e.g. "panel van").

  Produce exactly two documents:
    Doc A  source_id: KB-{{RUN}}-XSR-006-A-v1
           title: Witness Statement
           content: "The witness saw a <color> <vehicle_type>."

    Doc B  source_id: KB-{{RUN}}-XSR-006-B-v1
           title: Delivery Record
           content: "The delivery vehicle was <color>."

  Produce exactly one question:
    "Was the <vehicle_type> seen by the witness the delivery vehicle?"

  Produce the expected answer:
    question_index: 0
    correct_answer: "UNCERTAIN"
    required_label: "UNCERTAIN"
    required_source_ids: [
      "KB-{{RUN}}-XSR-006-A-v1",
      "KB-{{RUN}}-XSR-006-B-v1"
    ]
    must_refuse: false
    notes: |
      Both documents describe a <color> vehicle, but color similarity alone is
      insufficient to confirm the two vehicles are identical. Multiple vehicles
      of the same color may exist. The question cannot be answered with certainty
      from these documents.

  Replace {{RUN}} with the actual run_id value.

  Output ONLY this JSON (no markdown, no commentary):
  {
    "run_id": "<string>",
    "test_id": "XSR-006",
    "repetition": <integer 1-5>,
    "documents": [
      {
        "source_id": "KB-<run_id>-XSR-006-A-v1",
        "title": "Witness Statement",
        "content": "The witness saw a <color> <vehicle_type>."
      },
      {
        "source_id": "KB-<run_id>-XSR-006-B-v1",
        "title": "Delivery Record",
        "content": "The delivery vehicle was <color>."
      }
    ],
    "questions": ["Was the <vehicle_type> seen by the witness the delivery vehicle?"],
    "expected_answers": [
      {
        "question_index": 0,
        "correct_answer": "UNCERTAIN",
        "required_label": "UNCERTAIN",
        "required_source_ids": [
          "KB-<run_id>-XSR-006-A-v1",
          "KB-<run_id>-XSR-006-B-v1"
        ],
        "must_refuse": false,
        "notes": "Color match is necessary but not sufficient for identity; cannot confirm they are the same vehicle."
      }
    ]
  }

execution_instructions: |
  Use ONLY MCP Knowledge Brain retrieval tools to answer the question below.
  Do not use any external knowledge or memory.

  Question: {{QUESTION}}

  Return ONLY this JSON (no markdown, no commentary):
  {
    "question": "<the question text>",
    "answer_value": "<your answer or UNCERTAIN>",
    "reasoning_type": "<CROSS_SOURCE | DIRECT | UNCERTAIN | INFERRED>",
    "source_ids_used": ["<id1>", "<id2>"],
    "confidence": "<high | medium | low>",
    "explanation": "<one or two sentences explaining your level of certainty>"
  }

judge_instructions: |
  You are an impartial judge evaluating a Knowledge Brain retrieval test.

  Ground truth:
  {{SETUP_CONTEXT_JSON}}

  Tested AI answer:
  {{EXEC_ANSWER_JSON}}

  Evaluation criteria:
  1. UNCERTAINTY_RECOGNIZED — answer_value is "UNCERTAIN" or the explanation clearly
                              states the identity cannot be confirmed from these documents.
  2. REASONING_TYPE_OK      — reasoning_type is "UNCERTAIN" (not CROSS_SOURCE, DIRECT, or INFERRED).
  3. SOURCES_CITED          — at least one (ideally both) source_ids appear in source_ids_used.
  4. NO_FALSE_CONFIRMATION  — answer does NOT assert the two vehicles are the same entity.
  5. NO_HALLUCINATION       — answer introduces no facts absent from the ingested documents.

  Return ONLY this JSON (no markdown, no commentary):
  {
    "test_id": "XSR-006",
    "run_id": "<from ground truth>",
    "repetition": <integer>,
    "pass": <true|false>,
    "score": <0.0-1.0>,
    "criteria": {
      "UNCERTAINTY_RECOGNIZED": <true|false>,
      "REASONING_TYPE_OK":      <true|false>,
      "SOURCES_CITED":          <true|false>,
      "NO_FALSE_CONFIRMATION":  <true|false>,
      "NO_HALLUCINATION":       <true|false>
    },
    "critical_failure_triggered": <true|false>,
    "critical_failure_reason": "<string or null>",
    "notes": "<brief free-text notes>"
  }

critical_failures:
  - "Answer confirms the two vehicles are the same entity based solely on color match."
  - "reasoning_type is CROSS_SOURCE, implying certainty that is not warranted."
  - "reasoning_type is DIRECT."
  - "Answer introduces vehicle identification details not present in any document."