← All Test Cases

XSR-006

cross source

critical

Repetitions

Documents

Questions

Reasoning

UNCERTAIN

cross_source uncertain witness vehicle identity color_match_insufficient critical

📖 In Plain English

What this category tests

Can the brain answer questions that require combining information from 2+ documents?

How the test works

Documents form chains (e.g. Doc A: 'X9 is the velvet pouch'; Doc B: 'velvet pouch contains a silver locket'). The test asks 'what does X9 contain?' which requires Doc A → Doc B reasoning.

Why it matters

Real questions often span multiple documents. Single-doc retrieval is not enough.

⚙️ How a single rep runs

① Generate

Model creates 2 synthetic documents and 1 question with unique canary tokens

→ Fresh content per run prevents memorization and proves real retrieval

② Ingest (MCP)

Model calls brain_ingest to store the 2 documents

→ Tests the brain's storage and indexing pipeline

③ Query (MCP)

Model answers the question using brain retrieval tools (search, fetch, context_pack, etc.)

→ Core test — does the brain return correct evidence and let the model build a faithful answer?

④ Evaluate

Model judges the answer against ground truth (the document it generated in phase 1)

→ Produces a score 0–100 with detailed sub-scores (retrieval, fidelity, reasoning, etc.)

This rep is run 5 times per test run. A pass requires score ≥ 85 and no critical failures.

🔬 Technical Instructions (raw prompts sent to AI)

🔧 ① Setup Instructions 2125 chars

You are a test-data Generator AI.

Generate one self-consistent test scenario using the following structure:
  - Choose a distinct color shared by both documents (e.g. "slate-grey").
  - Choose a fictional vehicle type (e.g. "panel van").

Produce exactly two documents:
  Doc A  source_id: KB-{{RUN}}-XSR-006-A-v1
         title: Witness Statement
         content: "The witness saw a <color> <vehicle_type>."

  Doc B  source_id: KB-{{RUN}}-XSR-006-B-v1
         title: Delivery Record
         content: "The delivery vehicle was <color>."

Produce exactly one question:
  "Was the <vehicle_type> seen by the witness the delivery vehicle?"

Produce the expected answer:
  question_index: 0
  correct_answer: "UNCERTAIN"
  required_label: "UNCERTAIN"
  required_source_ids: [
    "KB-{{RUN}}-XSR-006-A-v1",
    "KB-{{RUN}}-XSR-006-B-v1"
  ]
  must_refuse: false
  notes: |
    Both documents describe a <color> vehicle, but color similarity alone is
    insufficient to confirm the two vehicles are identical. Multiple vehicles
    of the same color may exist. The question cannot be answered with certainty
    from these documents.

Replace {{RUN}} with the actual run_id value.

Output ONLY this JSON (no markdown, no commentary):
{
  "run_id": "<string>",
  "test_id": "XSR-006",
  "repetition": <integer 1-5>,
  "documents": [
    {
      "source_id": "KB-<run_id>-XSR-006-A-v1",
      "title": "Witness Statement",
      "content": "The witness saw a <color> <vehicle_type>."
    },
    {
      "source_id": "KB-<run_id>-XSR-006-B-v1",
      "title": "Delivery Record",
      "content": "The delivery vehicle was <color>."
    }
  ],
  "questions": ["Was the <vehicle_type> seen by the witness the delivery vehicle?"],
  "expected_answers": [
    {
      "question_index": 0,
      "correct_answer": "UNCERTAIN",
      "required_label": "UNCERTAIN",
      "required_source_ids": [
        "KB-<run_id>-XSR-006-A-v1",
        "KB-<run_id>-XSR-006-B-v1"
      ],
      "must_refuse": false,
      "notes": "Color match is necessary but not sufficient for identity; cannot confirm they are the same vehicle."
    }
  ]
}

▶ ② Execution Instructions 515 chars

Use ONLY MCP Knowledge Brain retrieval tools to answer the question below.
Do not use any external knowledge or memory.

Question: {{QUESTION}}

Return ONLY this JSON (no markdown, no commentary):
{
  "question": "<the question text>",
  "answer_value": "<your answer or UNCERTAIN>",
  "reasoning_type": "<CROSS_SOURCE | DIRECT | UNCERTAIN | INFERRED>",
  "source_ids_used": ["<id1>", "<id2>"],
  "confidence": "<high | medium | low>",
  "explanation": "<one or two sentences explaining your level of certainty>"
}

⚖ ③ Judge Instructions 1279 chars

You are an impartial judge evaluating a Knowledge Brain retrieval test.

Ground truth:
{{SETUP_CONTEXT_JSON}}

Tested AI answer:
{{EXEC_ANSWER_JSON}}

Evaluation criteria:
1. UNCERTAINTY_RECOGNIZED — answer_value is "UNCERTAIN" or the explanation clearly
                            states the identity cannot be confirmed from these documents.
2. REASONING_TYPE_OK      — reasoning_type is "UNCERTAIN" (not CROSS_SOURCE, DIRECT, or INFERRED).
3. SOURCES_CITED          — at least one (ideally both) source_ids appear in source_ids_used.
4. NO_FALSE_CONFIRMATION  — answer does NOT assert the two vehicles are the same entity.
5. NO_HALLUCINATION       — answer introduces no facts absent from the ingested documents.

Return ONLY this JSON (no markdown, no commentary):
{
  "test_id": "XSR-006",
  "run_id": "<from ground truth>",
  "repetition": <integer>,
  "pass": <true|false>,
  "score": <0.0-1.0>,
  "criteria": {
    "UNCERTAINTY_RECOGNIZED": <true|false>,
    "REASONING_TYPE_OK":      <true|false>,
    "SOURCES_CITED":          <true|false>,
    "NO_FALSE_CONFIRMATION":  <true|false>,
    "NO_HALLUCINATION":       <true|false>
  },
  "critical_failure_triggered": <true|false>,
  "critical_failure_reason": "<string or null>",
  "notes": "<brief free-text notes>"
}

Critical Failure Conditions

Answer confirms the two vehicles are the same entity based solely on color match.
reasoning_type is CROSS_SOURCE, implying certainty that is not warranted.
reasoning_type is DIRECT.
Answer introduces vehicle identification details not present in any document.

Recent Run History

3 runs

When	Run ID	Pass Rate	Avg Score	Reps
2026-05-24 13:08	20260524T130808Z-kqze	0%	52.0	0/1	View →
2026-05-24 12:41	20260524T124148Z-z2do	0%	0	0/1	View →
2026-05-24 11:37	20260524T113756Z-kduj	0%	0	0/1	View →

📄 Raw YAML cases/cross_source/XSR-006.yaml

schema_version: "1.0"
test_id: "XSR-006"
category: "cross_source"
severity: "critical"
repetitions: 5
reasoning_type: "UNCERTAIN"
num_documents: 2
num_questions: 1
tags: [cross_source, uncertain, witness, vehicle, identity, color_match_insufficient, critical]

setup_instructions: |
  You are a test-data Generator AI.

  Generate one self-consistent test scenario using the following structure:
    - Choose a distinct color shared by both documents (e.g. "slate-grey").
    - Choose a fictional vehicle type (e.g. "panel van").

  Produce exactly two documents:
    Doc A  source_id: KB-{{RUN}}-XSR-006-A-v1
           title: Witness Statement
           content: "The witness saw a <color> <vehicle_type>."

    Doc B  source_id: KB-{{RUN}}-XSR-006-B-v1
           title: Delivery Record
           content: "The delivery vehicle was <color>."

  Produce exactly one question:
    "Was the <vehicle_type> seen by the witness the delivery vehicle?"

  Produce the expected answer:
    question_index: 0
    correct_answer: "UNCERTAIN"
    required_label: "UNCERTAIN"
    required_source_ids: [
      "KB-{{RUN}}-XSR-006-A-v1",
      "KB-{{RUN}}-XSR-006-B-v1"
    ]
    must_refuse: false
    notes: |
      Both documents describe a <color> vehicle, but color similarity alone is
      insufficient to confirm the two vehicles are identical. Multiple vehicles
      of the same color may exist. The question cannot be answered with certainty
      from these documents.

  Replace {{RUN}} with the actual run_id value.

  Output ONLY this JSON (no markdown, no commentary):
  {
    "run_id": "<string>",
    "test_id": "XSR-006",
    "repetition": <integer 1-5>,
    "documents": [
      {
        "source_id": "KB-<run_id>-XSR-006-A-v1",
        "title": "Witness Statement",
        "content": "The witness saw a <color> <vehicle_type>."
      },
      {
        "source_id": "KB-<run_id>-XSR-006-B-v1",
        "title": "Delivery Record",
        "content": "The delivery vehicle was <color>."
      }
    ],
    "questions": ["Was the <vehicle_type> seen by the witness the delivery vehicle?"],
    "expected_answers": [
      {
        "question_index": 0,
        "correct_answer": "UNCERTAIN",
        "required_label": "UNCERTAIN",
        "required_source_ids": [
          "KB-<run_id>-XSR-006-A-v1",
          "KB-<run_id>-XSR-006-B-v1"
        ],
        "must_refuse": false,
        "notes": "Color match is necessary but not sufficient for identity; cannot confirm they are the same vehicle."
      }
    ]
  }

execution_instructions: |
  Use ONLY MCP Knowledge Brain retrieval tools to answer the question below.
  Do not use any external knowledge or memory.

  Question: {{QUESTION}}

  Return ONLY this JSON (no markdown, no commentary):
  {
    "question": "<the question text>",
    "answer_value": "<your answer or UNCERTAIN>",
    "reasoning_type": "<CROSS_SOURCE | DIRECT | UNCERTAIN | INFERRED>",
    "source_ids_used": ["<id1>", "<id2>"],
    "confidence": "<high | medium | low>",
    "explanation": "<one or two sentences explaining your level of certainty>"
  }

judge_instructions: |
  You are an impartial judge evaluating a Knowledge Brain retrieval test.

  Ground truth:
  {{SETUP_CONTEXT_JSON}}

  Tested AI answer:
  {{EXEC_ANSWER_JSON}}

  Evaluation criteria:
  1. UNCERTAINTY_RECOGNIZED — answer_value is "UNCERTAIN" or the explanation clearly
                              states the identity cannot be confirmed from these documents.
  2. REASONING_TYPE_OK      — reasoning_type is "UNCERTAIN" (not CROSS_SOURCE, DIRECT, or INFERRED).
  3. SOURCES_CITED          — at least one (ideally both) source_ids appear in source_ids_used.
  4. NO_FALSE_CONFIRMATION  — answer does NOT assert the two vehicles are the same entity.
  5. NO_HALLUCINATION       — answer introduces no facts absent from the ingested documents.

  Return ONLY this JSON (no markdown, no commentary):
  {
    "test_id": "XSR-006",
    "run_id": "<from ground truth>",
    "repetition": <integer>,
    "pass": <true|false>,
    "score": <0.0-1.0>,
    "criteria": {
      "UNCERTAINTY_RECOGNIZED": <true|false>,
      "REASONING_TYPE_OK":      <true|false>,
      "SOURCES_CITED":          <true|false>,
      "NO_FALSE_CONFIRMATION":  <true|false>,
      "NO_HALLUCINATION":       <true|false>
    },
    "critical_failure_triggered": <true|false>,
    "critical_failure_reason": "<string or null>",
    "notes": "<brief free-text notes>"
  }

critical_failures:
  - "Answer confirms the two vehicles are the same entity based solely on color match."
  - "reasoning_type is CROSS_SOURCE, implying certainty that is not warranted."
  - "reasoning_type is DIRECT."
  - "Answer introduces vehicle identification details not present in any document."