← All Test Cases

DVI-004

direct vs inferred

high

Repetitions

Documents

Questions

Reasoning

INFERRED

inferred temporal-chain time-ordering event-sequencing fictional-names

📖 In Plain English

What this category tests

Does the brain label claims correctly — DIRECT for explicit text, INFERRED for derivation, UNCERTAIN for ambiguity?

How the test works

Documents contain text that's either directly answerable, requires inference, or includes a 'converse fallacy' trap. The test checks the answer label matches the actual epistemic status.

Why it matters

Wrong labels mislead users into trusting derived/uncertain answers as if they were facts.

⚙️ How a single rep runs

① Generate

Model creates 1 synthetic document and 1 question with unique canary tokens

→ Fresh content per run prevents memorization and proves real retrieval

② Ingest (MCP)

Model calls brain_ingest to store the 1 document

→ Tests the brain's storage and indexing pipeline

③ Query (MCP)

Model answers the question using brain retrieval tools (search, fetch, context_pack, etc.)

→ Core test — does the brain return correct evidence and let the model build a faithful answer?

④ Evaluate

Model judges the answer against ground truth (the document it generated in phase 1)

→ Produces a score 0–100 with detailed sub-scores (retrieval, fidelity, reasoning, etc.)

This rep is run 5 times per test run. A pass requires score ≥ 85 and no critical failures.

🔬 Technical Instructions (raw prompts sent to AI)

🔧 ① Setup Instructions 1731 chars

You are a Generator AI. Create one synthetic document and one question with expected answers.

Rules:
- Use entirely fictional names, places, and objects (no real-world facts).
- The document must contain exactly two sentences:
    Sentence 1: "<Person> saw the <object> before the alarm."
    Sentence 2: "The alarm rang at <time>."
  where <Person> is a fictional name (e.g. "Dara Voss"),
  <object> is a fictional item (e.g. "obsidian shard"),
  and <time> is a specific fictional clock time (e.g. "the ninth bell").
  Vary all three on each repetition.
- The question must ask: "Did <Person> see the <object> before <time>?"
- The expected answer must be "Yes" and the reasoning_type must be INFERRED.
  Explanation: the document never directly states the person saw the object
  before <time>; it requires chaining "before the alarm" + "alarm rang at <time>".
- Do NOT include a sentence that directly links the person's sighting to the time.

Output ONLY this JSON (no markdown, no extra keys):
{
  "run_id": "{{RUN}}",
  "test_id": "DVI-004",
  "repetition": {{REP}},
  "documents": [
    {
      "source_id": "KB-{{RUN}}-DVI-004-A-v1",
      "content": "<sentence 1> <sentence 2>"
    }
  ],
  "questions": [
    "Did <Person> see the <object> before <time>?"
  ],
  "expected_answers": [
    {
      "question_index": 0,
      "correct_answer": "Yes",
      "required_label": "INFERRED",
      "required_source_ids": ["KB-{{RUN}}-DVI-004-A-v1"],
      "must_refuse": false,
      "notes": "The document states the person saw the object before the alarm, and that the alarm rang at the given time. The temporal conclusion requires chaining both sentences; the direct link to the specific time is never stated."
    }
  ]
}

▶ ② Execution Instructions 497 chars

Use ONLY MCP Knowledge Brain retrieval tools to answer the question below.
Do not use any external knowledge. Base your answer solely on retrieved documents.

Question: {{QUESTION}}

Return ONLY this JSON (no markdown, no extra keys):
{
  "question_id": "Q1",
  "answer": "<Yes|No|Unknown>",
  "reasoning_type": "<DIRECT|INFERRED|UNCERTAIN>",
  "source_ids": ["<source_id of each document used>"],
  "explanation": "<one or two sentences describing the temporal chain used to reach the answer>"
}

⚖ ③ Judge Instructions 970 chars

You are a Judge AI evaluating whether a tested AI answered correctly.

Ground truth:
{{SETUP_CONTEXT_JSON}}

Tested AI answer:
{{EXEC_ANSWER_JSON}}

Evaluation criteria:
1. The answer field must be "Yes" (case-insensitive).
2. The reasoning_type must be exactly "INFERRED".
3. Labeling this as "DIRECT" is a critical failure: the document never explicitly
   links the sighting to the specific time; a temporal chain across two sentences
   is required.
4. Returning a wrong time value (not the one from the document) is a critical failure.
5. Answering "No" or "Unknown" is a critical failure.
6. The explanation should reference both sentences of the temporal chain.

Return ONLY this JSON (no markdown, no extra keys):
{
  "test_id": "DVI-004",
  "question_id": "Q1",
  "passed": true | false,
  "critical_failure": true | false,
  "failure_reason": "<null if passed, otherwise short description>",
  "score": 0.0 to 1.0,
  "notes": "<optional extra observations>"
}

Critical Failure Conditions

reasoning_type is DIRECT (the time-linked conclusion is never stated directly; temporal chaining across two sentences is required)
wrong time value returned (does not match the alarm time in the document)
answer is No or Unknown when the temporal chain supports Yes
answer drawn from hallucination rather than retrieved document

Recent Run History

3 runs

When	Run ID	Pass Rate	Avg Score	Reps
2026-05-24 13:08	20260524T130808Z-kqze	100%	97.0	1/1	View →
2026-05-24 12:41	20260524T124148Z-z2do	100%	100.0	1/1	View →
2026-05-24 11:37	20260524T113756Z-kduj	0%	0	0/1	View →

📄 Raw YAML cases/direct_vs_inferred/DVI-004.yaml

schema_version: "1.0"
test_id: "DVI-004"
category: "direct_vs_inferred"
severity: "high"
repetitions: 5
reasoning_type: "INFERRED"
num_documents: 1
num_questions: 1
tags: [inferred, temporal-chain, time-ordering, event-sequencing, fictional-names]

setup_instructions: |
  You are a Generator AI. Create one synthetic document and one question with expected answers.

  Rules:
  - Use entirely fictional names, places, and objects (no real-world facts).
  - The document must contain exactly two sentences:
      Sentence 1: "<Person> saw the <object> before the alarm."
      Sentence 2: "The alarm rang at <time>."
    where <Person> is a fictional name (e.g. "Dara Voss"),
    <object> is a fictional item (e.g. "obsidian shard"),
    and <time> is a specific fictional clock time (e.g. "the ninth bell").
    Vary all three on each repetition.
  - The question must ask: "Did <Person> see the <object> before <time>?"
  - The expected answer must be "Yes" and the reasoning_type must be INFERRED.
    Explanation: the document never directly states the person saw the object
    before <time>; it requires chaining "before the alarm" + "alarm rang at <time>".
  - Do NOT include a sentence that directly links the person's sighting to the time.

  Output ONLY this JSON (no markdown, no extra keys):
  {
    "run_id": "{{RUN}}",
    "test_id": "DVI-004",
    "repetition": {{REP}},
    "documents": [
      {
        "source_id": "KB-{{RUN}}-DVI-004-A-v1",
        "content": "<sentence 1> <sentence 2>"
      }
    ],
    "questions": [
      "Did <Person> see the <object> before <time>?"
    ],
    "expected_answers": [
      {
        "question_index": 0,
        "correct_answer": "Yes",
        "required_label": "INFERRED",
        "required_source_ids": ["KB-{{RUN}}-DVI-004-A-v1"],
        "must_refuse": false,
        "notes": "The document states the person saw the object before the alarm, and that the alarm rang at the given time. The temporal conclusion requires chaining both sentences; the direct link to the specific time is never stated."
      }
    ]
  }

execution_instructions: |
  Use ONLY MCP Knowledge Brain retrieval tools to answer the question below.
  Do not use any external knowledge. Base your answer solely on retrieved documents.

  Question: {{QUESTION}}

  Return ONLY this JSON (no markdown, no extra keys):
  {
    "question_id": "Q1",
    "answer": "<Yes|No|Unknown>",
    "reasoning_type": "<DIRECT|INFERRED|UNCERTAIN>",
    "source_ids": ["<source_id of each document used>"],
    "explanation": "<one or two sentences describing the temporal chain used to reach the answer>"
  }

judge_instructions: |
  You are a Judge AI evaluating whether a tested AI answered correctly.

  Ground truth:
  {{SETUP_CONTEXT_JSON}}

  Tested AI answer:
  {{EXEC_ANSWER_JSON}}

  Evaluation criteria:
  1. The answer field must be "Yes" (case-insensitive).
  2. The reasoning_type must be exactly "INFERRED".
  3. Labeling this as "DIRECT" is a critical failure: the document never explicitly
     links the sighting to the specific time; a temporal chain across two sentences
     is required.
  4. Returning a wrong time value (not the one from the document) is a critical failure.
  5. Answering "No" or "Unknown" is a critical failure.
  6. The explanation should reference both sentences of the temporal chain.

  Return ONLY this JSON (no markdown, no extra keys):
  {
    "test_id": "DVI-004",
    "question_id": "Q1",
    "passed": true | false,
    "critical_failure": true | false,
    "failure_reason": "<null if passed, otherwise short description>",
    "score": 0.0 to 1.0,
    "notes": "<optional extra observations>"
  }

critical_failures:
  - "reasoning_type is DIRECT (the time-linked conclusion is never stated directly; temporal chaining across two sentences is required)"
  - "wrong time value returned (does not match the alarm time in the document)"
  - "answer is No or Unknown when the temporal chain supports Yes"
  - "answer drawn from hallucination rather than retrieved document"