← All Test Cases
high
DVI-004
direct vs inferred
Repetitions
5
Documents
1
Questions
1
Reasoning
INFERRED
inferred
temporal-chain
time-ordering
event-sequencing
fictional-names
📖 In Plain English
What this category tests
Does the brain label claims correctly — DIRECT for explicit text, INFERRED for derivation, UNCERTAIN for ambiguity?
How the test works
Documents contain text that's either directly answerable, requires inference, or includes a 'converse fallacy' trap. The test checks the answer label matches the actual epistemic status.
Why it matters
Wrong labels mislead users into trusting derived/uncertain answers as if they were facts.
⚙️ How a single rep runs
① Generate
Model creates 1 synthetic document and 1 question with unique canary tokens
→ Fresh content per run prevents memorization and proves real retrieval
② Ingest (MCP)
Model calls brain_ingest to store the 1 document
→ Tests the brain's storage and indexing pipeline
③ Query (MCP)
Model answers the question using brain retrieval tools (search, fetch, context_pack, etc.)
→ Core test — does the brain return correct evidence and let the model build a faithful answer?
④ Evaluate
Model judges the answer against ground truth (the document it generated in phase 1)
→ Produces a score 0–100 with detailed sub-scores (retrieval, fidelity, reasoning, etc.)
This rep is run 5 times per test run. A pass requires score ≥ 85 and no critical failures.
🔬 Technical Instructions (raw prompts sent to AI)
🔧 ① Setup Instructions 1731 chars
You are a Generator AI. Create one synthetic document and one question with expected answers.
Rules:
- Use entirely fictional names, places, and objects (no real-world facts).
- The document must contain exactly two sentences:
Sentence 1: "<Person> saw the <object> before the alarm."
Sentence 2: "The alarm rang at <time>."
where <Person> is a fictional name (e.g. "Dara Voss"),
<object> is a fictional item (e.g. "obsidian shard"),
and <time> is a specific fictional clock time (e.g. "the ninth bell").
Vary all three on each repetition.
- The question must ask: "Did <Person> see the <object> before <time>?"
- The expected answer must be "Yes" and the reasoning_type must be INFERRED.
Explanation: the document never directly states the person saw the object
before <time>; it requires chaining "before the alarm" + "alarm rang at <time>".
- Do NOT include a sentence that directly links the person's sighting to the time.
Output ONLY this JSON (no markdown, no extra keys):
{
"run_id": "{{RUN}}",
"test_id": "DVI-004",
"repetition": {{REP}},
"documents": [
{
"source_id": "KB-{{RUN}}-DVI-004-A-v1",
"content": "<sentence 1> <sentence 2>"
}
],
"questions": [
"Did <Person> see the <object> before <time>?"
],
"expected_answers": [
{
"question_index": 0,
"correct_answer": "Yes",
"required_label": "INFERRED",
"required_source_ids": ["KB-{{RUN}}-DVI-004-A-v1"],
"must_refuse": false,
"notes": "The document states the person saw the object before the alarm, and that the alarm rang at the given time. The temporal conclusion requires chaining both sentences; the direct link to the specific time is never stated."
}
]
}
▶ ② Execution Instructions 497 chars
Use ONLY MCP Knowledge Brain retrieval tools to answer the question below.
Do not use any external knowledge. Base your answer solely on retrieved documents.
Question: {{QUESTION}}
Return ONLY this JSON (no markdown, no extra keys):
{
"question_id": "Q1",
"answer": "<Yes|No|Unknown>",
"reasoning_type": "<DIRECT|INFERRED|UNCERTAIN>",
"source_ids": ["<source_id of each document used>"],
"explanation": "<one or two sentences describing the temporal chain used to reach the answer>"
}
⚖ ③ Judge Instructions 970 chars
You are a Judge AI evaluating whether a tested AI answered correctly.
Ground truth:
{{SETUP_CONTEXT_JSON}}
Tested AI answer:
{{EXEC_ANSWER_JSON}}
Evaluation criteria:
1. The answer field must be "Yes" (case-insensitive).
2. The reasoning_type must be exactly "INFERRED".
3. Labeling this as "DIRECT" is a critical failure: the document never explicitly
links the sighting to the specific time; a temporal chain across two sentences
is required.
4. Returning a wrong time value (not the one from the document) is a critical failure.
5. Answering "No" or "Unknown" is a critical failure.
6. The explanation should reference both sentences of the temporal chain.
Return ONLY this JSON (no markdown, no extra keys):
{
"test_id": "DVI-004",
"question_id": "Q1",
"passed": true | false,
"critical_failure": true | false,
"failure_reason": "<null if passed, otherwise short description>",
"score": 0.0 to 1.0,
"notes": "<optional extra observations>"
}
Critical Failure Conditions
- reasoning_type is DIRECT (the time-linked conclusion is never stated directly; temporal chaining across two sentences is required)
- wrong time value returned (does not match the alarm time in the document)
- answer is No or Unknown when the temporal chain supports Yes
- answer drawn from hallucination rather than retrieved document
Recent Run History
3 runs| When | Run ID | Pass Rate | Avg Score | Reps | |
|---|---|---|---|---|---|
| 2026-05-24 13:08 | 20260524T130808Z-kqze | 100% | 97.0 | 1/1 | View → |
| 2026-05-24 12:41 | 20260524T124148Z-z2do | 100% | 100.0 | 1/1 | View → |
| 2026-05-24 11:37 | 20260524T113756Z-kduj | 0% | 0 | 0/1 | View → |
📄 Raw YAML cases/direct_vs_inferred/DVI-004.yaml
schema_version: "1.0"
test_id: "DVI-004"
category: "direct_vs_inferred"
severity: "high"
repetitions: 5
reasoning_type: "INFERRED"
num_documents: 1
num_questions: 1
tags: [inferred, temporal-chain, time-ordering, event-sequencing, fictional-names]
setup_instructions: |
You are a Generator AI. Create one synthetic document and one question with expected answers.
Rules:
- Use entirely fictional names, places, and objects (no real-world facts).
- The document must contain exactly two sentences:
Sentence 1: "<Person> saw the <object> before the alarm."
Sentence 2: "The alarm rang at <time>."
where <Person> is a fictional name (e.g. "Dara Voss"),
<object> is a fictional item (e.g. "obsidian shard"),
and <time> is a specific fictional clock time (e.g. "the ninth bell").
Vary all three on each repetition.
- The question must ask: "Did <Person> see the <object> before <time>?"
- The expected answer must be "Yes" and the reasoning_type must be INFERRED.
Explanation: the document never directly states the person saw the object
before <time>; it requires chaining "before the alarm" + "alarm rang at <time>".
- Do NOT include a sentence that directly links the person's sighting to the time.
Output ONLY this JSON (no markdown, no extra keys):
{
"run_id": "{{RUN}}",
"test_id": "DVI-004",
"repetition": {{REP}},
"documents": [
{
"source_id": "KB-{{RUN}}-DVI-004-A-v1",
"content": "<sentence 1> <sentence 2>"
}
],
"questions": [
"Did <Person> see the <object> before <time>?"
],
"expected_answers": [
{
"question_index": 0,
"correct_answer": "Yes",
"required_label": "INFERRED",
"required_source_ids": ["KB-{{RUN}}-DVI-004-A-v1"],
"must_refuse": false,
"notes": "The document states the person saw the object before the alarm, and that the alarm rang at the given time. The temporal conclusion requires chaining both sentences; the direct link to the specific time is never stated."
}
]
}
execution_instructions: |
Use ONLY MCP Knowledge Brain retrieval tools to answer the question below.
Do not use any external knowledge. Base your answer solely on retrieved documents.
Question: {{QUESTION}}
Return ONLY this JSON (no markdown, no extra keys):
{
"question_id": "Q1",
"answer": "<Yes|No|Unknown>",
"reasoning_type": "<DIRECT|INFERRED|UNCERTAIN>",
"source_ids": ["<source_id of each document used>"],
"explanation": "<one or two sentences describing the temporal chain used to reach the answer>"
}
judge_instructions: |
You are a Judge AI evaluating whether a tested AI answered correctly.
Ground truth:
{{SETUP_CONTEXT_JSON}}
Tested AI answer:
{{EXEC_ANSWER_JSON}}
Evaluation criteria:
1. The answer field must be "Yes" (case-insensitive).
2. The reasoning_type must be exactly "INFERRED".
3. Labeling this as "DIRECT" is a critical failure: the document never explicitly
links the sighting to the specific time; a temporal chain across two sentences
is required.
4. Returning a wrong time value (not the one from the document) is a critical failure.
5. Answering "No" or "Unknown" is a critical failure.
6. The explanation should reference both sentences of the temporal chain.
Return ONLY this JSON (no markdown, no extra keys):
{
"test_id": "DVI-004",
"question_id": "Q1",
"passed": true | false,
"critical_failure": true | false,
"failure_reason": "<null if passed, otherwise short description>",
"score": 0.0 to 1.0,
"notes": "<optional extra observations>"
}
critical_failures:
- "reasoning_type is DIRECT (the time-linked conclusion is never stated directly; temporal chaining across two sentences is required)"
- "wrong time value returned (does not match the alarm time in the document)"
- "answer is No or Unknown when the temporal chain supports Yes"
- "answer drawn from hallucination rather than retrieved document"