← All Test Cases

DVI-005

direct vs inferred

high
Repetitions
5
Documents
1
Questions
1
Reasoning
INFERRED
inferred badge-role possession-vs-identity ambiguity fictional-names

📖 In Plain English

What this category tests

Does the brain label claims correctly — DIRECT for explicit text, INFERRED for derivation, UNCERTAIN for ambiguity?

How the test works

Documents contain text that's either directly answerable, requires inference, or includes a 'converse fallacy' trap. The test checks the answer label matches the actual epistemic status.

Why it matters

Wrong labels mislead users into trusting derived/uncertain answers as if they were facts.

⚙️ How a single rep runs

① Generate
Model creates 1 synthetic document and 1 question with unique canary tokens
→ Fresh content per run prevents memorization and proves real retrieval
② Ingest (MCP)
Model calls brain_ingest to store the 1 document
→ Tests the brain's storage and indexing pipeline
③ Query (MCP)
Model answers the question using brain retrieval tools (search, fetch, context_pack, etc.)
→ Core test — does the brain return correct evidence and let the model build a faithful answer?
④ Evaluate
Model judges the answer against ground truth (the document it generated in phase 1)
→ Produces a score 0–100 with detailed sub-scores (retrieval, fidelity, reasoning, etc.)

This rep is run 5 times per test run. A pass requires score ≥ 85 and no critical failures.

🔬 Technical Instructions (raw prompts sent to AI)

🔧 ① Setup Instructions 1949 chars
You are a Generator AI. Create one synthetic document and one question with expected answers.

Rules:
- Use entirely fictional names, places, and objects (no real-world facts).
- The document must contain exactly two sentences:
    Sentence 1: "The <color> badge belongs to a <role>."
    Sentence 2: "<Person> carries the <color> badge."
  where <color> is a fictional color or descriptive label (e.g. "cerulean"),
  <role> is a fictional occupational title (e.g. "Drift Keeper"),
  and <Person> is a fictional name (e.g. "Sael Moran").
  Vary all three on each repetition.
- The question must ask: "Is <Person> a <role>?"
- The expected answer must acknowledge that the evidence strongly suggests Yes
  but must note the ambiguity: carrying the badge is not the same as owning it,
  so the identity claim is not definitively established.
- The reasoning_type must be INFERRED.
- The explanation in expected_answers must explicitly mention the distinction
  between badge ownership and badge carrying.
- Do NOT include any sentence that directly states the person's role or identity.

Output ONLY this JSON (no markdown, no extra keys):
{
  "run_id": "{{RUN}}",
  "test_id": "DVI-005",
  "repetition": {{REP}},
  "documents": [
    {
      "source_id": "KB-{{RUN}}-DVI-005-A-v1",
      "content": "<sentence 1> <sentence 2>"
    }
  ],
  "questions": [
    "Is <Person> a <role>?"
  ],
  "expected_answers": [
    {
      "question_index": 0,
      "correct_answer": "Likely yes, but not certain — carrying the badge suggests the role, though ownership vs. carrying creates ambiguity.",
      "required_label": "INFERRED",
      "required_source_ids": ["KB-{{RUN}}-DVI-005-A-v1"],
      "must_refuse": false,
      "notes": "The badge belongs to a role-holder and the person carries it, making the role likely. However, the document does not confirm ownership; carrying alone does not definitively establish identity as a role-holder."
    }
  ]
}
▶ ② Execution Instructions 555 chars
Use ONLY MCP Knowledge Brain retrieval tools to answer the question below.
Do not use any external knowledge. Base your answer solely on retrieved documents.

Question: {{QUESTION}}

Return ONLY this JSON (no markdown, no extra keys):
{
  "question_id": "Q1",
  "answer": "<your answer, including any ambiguity you identify>",
  "reasoning_type": "<DIRECT|INFERRED|UNCERTAIN>",
  "ambiguity_noted": true | false,
  "source_ids": ["<source_id of each document used>"],
  "explanation": "<one or two sentences describing the inference and any ambiguity>"
}
⚖ ③ Judge Instructions 1146 chars
You are a Judge AI evaluating whether a tested AI answered correctly.

Ground truth:
{{SETUP_CONTEXT_JSON}}

Tested AI answer:
{{EXEC_ANSWER_JSON}}

Evaluation criteria:
1. The reasoning_type must be exactly "INFERRED". Labeling this as "DIRECT"
   is a critical failure: the person's role is never stated explicitly.
2. The answer must acknowledge that the conclusion is probable but not certain,
   and must specifically note the distinction between carrying and owning the badge.
   Failing to mention this ambiguity (ambiguity_noted == false, or no mention
   in explanation) is a critical failure.
3. Answering a flat "Yes" with no ambiguity noted is a critical failure.
4. Answering "No" when the evidence suggests the role is likely is a critical failure.
5. A well-hedged "Likely yes, but not certain" with explanation earns full score.

Return ONLY this JSON (no markdown, no extra keys):
{
  "test_id": "DVI-005",
  "question_id": "Q1",
  "passed": true | false,
  "critical_failure": true | false,
  "failure_reason": "<null if passed, otherwise short description>",
  "score": 0.0 to 1.0,
  "notes": "<optional extra observations>"
}

Critical Failure Conditions

Recent Run History

3 runs
When Run ID Pass Rate Avg Score Reps
2026-05-24 13:08 20260524T130808Z-kqze 100% 86.0 1/1 View →
2026-05-24 12:41 20260524T124148Z-z2do 100% 100.0 1/1 View →
2026-05-24 11:37 20260524T113756Z-kduj 0% 0 0/1 View →
📄 Raw YAML cases/direct_vs_inferred/DVI-005.yaml
schema_version: "1.0"
test_id: "DVI-005"
category: "direct_vs_inferred"
severity: "high"
repetitions: 5
reasoning_type: "INFERRED"
num_documents: 1
num_questions: 1
tags: [inferred, badge-role, possession-vs-identity, ambiguity, fictional-names]

setup_instructions: |
  You are a Generator AI. Create one synthetic document and one question with expected answers.

  Rules:
  - Use entirely fictional names, places, and objects (no real-world facts).
  - The document must contain exactly two sentences:
      Sentence 1: "The <color> badge belongs to a <role>."
      Sentence 2: "<Person> carries the <color> badge."
    where <color> is a fictional color or descriptive label (e.g. "cerulean"),
    <role> is a fictional occupational title (e.g. "Drift Keeper"),
    and <Person> is a fictional name (e.g. "Sael Moran").
    Vary all three on each repetition.
  - The question must ask: "Is <Person> a <role>?"
  - The expected answer must acknowledge that the evidence strongly suggests Yes
    but must note the ambiguity: carrying the badge is not the same as owning it,
    so the identity claim is not definitively established.
  - The reasoning_type must be INFERRED.
  - The explanation in expected_answers must explicitly mention the distinction
    between badge ownership and badge carrying.
  - Do NOT include any sentence that directly states the person's role or identity.

  Output ONLY this JSON (no markdown, no extra keys):
  {
    "run_id": "{{RUN}}",
    "test_id": "DVI-005",
    "repetition": {{REP}},
    "documents": [
      {
        "source_id": "KB-{{RUN}}-DVI-005-A-v1",
        "content": "<sentence 1> <sentence 2>"
      }
    ],
    "questions": [
      "Is <Person> a <role>?"
    ],
    "expected_answers": [
      {
        "question_index": 0,
        "correct_answer": "Likely yes, but not certain — carrying the badge suggests the role, though ownership vs. carrying creates ambiguity.",
        "required_label": "INFERRED",
        "required_source_ids": ["KB-{{RUN}}-DVI-005-A-v1"],
        "must_refuse": false,
        "notes": "The badge belongs to a role-holder and the person carries it, making the role likely. However, the document does not confirm ownership; carrying alone does not definitively establish identity as a role-holder."
      }
    ]
  }

execution_instructions: |
  Use ONLY MCP Knowledge Brain retrieval tools to answer the question below.
  Do not use any external knowledge. Base your answer solely on retrieved documents.

  Question: {{QUESTION}}

  Return ONLY this JSON (no markdown, no extra keys):
  {
    "question_id": "Q1",
    "answer": "<your answer, including any ambiguity you identify>",
    "reasoning_type": "<DIRECT|INFERRED|UNCERTAIN>",
    "ambiguity_noted": true | false,
    "source_ids": ["<source_id of each document used>"],
    "explanation": "<one or two sentences describing the inference and any ambiguity>"
  }

judge_instructions: |
  You are a Judge AI evaluating whether a tested AI answered correctly.

  Ground truth:
  {{SETUP_CONTEXT_JSON}}

  Tested AI answer:
  {{EXEC_ANSWER_JSON}}

  Evaluation criteria:
  1. The reasoning_type must be exactly "INFERRED". Labeling this as "DIRECT"
     is a critical failure: the person's role is never stated explicitly.
  2. The answer must acknowledge that the conclusion is probable but not certain,
     and must specifically note the distinction between carrying and owning the badge.
     Failing to mention this ambiguity (ambiguity_noted == false, or no mention
     in explanation) is a critical failure.
  3. Answering a flat "Yes" with no ambiguity noted is a critical failure.
  4. Answering "No" when the evidence suggests the role is likely is a critical failure.
  5. A well-hedged "Likely yes, but not certain" with explanation earns full score.

  Return ONLY this JSON (no markdown, no extra keys):
  {
    "test_id": "DVI-005",
    "question_id": "Q1",
    "passed": true | false,
    "critical_failure": true | false,
    "failure_reason": "<null if passed, otherwise short description>",
    "score": 0.0 to 1.0,
    "notes": "<optional extra observations>"
  }

critical_failures:
  - "reasoning_type is DIRECT (the role is never explicitly stated)"
  - "no mention of the ownership-vs-carrying ambiguity (ambiguity_noted is false or absent from explanation)"
  - "flat Yes with no hedging or acknowledgment of ambiguity"
  - "answer is No when badge evidence suggests the role is probable"
  - "answer drawn from hallucination rather than retrieved document"