← All Test Cases
high
ENT-002
entity disambiguation
Repetitions
3
Documents
3
Questions
2
Reasoning
DIRECT
entity-disambiguation
three-entities
same-category
📖 In Plain English
What this category tests
Does the brain correctly distinguish entities with similar names?
How the test works
Documents describe different entities sharing a common word (e.g. Widget Pro / Widget Max / Widget Lite). The test asks about one specific entity and checks the others aren't returned by mistake.
Why it matters
Many domains have similar-named entities — confusing them produces wrong answers.
Specifically for ENT-002
Multi-question test — Q1 asks for a specific product's spec (DIRECT), Q2 compares all 3 products (CROSS_SOURCE).
⚙️ How a single rep runs
① Generate
Model creates 3 synthetic documents and 2 questions with unique canary tokens
→ Fresh content per run prevents memorization and proves real retrieval
② Ingest (MCP)
Model calls brain_ingest to store the 3 documents
→ Tests the brain's storage and indexing pipeline
③ Query (MCP)
Model answers the question using brain retrieval tools (search, fetch, context_pack, etc.)
→ Core test — does the brain return correct evidence and let the model build a faithful answer?
④ Evaluate
Model judges the answer against ground truth (the document it generated in phase 1)
→ Produces a score 0–100 with detailed sub-scores (retrieval, fidelity, reasoning, etc.)
This rep is run 3 times per test run. A pass requires score ≥ 85 and no critical failures.
🔬 Technical Instructions (raw prompts sent to AI)
🔧 ① Setup Instructions 2038 chars
Generate three documents about three different products with similar names.
Ask specific questions that only apply to one of them.
Requirements:
- Doc A: "Product <Name-A> has spec <Spec-A>." (e.g., "Widget Pro has capacity 500L.")
- Doc B: "Product <Name-B> has spec <Spec-B>." (e.g., "Widget Max has capacity 1200L.")
- Doc C: "Product <Name-C> has spec <Spec-C>." (e.g., "Widget Lite has capacity 250L.")
- Names share a common word (e.g., all "Widget X") but differ in suffix.
- source_ids: KB-{{RUN}}-ENT-002-A-v1, KB-{{RUN}}-ENT-002-B-v1, KB-{{RUN}}-ENT-002-C-v1.
- Question 1: "What is the capacity of <Name-B>?"
- Question 2: "Which product has the smallest capacity?"
Output ONLY this JSON:
{
"run_id": "{{RUN}}",
"test_id": "ENT-002",
"repetition": {{REP}},
"documents": [
{
"source_id": "KB-{{RUN}}-ENT-002-A-v1",
"title": "<title-A>",
"content": "Product <Name-A> has spec <Spec-A>."
},
{
"source_id": "KB-{{RUN}}-ENT-002-B-v1",
"title": "<title-B>",
"content": "Product <Name-B> has spec <Spec-B>."
},
{
"source_id": "KB-{{RUN}}-ENT-002-C-v1",
"title": "<title-C>",
"content": "Product <Name-C> has spec <Spec-C>."
}
],
"questions": [
"What is the capacity of <Name-B>?",
"Which product has the smallest capacity?"
],
"expected_answers": [
{
"question_index": 0,
"correct_answer": "<Spec-B>",
"required_source_ids": ["KB-{{RUN}}-ENT-002-B-v1"],
"required_quote": "<exact Doc B sentence>",
"required_label": "DIRECT",
"must_refuse": false,
"notes": "Must retrieve only Doc B, not A or C."
},
{
"question_index": 1,
"correct_answer": "<Name-C> with <Spec-C>",
"required_source_ids": ["KB-{{RUN}}-ENT-002-A-v1", "KB-{{RUN}}-ENT-002-B-v1", "KB-{{RUN}}-ENT-002-C-v1"],
"required_quote": null,
"required_label": "CROSS_SOURCE",
"must_refuse": false,
"notes": "Requires comparing all 3 products. CROSS_SOURCE required."
}
]
}
▶ ② Execution Instructions 667 chars
You have access to MCP Knowledge Brain retrieval tools.
The setup_context contains two questions. Answer BOTH using brain_search.
Do NOT use {{QUESTION}} placeholder — instead read the questions from setup_context:
- questions[0]: asks for a specific product's spec
- questions[1]: asks which product has the smallest spec value
Steps:
1. Use brain_search to find all three product documents
2. For question 1: find the spec of the named product — cite its source_id
3. For question 2: compare specs across all three products — cite all source_ids used
Return ONLY a JSON array with two answer objects:
[{{REQUIRED_ANSWER_SCHEMA}}, {{REQUIRED_ANSWER_SCHEMA}}]
⚖ ③ Judge Instructions 350 chars
Ground truth:
{{SETUP_CONTEXT_JSON}}
Tested AI answer:
{{EXEC_ANSWER_JSON}}
Two questions. Check each answer separately. Q1: exact spec of Name-B. Q2: correct smallest product using all 3 sources.
Critical failures: wrong spec, wrong smallest product, missing sources for Q2, hallucinated source_id.
Return ONLY this JSON:
{{JUDGE_RESULT_SCHEMA}}
Critical Failure Conditions
- wrong_spec_for_named_product
- wrong_smallest_product
- missing_required_source_id_for_comparison
- hallucinated_source_id
Recent Run History
3 runs| When | Run ID | Pass Rate | Avg Score | Reps | |
|---|---|---|---|---|---|
| 2026-05-24 13:08 | 20260524T130808Z-kqze | 0% | 25.0 | 0/1 | View → |
| 2026-05-24 12:41 | 20260524T124148Z-z2do | 0% | 0 | 0/1 | View → |
| 2026-05-24 11:37 | 20260524T113756Z-kduj | 0% | 45.0 | 0/1 | View → |
📄 Raw YAML cases/entity_disambiguation/ENT-002.yaml
schema_version: "1.0"
test_id: "ENT-002"
category: "entity_disambiguation"
severity: "high"
repetitions: 3
reasoning_type: "DIRECT"
num_documents: 3
num_questions: 2
tags: ["entity-disambiguation", "three-entities", "same-category"]
setup_instructions: |
Generate three documents about three different products with similar names.
Ask specific questions that only apply to one of them.
Requirements:
- Doc A: "Product <Name-A> has spec <Spec-A>." (e.g., "Widget Pro has capacity 500L.")
- Doc B: "Product <Name-B> has spec <Spec-B>." (e.g., "Widget Max has capacity 1200L.")
- Doc C: "Product <Name-C> has spec <Spec-C>." (e.g., "Widget Lite has capacity 250L.")
- Names share a common word (e.g., all "Widget X") but differ in suffix.
- source_ids: KB-{{RUN}}-ENT-002-A-v1, KB-{{RUN}}-ENT-002-B-v1, KB-{{RUN}}-ENT-002-C-v1.
- Question 1: "What is the capacity of <Name-B>?"
- Question 2: "Which product has the smallest capacity?"
Output ONLY this JSON:
{
"run_id": "{{RUN}}",
"test_id": "ENT-002",
"repetition": {{REP}},
"documents": [
{
"source_id": "KB-{{RUN}}-ENT-002-A-v1",
"title": "<title-A>",
"content": "Product <Name-A> has spec <Spec-A>."
},
{
"source_id": "KB-{{RUN}}-ENT-002-B-v1",
"title": "<title-B>",
"content": "Product <Name-B> has spec <Spec-B>."
},
{
"source_id": "KB-{{RUN}}-ENT-002-C-v1",
"title": "<title-C>",
"content": "Product <Name-C> has spec <Spec-C>."
}
],
"questions": [
"What is the capacity of <Name-B>?",
"Which product has the smallest capacity?"
],
"expected_answers": [
{
"question_index": 0,
"correct_answer": "<Spec-B>",
"required_source_ids": ["KB-{{RUN}}-ENT-002-B-v1"],
"required_quote": "<exact Doc B sentence>",
"required_label": "DIRECT",
"must_refuse": false,
"notes": "Must retrieve only Doc B, not A or C."
},
{
"question_index": 1,
"correct_answer": "<Name-C> with <Spec-C>",
"required_source_ids": ["KB-{{RUN}}-ENT-002-A-v1", "KB-{{RUN}}-ENT-002-B-v1", "KB-{{RUN}}-ENT-002-C-v1"],
"required_quote": null,
"required_label": "CROSS_SOURCE",
"must_refuse": false,
"notes": "Requires comparing all 3 products. CROSS_SOURCE required."
}
]
}
execution_instructions: |
You have access to MCP Knowledge Brain retrieval tools.
The setup_context contains two questions. Answer BOTH using brain_search.
Do NOT use {{QUESTION}} placeholder — instead read the questions from setup_context:
- questions[0]: asks for a specific product's spec
- questions[1]: asks which product has the smallest spec value
Steps:
1. Use brain_search to find all three product documents
2. For question 1: find the spec of the named product — cite its source_id
3. For question 2: compare specs across all three products — cite all source_ids used
Return ONLY a JSON array with two answer objects:
[{{REQUIRED_ANSWER_SCHEMA}}, {{REQUIRED_ANSWER_SCHEMA}}]
judge_instructions: |
Ground truth:
{{SETUP_CONTEXT_JSON}}
Tested AI answer:
{{EXEC_ANSWER_JSON}}
Two questions. Check each answer separately. Q1: exact spec of Name-B. Q2: correct smallest product using all 3 sources.
Critical failures: wrong spec, wrong smallest product, missing sources for Q2, hallucinated source_id.
Return ONLY this JSON:
{{JUDGE_RESULT_SCHEMA}}
critical_failures:
- "wrong_spec_for_named_product"
- "wrong_smallest_product"
- "missing_required_source_id_for_comparison"
- "hallucinated_source_id"