← All Test Cases
high
PRV-001
provenance
Repetitions
2
Documents
1
Questions
1
Reasoning
DIRECT
provenance
source-quote
high-provenance
data-quality
📖 In Plain English
What this category tests
Does the brain score how well a claim is supported by its source?
How the test works
Claims are ingested with source quotes that either closely match (high provenance) or are unrelated (low provenance). The brain assigns a provenance_score that the test checks.
Why it matters
Low-provenance claims are likely hallucinations — flagging them helps catch errors before they reach users.
Specifically for PRV-001
Tests high-provenance score — claim text closely matches source, provenance_score should be > 0.7.
⚙️ How a single rep runs
① Generate
Model creates 1 synthetic document and 1 question with unique canary tokens
→ Fresh content per run prevents memorization and proves real retrieval
② Ingest (MCP)
Model calls brain_ingest to store the 1 document
→ Tests the brain's storage and indexing pipeline
③ Query (MCP)
Model answers the question using brain retrieval tools (search, fetch, context_pack, etc.)
→ Core test — does the brain return correct evidence and let the model build a faithful answer?
④ Evaluate
Model judges the answer against ground truth (the document it generated in phase 1)
→ Produces a score 0–100 with detailed sub-scores (retrieval, fidelity, reasoning, etc.)
This rep is run 2 times per test run. A pass requires score ≥ 85 and no critical failures.
🔬 Technical Instructions (raw prompts sent to AI)
🔧 ① Setup Instructions 1465 chars
Generate a document where the extracted claim closely matches the source text.
This tests that the brain assigns a HIGH provenance score when the claim is
a faithful extraction of the source.
Requirements:
- Source text: a clear, specific factual sentence (e.g. "The Helios reactor output is 850 MW.")
- Extracted claim text: nearly identical to the source (minor paraphrase at most)
- This should result in provenance_score > 0.7 in brain search results
- source_id: KB-{{RUN}}-PRV-001-A-v1
- Question: search for the claim and report the provenance_score from the result
Output ONLY this JSON:
{
"run_id": "{{RUN}}",
"test_id": "PRV-001",
"repetition": {{REP}},
"documents": [
{
"source_id": "KB-{{RUN}}-PRV-001-A-v1",
"content": "<specific factual sentence with canary token>",
"title": "<title>",
"version": 1
}
],
"questions": [
"Search the brain for '<key term from the claim>'. Report the provenance_score field from the search result for source KB-{{RUN}}-PRV-001-A-v1. The score should be above 0.7."
],
"expected_answers": [
{
"question_index": 0,
"correct_answer": "provenance_score > 0.7",
"required_source_ids": ["KB-{{RUN}}-PRV-001-A-v1"],
"required_label": "DIRECT",
"must_refuse": false,
"notes": "When claim text closely matches source text, provenance_score should be high (>0.7). Ingest with claim text nearly identical to source content."
}
]
}
▶ ② Execution Instructions 444 chars
You have access to MCP Knowledge Brain tools.
Step 1: Use brain_ingest with extracted.claims where the claim text is nearly
identical to the source document text (faithful extraction).
Step 2: Use brain_search to find the claim. Examine the provenance_score field
in the search results.
Step 3: Report the provenance_score. Answer should be "provenance_score: X.XX"
Question: {{QUESTION}}
Return ONLY this JSON:
{{REQUIRED_ANSWER_SCHEMA}}
⚖ ③ Judge Instructions 1051 chars
Ground truth:
{{SETUP_CONTEXT_JSON}}
Tested AI answer:
{{EXEC_ANSWER_JSON}}
The AI ingested a claim that closely matches the source text (high provenance expected).
It should have found a provenance_score > 0.7 in the search results.
Scoring:
- retrieval_tool_use (0-20): Used brain_ingest with faithful claim + brain_search?
- evidence_fidelity (0-20): Correct source_id? Accurately reported provenance_score?
- reasoning_correctness (0-20): Reported score above 0.7? (or acknowledged if unavailable)
- uncertainty_discipline (0-15): Reported actual score accurately, not guessed?
- output_format_compliance (0-15): Valid JSON?
- adversarial_resistance (0-10): Full marks.
Note: if provenance_score field is not present in search results, that itself is a finding
worth reporting — not necessarily a test failure, but should be noted.
Critical failures: did not search brain after ingesting, hallucinated source_id, reported score < 0.7 for a faithful extraction (indicates brain provenance bug).
Return ONLY this JSON:
{{JUDGE_RESULT_SCHEMA}}
Critical Failure Conditions
- provenance_score_below_threshold_for_faithful_extraction
- hallucinated_source_id
- did_not_report_provenance_score
Recent Run History
2 runs| When | Run ID | Pass Rate | Avg Score | Reps | |
|---|---|---|---|---|---|
| 2026-05-24 13:08 | 20260524T130808Z-kqze | 100% | 100.0 | 1/1 | View → |
| 2026-05-24 11:37 | 20260524T113756Z-kduj | 100% | 100.0 | 1/1 | View → |
📄 Raw YAML cases/provenance/PRV-001.yaml
schema_version: "1.0"
test_id: "PRV-001"
category: "provenance"
severity: "high"
repetitions: 2
reasoning_type: "DIRECT"
num_documents: 1
num_questions: 1
tags: ["provenance", "source-quote", "high-provenance", "data-quality"]
setup_instructions: |
Generate a document where the extracted claim closely matches the source text.
This tests that the brain assigns a HIGH provenance score when the claim is
a faithful extraction of the source.
Requirements:
- Source text: a clear, specific factual sentence (e.g. "The Helios reactor output is 850 MW.")
- Extracted claim text: nearly identical to the source (minor paraphrase at most)
- This should result in provenance_score > 0.7 in brain search results
- source_id: KB-{{RUN}}-PRV-001-A-v1
- Question: search for the claim and report the provenance_score from the result
Output ONLY this JSON:
{
"run_id": "{{RUN}}",
"test_id": "PRV-001",
"repetition": {{REP}},
"documents": [
{
"source_id": "KB-{{RUN}}-PRV-001-A-v1",
"content": "<specific factual sentence with canary token>",
"title": "<title>",
"version": 1
}
],
"questions": [
"Search the brain for '<key term from the claim>'. Report the provenance_score field from the search result for source KB-{{RUN}}-PRV-001-A-v1. The score should be above 0.7."
],
"expected_answers": [
{
"question_index": 0,
"correct_answer": "provenance_score > 0.7",
"required_source_ids": ["KB-{{RUN}}-PRV-001-A-v1"],
"required_label": "DIRECT",
"must_refuse": false,
"notes": "When claim text closely matches source text, provenance_score should be high (>0.7). Ingest with claim text nearly identical to source content."
}
]
}
execution_instructions: |
You have access to MCP Knowledge Brain tools.
Step 1: Use brain_ingest with extracted.claims where the claim text is nearly
identical to the source document text (faithful extraction).
Step 2: Use brain_search to find the claim. Examine the provenance_score field
in the search results.
Step 3: Report the provenance_score. Answer should be "provenance_score: X.XX"
Question: {{QUESTION}}
Return ONLY this JSON:
{{REQUIRED_ANSWER_SCHEMA}}
judge_instructions: |
Ground truth:
{{SETUP_CONTEXT_JSON}}
Tested AI answer:
{{EXEC_ANSWER_JSON}}
The AI ingested a claim that closely matches the source text (high provenance expected).
It should have found a provenance_score > 0.7 in the search results.
Scoring:
- retrieval_tool_use (0-20): Used brain_ingest with faithful claim + brain_search?
- evidence_fidelity (0-20): Correct source_id? Accurately reported provenance_score?
- reasoning_correctness (0-20): Reported score above 0.7? (or acknowledged if unavailable)
- uncertainty_discipline (0-15): Reported actual score accurately, not guessed?
- output_format_compliance (0-15): Valid JSON?
- adversarial_resistance (0-10): Full marks.
Note: if provenance_score field is not present in search results, that itself is a finding
worth reporting — not necessarily a test failure, but should be noted.
Critical failures: did not search brain after ingesting, hallucinated source_id, reported score < 0.7 for a faithful extraction (indicates brain provenance bug).
Return ONLY this JSON:
{{JUDGE_RESULT_SCHEMA}}
critical_failures:
- "provenance_score_below_threshold_for_faithful_extraction"
- "hallucinated_source_id"
- "did_not_report_provenance_score"