20260524T130808Z-kqze

65%

Pass Rate

77.8

Avg Score

100

Total Cases

65

Passed

Pass Rate by Category

Category Breakdown

Category	Cases	Pass Rate	Avg Score	Stability
adversarial	5	40%	82.2	flaky
alias	3	100%	90.0	stable pass
causal	2	50%	86.0	flaky
conflicting_evidence	3	33%	62.7	flaky
context_isolation	5	80%	98.0	flaky
cross_source	10	80%	85.2	flaky
deduplication	2	100%	100.0	stable pass
direct_vs_inferred	10	100%	96.8	stable pass
entity_disambiguation	3	33%	58.3	flaky
exact_phrase	5	40%	61.8	flaky
exact_source_id	5	40%	60.0	flaky
identifier	3	100%	100.0	stable pass
judge_reliability	5	20%	54.4	flaky
missing_source	5	20%	44.0	flaky
needle_haystack	3	0%	21.7	stable fail
negation	3	100%	100.0	stable pass
numerical	3	100%	98.3	stable pass
provenance	2	50%	65.0	flaky
raw_text	5	100%	100.0	stable pass
semantic_drift	3	33%	59.0	flaky
storage	5	80%	90.0	flaky
supersession	2	50%	67.0	flaky
update_versioning	5	80%	80.0	flaky
verified	3	67%	67.7	flaky

Test Cases (100 total — sorted by pass rate)

adversarial critical

missing_source high

entity_disambiguation high

cross_source high

semantic_drift high

adversarial critical

provenance high

conflicting_evidence critical

adversarial critical

exact_phrase high

exact_source_id high

exact_phrase critical

judge_reliability critical

needle_haystack high

semantic_drift high

judge_reliability critical

judge_reliability high

update_versioning medium

missing_source critical

context_isolation critical

supersession high

needle_haystack high

exact_phrase high

missing_source critical

judge_reliability critical

exact_source_id critical

cross_source critical

conflicting_evidence high

needle_haystack critical

entity_disambiguation critical

missing_source critical

exact_source_id critical

direct_vs_inferred high

semantic_drift high

storage critical

missing_source critical

direct_vs_inferred high

direct_vs_inferred medium

identifier critical

update_versioning critical

adversarial critical

context_isolation critical

cross_source high

direct_vs_inferred critical

cross_source high

raw_text critical

cross_source high

cross_source critical

direct_vs_inferred high

context_isolation critical

update_versioning high

direct_vs_inferred high

conflicting_evidence high

storage critical

direct_vs_inferred high

raw_text critical

exact_source_id high

entity_disambiguation critical

provenance high

identifier high

negation critical

deduplication high

update_versioning critical

exact_phrase critical

cross_source critical

negation critical

direct_vs_inferred medium

cross_source critical

context_isolation high

exact_source_id critical

storage critical

supersession high

deduplication medium

direct_vs_inferred high

cross_source critical

identifier high

context_isolation high

judge_reliability high

adversarial critical

cross_source critical

direct_vs_inferred critical

exact_phrase critical

verified medium

update_versioning high

📄 Improvement Report

Paste this into Claude CLI (brain creator) to improve the brain.

Generating report…

🗂 Run Files

All files from this run — click ↓ to download individually.

Click "Files" above or scroll here to browse.