← All Runs

20260524T124148Z-z2do

61%
Pass Rate
71.1
Avg Score
72
Total Cases
44
Passed

Pass Rate by Category

Category Breakdown

Category Cases Pass Rate Avg Score Stability
adversarial 5 80% 95.4 flaky
alias 3 100% 100.0 stable pass
causal 2 100% 88.5 stable pass
conflicting_evidence 3 100% 100.0 stable pass
context_isolation 5 80% 98.0 flaky
cross_source 10 10% 10.0 flaky
deduplication 2 100% 100.0 stable pass
direct_vs_inferred 10 100% 98.4 stable pass
entity_disambiguation 3 33% 28.7 flaky
exact_phrase 5 100% 100.0 stable pass
exact_source_id 5 20% 63.6 flaky
identifier 3 100% 100.0 stable pass
judge_reliability 5 20% 49.6 flaky
missing_source 5 20% 56.6 flaky
needle_haystack 1 0% 0.0 stable fail
negation 3 67% 85.7 flaky
numerical 1 100% 100.0 stable pass
raw_text 1 0% 0.0 stable fail

Test Cases (72 total β€” sorted by pass rate)

ADV-001
adversarial critical
0% 95.0/100
MISS-003
missing_source high
0% 68.0/100
ENT-002
entity_disambiguation high
0% 0/100
XSR-008
cross_source high
0% 0/100
XSR-005
cross_source high
0% 0/100
XSR-010
cross_source high
0% 0/100
XSR-004
cross_source high
0% 0/100
XSR-007
cross_source critical
0% 0/100
SID-003
exact_source_id high
0% 60.0/100
RAW-001
raw_text critical
0% 0.0/100
JUDGE-004
judge_reliability critical
0% 15.0/100
SID-004
exact_source_id high
0% 70.0/100
ENT-001
entity_disambiguation critical
0% 0/100
JUDGE-002
judge_reliability critical
0% 60.0/100
JUDGE-005
judge_reliability high
0% 35.0/100
XSR-009
cross_source critical
0% 0/100
NEG-003
negation critical
0% 57.0/100
MISS-005
missing_source critical
0% 47.0/100
XSR-003
cross_source critical
0% 0/100
ISO-005
context_isolation critical
0% 90.0/100
SID-002
exact_source_id critical
0% 23.0/100
MISS-004
missing_source critical
0% 30.0/100
JUDGE-003
judge_reliability critical
0% 38.0/100
SID-005
exact_source_id critical
0% 65.0/100
XSR-006
cross_source critical
0% 0/100
XSR-002
cross_source critical
0% 0/100
NIH-002
needle_haystack critical
0% 0/100
MISS-002
missing_source critical
0% 38.0/100
DVI-004
direct_vs_inferred high
100% 100.0/100
MISS-001
missing_source critical
100% 100.0/100
DVI-003
direct_vs_inferred high
100% 100.0/100
DVI-006
direct_vs_inferred medium
100% 99.0/100
IDN-001
identifier critical
100% 100.0/100
CAU-001
causal high
100% 85.0/100
ADV-004
adversarial critical
100% 90.0/100
ISO-001
context_isolation critical
100% 100.0/100
CAU-002
causal high
100% 92.0/100
NEG-002
negation high
100% 100.0/100
ADV-005
adversarial critical
100% 100.0/100
DVI-008
direct_vs_inferred critical
100% 100.0/100
CON-001
conflicting_evidence critical
100% 100.0/100
ADV-002
adversarial critical
100% 92.0/100
PHR-004
exact_phrase high
100% 100.0/100
DVI-010
direct_vs_inferred high
100% 100.0/100
ISO-003
context_isolation critical
100% 100.0/100
ALI-003
alias medium
100% 100.0/100
PHR-005
exact_phrase critical
100% 100.0/100
DVI-002
direct_vs_inferred high
100% 100.0/100
CON-002
conflicting_evidence high
100% 100.0/100
DVI-005
direct_vs_inferred high
100% 100.0/100
IDN-003
identifier high
100% 100.0/100
NEG-001
negation critical
100% 100.0/100
DDP-001
deduplication high
100% 100.0/100
PHR-002
exact_phrase critical
100% 100.0/100
DVI-007
direct_vs_inferred medium
100% 100.0/100
ALI-002
alias high
100% 100.0/100
ISO-002
context_isolation high
100% 100.0/100
PHR-003
exact_phrase high
100% 100.0/100
ALI-001
alias high
100% 100.0/100
DDP-002
deduplication medium
100% 100.0/100
DVI-009
direct_vs_inferred high
100% 90.0/100
NUM-002
numerical high
100% 100.0/100
CON-003
conflicting_evidence high
100% 100.0/100
IDN-002
identifier high
100% 100.0/100
ISO-004
context_isolation high
100% 100.0/100
ENT-003
entity_disambiguation critical
100% 86.0/100
JUDGE-001
judge_reliability high
100% 100.0/100
ADV-003
adversarial critical
100% 100.0/100
XSR-001
cross_source critical
100% 100.0/100
SID-001
exact_source_id critical
100% 100.0/100
DVI-001
direct_vs_inferred critical
100% 95.0/100
PHR-001
exact_phrase critical
100% 100.0/100

πŸ“„ Improvement Report

Paste this into Claude CLI (brain creator) to improve the brain.

Generating report…

πŸ—‚ Run Files

All files from this run β€” click ↓ to download individually.

Click "Files" above or scroll here to browse.