← All Runs

20260524T113756Z-kduj

56%
Pass Rate
64.3
Avg Score
100
Total Cases
56
Passed

Pass Rate by Category

Category Breakdown

Category Cases Pass Rate Avg Score Stability
adversarial 5 80% 90.6 flaky
alias 3 100% 99.7 stable pass
causal 2 50% 80.0 flaky
conflicting_evidence 3 100% 100.0 stable pass
context_isolation 5 0% 0.0 stable fail
cross_source 10 10% 10.0 flaky
deduplication 2 100% 100.0 stable pass
direct_vs_inferred 10 30% 30.0 flaky
entity_disambiguation 3 67% 80.0 flaky
exact_phrase 5 100% 100.0 stable pass
exact_source_id 5 60% 83.0 flaky
identifier 3 100% 100.0 stable pass
judge_reliability 5 20% 50.0 flaky
missing_source 5 40% 74.2 flaky
needle_haystack 3 33% 33.3 flaky
negation 3 100% 100.0 stable pass
numerical 3 100% 99.0 stable pass
provenance 2 100% 100.0 stable pass
raw_text 5 100% 100.0 stable pass
semantic_drift 3 100% 97.0 stable pass
storage 5 80% 94.0 flaky
supersession 2 50% 67.5 flaky
update_versioning 5 0% 0.0 stable fail
verified 3 33% 82.3 flaky

Test Cases (100 total β€” sorted by pass rate)

DVI-004
direct_vs_inferred high
0% 0/100
MISS-001
missing_source critical
0% 100.0/100
ENT-002
entity_disambiguation high
0% 45.0/100
DVI-003
direct_vs_inferred high
0% 0/100
DVI-006
direct_vs_inferred medium
0% 0/100
XSR-008
cross_source high
0% 0/100
CAU-001
causal high
0% 75.0/100
VER-001
update_versioning critical
0% 0/100
ISO-001
context_isolation critical
0% 0/100
XSR-005
cross_source high
0% 0/100
XSR-010
cross_source high
0% 0/100
XSR-004
cross_source high
0% 0/100
XSR-007
cross_source critical
0% 0/100
ADV-002
adversarial critical
0% 55.0/100
ISO-003
context_isolation critical
0% 0/100
VER-002
update_versioning high
0% 0/100
DVI-002
direct_vs_inferred high
0% 0/100
DVI-005
direct_vs_inferred high
0% 0/100
JUDGE-004
judge_reliability critical
0% 0.0/100
JUDGE-002
judge_reliability critical
0% 50.0/100
JUDGE-005
judge_reliability high
0% 45.0/100
VER-004
update_versioning medium
0% 0/100
VER-005
update_versioning critical
0% 0/100
XSR-009
cross_source critical
0% 0/100
VRF-001
verified high
0% 75.0/100
DVI-007
direct_vs_inferred medium
0% 0/100
XSR-003
cross_source critical
0% 0/100
ISO-005
context_isolation critical
0% 0/100
SUP-001
supersession high
0% 45.0/100
ISO-002
context_isolation high
0% 0/100
NIH-003
needle_haystack high
0% 0/100
MISS-004
missing_source critical
0% 15.0/100
JUDGE-003
judge_reliability critical
0% 55.0/100
SID-005
exact_source_id critical
0% 82.0/100
XSR-006
cross_source critical
0% 0/100
XSR-002
cross_source critical
0% 0/100
NIH-002
needle_haystack critical
0% 0/100
ISO-004
context_isolation high
0% 0/100
MISS-002
missing_source critical
0% 59.0/100
SID-001
exact_source_id critical
0% 33.0/100
DVI-001
direct_vs_inferred critical
0% 0/100
STOR-003
storage high
0% 73.0/100
VRF-003
verified medium
0% 72.0/100
VER-003
update_versioning high
0% 0/100
ADV-001
adversarial critical
100% 100.0/100
VRF-002
verified high
100% 100.0/100
SDR-003
semantic_drift high
100% 100.0/100
MISS-003
missing_source high
100% 100.0/100
STOR-005
storage critical
100% 100.0/100
SDR-001
semantic_drift high
100% 100.0/100
IDN-001
identifier critical
100% 100.0/100
ADV-004
adversarial critical
100% 98.0/100
CAU-002
causal high
100% 85.0/100
NEG-002
negation high
100% 100.0/100
ADV-005
adversarial critical
100% 100.0/100
DVI-008
direct_vs_inferred critical
100% 100.0/100
PRV-002
provenance high
100% 100.0/100
RAW-002
raw_text critical
100% 100.0/100
CON-001
conflicting_evidence critical
100% 100.0/100
PHR-004
exact_phrase high
100% 100.0/100
DVI-010
direct_vs_inferred high
100% 100.0/100
ALI-003
alias medium
100% 100.0/100
SID-003
exact_source_id high
100% 100.0/100
PHR-005
exact_phrase critical
100% 100.0/100
CON-002
conflicting_evidence high
100% 100.0/100
STOR-001
storage critical
100% 97.0/100
RAW-001
raw_text critical
100% 100.0/100
RAW-003
raw_text high
100% 100.0/100
RAW-004
raw_text high
100% 100.0/100
SID-004
exact_source_id high
100% 100.0/100
ENT-001
entity_disambiguation critical
100% 100.0/100
PRV-001
provenance high
100% 100.0/100
NIH-001
needle_haystack high
100% 100.0/100
RAW-005
raw_text high
100% 100.0/100
IDN-003
identifier high
100% 100.0/100
NEG-001
negation critical
100% 100.0/100
DDP-001
deduplication high
100% 100.0/100
SDR-002
semantic_drift high
100% 91.0/100
PHR-002
exact_phrase critical
100% 100.0/100
NEG-003
negation critical
100% 100.0/100
MISS-005
missing_source critical
100% 97.0/100
ALI-002
alias high
100% 99.0/100
SID-002
exact_source_id critical
100% 100.0/100
STOR-002
storage critical
100% 100.0/100
SUP-002
supersession high
100% 90.0/100
PHR-003
exact_phrase high
100% 100.0/100
ALI-001
alias high
100% 100.0/100
STOR-004
storage high
100% 100.0/100
DDP-002
deduplication medium
100% 100.0/100
DVI-009
direct_vs_inferred high
100% 100.0/100
NUM-002
numerical high
100% 100.0/100
NUM-001
numerical high
100% 100.0/100
CON-003
conflicting_evidence high
100% 100.0/100
IDN-002
identifier high
100% 100.0/100
ENT-003
entity_disambiguation critical
100% 95.0/100
JUDGE-001
judge_reliability high
100% 100.0/100
ADV-003
adversarial critical
100% 100.0/100
XSR-001
cross_source critical
100% 100.0/100
NUM-003
numerical high
100% 97.0/100
PHR-001
exact_phrase critical
100% 100.0/100

πŸ“„ Improvement Report

Paste this into Claude CLI (brain creator) to improve the brain.

Generating report…

πŸ—‚ Run Files

All files from this run β€” click ↓ to download individually.

Click "Files" above or scroll here to browse.