β All Runs
20260524T130808Z-kqze
65%
Pass Rate
77.8
Avg Score
100
Total Cases
65
Passed
Pass Rate by Category
Category Breakdown
| Category | Cases | Pass Rate | Avg Score | Stability |
|---|---|---|---|---|
| adversarial | 5 | 40% | 82.2 | flaky |
| alias | 3 | 100% | 90.0 | stable pass |
| causal | 2 | 50% | 86.0 | flaky |
| conflicting_evidence | 3 | 33% | 62.7 | flaky |
| context_isolation | 5 | 80% | 98.0 | flaky |
| cross_source | 10 | 80% | 85.2 | flaky |
| deduplication | 2 | 100% | 100.0 | stable pass |
| direct_vs_inferred | 10 | 100% | 96.8 | stable pass |
| entity_disambiguation | 3 | 33% | 58.3 | flaky |
| exact_phrase | 5 | 40% | 61.8 | flaky |
| exact_source_id | 5 | 40% | 60.0 | flaky |
| identifier | 3 | 100% | 100.0 | stable pass |
| judge_reliability | 5 | 20% | 54.4 | flaky |
| missing_source | 5 | 20% | 44.0 | flaky |
| needle_haystack | 3 | 0% | 21.7 | stable fail |
| negation | 3 | 100% | 100.0 | stable pass |
| numerical | 3 | 100% | 98.3 | stable pass |
| provenance | 2 | 50% | 65.0 | flaky |
| raw_text | 5 | 100% | 100.0 | stable pass |
| semantic_drift | 3 | 33% | 59.0 | flaky |
| storage | 5 | 80% | 90.0 | flaky |
| supersession | 2 | 50% | 67.0 | flaky |
| update_versioning | 5 | 80% | 80.0 | flaky |
| verified | 3 | 67% | 67.7 | flaky |
Test Cases (100 total β sorted by pass rate)
ADV-001
adversarial
critical
0%
98.0/100
MISS-003
missing_source
high
0%
75.0/100
ENT-002
entity_disambiguation
high
0%
25.0/100
XSR-008
cross_source
high
0%
0/100
SDR-001
semantic_drift
high
0%
37.0/100
CAU-001
causal
high
0%
77.0/100
ADV-005
adversarial
critical
0%
80.0/100
PRV-002
provenance
high
0%
30.0/100
CON-001
conflicting_evidence
critical
0%
68.0/100
ADV-002
adversarial
critical
0%
36.0/100
PHR-004
exact_phrase
high
0%
73.0/100
SID-003
exact_source_id
high
0%
15.0/100
PHR-005
exact_phrase
critical
0%
10.0/100
JUDGE-004
judge_reliability
critical
0%
0.0/100
NIH-001
needle_haystack
high
0%
65.0/100
SDR-002
semantic_drift
high
0%
40.0/100
JUDGE-002
judge_reliability
critical
0%
70.0/100
JUDGE-005
judge_reliability
high
0%
47.0/100
VER-004
update_versioning
medium
0%
0/100
VRF-001
verified
high
0%
10.0/100
MISS-005
missing_source
critical
0%
20.0/100
ISO-005
context_isolation
critical
0%
90.0/100
SUP-001
supersession
high
0%
45.0/100
NIH-003
needle_haystack
high
0%
0/100
PHR-003
exact_phrase
high
0%
31.0/100
MISS-004
missing_source
critical
0%
5.0/100
JUDGE-003
judge_reliability
critical
0%
55.0/100
SID-005
exact_source_id
critical
0%
20.0/100
XSR-006
cross_source
critical
0%
52.0/100
CON-003
conflicting_evidence
high
0%
20.0/100
NIH-002
needle_haystack
critical
0%
0/100
ENT-003
entity_disambiguation
critical
0%
50.0/100
MISS-002
missing_source
critical
0%
20.0/100
SID-001
exact_source_id
critical
0%
65.0/100
STOR-003
storage
high
0%
55.0/100
DVI-004
direct_vs_inferred
high
100%
97.0/100
VRF-002
verified
high
100%
100.0/100
SDR-003
semantic_drift
high
100%
100.0/100
STOR-005
storage
critical
100%
100.0/100
MISS-001
missing_source
critical
100%
100.0/100
DVI-003
direct_vs_inferred
high
100%
100.0/100
DVI-006
direct_vs_inferred
medium
100%
100.0/100
IDN-001
identifier
critical
100%
100.0/100
VER-001
update_versioning
critical
100%
100.0/100
ADV-004
adversarial
critical
100%
97.0/100
ISO-001
context_isolation
critical
100%
100.0/100
XSR-005
cross_source
high
100%
100.0/100
CAU-002
causal
high
100%
95.0/100
NEG-002
negation
high
100%
100.0/100
DVI-008
direct_vs_inferred
critical
100%
100.0/100
XSR-010
cross_source
high
100%
100.0/100
RAW-002
raw_text
critical
100%
100.0/100
XSR-004
cross_source
high
100%
100.0/100
XSR-007
cross_source
critical
100%
100.0/100
DVI-010
direct_vs_inferred
high
100%
100.0/100
ISO-003
context_isolation
critical
100%
100.0/100
ALI-003
alias
medium
100%
85.0/100
VER-002
update_versioning
high
100%
100.0/100
DVI-002
direct_vs_inferred
high
100%
100.0/100
CON-002
conflicting_evidence
high
100%
100.0/100
STOR-001
storage
critical
100%
100.0/100
DVI-005
direct_vs_inferred
high
100%
86.0/100
RAW-001
raw_text
critical
100%
100.0/100
RAW-003
raw_text
high
100%
100.0/100
RAW-004
raw_text
high
100%
100.0/100
SID-004
exact_source_id
high
100%
100.0/100
ENT-001
entity_disambiguation
critical
100%
100.0/100
PRV-001
provenance
high
100%
100.0/100
RAW-005
raw_text
high
100%
100.0/100
IDN-003
identifier
high
100%
100.0/100
NEG-001
negation
critical
100%
100.0/100
DDP-001
deduplication
high
100%
100.0/100
VER-005
update_versioning
critical
100%
100.0/100
PHR-002
exact_phrase
critical
100%
95.0/100
XSR-009
cross_source
critical
100%
100.0/100
NEG-003
negation
critical
100%
100.0/100
DVI-007
direct_vs_inferred
medium
100%
100.0/100
XSR-003
cross_source
critical
100%
100.0/100
ALI-002
alias
high
100%
85.0/100
ISO-002
context_isolation
high
100%
100.0/100
SID-002
exact_source_id
critical
100%
100.0/100
STOR-002
storage
critical
100%
100.0/100
SUP-002
supersession
high
100%
89.0/100
ALI-001
alias
high
100%
100.0/100
STOR-004
storage
high
100%
95.0/100
DDP-002
deduplication
medium
100%
100.0/100
DVI-009
direct_vs_inferred
high
100%
85.0/100
NUM-002
numerical
high
100%
100.0/100
XSR-002
cross_source
critical
100%
100.0/100
NUM-001
numerical
high
100%
100.0/100
IDN-002
identifier
high
100%
100.0/100
ISO-004
context_isolation
high
100%
100.0/100
JUDGE-001
judge_reliability
high
100%
100.0/100
ADV-003
adversarial
critical
100%
100.0/100
XSR-001
cross_source
critical
100%
100.0/100
NUM-003
numerical
high
100%
95.0/100
DVI-001
direct_vs_inferred
critical
100%
100.0/100
PHR-001
exact_phrase
critical
100%
100.0/100
VRF-003
verified
medium
100%
93.0/100
VER-003
update_versioning
high
100%
100.0/100
π Improvement Report
Paste this into Claude CLI (brain creator) to improve the brain.
Generating reportβ¦
π Run Files
All files from this run β click β to download individually.
Click "Files" above or scroll here to browse.