β All Runs
20260524T124148Z-z2do
61%
Pass Rate
71.1
Avg Score
72
Total Cases
44
Passed
Pass Rate by Category
Category Breakdown
| Category | Cases | Pass Rate | Avg Score | Stability |
|---|---|---|---|---|
| adversarial | 5 | 80% | 95.4 | flaky |
| alias | 3 | 100% | 100.0 | stable pass |
| causal | 2 | 100% | 88.5 | stable pass |
| conflicting_evidence | 3 | 100% | 100.0 | stable pass |
| context_isolation | 5 | 80% | 98.0 | flaky |
| cross_source | 10 | 10% | 10.0 | flaky |
| deduplication | 2 | 100% | 100.0 | stable pass |
| direct_vs_inferred | 10 | 100% | 98.4 | stable pass |
| entity_disambiguation | 3 | 33% | 28.7 | flaky |
| exact_phrase | 5 | 100% | 100.0 | stable pass |
| exact_source_id | 5 | 20% | 63.6 | flaky |
| identifier | 3 | 100% | 100.0 | stable pass |
| judge_reliability | 5 | 20% | 49.6 | flaky |
| missing_source | 5 | 20% | 56.6 | flaky |
| needle_haystack | 1 | 0% | 0.0 | stable fail |
| negation | 3 | 67% | 85.7 | flaky |
| numerical | 1 | 100% | 100.0 | stable pass |
| raw_text | 1 | 0% | 0.0 | stable fail |
Test Cases (72 total β sorted by pass rate)
ADV-001
adversarial
critical
0%
95.0/100
MISS-003
missing_source
high
0%
68.0/100
ENT-002
entity_disambiguation
high
0%
0/100
XSR-008
cross_source
high
0%
0/100
XSR-005
cross_source
high
0%
0/100
XSR-010
cross_source
high
0%
0/100
XSR-004
cross_source
high
0%
0/100
XSR-007
cross_source
critical
0%
0/100
SID-003
exact_source_id
high
0%
60.0/100
RAW-001
raw_text
critical
0%
0.0/100
JUDGE-004
judge_reliability
critical
0%
15.0/100
SID-004
exact_source_id
high
0%
70.0/100
ENT-001
entity_disambiguation
critical
0%
0/100
JUDGE-002
judge_reliability
critical
0%
60.0/100
JUDGE-005
judge_reliability
high
0%
35.0/100
XSR-009
cross_source
critical
0%
0/100
NEG-003
negation
critical
0%
57.0/100
MISS-005
missing_source
critical
0%
47.0/100
XSR-003
cross_source
critical
0%
0/100
ISO-005
context_isolation
critical
0%
90.0/100
SID-002
exact_source_id
critical
0%
23.0/100
MISS-004
missing_source
critical
0%
30.0/100
JUDGE-003
judge_reliability
critical
0%
38.0/100
SID-005
exact_source_id
critical
0%
65.0/100
XSR-006
cross_source
critical
0%
0/100
XSR-002
cross_source
critical
0%
0/100
NIH-002
needle_haystack
critical
0%
0/100
MISS-002
missing_source
critical
0%
38.0/100
DVI-004
direct_vs_inferred
high
100%
100.0/100
MISS-001
missing_source
critical
100%
100.0/100
DVI-003
direct_vs_inferred
high
100%
100.0/100
DVI-006
direct_vs_inferred
medium
100%
99.0/100
IDN-001
identifier
critical
100%
100.0/100
CAU-001
causal
high
100%
85.0/100
ADV-004
adversarial
critical
100%
90.0/100
ISO-001
context_isolation
critical
100%
100.0/100
CAU-002
causal
high
100%
92.0/100
NEG-002
negation
high
100%
100.0/100
ADV-005
adversarial
critical
100%
100.0/100
DVI-008
direct_vs_inferred
critical
100%
100.0/100
CON-001
conflicting_evidence
critical
100%
100.0/100
ADV-002
adversarial
critical
100%
92.0/100
PHR-004
exact_phrase
high
100%
100.0/100
DVI-010
direct_vs_inferred
high
100%
100.0/100
ISO-003
context_isolation
critical
100%
100.0/100
ALI-003
alias
medium
100%
100.0/100
PHR-005
exact_phrase
critical
100%
100.0/100
DVI-002
direct_vs_inferred
high
100%
100.0/100
CON-002
conflicting_evidence
high
100%
100.0/100
DVI-005
direct_vs_inferred
high
100%
100.0/100
IDN-003
identifier
high
100%
100.0/100
NEG-001
negation
critical
100%
100.0/100
DDP-001
deduplication
high
100%
100.0/100
PHR-002
exact_phrase
critical
100%
100.0/100
DVI-007
direct_vs_inferred
medium
100%
100.0/100
ALI-002
alias
high
100%
100.0/100
ISO-002
context_isolation
high
100%
100.0/100
PHR-003
exact_phrase
high
100%
100.0/100
ALI-001
alias
high
100%
100.0/100
DDP-002
deduplication
medium
100%
100.0/100
DVI-009
direct_vs_inferred
high
100%
90.0/100
NUM-002
numerical
high
100%
100.0/100
CON-003
conflicting_evidence
high
100%
100.0/100
IDN-002
identifier
high
100%
100.0/100
ISO-004
context_isolation
high
100%
100.0/100
ENT-003
entity_disambiguation
critical
100%
86.0/100
JUDGE-001
judge_reliability
high
100%
100.0/100
ADV-003
adversarial
critical
100%
100.0/100
XSR-001
cross_source
critical
100%
100.0/100
SID-001
exact_source_id
critical
100%
100.0/100
DVI-001
direct_vs_inferred
critical
100%
95.0/100
PHR-001
exact_phrase
critical
100%
100.0/100
π Improvement Report
Paste this into Claude CLI (brain creator) to improve the brain.
Generating reportβ¦
π Run Files
All files from this run β click β to download individually.
Click "Files" above or scroll here to browse.