Retrieval precision@5
designed, unverifiedCurrent
—
Target
≥ 0.80
gold-set eval pending corpus assembly
[demo:03] evaluation · Evaluation · sha:29b2de8e4054 · build:2026-06-20T23:34:47.545Z
Gold-set metrics, baseline targets, and known degraded results. Nothing hidden. Evaluation is the surface that proves — or disproves — the architecture claims.
This page separates measured demo evidence from Phase I design targets. The local gold-evidence package exists; public release of the source set, full shape-gate execution, and signed receipt packaging remain Phase I work.
Model identifiers and private run paths are omitted from this public surface. Hallucination rate improved from 0.673 in the prior baseline to 0.183 in the strongest current run.
run 2026-05-25
| Run | Precision | Hallucination | Abstention | Composite | Verdict |
|---|---|---|---|---|---|
| Local model A current run | 0.833 | 0.183 | 0.917 | 0.845 | PASS DEGRADED |
| Local model B current run | 0.810 | 0.190 | 0.783 | 0.804 | PASS DEGRADED |
| Prior baseline | 0.882 | 0.673 | 0.917 | 0.723 | PASS DEGRADED |
Current
—
Target
≥ 0.80
gold-set eval pending corpus assembly
Current
—
Target
≥ 0.85
gold-set eval pending corpus assembly
Current
—
Target
≥ 0.90
groundedness metric: claim-to-citation ratio
Current
—
Target
≥ 0.95
citation resolver verification pass
Current
—
Target
≥ 0.90 (abstains when it should, answers when it can)
false-abstention / false-answer tradeoff analysis
Current
designed, not yet run
Target
repaired output passes verifier on re-check
Known degraded result: Initial composer output may produce reports with missing citation fields or inconsistent confidence signals. The repair gate catches these and triggers recomposition. This is a known design feature, not a bug — the system is designed to self-correct.
design doc: report-shape repair gate
Flags answers that blend incompatible source claims without declaring the conflict. Current state: specified for Phase I implementation.
public-safe gate summary
Rejects reports that repeat the same section heading or evidence block. Current state: specified for Phase I implementation.
public-safe gate summary
Rejects reports with incomplete sections, clipped citations, or missing receipt fields. Current state: specified for Phase I implementation.
public-safe gate summary
Rejects unresolved template tokens and unfinished draft markers before review packaging. Current state: specified for Phase I implementation.
public-safe gate summary
Rejects reports that omit required answer, citation, uncertainty, and receipt sections. Current state: specified for Phase I implementation.
public-safe gate summary