[demo:03] evaluation · Evaluation · sha:29b2de8e4054 · build:2026-06-20T23:34:47.545Z

Evaluation

Gold-set metrics, baseline targets, and known degraded results. Nothing hidden. Evaluation is the surface that proves — or disproves — the architecture claims.

This page separates measured demo evidence from Phase I design targets. The local gold-evidence package exists; public release of the source set, full shape-gate execution, and signed receipt packaging remain Phase I work.

Measured demo summary

Model identifiers and private run paths are omitted from this public surface. Hallucination rate improved from 0.673 in the prior baseline to 0.183 in the strongest current run.

run 2026-05-25

RunPrecisionHallucinationAbstentionCompositeVerdict
Local model A current run0.8330.1830.9170.845PASS DEGRADED
Local model B current run0.8100.1900.7830.804PASS DEGRADED
Prior baseline0.8820.6730.9170.723PASS DEGRADED

Retrieval precision@5

designed, unverified

Current

Target

≥ 0.80

gold-set eval pending corpus assembly

Retrieval recall@10

designed, unverified

Current

Target

≥ 0.85

gold-set eval pending corpus assembly

Answer groundedness

designed, unverified

Current

Target

≥ 0.90

groundedness metric: claim-to-citation ratio

Citation accuracy

designed, unverified

Current

Target

≥ 0.95

citation resolver verification pass

Abstention correctness

designed, unverified

Current

Target

≥ 0.90 (abstains when it should, answers when it can)

false-abstention / false-answer tradeoff analysis

Report-shape repair gate

degraded

Current

designed, not yet run

Target

repaired output passes verifier on re-check

Known degraded result: Initial composer output may produce reports with missing citation fields or inconsistent confidence signals. The repair gate catches these and triggers recomposition. This is a known design feature, not a bug — the system is designed to self-correct.

design doc: report-shape repair gate

Output-shape gates

Conflict-leak detection

designed, unverified

Flags answers that blend incompatible source claims without declaring the conflict. Current state: specified for Phase I implementation.

public-safe gate summary

Duplicate-section detection

designed, unverified

Rejects reports that repeat the same section heading or evidence block. Current state: specified for Phase I implementation.

public-safe gate summary

Truncation detection

designed, unverified

Rejects reports with incomplete sections, clipped citations, or missing receipt fields. Current state: specified for Phase I implementation.

public-safe gate summary

Scaffold-leak detection

designed, unverified

Rejects unresolved template tokens and unfinished draft markers before review packaging. Current state: specified for Phase I implementation.

public-safe gate summary

Section-floor enforcement

designed, unverified

Rejects reports that omit required answer, citation, uncertainty, and receipt sections. Current state: specified for Phase I implementation.

public-safe gate summary