[demo:03] evaluation · Evaluation · sha:29b2de8e4054 · build:2026-06-20T23:34:47.545Z

Evaluation

Gold-set metrics, baseline targets, and known degraded results. Nothing hidden. Evaluation is the surface that proves — or disproves — the architecture claims.

This page separates measured demo evidence from Phase I design targets. The local gold-evidence package exists; public release of the source set, full shape-gate execution, and signed receipt packaging remain Phase I work.

Measured demo summary

Model identifiers and private run paths are omitted from this public surface. Hallucination rate improved from 0.673 in the prior baseline to 0.183 in the strongest current run.

run 2026-05-25

Run	Precision	Hallucination	Abstention	Composite	Verdict
Local model A current run	0.833	0.183	0.917	0.845	PASS DEGRADED
Local model B current run	0.810	0.190	0.783	0.804	PASS DEGRADED
Prior baseline	0.882	0.673	0.917	0.723	PASS DEGRADED

Retrieval precision@5

designed, unverified

Current

—

Target

≥ 0.80

gold-set eval pending corpus assembly

Retrieval recall@10

designed, unverified

Current

—

Target

≥ 0.85

gold-set eval pending corpus assembly

Answer groundedness

designed, unverified

Current

—

Target

≥ 0.90

groundedness metric: claim-to-citation ratio

Citation accuracy

designed, unverified

Current

—

Target

≥ 0.95

citation resolver verification pass

Abstention correctness

designed, unverified

Current

—

Target

≥ 0.90 (abstains when it should, answers when it can)

false-abstention / false-answer tradeoff analysis

Report-shape repair gate

degraded

Current

designed, not yet run

Target

repaired output passes verifier on re-check

Known degraded result: Initial composer output may produce reports with missing citation fields or inconsistent confidence signals. The repair gate catches these and triggers recomposition. This is a known design feature, not a bug — the system is designed to self-correct.

design doc: report-shape repair gate

Output-shape gates

Conflict-leak detection

designed, unverified

Flags answers that blend incompatible source claims without declaring the conflict. Current state: specified for Phase I implementation.

public-safe gate summary

Duplicate-section detection

designed, unverified

Rejects reports that repeat the same section heading or evidence block. Current state: specified for Phase I implementation.

public-safe gate summary

Truncation detection

designed, unverified

Rejects reports with incomplete sections, clipped citations, or missing receipt fields. Current state: specified for Phase I implementation.

public-safe gate summary

Scaffold-leak detection

designed, unverified

Rejects unresolved template tokens and unfinished draft markers before review packaging. Current state: specified for Phase I implementation.

public-safe gate summary

Section-floor enforcement

designed, unverified

Rejects reports that omit required answer, citation, uncertainty, and receipt sections. Current state: specified for Phase I implementation.

public-safe gate summary