Why moment-aware evaluation
Status: working draft. The full 2–3 page whitepaper lands before v1.0. The argument below is the abstract.
Most content linters treat a string as a string. They run readability
math (Flesch, syllable counts), check for forbidden words, maybe flag
title-case violations. They are linters in the same way wc -l is a
linter for prose: technically correct, structurally indifferent.
This is fine for a help center where most strings live in roughly the same context. It falls down at scale in product surfaces where the same phrase can be exactly right and disastrously wrong depending on the moment of contact.
Two examples
"Got it." As a confirmation in a low-stakes settings flow: warm, quick, calibrated. As the headline of an error message after a payment fails: callous bordering on cruel.
"Save" as a button on a routine form: invisible, correct. As the button on the dialog confirming you're about to overwrite a collaborator's edits: under-built; the situation calls for "Replace" or "Overwrite teammate's changes" so the user can register the gravity.
A stringwise linter has no way to see this difference. It only sees the literal text.
What changes when you add situational awareness
Three things:
- Rule selection. Empathy-in-error-states rules only fire when the string is in fact recovering from an error. The same string posted as a routine confirmation is not graded against them.
- Suggestion shape. Plain-language guidance during first contact suggests a plain-language alternative; the same guidance in a reference surface may suggest a glossary link instead.
- Severity. A length cap violation on a destructive confirmation is a higher-severity finding than the same violation on a browsing surface.
How ContentRX implements it
The engine classifies each string into a (content_type, situation)
pair before any rule runs. Mechanical rules check what they can
statically; nuanced rules go to an LLM with the relevant subset of the
standards library injected as the system prompt. The merge layer
reconciles deterministic and LLM findings, deduplicates, and
prioritizes by severity. Output is a violations list with the public
fields the surface needs — issue, suggestion, severity, confidence —
and nothing else.
The full eval methodology and accuracy reporting land at /accuracy; the weekly calibration log at /calibration shows how the numbers move.