ENG-1234 · "Fix slow dashboard" × Spec Completeness Judge
5/8/2026, 8:01:15 AMEdited by gallery-seed
Scores a product / engineering spec on whether it gives a team enough to build without follow-up questions.
- Acceptance criteria21%
- Problem clarity19%
- Success metric18%
Score: 15.0 (first iteration — no prior baseline).
Why the judge scored it this way— show fullThis spec scores poorly across all dimensions. The problem statement relies entirely on anecdotal, unmeasured language ("feels slow," "forever to load," "looks bad") with no baseline numbers or reproducible conditions. Two separate concerns — query/caching latency and pie-chart flicker — are mixed into one ticket alongside a speculative third issue ("possibly related to new analytics rollout"), making scope impossible to bound. There are zero acceptance criteria and no measurable success metric; the closest thing to a target is "maybe add some caching." Ownership is assigned to "someone from platform," which is not actionable, and no definition of done exists anywhere in the spec.
This spec scores poorly across all dimensions. The problem statement relies entirely on anecdotal, unmeasured language ("feels slow," "forever to load," "looks bad") with no baseline numbers or reproducible conditions. Two separate concerns — query/caching latency and pie-chart flicker — are mixed into one ticket alongside a speculative third issue ("possibly related to new analytics rollout"), making scope impossible to bound. There are zero acceptance criteria and no measurable success metric; the closest thing to a target is "maybe add some caching." Ownership is assigned to "someone from platform," which is not actionable, and no definition of done exists anywhere in the spec.
- 1Acceptance criteriaPoorweight 21%score 1.0/10
ACs are mechanical and testable.
DiagnosisThere are zero mechanically-testable ACs; the entire spec contains only suggestions ('should probably look at queries,' 'maybe add some caching') with no pass/fail criteria.
Do this nextAdd numbered ACs in Given/When/Then or checklist form, e.g. 'Given a user with ≥500 rows of data, when the homepage loads, then Time-to-Interactive is ≤2 s in a Lighthouse CI run on the main branch.' - 2Success metricPoorweight 18%score 1.0/10
A measurable signal of success is named.
DiagnosisNo measurable success signal is defined anywhere; 'feels slow' and 'looks bad' are the only proxies offered and neither is quantified.
Do this nextDefine at least one instrumented metric with a target and measurement method, e.g. 'Homepage P95 server response time drops from current Xms to ≤500 ms as tracked in Datadog dashboard [link].' - 3Problem clarityPoorweight 19%score 2.0/10
Problem statement is concrete and bounded.
DiagnosisThe problem is described entirely in subjective, unbounded terms — 'feels slow,' 'forever to load,' 'looks bad' — with no baseline measurement, no affected user count, and no reproducible scenario.
Do this nextReplace vague language with concrete data: add a sentence like 'P95 homepage TTI measured at Xs on YYYY-MM-DD via [tool]; target is ≤Y s' and specify which user segments/browsers/data sizes reproduce the issue. - 4Scope boundariesPoorweight 16%score 2.0/10
Explicit in-scope and out-of-scope.
DiagnosisTwo distinct concerns (load-time performance and pie-chart flicker) are bundled together with no explicit in-scope/out-of-scope list, and a third speculative concern ('possibly related to analytics rollout') is left dangling.
Do this nextSplit into at least two separate tickets (query/caching performance vs. chart flicker) and add explicit non-goals, e.g. 'Out of scope: analytics pipeline changes, mobile dashboard, other pages.' - 5Edge casesPoorweight 15%score 2.0/10
Realistic edge / failure cases are enumerated.
DiagnosisThe possible analytics-rollout regression is noted but not investigated or structured as a hypothesis; no other failure modes (e.g., large datasets, concurrent users, cache invalidation, empty-state rendering) are considered.
Do this nextAdd an edge-cases section listing scenarios to test: empty dataset, >10 k rows, cache miss on cold start, simultaneous refresh during data pipeline run, and the analytics-rollout regression — each with an expected outcome.
Metrics
Scores a product / engineering spec on whether it gives a team enough to build without follow-up questions.
- problem clarity20%
- scope boundaries20%
- edge cases20%
- acceptance criteria10%
- success metric10%
- owner dod10%
spec-completenessPublic link · read-onlyWant quality scoring on your own code? Try Judge.