For solo builders and teams shipping AI faster than humans can review

An AI reviewer
for everything you ship.

Judge scores your code, AI output, content, and designs against a rubric, an expert critic, or your real buyer's perspective. On every PR and prompt change, see exactly what got better, what got worse, and what to fix — before it reaches users.

  • What it scores: code · PRs · AI output · content · screenshots · designs.
  • How it judges: rubric for objective metrics, expert critic for taste, buyer persona for fit.
  • Where it lives: GitHub App on every PR, CLI in CI, MCP for agents. Slack on regression.
RubricsReviewersPersonas·GitHub App · CI · MCP
judge.dashboard / score #314
PersonaSarah K. — Series A founder

Scoring /pricing on acme.com

0.0
+0.0 improved
Value-prop clarity8.1 +0.6
Pricing transparency6.4 +0.9
Trust signals7.8 -0.2
Next-step clarity6.9 +0.4
Last 8 iterationstrending
Triggered byMCP · claude-code
Try it

See it in 30 seconds. No signup.

What you actually type, what you actually get back. One Node CLI (npx judge), three real workflows. Every score is a typed row in your dashboard with cited evidence — not a chat blob.

JUDGE_API_KEY=jdg_…· mint on /connect· BYOK Anthropic key works too
Free account, no card·BYOK pays Anthropic at cost — one Sonnet 4.6 call per score (~$0.01–$0.05)·First score in <5 min — `npm i -g judge` → `judge init` → `judge run`
01

"Did my last edit make this file better — or worse?"

# Score what HEAD did to a file vs HEAD~1
$ judge score-edit src/auth.ts --judge code-quality

→ scoring before (HEAD~1)…  72.4
→ scoring after  (HEAD)  …  68.9

✗ verdict: REGRESSED   weighted Δ −3.5
  ↓ input_validation  0.9 → 0.6  (−0.3)
  ↓ error_handling    0.8 → 0.5  (−0.3)
  = type_safety       1.0 → 1.0   (·)

history: 12 runs → 71.6 → 71.8 → 72.4 → 68.9
scorecard: judge.app/s/8a7…f2c (cited evidence)
exit 2  (--fail-on-regression)

The wedge: edit scoring. Diffs HEAD~1 → HEAD, scores both, returns a verdict with per-metric deltas. Plug it into a pre-commit hook with --fail-on-regression and you stop shipping silent quality drops.

02

"Does my landing page actually land for my buyer?"

# Score a live URL through a buyer persona
$ judge score \
$   --judge sarah-k-founder \
$   --url https://acme.com/pricing

→ fetch + visible-text extract  (1.4s)
→ submit_score (forced tool-use)

● overall 72.4 / 100   △ +1.8 IMPROVED
  ✓ value_clarity      0.85   (+0.10)
  ✓ pricing_clarity    0.70   (+0.20)
  ✗ social_proof       0.40   (·)

rationale (cited):
  "Starter $29" — clear price; no annual toggle
  no logos / quotes above the fold

A persona judge reads the page through a specific buyer's lens (optionally grounded in a crawl of your product). Ship a copy change, re-run, see if the persona's pricing-clarity metric moved before users do.

03

"Block regressions in CI without writing glue."

# Once: scaffold .judge/config.json + GH Action
$ judge init && judge install gh-actions

# On every PR (CI):
$ judge run --fail-on-regression

→ syncing 3 pipelines from .judge/config.json
✓ pricing-page     78.2  IMPROVED  (+2.1)
✓ pr-diff:auth     74.0  STABLE    (·)
✗ readme-clarity   61.5  REGRESSED (−4.3)

1 regression → exit 1, build fails
PR comment posted with 3 scorecards

Check in .judge/config.json with named pipelines (PR diff, landing page, README — anything). One judge run in CI scores them all, fails the build on any REGRESSED, and posts the scorecard URL to the PR.

· Same engine via MCP (Claude Code, Cursor) and REST
Drop it into your loop

Three install paths. Same engine.

From zero-effort to full programmatic control. Most teams use all three: GitHub App on PRs, CLI in CI, MCP for the agents that ship the AI features.

Zero installGitHub App

Score every PR automatically.

Connect once. Every PR that changes a watched artifact gets scored against your chosen rubric or reviewer. Slack pings on REGRESSED, link to the diff in the dashboard.

  • One-click install on any repo or org
  • Typed score commented on the PR with cited evidence
  • Block the merge on regression — or just observe
Recent reviewsauto-refresh
PERSONASarah K. — founder
72.4 +4.1
REVIEWERB2B copy critic
64.2 -1.8
RUBRICConversion checklist
81.0 +2.3
One binaryCLI · MCP · REST

Wire it into CI, agents, anywhere.

`judge score` in your pipeline. MCP server for Claude Code and other agents to score their own output. REST endpoints if you want to call it from anywhere else. Same scores, same history.

  • CI: fail the deploy on regression with one config file
  • MCP: agents check their work before returning to the user
  • REST + cron: scheduled runs against any artifact
claude-code$
One scoring engine

Score by rubric, by expert, or by your buyer.

Each judge produces an overall in 0–100, per-metric values, cited rationale, and a deterministic delta vs. the previous run. Stack as many as you want — disagreements are signal, not noise.

Rubric

A checklist for measurable quality

Typed metrics — boolean gates and 0–10 scales, weighted. Deterministic, repeatable, audit-friendly. Use the seeded system rubrics or generate one from a one-line description.

Code quality rubric81.0
  • Type safety
  • Test coverage

code quality · factuality · tone · security · brevity

Reviewer

An expert critic with a worldview

A senior practitioner ("Engineering lead", "UX strategist", "Brand director", "Conversion auditor") reading the artifact through their lens. Catches taste-level issues a checklist won't.

Senior B2B copy critic64.2
  • Narrative coherence
  • Audience fit

engineering reviews · UX critiques · brand audits

Persona

A real-feeling end user

A specific buyer ("Sarah K., Series A founder") reading through their own lens. Optionally grounded in a crawl of your product so the persona knows your context. Powerful for buyer-facing artifacts; checklists serve code better.

Sarah K. — founder72.4
  • Pricing clarity
  • Service fit

landing pages · onboarding flows · generated content

How it works

Output → Judge → Score → Delta → Decision.

  1. 01

    Point it at an output

    A webpage, a PR diff, a code file, a generated report, a screenshot, an MCP resource — anything an LLM can read.

  2. 02

    Pick a judge

    Use a seeded system rubric, generate one from a one-line description, or stack a reviewer + persona for cross-checked signal.

  3. 03

    Run it — PR, CI, agent, or hand

    GitHub App scores on every PR. `judge score` runs in CI. MCP tools fire from Claude Code. Click Run in the dashboard. Same engine, same scores.

  4. 04

    See what moved

    Every run is a typed Score row with per-metric deltas. IMPROVED / STABLE / REGRESSED is computed in code, not generated. Slack alert on regression.

Inputs

If an LLM can read it, Judge can score it.

Source providers fetch the artifact and hand it to the scoring engine. Code, content, generated AI output, designs, dashboards, reports — all read the same way. Default model is Claude Sonnet 4.6; the architecture isn't Claude-locked.

Website HTML
live URL
GitHub PR diff
App or PAT
GitHub repo file
any branch
Document
Markdown / text
Screenshot
any image
Image URL
remote asset
MCP resource
tool or doc
Raw text
paste & score
Why the score holds up

We publish our noise floor.

Every LLM judge has run-to-run variance. Most tools hide it. We measure it, render it, and refuse to call something IMPROVED unless it's actually above the band.

Same input → same verdict

Run it twice on identical input, get the same STABLE. Phantom regressions render inside a confidence band — never as a red trend chip.

Schema-validated output

Tool-use forces the LLM to call submit_score with a JSON-Schema-validated payload. No prose to parse, no malformed runs.

Cited evidence

Every metric comment quotes the artifact — line, phrase, screenshot region — so you can verify the score didn't make it up.

Versioned judges

Editing a judge mints a new slug. Old scores stay honest; new ones compound on a fresh baseline. No silent recalibration.

Deterministic deltas

IMPROVED / STABLE / REGRESSED is computed in code from typed metric values, not generated by a second LLM call. No drift, no theater.

Open data model

Targets, Judges, Jobs, Runs, Scores, Metrics — everything is a typed Postgres row. Query directly, export, feed pipelines, BYOK or SaaS.

Where Judge sits

Eval tools (Braintrust, Promptfoo, Langfuse) target ML engineers with golden datasets. Code review tools (CodeRabbit, Greptile) target reviewers on diffs. UX tools target marketers on flows. Judge sits in the middle — typed quality scoring for any team that ships a thing, iterates on it, and needs to know whether the last change was a step forward.

Stop shipping changes you can't measure.

Paste an output. Get a score, a delta vs. last run, and a reason. 30 seconds, no card. Free account, the 10 system rubrics seeded — BYOK pays Anthropic at cost (~$0.02/score), no Judge markup. Connect GitHub when you're ready to score every PR.