Judge scores any artifact (URL, file, PR, screenshot) against a typed rubric, an expert critic, or a buyer persona — and computes IMPROVED / STABLE / REGRESSED against the previous run in code, not by a second LLM call. Use it three ways: CLI in CI to fail builds on regression, GitHub App to score every PR, or MCP so agents check their own output. BYOK Anthropic key — pay at cost, no Judge markup.
Quickstart
Score your first thing in five minutes — from the dashboard, the CLI, or an MCP-aware agent.
Create an account
Email + password. The 10 system rubrics are seeded into your account on first sign-in.
Create a judge
Add a website and click Suggest personas / reviewers / rubrics. Or describe a custom rubric in one line on the Judges tab.
Run it
Pick a target (URL, file, PR, document, screenshot…) and run. Scores stream in with deterministic deltas and per-metric history.
The data model
Six entities. Each is a typed Postgres row, queryable directly or via the API.
Target
The thing being scored. Has a kind (WEBSITE, COMPONENT, PR, OTHER) and a stable key (URL, file path, PR identifier). History is tracked per (target × judge).
Source
How the input is fetched. Pluggable providers: GitHub PR diff, GitHub repo file, website HTML, screenshot, image URL, MCP resource, raw text. New ones drop in by writing a provider.
Judge
A system prompt + metrics spec. Three kinds:
- Persona — a real-feeling visitor reacting through their own lens.
- Reviewer — a senior third-party expert performing an audit.
- Rubric — a fixed checklist with weighted boolean / numeric metrics.
Job
A pairing of (Source × Judge × Target) with an optional cron. One-off runs and scheduled runs share the same row shape.
Run + Score
One execution of a Job produces one Run. The Run produces one Score: an overall in 0–100 plus typed per-metric values, a rationale, and a deterministic trend bucket (FIRST · IMPROVED · STABLE · REGRESSED).
Iteration comment
Auto-generated diff summary between this Score and the previous one for the same (target × judge). Built from typed metric deltas in code — no extra LLM call.
How the overall is computed
Each metric is normalized to 0..1 inside its range (numbers scaled in min..max; booleans mapped to 0 or 1), multiplied by its weight, summed, and divided by the total weight. The result × 100 is the overall. Heavier weights = the metric dominates. Boolean metrics are typically used for hard gates.
overall = (Σ normalized(metric) × weight) / (Σ weight) × 100 trend = bucket(overall - previousOverall) // |Δ| < 1.0 = STABLE
Dashboard
The web UI for humans. Everything you can do via CLI / MCP / REST you can also do here.
Pages
- /dashboard — overview of recent activity across all targets and judges.
- /websites — crawl a site, generate site-grounded personas / reviewers / rubrics, schedule cron-based scoring per URL.
- /judges — every judge you have access to: system + your own. Filter by kind, owner, or search; edit and delete user-owned ones.
- /scores — score history per (target × judge) with charts, deltas, and re-run buttons.
- /jobs — scheduled jobs, enable / disable, cron expressions, last + next run.
- /connect — mint API keys, install the GitHub App, configure BYOK.
Command line
A small Node CLI that wraps the MCP endpoint with ergonomic subcommands. Designed for terminals, scripts, and CI. Two operating modes: one-off scoring (good for quick checks) and project workflow with a checked-in config (good for teams + CI).
Install
The CLI ships with the repo as bin/judge.mjs. From a clone of the Judge repo (npm publish coming):
pnpm install pnpm link --global # pnpm flow — needs `pnpm setup` first # or, equivalent and Windows-friendly: npm link # uses the npm global bin already on your PATH judge help # verify "judge" is on PATH
Configure
Two env vars. Mint a key on /connect; copy it once (it's only shown then).
export JUDGE_URL="https://judge.example.com" # default: http://localhost:3000 export JUDGE_API_KEY="jdg_xxxxxxxxxxxx" # mint on /connect
JUDGE_API_KEY except judge help and judge install … when used to write a local config (some installs read the key to embed it). Missing key → CLI exits with a one-line error and a pointer to /connect.One-command installs
Each subcommand is idempotent and writes the right file in the right place — no copy-paste, no jq, no curl. They're described in detail on the Integrations page.
judge install claude-code # ~/.claude/settings.json (HTTP MCP)
judge install claude-desktop # OS-specific Claude Desktop config (stdio bridge)
judge install codex # ~/.codex/config.toml
judge install agent-system-prompt # appends Judge usage block to CLAUDE.md /
# AGENTS.md / .cursorrules — agents stop
# fragmenting history with bad keys
judge install precommit # .husky/pre-commit (or .git/hooks/pre-commit)
judge install gh-actions # .github/workflows/judge.yml
judge install all # all of the aboveOne-off scoring
Use these when you want to score something now without committing a config. Each call creates the Target server-side (or reuses one with the same (kind, key)) and accumulates history under it.
# List judges available to your account judge judges # Show one judge's full rubric (metrics + system prompt) judge get code-quality # Score a website URL with a persona judge judge score \ --judge sarah-k-founder \ --target-kind WEBSITE --target-key https://acme.com/pricing \ --target-label "Pricing page" \ --url https://acme.com/pricing # Score a GitHub PR with the security rubric judge score \ --judge security \ --target-kind REPO --target-key acme/web \ --target-label "Web app" \ --pr acme/web#314 # Score a local file as raw text (read by the CLI, sent inline) judge score \ --judge code-quality \ --target-kind COMPONENT --target-key src/auth.ts \ --target-label "Auth module" \ --text "$(cat src/auth.ts)" # Generate a custom rubric from a one-line prompt judge create-judge "B2B onboarding clarity judge" # View history per (target × judge) judge history --judge code-quality --target src/auth.ts --limit 25 # Roll-up across all judges for a target — or platform-wide if --target omitted judge progress --target src/auth.ts
Source flag shorthands
judge score needs exactly one source. Pick the shorthand that matches your input:
--pr owner/name#NUMBER # GITHUB_PR (diff)
--file owner/name@ref:path # GITHUB_FILE (one file at ref)
--url https://… # WEBSITE_HTML (visible-text extract)
--shot https://… # WEBSITE_SCREENSHOT (Playwright PNG, multimodal)
--image https://… # IMAGE_URL (png/jpeg/webp/gif)
--text "literal text" # TEXT (inline)
--source-kind X --source-config '{…}' # escape hatch for kinds without a flag--connection <id> when the source needs auth (private GitHub repo, MCP server). Connection IDs are visible in the dashboard URL on each connection's row.Edit scoring (the wedge)
Most evaluators score one artifact at a time. Judge also scores the change between two versions and tells you whether it improved or regressed quality — the answer to "is my dzisiejsza iteracja lepsza od wczorajszej?". It returns editVerdict ∈ {IMPROVED,STABLE,REGRESSED}, weightedEditDelta, and per-metric deltas — without you having to score before/after manually and diff yourself.
# Score what the last commit did to a file (HEAD~1 → HEAD) judge score-edit src/foo.ts --judge code-quality # Score uncommitted changes (HEAD → working tree) judge score-edit src/foo.ts --judge code-quality --working # Compare two arbitrary refs judge score-edit src/foo.ts --judge code-quality --before main --after HEAD # Score everything currently staged for commit (the pre-commit hook does this) judge score-edit-staged --judge code-quality judge score-edit-staged --judge spec-completeness --include '\.md$' judge score-edit-staged --judge code-quality --fail-on-regression
Common edit-scoring flags
--agent <id>— attribution. Defaults tohuman:<git config user.email>. Set toclaude-code/cursor/devin/copilot-workspacewhen an agent is making the edit. Powers the per-agent reliability rollups on/reliability.--fail-on-regression— exit code 2 if the verdict isREGRESSED. Use in pre-commit / CI gates.--target-kind,--target-key,--target-label— overrides; defaults toCOMPONENT+ the file path.
Project workflow
For repos that want a checked-in score config, use judge init. It creates a Project on the server and writes .judge/config.json with named pipelines. From then on, every judge run mirrors the file to the server (so the dashboard knows what you have), scores every pipeline, and caches the result locally.
judge init # interactive — creates .judge/config.json,
# adds 'judges' + 'judges:sync' npm scripts,
# runs codegen so autocomplete works
judge run # score every pipeline; write .judge/cache/<slug>.json
judge run --only homepage # filter — only one pipeline (or comma-list)
judge run --fail-on-regression # exit 1 on any REGRESSED trend (CI gate)
judge run --no-sync # skip the server upsert step (rare; lets you
# score from a config that hasn't been blessed
# by the server yet — e.g. air-gapped CI)
judge sync # pull server's canonical artifacts into
# .judge/cache/ (read-only mirror; useful on
# fresh clone or in CI without credentials)
judge codegen # refresh .judge/generated — IDE autocomplete +
# strict zod slug validation
judge artifacts # latest score per pipeline (one line each)
judge projects # list your projects
judge pipelines # list pipelines in this projectConfig file — anatomy
.judge/config.json is the source of truth. Every field below is validated by Zod before the CLI makes a single RPC call — bad slugs, missing labels, unknown kinds all fail fast with an explicit error.
{
// Local JSON Schema → editors auto-complete the `judge:` field
// with the slugs that actually exist on your server. Generated by
// `judge codegen`; `judge init` runs codegen for you.
"$schema": "./.judge/generated/config.schema.json",
// Project slug — must match a project visible in `judge projects`.
// Created by `judge init` and validated against the server on `judge run`.
"project": "acme-web",
"pipelines": [
{
// kebab-case, unique within the project. Stable identifier — the
// server tracks history per (project, slug) so you can rename the
// judge or move the target without losing the iteration timeline.
"slug": "homepage-buyer",
// Human-readable. Shown in the dashboard + iteration emails.
"name": "Homepage × buyer persona",
// Slug of a judge available on your account. Tab-completes in your
// editor after `judge codegen`. Frozen-by-string: surviving renames
// requires editing the field too. Strict mode rejects unknown slugs.
"judge": "buyer-persona",
"target": {
// COMPONENT | SCREEN | PACKAGE | WEBSITE | REPO | OTHER
"kind": "WEBSITE",
// Stable canonical identity. Identity is (kind, key) FOREVER.
// Two pipelines with the same (kind, key) reuse the same Target
// and share history — so use a stable string (full URL, repo-
// relative path, owner/name, package import name).
"key": "https://acme.com",
// Human-readable noun phrase. Doesn't affect identity; can be
// edited freely.
"label": "Homepage"
},
"source": {
// GITHUB_PR | GITHUB_FILE | WEBSITE_HTML | WEBSITE_SCREENSHOT |
// IMAGE_URL | MCP | TEXT
"kind": "WEBSITE_HTML",
// EXACTLY ONE of: `config`, `configFromFile`, or `configFromShell`.
// `config` is provider-specific (see "Source kinds" below).
"config": { "url": "https://acme.com" }
},
// Optional cron expression. Null/absent = on-demand only. The
// server's cron dispatcher fires this on schedule.
"cron": null,
// Optional. Default true. When false, `judge run` skips this
// pipeline AND the server's cron won't fire it either.
"enabled": true
}
]
}Validation rules (the Zod schema)
project— required, non-empty string. Must match an existing Project on the server (otherwisepipeline_upserterrors at run time).pipelines— at least one. Order doesn't matter;judge runprocesses them sequentially.pipelines[].slug— kebab-case^[a-z0-9-]+$, unique within the project.Has SpacesorUPPERrejected.pipelines[].judge— known slug in strict mode (.judge/generated/judges.jsonpresent), any non-empty string in loose mode. Typo →did you mean "code-quality"?.pipelines[].target.{kind,key,label}— all three required. Forgettinglabelis the most common mistake.pipelines[].source— must declare exactly one of:config,configFromFile,configFromShell.
Composing pipelines
A pipeline is (target × source × judge). Each axis is a decision; here's how to choose.
How many pipelines per project?
- One per concern, not one per file. Score the landing page with the buyer persona as one pipeline, the checkout page with the same persona as another. Don't fan out 20 pipelines for one judge across 20 files — use a shell loop with
judge score-edit-stagedfor that. - Multiple judges on the same target = multiple pipelines. Pricing page × buyer-persona AND pricing page × ux-accessibility = two pipelines, two named slugs, two history timelines.
- Make slugs concept-stable. Prefer
homepage-buyeroverscore-acme-com. The slug should describe thequestion being asked, not the URL — so renaming the domain doesn't orphan history.
Picking the target kind
COMPONENT— file or function. Key: repo-relative path (src/components/Button.tsx) or<file>::<function>.SCREEN— route or page. Key: route slug (/checkout) — never abbreviate.PACKAGE— fully-qualified name (@my-org/auth-core).WEBSITE— full URL with protocol (https://acme.com/pricing).REPO—owner/name. Use this for PR scoring — same target accumulates history across every PR on that repo (the PR number lives in the source, not the target).OTHER— any short, stable, distinct slug you control.
Picking the source
Match the source to what the rubric needs to read, not to what's easiest:
- A code-quality rubric reading a 1500-line file: use
TEXTwithconfigFromFile. Cheaper than fetching from GitHub on every run. - A buyer-persona rubric reading a landing page: use
WEBSITE_HTMLfor the visible-text extract, orWEBSITE_SCREENSHOTif visual hierarchy matters. - A security rubric on a PR: use
GITHUB_PR— the diff is the unit of review. - A brand-consistency rubric on an image:
IMAGE_URLsends it multimodal.
Source kinds — reference
Each source kind has a typed config. The CLI shorthand flags (above) write these for you; the table is for the moments when you write the config by hand.
WEBSITE_HTML — pull visible text
"source": {
"kind": "WEBSITE_HTML",
"config": { "url": "https://acme.com/pricing" }
}WEBSITE_SCREENSHOT — full-page PNG via Playwright
"source": {
"kind": "WEBSITE_SCREENSHOT",
"config": { "url": "https://acme.com/pricing" }
}IMAGE_URL — direct fetch
"source": {
"kind": "IMAGE_URL",
"config": { "url": "https://cdn.acme.com/hero.png" }
}GITHUB_PR — diff + base/head SHA
"source": {
"kind": "GITHUB_PR",
"config": { "repo": "acme/web", "prNumber": 314 },
"connectionId": "ckzz…" // optional — needed for private repos
}GITHUB_FILE — one file at a ref
"source": {
"kind": "GITHUB_FILE",
"config": { "repo": "acme/web", "ref": "main", "path": "src/auth.ts" },
"connectionId": "ckzz…"
}TEXT — inline string OR (better) read from disk at run time
// Inline:
"source": { "kind": "TEXT", "config": { "text": "Your card was declined." } }
// Or read from a local file each run (CI-friendly):
"source": { "kind": "TEXT", "configFromFile": "src/components/Button.tsx" }
// Or run a command and use stdout (great for diffs / build logs):
"source": { "kind": "TEXT", "configFromShell": "git diff HEAD~1 -- src/" }MCP — read a resource from another MCP server
"source": {
"kind": "MCP",
"config": { "uri": "mcp://my-server/resources/spec.md" },
"connectionId": "ckzz…" // a Connection of kind=MCP
}Source hints — configFromFile / configFromShell / configFromBundle
Three hints exist on every source, resolved by the CLI before calling the server (the server has no filesystem access into your repo). They turn whatever kind is declared into a TEXT source filled with the resolved content.
source.configFromFile: "<path>"— read one file at run time. Path is relative to the repo root (where.judge/config.jsonlives). Caution: fine for self-contained artifacts (a website snapshot, a single doc), but for code code-quality judges this hides test files and pinstest_signal-style metrics low — preferconfigFromBundlebelow.source.configFromShell: "<cmd>"— run the shell command and use stdout. Useful forgit diff HEAD~1, build logs, generated reports, anything that has to be computed at run time.source.configFromBundle: { paths, includeTests }— bundle one or more files into a single TEXT artifact prefixed with a=== SOURCE PROVENANCE ===header (file list, line counts, whether tests were detected). WithincludeTests: truethe CLI auto-discovers{base}.test.*,{base}.spec.*, and__tests__/**/{base}*for each path so context-needing metrics see the real codebase. This is the recommended mode forcode-qualityand similar judges.
// Recommended for code: bundle impl + sibling tests
"source": {
"kind": "TEXT",
"configFromBundle": {
"paths": ["src/lib/foo.ts"],
"includeTests": true // auto-discovers foo.test.ts, __tests__/foo*
}
}Run judge run with a context-mismatched pipeline and the CLI prints a concrete migration warning: e.g. when a judge has a test_signal metric (declared via requires: ["impl", "tests"] on the metric) but your source ships only one file, the warning shows the exact configFromBundle snippet to switch to.
judge run but not from the dashboard's "Run now" button — the UI shows a CLI only chip on those rows. Keep them for things that genuinely depend on local state; for static content, prefer an explicit config so the dashboard can rerun.Validation & codegen
Two layers catch bad config, with codegen tightening the second:
1. Editor (JSON Schema)
The $schema field at the top of .judge/config.json points at ./.judge/generated/config.schema.json. VS Code, Cursor, JetBrains pick this up automatically: tab-complete on judge: shows the slugs that exist on your server, with markdown tooltips listing each rubric's name and system / custom badge. Typos light up red before you save.
2. CLI (Zod, two modes)
- Loose — used when
.judge/generated/judges.jsonis missing (fresh clone, beforejudge codegenhas ever run). Validates structure (kinds, required fields, slug regex); accepts any string forjudge:. CLI prints a one-line hint suggestingjudge codegen. - Strict — automatic once codegen has run. Adds
judge:as az.enumof known slugs. Typo → CLI exits with the available list and the closest match (Levenshtein):
$ judge run
judge: .judge/config.json failed validation (strict mode)
- pipelines.0.judge: Invalid enum value. Expected 'code-quality' | 'security' | …, received 'code-qualty'
→ did you mean "code-quality"?
→ available: code-quality, security, spec-completeness, …
→ if you just created it, run: judge codegenWhen to re-run codegen
- You created or renamed a judge →
judge codegen(otherwise the new slug is "unknown" in strict mode). - You deleted a judge that some pipeline still references →
judge codegen+ edit the config. - Switching between
JUDGE_URLenvironments (dev / staging / prod) where rubrics differ — codegen pulls from the server pointed at byJUDGE_URL, so re-run it when you switch.
Use case recipes
1. Score every PR against a code-quality rubric — exit non-zero on regression
{
"project": "acme-web",
"pipelines": [{
"slug": "pr-quality",
"name": "PR × code-quality",
"judge": "code-quality",
"target": { "kind": "REPO", "key": "acme/web", "label": "Web app" },
"source": {
"kind": "GITHUB_PR",
"config": { "repo": "acme/web", "prNumber": 0 },
"configFromShell": "echo $PR_NUMBER"
}
}]
}In CI: PR_NUMBER=$GITHUB_PR judge run --fail-on-regression. See judge install gh-actions for the canonical workflow.
2. Track a landing page weekly under three personas
{
"project": "acme-marketing",
"pipelines": [
{
"slug": "home-buyer",
"name": "Home × buyer persona",
"judge": "buyer-persona",
"target": { "kind": "WEBSITE", "key": "https://acme.com",
"label": "Homepage" },
"source": { "kind": "WEBSITE_HTML",
"config": { "url": "https://acme.com" } },
"cron": "0 6 * * MON"
},
{
"slug": "home-skeptic",
"name": "Home × technical-skeptic persona",
"judge": "technical-skeptic",
"target": { "kind": "WEBSITE", "key": "https://acme.com",
"label": "Homepage" },
"source": { "kind": "WEBSITE_HTML",
"config": { "url": "https://acme.com" } },
"cron": "0 6 * * MON"
},
{
"slug": "home-mobile-shot",
"name": "Home × visual-hierarchy",
"judge": "visual-hierarchy",
"target": { "kind": "WEBSITE", "key": "https://acme.com",
"label": "Homepage" },
"source": { "kind": "WEBSITE_SCREENSHOT",
"config": { "url": "https://acme.com" } },
"cron": "0 6 * * MON"
}
]
}Three pipelines, same target — three independent timelines you can graph side-by-side on /projects/acme-marketing.
3. Catch regressions on uncommitted changes (pre-commit)
No config needed; judge install precommit drops a hook that calls judge precommit-run. To make it score the change rather than the absolute file:
# .husky/pre-commit (replace the generated 'judge precommit-run' line) judge score-edit-staged --judge code-quality --fail-on-regression
4. Compare two prompt versions (A/B)
Treat each prompt version as a different artifact with its own target key. Score both; judge progress shows them side-by-side.
judge score --judge response-quality \ --target-kind OTHER --target-key prompt/v1 --target-label "Prompt v1" \ --text "$(cat prompts/v1.txt)" judge score --judge response-quality \ --target-kind OTHER --target-key prompt/v2 --target-label "Prompt v2" \ --text "$(cat prompts/v2.txt)" judge progress --judge response-quality
5. Track an LLM agent's edits (per-agent reliability)
Pass --agent on every score-edit call so per-agent rollups make sense:
# In a Claude Code session that just edited src/foo.ts: judge score-edit src/foo.ts --judge code-quality \ --working --agent claude-code --agent-version "claude-sonnet-4-6" # In a Cursor session: judge score-edit src/foo.ts --judge code-quality \ --working --agent cursor --agent-version "cursor-2025-04" # Then on the dashboard: /reliability shows IMPROVED / REGRESSED rates per agent.
6. Score raw text (microcopy, error messages, emails)
judge score \ --judge customer-centric \ --target-kind OTHER --target-key copy/declined-card \ --target-label "Declined-card error copy" \ --text "Your card was declined. Please try again."
CI integration
Two CI patterns, both gated on the same exit code semantics.
Pattern A — single PR-level gate
judge install gh-actions writes .github/workflows/judge.yml. It scores the PR against an opinionated set of judges (default: code-quality, security) and fails the build on any REGRESSED trend. Override: judge install gh-actions --judges code-quality,security,custom-slug.
Pattern B — multi-pipeline run from a checked-in config
For repos already on the project workflow:
# .github/workflows/judge.yml
- run: pnpm install
- run: pnpm judges --fail-on-regression
env:
JUDGE_URL: ${{ secrets.JUDGE_URL }}
JUDGE_API_KEY: ${{ secrets.JUDGE_API_KEY }}judge run --no-sync to avoid re-upserting pipeline definitions on every run. Drop --no-sync on a scheduled job that should keep the server in sync with what's in the repo.Stable identity — what NOT to do
Identity is (kind, key) forever. Two pipelines with different keys for the same thing accumulate two histories that never reconcile. The most common mistakes:
- Abbreviating —
Button.tsxinstead ofsrc/components/Button.tsx;acme.cominstead ofhttps://acme.com;dashboardinstead of/dashboard. - Putting variable info in the key — e.g.
acme/web#314as a REPO key. Put the PR number in the sourceconfig.prNumber; the target staysacme/web, and history accumulates across all PRs on the repo. - Reusing one key for two unrelated things — silently destroys both histories. Pick a new, stable, distinct key.
See MCP for the full identity rules and recall pattern (target_resolve before every judge_score).
MCP server
Judge exposes its full toolset over the Model Context Protocol. Wire it into Claude Code, Claude Desktop, or any MCP-aware client.
Endpoint
The server is mounted at /mcp on the same host that serves the dashboard. It speaks JSON-RPC over HTTP and authenticates with a bearer API key (mint one on /connect).
POST https://judge.example.com/mcp Authorization: Bearer jdg_xxxxxxxxxxxx Content-Type: application/json
Wire it into Claude Code
One command — the CLI edits ~/.claude/settings.json for you, preserving anything you already had there:
judge install claude-code # → adds the Judge MCP entry to ~/.claude/settings.json # → restart Claude Code, then ask: "list everything Judge is tracking"
Manual fallback (paste into ~/.claude/settings.json or a project-local .mcp.json):
{
"mcpServers": {
"judge": {
"type": "http",
"url": "https://judge.example.com/mcp",
"headers": { "Authorization": "Bearer jdg_xxxxxxxxxxxx" }
}
}
}Wire it into Codex CLI / Claude Desktop
These clients spawn MCP servers as stdio processes. The Judge CLI ships a stdio bridge — judge mcp-stdio — so the same toolset works there too. The install commands write the platform-correct config file:
judge install codex # → ~/.codex/config.toml judge install claude-desktop # → claude_desktop_config.json (macOS/Win/Linux)
Available tools
Each tool is a typed JSON-Schema endpoint. Claude Code will surface them as callable tools automatically.
Judges
judge_listList judges available to youjudge_getShow a judge's full rubric and promptjudge_createGenerate a custom rubric from a one-line promptjudge_updatePatch a user-owned judgejudge_deleteDelete a user-owned judge
Scoring
judge_scoreRun one Job and return a Scorejudge_historyScore history per (target × judge)judge_actionsRanked actions to move the scoretarget_progressRoll-up across all judges for a target
Targets
target_searchFind targets by key or labeltarget_resolveGet-or-create a target from kind+keytarget_upsertCreate or update a targettarget_mergeMerge duplicate targets
Websites
website_listCrawled siteswebsite_createAdd a site and start crawlingwebsite_suggest_judgesAuto-generate site-grounded rubricswebsite_generate_personasAuto-generate site-grounded personaswebsite_create_personaOne persona from a free-text briefwebsite_schedule_scoringCron-schedule (URL × judge) jobs
Projects & pipelines
project_createCreate a projectproject_listList your projectspipeline_upsertDefine / update a pipeline by slugpipeline_artifactsLatest score per pipeline
Setup
setup_list_playbooksCurated quick-start configssetup_get_playbookDetail of one playbook
REST API
When you don't want a tool layer, hit the same endpoints that the dashboard hits. Bearer-token authenticated.
Authentication
Mint an API key on /connect in the dashboard. Pass it as Authorization: Bearer …. Keys are scoped per user and revocable.
Score one thing
POST /api/score
Authorization: Bearer jdg_xxxxxxxxxxxx
Content-Type: application/json
{
"target": {
"kind": "WEBSITE",
"key": "https://acme.com/pricing",
"label": "Pricing page"
},
"source": {
"kind": "WEBSITE_HTML",
"config": { "url": "https://acme.com/pricing" }
},
"judgeSlug": "buyer-persona"
}Other endpoints
GET /api/scores/[id]— one score with full metric breakdown and rationaleGET /api/jobs— scheduled jobs;POST /api/jobs/[id]/runto fire one manuallyPOST /api/websites— add a site to crawlPOST /api/websites/[id]/suggest— generate judge drafts of a kind (PERSONA/REVIEWER/RUBRIC); preview before savingPOST /api/websites/[id]/judges/save— persist chosen draftsPATCH /api/judges/[id]/DELETE /api/judges/[id]— edit / delete user-owned judges
GitHub integration
Pull diffs and files straight into Judge with one of two auth modes — the GitHub App for permanent install, or a personal access token for ad-hoc work.
The Judge GitHub App
Install on your org / repo from /connect. The app gets read access to PRs and repo contents, and Judge subscribes to pull_request and push webhooks. PRs get scored automatically against any pipelines you've linked.
Personal access tokens
Want to score one repo without an install? Add a fine-grained PAT to a Connection on /connections and reference it via --connection <id> in the CLI or connectionId in REST / MCP source configs.
Source kinds
GITHUB_PR— diff + base/head SHA. Config:{ repo: "owner/name", prNumber: 314 }.GITHUB_FILE— one file at a ref. Config:{ repo: "owner/name", ref: "main", path: "src/auth.ts" }.
BYOK & model choice
Bring your own LLM key for zero-margin pricing. The architecture isn't Claude-locked — Sonnet 4.6 is the default, swap if you prefer.
Bring your own key
Add an Anthropic API key on /byok. Stored encrypted with AES-256-GCM and decrypted only at request time. When present, all scoring runs use your key — the hosted plan steps out of the loop.
Why Sonnet 4.6 by default
Sonnet 4.6 is fast, cheap, and reliably tool-calls submit_score with a JSON-Schema-validated payload — which is what makes typed scores possible. You can override per-judge via modelId if you want to test against a different model.
Output is enforced
On every run, the LLM is forced to call the submit_score tool. No prose to parse, no malformed runs — only typed values that match the judge's metrics spec. That's why deltas, history, and dashboards all work mechanically.