Docs · Judge

Last updated 2026-05-06·For developers and teams shipping LLM features who want a regression gate before merge

TL;DR

Judge scores any artifact (URL, file, PR, screenshot) against a typed rubric, an expert critic, or a buyer persona — and computes IMPROVED / STABLE / REGRESSED against the previous run in code, not by a second LLM call. Use it three ways: CLI in CI to fail builds on regression, GitHub App to score every PR, or MCP so agents check their own output. BYOK Anthropic key — pay at cost, no Judge markup.

Five-min quickstart →CLI reference BYOK pricing

Start here

Quickstart

Score your first thing in five minutes — from the dashboard, the CLI, or an MCP-aware agent.

Create an account

Email + password. The 10 system rubrics are seeded into your account on first sign-in.

Create a judge

Add a website and click Suggest personas / reviewers / rubrics. Or describe a custom rubric in one line on the Judges tab.

Run it

Pick a target (URL, file, PR, document, screenshot…) and run. Scores stream in with deterministic deltas and per-metric history.

Prefer not to use the dashboard? Go straight to the CLI or MCP. They use the same engine and write to the same database.

Concepts

The data model

Six entities. Each is a typed Postgres row, queryable directly or via the API.

Target

The thing being scored. Has a kind (WEBSITE, COMPONENT, PR, OTHER) and a stable key (URL, file path, PR identifier). History is tracked per (target × judge).

Source

How the input is fetched. Pluggable providers: GitHub PR diff, GitHub repo file, website HTML, screenshot, image URL, MCP resource, raw text. New ones drop in by writing a provider.

Judge

A system prompt + metrics spec. Three kinds:

Persona — a real-feeling visitor reacting through their own lens.
Reviewer — a senior third-party expert performing an audit.
Rubric — a fixed checklist with weighted boolean / numeric metrics.

Job

A pairing of (Source × Judge × Target) with an optional cron. One-off runs and scheduled runs share the same row shape.

Run + Score

One execution of a Job produces one Run. The Run produces one Score: an overall in 0–100 plus typed per-metric values, a rationale, and a deterministic trend bucket (FIRST · IMPROVED · STABLE · REGRESSED).

Iteration comment

Auto-generated diff summary between this Score and the previous one for the same (target × judge). Built from typed metric deltas in code — no extra LLM call.

How the overall is computed

Each metric is normalized to 0..1 inside its range (numbers scaled in min..max; booleans mapped to 0 or 1), multiplied by its weight, summed, and divided by the total weight. The result × 100 is the overall. Heavier weights = the metric dominates. Boolean metrics are typically used for hard gates.

overall = (Σ normalized(metric) × weight) / (Σ weight) × 100
trend   = bucket(overall - previousOverall)   // |Δ| < 1.0 = STABLE

Dashboard

The web UI for humans. Everything you can do via CLI / MCP / REST you can also do here.

Command line

A small Node CLI that wraps the MCP endpoint with ergonomic subcommands. Designed for terminals, scripts, and CI. Two operating modes: one-off scoring (good for quick checks) and project workflow with a checked-in config (good for teams + CI).

Install

The CLI ships with the repo as bin/judge.mjs. From a clone of the Judge repo (npm publish coming):

pnpm install
pnpm link --global         # pnpm flow — needs `pnpm setup` first
# or, equivalent and Windows-friendly:
npm link                   # uses the npm global bin already on your PATH

judge help                 # verify "judge" is on PATH

Configure

Two env vars. Mint a key on /connect; copy it once (it's only shown then).

export JUDGE_URL="https://judge.example.com"     # default: http://localhost:3000
export JUDGE_API_KEY="jdg_xxxxxxxxxxxx"          # mint on /connect

Every command listed below requires JUDGE_API_KEY except judge help and judge install … when used to write a local config (some installs read the key to embed it). Missing key → CLI exits with a one-line error and a pointer to /connect.

One-command installs

Each subcommand is idempotent and writes the right file in the right place — no copy-paste, no jq, no curl. They're described in detail on the Integrations page.

judge install claude-code           # ~/.claude/settings.json (HTTP MCP)
judge install claude-desktop        # OS-specific Claude Desktop config (stdio bridge)
judge install codex                 # ~/.codex/config.toml
judge install agent-system-prompt   # appends Judge usage block to CLAUDE.md /
                                    # AGENTS.md / .cursorrules — agents stop
                                    # fragmenting history with bad keys
judge install precommit             # .husky/pre-commit (or .git/hooks/pre-commit)
judge install gh-actions            # .github/workflows/judge.yml
judge install all                   # all of the above

One-off scoring

Use these when you want to score something now without committing a config. Each call creates the Target server-side (or reuses one with the same (kind, key)) and accumulates history under it.

# List judges available to your account
judge judges

# Show one judge's full rubric (metrics + system prompt)
judge get code-quality

# Score a website URL with a persona judge
judge score \
  --judge sarah-k-founder \
  --target-kind WEBSITE --target-key https://acme.com/pricing \
  --target-label "Pricing page" \
  --url https://acme.com/pricing

# Score a GitHub PR with the security rubric
judge score \
  --judge security \
  --target-kind REPO --target-key acme/web \
  --target-label "Web app" \
  --pr acme/web#314

# Score a local file as raw text (read by the CLI, sent inline)
judge score \
  --judge code-quality \
  --target-kind COMPONENT --target-key src/auth.ts \
  --target-label "Auth module" \
  --text "$(cat src/auth.ts)"

# Generate a custom rubric from a one-line prompt
judge create-judge "B2B onboarding clarity judge"

# View history per (target × judge)
judge history --judge code-quality --target src/auth.ts --limit 25

# Roll-up across all judges for a target — or platform-wide if --target omitted
judge progress --target src/auth.ts

Source flag shorthands

judge score needs exactly one source. Pick the shorthand that matches your input:

--pr      owner/name#NUMBER        # GITHUB_PR (diff)
--file    owner/name@ref:path      # GITHUB_FILE (one file at ref)
--url     https://…                # WEBSITE_HTML (visible-text extract)
--shot    https://…                # WEBSITE_SCREENSHOT (Playwright PNG, multimodal)
--image   https://…                # IMAGE_URL (png/jpeg/webp/gif)
--text    "literal text"           # TEXT (inline)
--source-kind X --source-config '{…}'   # escape hatch for kinds without a flag

Add --connection <id> when the source needs auth (private GitHub repo, MCP server). Connection IDs are visible in the dashboard URL on each connection's row.

Edit scoring (the wedge)

Most evaluators score one artifact at a time. Judge also scores the change between two versions and tells you whether it improved or regressed quality — the answer to "is my dzisiejsza iteracja lepsza od wczorajszej?". It returns editVerdict ∈ {IMPROVED,STABLE,REGRESSED}, weightedEditDelta, and per-metric deltas — without you having to score before/after manually and diff yourself.

# Score what the last commit did to a file (HEAD~1 → HEAD)
judge score-edit src/foo.ts --judge code-quality

# Score uncommitted changes (HEAD → working tree)
judge score-edit src/foo.ts --judge code-quality --working

# Compare two arbitrary refs
judge score-edit src/foo.ts --judge code-quality --before main --after HEAD

# Score everything currently staged for commit (the pre-commit hook does this)
judge score-edit-staged --judge code-quality
judge score-edit-staged --judge spec-completeness --include '\.md$'
judge score-edit-staged --judge code-quality --fail-on-regression

Common edit-scoring flags

--agent <id> — attribution. Defaults to human:<git config user.email>. Set to claude-code / cursor / devin / copilot-workspace when an agent is making the edit. Powers the per-agent reliability rollups on /reliability.
--fail-on-regression — exit code 2 if the verdict is REGRESSED. Use in pre-commit / CI gates.
--target-kind, --target-key, --target-label — overrides; defaults to COMPONENT + the file path.

Project workflow

For repos that want a checked-in score config, use judge init. It creates a Project on the server and writes .judge/config.json with named pipelines. From then on, every judge run mirrors the file to the server (so the dashboard knows what you have), scores every pipeline, and caches the result locally.

judge init                     # interactive — creates .judge/config.json,
                               # adds 'judges' + 'judges:sync' npm scripts,
                               # runs codegen so autocomplete works
judge run                      # score every pipeline; write .judge/cache/<slug>.json
judge run --only homepage      # filter — only one pipeline (or comma-list)
judge run --fail-on-regression # exit 1 on any REGRESSED trend (CI gate)
judge run --no-sync            # skip the server upsert step (rare; lets you
                               # score from a config that hasn't been blessed
                               # by the server yet — e.g. air-gapped CI)
judge sync                     # pull server's canonical artifacts into
                               # .judge/cache/ (read-only mirror; useful on
                               # fresh clone or in CI without credentials)
judge codegen                  # refresh .judge/generated — IDE autocomplete +
                               # strict zod slug validation
judge artifacts                # latest score per pipeline (one line each)
judge projects                 # list your projects
judge pipelines                # list pipelines in this project

Config file — anatomy

.judge/config.json is the source of truth. Every field below is validated by Zod before the CLI makes a single RPC call — bad slugs, missing labels, unknown kinds all fail fast with an explicit error.

{
  // Local JSON Schema → editors auto-complete the `judge:` field
  // with the slugs that actually exist on your server. Generated by
  // `judge codegen`; `judge init` runs codegen for you.
  "$schema": "./.judge/generated/config.schema.json",

  // Project slug — must match a project visible in `judge projects`.
  // Created by `judge init` and validated against the server on `judge run`.
  "project": "acme-web",

  "pipelines": [
    {
      // kebab-case, unique within the project. Stable identifier — the
      // server tracks history per (project, slug) so you can rename the
      // judge or move the target without losing the iteration timeline.
      "slug": "homepage-buyer",

      // Human-readable. Shown in the dashboard + iteration emails.
      "name": "Homepage × buyer persona",

      // Slug of a judge available on your account. Tab-completes in your
      // editor after `judge codegen`. Frozen-by-string: surviving renames
      // requires editing the field too. Strict mode rejects unknown slugs.
      "judge": "buyer-persona",

      "target": {
        // COMPONENT | SCREEN | PACKAGE | WEBSITE | REPO | OTHER
        "kind": "WEBSITE",

        // Stable canonical identity. Identity is (kind, key) FOREVER.
        // Two pipelines with the same (kind, key) reuse the same Target
        // and share history — so use a stable string (full URL, repo-
        // relative path, owner/name, package import name).
        "key": "https://acme.com",

        // Human-readable noun phrase. Doesn't affect identity; can be
        // edited freely.
        "label": "Homepage"
      },

      "source": {
        // GITHUB_PR | GITHUB_FILE | WEBSITE_HTML | WEBSITE_SCREENSHOT |
        // IMAGE_URL | MCP | TEXT
        "kind": "WEBSITE_HTML",

        // EXACTLY ONE of: `config`, `configFromFile`, or `configFromShell`.
        // `config` is provider-specific (see "Source kinds" below).
        "config": { "url": "https://acme.com" }
      },

      // Optional cron expression. Null/absent = on-demand only. The
      // server's cron dispatcher fires this on schedule.
      "cron": null,

      // Optional. Default true. When false, `judge run` skips this
      // pipeline AND the server's cron won't fire it either.
      "enabled": true
    }
  ]
}

Validation rules (the Zod schema)

project — required, non-empty string. Must match an existing Project on the server (otherwise pipeline_upsert errors at run time).
pipelines — at least one. Order doesn't matter; judge run processes them sequentially.
pipelines[].slug — kebab-case ^[a-z0-9-]+$, unique within the project. Has Spaces or UPPER rejected.
pipelines[].judge — known slug in strict mode (.judge/generated/judges.json present), any non-empty string in loose mode. Typo → did you mean "code-quality"?.
pipelines[].target.{kind,key,label} — all three required. Forgetting label is the most common mistake.
pipelines[].source — must declare exactly one of: config, configFromFile, configFromShell.

Composing pipelines

A pipeline is (target × source × judge). Each axis is a decision; here's how to choose.

How many pipelines per project?

One per concern, not one per file. Score the landing page with the buyer persona as one pipeline, the checkout page with the same persona as another. Don't fan out 20 pipelines for one judge across 20 files — use a shell loop with judge score-edit-staged for that.
Multiple judges on the same target = multiple pipelines. Pricing page × buyer-persona AND pricing page × ux-accessibility = two pipelines, two named slugs, two history timelines.
Make slugs concept-stable. Prefer homepage-buyer over score-acme-com. The slug should describe thequestion being asked, not the URL — so renaming the domain doesn't orphan history.

Picking the target kind

COMPONENT — file or function. Key: repo-relative path (src/components/Button.tsx) or <file>::<function>.
SCREEN — route or page. Key: route slug (/checkout) — never abbreviate.
PACKAGE — fully-qualified name (@my-org/auth-core).
WEBSITE — full URL with protocol (https://acme.com/pricing).
REPO — owner/name. Use this for PR scoring — same target accumulates history across every PR on that repo (the PR number lives in the source, not the target).
OTHER — any short, stable, distinct slug you control.

Picking the source

Match the source to what the rubric needs to read, not to what's easiest:

A code-quality rubric reading a 1500-line file: use TEXT with configFromFile. Cheaper than fetching from GitHub on every run.
A buyer-persona rubric reading a landing page: use WEBSITE_HTML for the visible-text extract, or WEBSITE_SCREENSHOT if visual hierarchy matters.
A security rubric on a PR: use GITHUB_PR — the diff is the unit of review.
A brand-consistency rubric on an image: IMAGE_URL sends it multimodal.

Source kinds — reference

Each source kind has a typed config. The CLI shorthand flags (above) write these for you; the table is for the moments when you write the config by hand.

WEBSITE_HTML — pull visible text

"source": {
  "kind": "WEBSITE_HTML",
  "config": { "url": "https://acme.com/pricing" }
}

WEBSITE_SCREENSHOT — full-page PNG via Playwright

"source": {
  "kind": "WEBSITE_SCREENSHOT",
  "config": { "url": "https://acme.com/pricing" }
}

IMAGE_URL — direct fetch

"source": {
  "kind": "IMAGE_URL",
  "config": { "url": "https://cdn.acme.com/hero.png" }
}

GITHUB_PR — diff + base/head SHA

"source": {
  "kind": "GITHUB_PR",
  "config": { "repo": "acme/web", "prNumber": 314 },
  "connectionId": "ckzz…"     // optional — needed for private repos
}

GITHUB_FILE — one file at a ref

"source": {
  "kind": "GITHUB_FILE",
  "config": { "repo": "acme/web", "ref": "main", "path": "src/auth.ts" },
  "connectionId": "ckzz…"
}

TEXT — inline string OR (better) read from disk at run time

// Inline:
"source": { "kind": "TEXT", "config": { "text": "Your card was declined." } }

// Or read from a local file each run (CI-friendly):
"source": { "kind": "TEXT", "configFromFile": "src/components/Button.tsx" }

// Or run a command and use stdout (great for diffs / build logs):
"source": { "kind": "TEXT", "configFromShell": "git diff HEAD~1 -- src/" }

MCP — read a resource from another MCP server

"source": {
  "kind": "MCP",
  "config": { "uri": "mcp://my-server/resources/spec.md" },
  "connectionId": "ckzz…"     // a Connection of kind=MCP
}

Source hints — configFromFile / configFromShell / configFromBundle

Three hints exist on every source, resolved by the CLI before calling the server (the server has no filesystem access into your repo). They turn whatever kind is declared into a TEXT source filled with the resolved content.

source.configFromFile: "<path>" — read one file at run time. Path is relative to the repo root (where .judge/config.json lives). Caution: fine for self-contained artifacts (a website snapshot, a single doc), but for code code-quality judges this hides test files and pins test_signal-style metrics low — prefer configFromBundle below.
source.configFromShell: "<cmd>" — run the shell command and use stdout. Useful for git diff HEAD~1, build logs, generated reports, anything that has to be computed at run time.
source.configFromBundle: { paths, includeTests } — bundle one or more files into a single TEXT artifact prefixed with a === SOURCE PROVENANCE === header (file list, line counts, whether tests were detected). With includeTests: true the CLI auto-discovers {base}.test.*, {base}.spec.*, and __tests__/**/{base}* for each path so context-needing metrics see the real codebase. This is the recommended mode for code-quality and similar judges.

// Recommended for code: bundle impl + sibling tests
"source": {
  "kind": "TEXT",
  "configFromBundle": {
    "paths": ["src/lib/foo.ts"],
    "includeTests": true        // auto-discovers foo.test.ts, __tests__/foo*
  }
}

Run judge run with a context-mismatched pipeline and the CLI prints a concrete migration warning: e.g. when a judge has a test_signal metric (declared via requires: ["impl", "tests"] on the metric) but your source ships only one file, the warning shows the exact configFromBundle snippet to switch to.

Pipelines that use these hints can be run from judge run but not from the dashboard's "Run now" button — the UI shows a CLI only chip on those rows. Keep them for things that genuinely depend on local state; for static content, prefer an explicit config so the dashboard can rerun.

Validation & codegen

Two layers catch bad config, with codegen tightening the second:

1. Editor (JSON Schema)

The $schema field at the top of .judge/config.json points at ./.judge/generated/config.schema.json. VS Code, Cursor, JetBrains pick this up automatically: tab-complete on judge: shows the slugs that exist on your server, with markdown tooltips listing each rubric's name and system / custom badge. Typos light up red before you save.

2. CLI (Zod, two modes)

Loose — used when .judge/generated/judges.json is missing (fresh clone, before judge codegen has ever run). Validates structure (kinds, required fields, slug regex); accepts any string for judge:. CLI prints a one-line hint suggesting judge codegen.
Strict — automatic once codegen has run. Adds judge: as a z.enum of known slugs. Typo → CLI exits with the available list and the closest match (Levenshtein):

$ judge run
judge: .judge/config.json failed validation (strict mode)
  - pipelines.0.judge: Invalid enum value. Expected 'code-quality' | 'security' | …, received 'code-qualty'
      → did you mean "code-quality"?
      → available: code-quality, security, spec-completeness, …
      → if you just created it, run: judge codegen

When to re-run codegen

You created or renamed a judge → judge codegen (otherwise the new slug is "unknown" in strict mode).
You deleted a judge that some pipeline still references → judge codegen + edit the config.
Switching between JUDGE_URL environments (dev / staging / prod) where rubrics differ — codegen pulls from the server pointed at by JUDGE_URL, so re-run it when you switch.

Use case recipes

1. Score every PR against a code-quality rubric — exit non-zero on regression

{
  "project": "acme-web",
  "pipelines": [{
    "slug": "pr-quality",
    "name": "PR × code-quality",
    "judge": "code-quality",
    "target": { "kind": "REPO", "key": "acme/web", "label": "Web app" },
    "source": {
      "kind": "GITHUB_PR",
      "config": { "repo": "acme/web", "prNumber": 0 },
      "configFromShell": "echo $PR_NUMBER"
    }
  }]
}

In CI: PR_NUMBER=$GITHUB_PR judge run --fail-on-regression. See judge install gh-actions for the canonical workflow.

2. Track a landing page weekly under three personas

{
  "project": "acme-marketing",
  "pipelines": [
    {
      "slug": "home-buyer",
      "name": "Home × buyer persona",
      "judge": "buyer-persona",
      "target": { "kind": "WEBSITE", "key": "https://acme.com",
                  "label": "Homepage" },
      "source": { "kind": "WEBSITE_HTML",
                  "config": { "url": "https://acme.com" } },
      "cron": "0 6 * * MON"
    },
    {
      "slug": "home-skeptic",
      "name": "Home × technical-skeptic persona",
      "judge": "technical-skeptic",
      "target": { "kind": "WEBSITE", "key": "https://acme.com",
                  "label": "Homepage" },
      "source": { "kind": "WEBSITE_HTML",
                  "config": { "url": "https://acme.com" } },
      "cron": "0 6 * * MON"
    },
    {
      "slug": "home-mobile-shot",
      "name": "Home × visual-hierarchy",
      "judge": "visual-hierarchy",
      "target": { "kind": "WEBSITE", "key": "https://acme.com",
                  "label": "Homepage" },
      "source": { "kind": "WEBSITE_SCREENSHOT",
                  "config": { "url": "https://acme.com" } },
      "cron": "0 6 * * MON"
    }
  ]
}

Three pipelines, same target — three independent timelines you can graph side-by-side on /projects/acme-marketing.

3. Catch regressions on uncommitted changes (pre-commit)

No config needed; judge install precommit drops a hook that calls judge precommit-run. To make it score the change rather than the absolute file:

# .husky/pre-commit  (replace the generated 'judge precommit-run' line)
judge score-edit-staged --judge code-quality --fail-on-regression

4. Compare two prompt versions (A/B)

Treat each prompt version as a different artifact with its own target key. Score both; judge progress shows them side-by-side.

judge score --judge response-quality \
  --target-kind OTHER --target-key prompt/v1 --target-label "Prompt v1" \
  --text "$(cat prompts/v1.txt)"

judge score --judge response-quality \
  --target-kind OTHER --target-key prompt/v2 --target-label "Prompt v2" \
  --text "$(cat prompts/v2.txt)"

judge progress --judge response-quality

5. Track an LLM agent's edits (per-agent reliability)

Pass --agent on every score-edit call so per-agent rollups make sense:

# In a Claude Code session that just edited src/foo.ts:
judge score-edit src/foo.ts --judge code-quality \
  --working --agent claude-code --agent-version "claude-sonnet-4-6"

# In a Cursor session:
judge score-edit src/foo.ts --judge code-quality \
  --working --agent cursor --agent-version "cursor-2025-04"

# Then on the dashboard: /reliability shows IMPROVED / REGRESSED rates per agent.

6. Score raw text (microcopy, error messages, emails)

judge score \
  --judge customer-centric \
  --target-kind OTHER --target-key copy/declined-card \
  --target-label "Declined-card error copy" \
  --text "Your card was declined. Please try again."

CI integration

Two CI patterns, both gated on the same exit code semantics.

Pattern A — single PR-level gate

judge install gh-actions writes .github/workflows/judge.yml. It scores the PR against an opinionated set of judges (default: code-quality, security) and fails the build on any REGRESSED trend. Override: judge install gh-actions --judges code-quality,security,custom-slug.

Pattern B — multi-pipeline run from a checked-in config

For repos already on the project workflow:

# .github/workflows/judge.yml
- run: pnpm install
- run: pnpm judges --fail-on-regression
  env:
    JUDGE_URL: ${{ secrets.JUDGE_URL }}
    JUDGE_API_KEY: ${{ secrets.JUDGE_API_KEY }}

In CI you usually want judge run --no-sync to avoid re-upserting pipeline definitions on every run. Drop --no-sync on a scheduled job that should keep the server in sync with what's in the repo.

Stable identity — what NOT to do

Identity is (kind, key) forever. Two pipelines with different keys for the same thing accumulate two histories that never reconcile. The most common mistakes:

Abbreviating — Button.tsx instead of src/components/Button.tsx; acme.com instead of https://acme.com; dashboard instead of /dashboard.
Putting variable info in the key — e.g. acme/web#314 as a REPO key. Put the PR number in the source config.prNumber; the target stays acme/web, and history accumulates across all PRs on the repo.
Reusing one key for two unrelated things — silently destroys both histories. Pick a new, stable, distinct key.

See MCP for the full identity rules and recall pattern (target_resolve before every judge_score).

MCP

MCP server

Judge exposes its full toolset over the Model Context Protocol. Wire it into Claude Code, Claude Desktop, or any MCP-aware client.

Endpoint

The server is mounted at /mcp on the same host that serves the dashboard. It speaks JSON-RPC over HTTP and authenticates with a bearer API key (mint one on /connect).

POST  https://judge.example.com/mcp
Authorization: Bearer jdg_xxxxxxxxxxxx
Content-Type: application/json

Wire it into Claude Code

One command — the CLI edits ~/.claude/settings.json for you, preserving anything you already had there:

judge install claude-code
# → adds the Judge MCP entry to ~/.claude/settings.json
# → restart Claude Code, then ask: "list everything Judge is tracking"

Manual fallback (paste into ~/.claude/settings.json or a project-local .mcp.json):

{
  "mcpServers": {
    "judge": {
      "type": "http",
      "url": "https://judge.example.com/mcp",
      "headers": { "Authorization": "Bearer jdg_xxxxxxxxxxxx" }
    }
  }
}

Wire it into Codex CLI / Claude Desktop

These clients spawn MCP servers as stdio processes. The Judge CLI ships a stdio bridge — judge mcp-stdio — so the same toolset works there too. The install commands write the platform-correct config file:

judge install codex            # → ~/.codex/config.toml
judge install claude-desktop   # → claude_desktop_config.json (macOS/Win/Linux)

Available tools

Each tool is a typed JSON-Schema endpoint. Claude Code will surface them as callable tools automatically.

Judges

judge_listList judges available to you
judge_getShow a judge's full rubric and prompt
judge_createGenerate a custom rubric from a one-line prompt
judge_updatePatch a user-owned judge
judge_deleteDelete a user-owned judge

Scoring

judge_scoreRun one Job and return a Score
judge_historyScore history per (target × judge)
judge_actionsRanked actions to move the score
target_progressRoll-up across all judges for a target

Targets

target_searchFind targets by key or label
target_resolveGet-or-create a target from kind+key
target_upsertCreate or update a target
target_mergeMerge duplicate targets

Websites

website_listCrawled sites
website_createAdd a site and start crawling
website_suggest_judgesAuto-generate site-grounded rubrics
website_generate_personasAuto-generate site-grounded personas
website_create_personaOne persona from a free-text brief
website_schedule_scoringCron-schedule (URL × judge) jobs

Projects & pipelines

project_createCreate a project
project_listList your projects
pipeline_upsertDefine / update a pipeline by slug
pipeline_artifactsLatest score per pipeline

Setup

setup_list_playbooksCurated quick-start configs
setup_get_playbookDetail of one playbook

HTTP

REST API

When you don't want a tool layer, hit the same endpoints that the dashboard hits. Bearer-token authenticated.

Authentication

Mint an API key on /connect in the dashboard. Pass it as Authorization: Bearer …. Keys are scoped per user and revocable.

Score one thing

POST /api/score
Authorization: Bearer jdg_xxxxxxxxxxxx
Content-Type: application/json

{
  "target": {
    "kind": "WEBSITE",
    "key": "https://acme.com/pricing",
    "label": "Pricing page"
  },
  "source": {
    "kind": "WEBSITE_HTML",
    "config": { "url": "https://acme.com/pricing" }
  },
  "judgeSlug": "buyer-persona"
}

Other endpoints

GET /api/scores/[id] — one score with full metric breakdown and rationale
GET /api/jobs — scheduled jobs; POST /api/jobs/[id]/run to fire one manually
POST /api/websites — add a site to crawl
POST /api/websites/[id]/suggest — generate judge drafts of a kind (PERSONA / REVIEWER / RUBRIC); preview before saving
POST /api/websites/[id]/judges/save — persist chosen drafts
PATCH /api/judges/[id] / DELETE /api/judges/[id] — edit / delete user-owned judges

Every dashboard action is one of these endpoints. If something is possible in the UI but not documented here, look at the network tab — the URL is the API.

Git

GitHub integration

Pull diffs and files straight into Judge with one of two auth modes — the GitHub App for permanent install, or a personal access token for ad-hoc work.

The Judge GitHub App

Install on your org / repo from /connect. The app gets read access to PRs and repo contents, and Judge subscribes to pull_request and push webhooks. PRs get scored automatically against any pipelines you've linked.

Personal access tokens

Want to score one repo without an install? Add a fine-grained PAT to a Connection on /connections and reference it via --connection <id> in the CLI or connectionId in REST / MCP source configs.

Source kinds

GITHUB_PR — diff + base/head SHA. Config: { repo: "owner/name", prNumber: 314 }.
GITHUB_FILE — one file at a ref. Config: { repo: "owner/name", ref: "main", path: "src/auth.ts" }.

Models

BYOK & model choice

Bring your own LLM key for zero-margin pricing. The architecture isn't Claude-locked — Sonnet 4.6 is the default, swap if you prefer.

Bring your own key

Add an Anthropic API key on /byok. Stored encrypted with AES-256-GCM and decrypted only at request time. When present, all scoring runs use your key — the hosted plan steps out of the loop.

Why Sonnet 4.6 by default

Sonnet 4.6 is fast, cheap, and reliably tool-calls submit_score with a JSON-Schema-validated payload — which is what makes typed scores possible. You can override per-judge via modelId if you want to test against a different model.

Output is enforced

On every run, the LLM is forced to call the submit_score tool. No prose to parse, no malformed runs — only typed values that match the judge's metrics spec. That's why deltas, history, and dashboards all work mechanically.