Benchmark harness

scripts/benchmark.py is a cross-commit benchmark runner that walks hyperfine across HEAD, trunk, a range, the last N commits, an explicit list of tags / SHAs, or just HEAD vs. trunk. It surfaces uniform cold-start regressions that are easy to miss by eyeballing — promoting the one-off shell-script prototype to a first-class repo tool turns “where did grep slow down?” into a one-liner.

The harness is a development-only tool — it isn’t shipped with the agentgrep wheel and isn’t installed by pip install agentgrep. It lives in scripts/ alongside scripts/mcp_swap.py and runs straight from a repo checkout via uv run.

The script is a PEP 723 self-contained file — its third-party deps (typer / rich / pydantic) resolve through uv run’s transient venv, so the harness is portable across any uv-managed project. Drop in your own scripts/benchmark.toml and it works.

Invocation recipes

Every subcommand is reachable via uv run scripts/benchmark.py:

$ uv run scripts/benchmark.py run --target HEAD

Single-commit bench against the configured [bench.*] entries.

$ uv run scripts/benchmark.py run --target trunk

Single-commit bench against the trunk ref (default master, override in [settings].trunk).

$ uv run scripts/benchmark.py run --head-vs-trunk

HEAD versus trunk — the most common “did my branch regress?” shape.

$ uv run scripts/benchmark.py run --range master..HEAD

Walk every commit on the current branch since it diverged from trunk.

$ uv run scripts/benchmark.py run --lookback 10

The last 10 commits from HEAD, oldest first.

$ uv run scripts/benchmark.py run --from-trunk-back 5

Trunk and the five commits before it — useful for establishing a baseline that pre-dates a refactor.

$ uv run scripts/benchmark.py run --tags

Every git tag, sorted by --sort=v:refname.

$ uv run scripts/benchmark.py run --commits abc1234,def5678

Explicit list of refs (SHAs, tags, or branches).

$ uv run scripts/benchmark.py compare HEAD master

Sugar for run --commits HEAD,master.

Discovery subcommands

$ uv run scripts/benchmark.py list-commits --lookback 10

Print the commits a selector would resolve to, without running any benchmarks — handy for sanity-checking --range and --lookback arguments before committing to a long run.

$ uv run scripts/benchmark.py list-commands

Print every configured [bench.X] entry after applying the TOML layers — useful when iterating on benchmark.local.toml.

$ uv run scripts/benchmark.py show-config

Dump the post-layering config as JSON, the closest thing to a “what will actually run” inspection.

Output formats

--format selects the renderer; the default is a rich terminal table:

Format

Use case

rich

Interactive terminal viewing (default)

json

Single JSON document with raw samples preserved

ndjson

One JSON object per line — pipe-friendly for jq

md

Markdown tables (mirrors the prototype performance.md)

csv

Flat row-per-measurement; raw samples joined by ;

$ uv run scripts/benchmark.py run --lookback 50 --format md --output performance.md

--show-percentiles picks the subset of stat labels rendered in the visible cells:

$ uv run scripts/benchmark.py run --target HEAD --show-percentiles min,avg,p90,max

Accepted labels: min, max, avg, and any p<N> (e.g. p50, p90, p95, p99).

Config layering

Four layers compose into the effective config, in this precedence order (each layer overlays the previous):

  1. Built-in pydantic defaultsruns = 3, warmup = 1, trunk = "master", etc.

  2. scripts/benchmark.toml — the committed defaults shipped with the repo. Defines [bench.grep], [bench.find], [bench.search], [bench.import-time] for agentgrep against libtmux.

  3. scripts/benchmark.local.toml — per-machine overrides (gitignored). Copy scripts/benchmark.local.toml.example to start.

  4. CLI flags--runs N, --warmup N, --query STR, --commands grep,find always trump the TOML.

Deep-merge semantics: only the keys you set in a higher layer are replaced. So adding [bench.fuzzy] in benchmark.local.toml extends the bench set without disturbing the existing entries.

Templating

Command strings in [bench.X].command support these placeholders:

Token

Value

{venv}

[settings].venv (default .venv)

{query}

per-bench default_query, or --query STR

{sha}

full commit SHA being benchmarked

{short_sha}

first seven chars

{repo}

repo root absolute path

Unknown placeholders raise — the harness fails loud rather than emit a broken command.

Safety

  • Dirty-tree refusal. The script aborts on uncommitted changes unless --allow-dirty is passed. Per-commit benchmarks shred the worktree via git checkout; running them on top of uncommitted work would silently roll your edits forward into the wrong commit.

  • HEAD restore on exit. An atexit hook plus SIGINT / SIGTERM trap restores the original ref (branch name or HEAD SHA) on every exit path. Pass --keep-checkout to disable the trap when you want to iterate on a checked-out commit after the run.

  • Per-commit isolation. Each iteration runs git checkout + (by default) uv sync to rebuild .venv against the current source. --no-sync skips the sync if the wheel happens to be ABI-compatible across the range you’re testing.

When to skip the harness

The harness is a developer tool, not a CI gate. It expects:

  • A clean worktree (or --allow-dirty).

  • A repo where every commit in the range builds — failed uv sync marks the row sync_fail and moves on, but a wide swath of unbuildable history makes the output less useful.

  • Hyperfine on PATH for the default timing path. The pure-Python fallback (--no-hyperfine, or simply not having hyperfine installed) is portable but produces noisier samples.