Benchmark harness¶
scripts/benchmark.py is a cross-commit benchmark runner that walks
hyperfine across HEAD, trunk, a range, the last N commits, an explicit
list of tags / SHAs, or just HEAD vs. trunk. It surfaces uniform
cold-start regressions that are easy to miss by eyeballing — promoting
the one-off shell-script prototype to a first-class repo tool turns
“where did grep slow down?” into a one-liner.
The harness is a development-only tool — it isn’t shipped with the
agentgrep wheel and isn’t installed by pip install agentgrep. It
lives in scripts/ alongside scripts/mcp_swap.py and runs straight
from a repo checkout via uv run.
The script is a PEP 723 self-contained file — its third-party deps
(typer / rich / pydantic) resolve through uv run’s transient venv,
so the harness is portable across any uv-managed project. Drop in your
own scripts/benchmark.toml and it works.
Invocation recipes¶
Every subcommand is reachable via uv run scripts/benchmark.py:
$ uv run scripts/benchmark.py run --target HEAD
Single-commit bench against the configured [bench.*] entries.
$ uv run scripts/benchmark.py run --target trunk
Single-commit bench against the trunk ref (default master, override
in [settings].trunk).
$ uv run scripts/benchmark.py run --head-vs-trunk
HEAD versus trunk — the most common “did my branch regress?” shape.
$ uv run scripts/benchmark.py run --range master..HEAD
Walk every commit on the current branch since it diverged from trunk.
$ uv run scripts/benchmark.py run --lookback 10
The last 10 commits from HEAD, oldest first.
$ uv run scripts/benchmark.py run --from-trunk-back 5
Trunk and the five commits before it — useful for establishing a baseline that pre-dates a refactor.
$ uv run scripts/benchmark.py run --tags
Every git tag, sorted by --sort=v:refname.
$ uv run scripts/benchmark.py run --commits abc1234,def5678
Explicit list of refs (SHAs, tags, or branches).
$ uv run scripts/benchmark.py compare HEAD master
Sugar for run --commits HEAD,master.
Discovery subcommands¶
$ uv run scripts/benchmark.py list-commits --lookback 10
Print the commits a selector would resolve to, without running any
benchmarks — handy for sanity-checking --range and --lookback
arguments before committing to a long run.
$ uv run scripts/benchmark.py list-commands
Print every configured [bench.X] entry after applying the TOML
layers — useful when iterating on benchmark.local.toml.
$ uv run scripts/benchmark.py show-config
Dump the post-layering config as JSON, the closest thing to a “what will actually run” inspection.
Output formats¶
--format selects the renderer; the default is a rich terminal table:
Format |
Use case |
|---|---|
|
Interactive terminal viewing (default) |
|
Single JSON document with raw samples preserved |
|
One JSON object per line — pipe-friendly for |
|
Markdown tables (mirrors the prototype |
|
Flat row-per-measurement; raw samples joined by |
$ uv run scripts/benchmark.py run --lookback 50 --format md --output performance.md
--show-percentiles picks the subset of stat labels rendered in the
visible cells:
$ uv run scripts/benchmark.py run --target HEAD --show-percentiles min,avg,p90,max
Accepted labels: min, max, avg, and any p<N> (e.g. p50,
p90, p95, p99).
Config layering¶
Four layers compose into the effective config, in this precedence order (each layer overlays the previous):
Built-in pydantic defaults —
runs = 3,warmup = 1,trunk = "master", etc.scripts/benchmark.toml— the committed defaults shipped with the repo. Defines[bench.grep],[bench.find],[bench.search],[bench.import-time]for agentgrep againstlibtmux.scripts/benchmark.local.toml— per-machine overrides (gitignored). Copyscripts/benchmark.local.toml.exampleto start.CLI flags —
--runs N,--warmup N,--query STR,--commands grep,findalways trump the TOML.
Deep-merge semantics: only the keys you set in a higher layer are
replaced. So adding [bench.fuzzy] in benchmark.local.toml extends
the bench set without disturbing the existing entries.
Templating¶
Command strings in [bench.X].command support these placeholders:
Token |
Value |
|---|---|
|
|
|
per-bench |
|
full commit SHA being benchmarked |
|
first seven chars |
|
repo root absolute path |
Unknown placeholders raise — the harness fails loud rather than emit a broken command.
Safety¶
Dirty-tree refusal. The script aborts on uncommitted changes unless
--allow-dirtyis passed. Per-commit benchmarks shred the worktree viagit checkout; running them on top of uncommitted work would silently roll your edits forward into the wrong commit.HEAD restore on exit. An
atexithook plus SIGINT / SIGTERM trap restores the original ref (branch name or HEAD SHA) on every exit path. Pass--keep-checkoutto disable the trap when you want to iterate on a checked-out commit after the run.Per-commit isolation. Each iteration runs
git checkout+ (by default)uv syncto rebuild.venvagainst the current source.--no-syncskips the sync if the wheel happens to be ABI-compatible across the range you’re testing.
When to skip the harness¶
The harness is a developer tool, not a CI gate. It expects:
A clean worktree (or
--allow-dirty).A repo where every commit in the range builds — failed
uv syncmarks the rowsync_failand moves on, but a wide swath of unbuildable history makes the output less useful.Hyperfine on PATH for the default timing path. The pure-Python fallback (
--no-hyperfine, or simply not having hyperfine installed) is portable but produces noisier samples.