ADR 0007: Query language comparison and full queryability¶

Status¶

Accepted.

Context¶

agentgrep exposes a Lucene-inspired query language: field predicates (agent:codex), boolean composition (AND / OR / NOT, + / -), grouping, date comparisons (timestamp:>2026-01-01), and ranges (timestamp:[a TO b]). ADR 0006 makes this language a public, discoverable surface. This ADR records what that language is, what it deliberately is not, and how it is completed.

The language is frequently mistaken for a full-text engine. It is not. agentgrep matches by substring containment (casefolded), regex (the grep verb), and rapidfuzz relevance ranking (the search verb). There is no inverted index, no tokenizer, no postings list, and no BM25 scoring. The semantic source of truth is the pure-Python compiler in agentgrep.query.compile.

For comparison we studied Tantivy, a Rust full-text search engine, and its Python bindings:

Tantivy 0.26.1 (QueryParser and the Query trait implementations).
tantivy-py 0.26.0 (the Query.* constructors and Index.parse_query exposed to Python).

Tantivy tokenizes text into an inverted index and scores matches with BM25. Its QueryParser accepts terms (default OR), phrases with slop and prefix, set membership, boost, configurable fuzzy fields, optional regex, field-exists, and typed ranges over dates and IP addresses — each lowering to an indexed Query. agentgrep cannot adopt those semantics without adopting an index, which ADR 0003 would classify as a native engine or worker decision, not a parser change.

Mapping the agentgrep query language against Tantivy’s QueryParser:

Capability	Tantivy `QueryParser`	agentgrep
Bare terms	Yes (default OR)	Yes (default AND)
`field:value`	Yes	Yes
`AND` / `OR` / `NOT`, `+` / `-`	Yes	Yes
Grouping `( )`	Yes	Yes
Ranges `[a TO b]` / `{a TO b}`	Yes	Yes (date fields)
Comparison `< > <= >=`	Yes	Yes (date fields)
Phrase `"..."`	Yes	Added by this ADR (substring)
Field-exists `field:*`	Yes	Added by this ADR
Wildcards `*` / `?`	Partial	Added by this ADR (field values)
Phrase slop / prefix `"..."~N` / `"..."*`	Yes	No (non-goal)
Set `IN [...]`	Yes	No (non-goal)
Boost `^N`	Yes	No (non-goal)
Regex field predicates	Yes (opt-in)	No (`grep` is regex-native)
Dismax / more-like-this / const-score	API	No (non-goal)
Term matching model	Tokenized + BM25	Substring / regex / rapidfuzz

Two gaps motivated this ADR. First, the overlapping features are usable but incomplete: phrases, field-exists, and wildcards have no expression even though the substring model can support them. Second, and more damaging, boolean composition only engaged when a positional contained a colon. A query such as search "ruff OR uv" was split into the literal terms ruff, OR, and uv and AND-matched, returning zero results instead of a union. The operators existed in the grammar but were unreachable from the most natural input.

Decision¶

agentgrep completes the subset of query features that its substring/regex matching model can express, and makes boolean composition reachable from bare input. It does not pursue Tantivy parity.

Boolean composition without a field predicate¶

Boolean operators, grouping, and quoted phrases engage the parser whether or not a field predicate is present. A cheap, dependency-free heuristic decides whether input carries query syntax (a field colon, an uppercase boolean keyword, a parenthesis, or a quote) so that plain bare-term queries keep the legacy fast path and the cold-start budget in ADR 0006.

Phrase queries¶

A quoted value ("deploy v1") is a phrase: its internal whitespace is collapsed and it matches as a single casefolded substring of the record text. A phrase is a text term for every downstream purpose — prefilter, relevance ranking, and matching — so it carries no scoring semantics beyond ordered adjacency within the substring.

Field-exists¶

field:* matches records or sources where the field has a non-empty value. An empty string counts as absent. Negation uses the existing NOT / - forms (-model:*).

Wildcards on text and string fields¶

A field value containing * or ? matches by anchored, casefolded glob (fnmatch) rather than substring. Wildcards apply to field values only, never to bare terms, and never to enum or date fields. The path field keeps its existing case-sensitive glob.

Scope¶

This ADR governs the agentgrep query language as compiled by agentgrep.query. It applies to every surface that consumes a compiled query: the search, grep, and find CLI verbs, the Textual UI search box, and the MCP search and validation tools. It does not change the execution engine (ADR 0004) or introduce native code (ADR 0003).

Requirements¶

Matching semantics¶

Bare terms match as casefolded substrings and combine with implicit AND.
Phrases match as a single casefolded substring with collapsed internal whitespace.
Field-exists is true when the field value is non-empty; for source-prunable fields it resolves at the source layer, for record fields it resolves after parsing.
Wildcard field values match by anchored casefolded glob; users who want substring semantics write *value*.

Discoverability¶

The query language must be discoverable from CLI help examples, the MCP server instructions, MCP tool and parameter descriptions, and a machine-readable MCP resource, consistent with the registry-backed discovery direction in ADR 0006.
Field and operator descriptions derive from the field registry so the surfaces cannot drift from the compiler.

Tests¶

Parser, compiler, and engine behavior are covered by fast, pure tests with no subprocess or home-directory access, following the existing parametrized case-table pattern.
A guard test proves that plain bare-term queries never import the query module, protecting cold start.

Consequences¶

Positive¶

Boolean queries, phrases, field-exists, and wildcards are reachable from the input users actually type.
One registry feeds the compiler, CLI help, and MCP hints, so discovery and behavior stay aligned.
The substring model stays the single semantic source of truth; no index or native dependency is introduced.

Behavior changes¶

Input that previously searched for the literal words OR, AND, or NOT, or for literal parentheses or quotes, now parses as a query. For example search OR raises a parse error rather than searching for the substring OR. The escape valves are lowercase keywords (or, and, not remain literal terms) and quoting ("OR" searches for the literal text). Users of a published release experienced the old colon-gated behavior, so this is a documented behavior change rather than branch-internal narrative.

Tradeoffs and asymmetries¶

Wildcards apply to field values only, never to bare terms. A bare c*x stays a literal substring.
The path field keeps case-sensitive glob (filesystem semantics) while text and string field wildcards are casefolded (text semantics). This asymmetry is intentional.
field:* is field-exists, but field:** is a wildcard. The distinction is the exact value *.

Risks¶

Heuristic over-engagement: a positional that looks like a field predicate engages the parser. The heuristic matches only registry-shaped identifiers, and unknown fields raise a clean parse error rather than searching silently.
Generated-description drift: registry-backed help can still drift if generation is partial. Drift-guard tests compare the rendered field list against the registry.

Relationship to other ADRs¶

ADR 0003 owns the native boundary: adopting an index or BM25 scoring would be an engine or worker decision under that policy, which this ADR explicitly declines. ADR 0004 owns planning, execution, and result payloads, which are unchanged. ADR 0006 owns the public CLI and MCP surface and calls for registry-backed query discovery; this ADR supplies the query-language content those surfaces expose.

Final position¶

agentgrep’s query language is a deliberately bounded, substring-based subset of Lucene shapes, not a Tantivy-class index. This ADR completes the reachable parts of that subset, makes boolean composition work from bare input, and keeps the registry as the one source of both behavior and discovery.