Benchmarks · skylakegrep

Evaluation · benchmarks

Empirical results across four benchmark dimensions.

skylakegrep ships six reproducible benchmarks. Each measures a different question; quoting the right number for the right claim matters. The repository's parity-benchmarks.html holds the full raw tables, methodology, and per-task data; this page is the curated summary.

0a. Wrong-path quick-answer (NEW in 0.5.7)

The newest UX gate. Measures how fast the parallel proactive umbrella streams a first answer when the user is in an unrelated cwd and the answer lives in a sibling folder. Setup: cd into an empty directory, SKYGREP_PROACTIVE_DIRS=/tmp/oss-bench, query "do I have any package configuration files". The user runs the real skygrep CLI; we measure wall clock to first stream block AND wall clock to script exit.

Surface	Wall clock	Result
First stream block (proactive umbrella / cross-folder)	~1.1 s	5 cosine-ranked `package.json` / `jest.config.js` hits from React + Django repos in `SKYGREP_PROACTIVE_DIRS`
Script exit (full pipeline)	1.091 s	cascade has nothing to do (cwd empty); umbrella subprocesses delivered directly
Same query in 0.5.4 (sequential pre-refactor)	~12 m 50 s on a code-repo cwd that wasn't the right project	cascade rerank ran 99.7 s on σ-zero query before the proactive answer surfaced

Concrete output captured in the v0.5.7 release notes (full notes →): the streaming markers 🔍 / ▾ / 🌐 / 🌊 / ⚡ arrive with route + quality labels (filename_extend, ~100 ms-1 s; pure filename glob, no semantic understanding / cross-folder lazy, embed budget 5 seeds, σ-validated cosine) so the user can judge the answer's quality against the route that produced it. Cascade has a 30 s hard timeout; cross-folder lazy has an 8 s hard timeout.

0b. Cold-start lazy auto-trigger (NEW in 0.5.3)

The newest tier — measures whether the --lazy auto-trigger that fires on a never-indexed project actually beats plain ripgrep cold-start on vocabulary-mismatch queries. Run via benchmarks/release-0.5.3-rg-vs-lazy.py — 10 hand-labelled Django queries phrased as natural language ("where is the migration runner that applies pending schema changes to the database"), every query starts from a freshly cleaned SQLite DB, and the script invokes the real skygrep search CLI (not the python API) so the measured numbers are exactly what a user would see. Compares --no-lazy (pure rg cold-start) vs default (auto-trigger).

Config	hit @ 5	avg latency	note
`--no-lazy`	0 / 10	4.85 s	pure ripgrep cold-start; vocabulary mismatch finds nothing
default (auto-trigger)	4 / 10	20.76 s	LLM-routed dir picker + token-shortcut + import-diffusion
delta	+4 / 10	+15.9 s / query	real, measurable +30 % hit-rate over rg cold-start

Specific hits: Q1 → django/urls/resolvers.py; Q3 → migration.py + executor.py; Q4 → backends.py; Q7 → base.py. 0.5.3 release notes call out the misses (Q2, Q5, Q6, Q8, Q9, Q10 still don't hit because qwen 2.5:3b can't reliably pick the right dir for some oracle phrasings — tracked as a 0.6 candidate for a larger router model).

1. End-to-end Claude Code agent

Real Claude Code sub-agents answering hand-labelled code-search questions in two prompted conditions — rg-only (skygrep forbidden) vs skygrep-on (rg/grep/find forbidden) — across 21 questions and 1 multi-turn session in three repos.

Bench	Tasks	rg-only tools	skygrep tools	Δ tools	Δ tokens
Multi-turn (3-turn Rust workspace session)	1 × 3	38	7	−82 %	−5 %
6 medium tasks	6	25	6	−76 %	−8 %
14 single-turn ( + )	14	124	87	−30 %	+12 %
20-task single-turn aggregate	20	149	93	−37.6 %	+6.5 %
Strict-label correctness (20 tasks)	—	12 / 20	14 / 20	+2 tasks	—

agent tool calls — rg-only vs skygrep-on
Multi-turn (3-turn Rust workspace session, 1 task)
rg-only
38
skygrep
7 (−82 %)
6 medium tasks
rg-only
25
skygrep
6 (−76 %)
14 single-turn (vocab-balanced mix)
rg-only
124
skygrep
87 (−30 %)

 20-task single-turn aggregate:
 149 →
 93 tool calls ·
 −37.6 %
 

The cleanest, most consistent signal is tool-call reduction: −37.6 % single-turn, −82 % multi-turn. Each agent tool call costs an LLM round-trip + network RTT + serialization + context-window growth, so reducing them shortens the agent loop even when total tokens are equal. Token cost across the 20-task aggregate is roughly flat (+6.5 %); we do not claim skygrep saves the LLM bill.

Best-case task in the original aggregate: a vocabulary-mismatch question (NL phrasing didn't match any code identifier) — skygrep finished in 1 tool call vs rg-only's 25 (25× fewer), at one eighth the wall time. Worst-case: a token-friendly question whose vocabulary (auth / session / token) overlapped directly with code-path tokens, so rg's straightforward scan was already efficient — skygrep was 40 % more tool calls. The lexical pre-gate addresses this via a conservative four-condition gate that detects exactly these queries and short-circuits to rg internally in ~50 ms. Methodology and aggregate published in the release notes; see the release notes for the routing fix.

2. Public OSS recall (Django · React · Tokio)

30 hand-labelled questions across three popular open-source codebases. Anyone can clone the repos and rerun benchmarks/public_oss_bench.py to reproduce every number below.

Repo	Language	LOC ≈	Tasks	skygrep recall	rg recall	Token reduction
`django/django`	Python	524 K	10	10 / 10	10 / 10	703 ×
`tokio-rs/tokio`	Rust	80 K	10	10 / 10	10 / 10	61 ×
`facebook/react`	JS+TS	270 K	10	10 / 10	10 / 10	773 ×
Aggregate	3 langs	~ 870 K	30	30 / 30 (100 %)	30 / 30	60×–770×

Honest framing: rg's 100 % is a recall-ceiling baseline — it returns 20 M+ tokens of term-OR scan output per query, so the answer is in there but the agent has to read the whole haystack. skygrep returns the right file ranked top-10 in 30 / 30 cases while emitting 60×–770× less context. React reached 10 / 10 after the Option-C substrate upgrade (bge-m3 embedder + content-agnostic non-canonical-path filter); the original failure modes on react-007 and react-010 and the resolution are documented in parity-benchmarks.html as the engineering record, not erased.

3. Worked example — `django-001` (one query, real numbers)

Reproduce locally to verify every number. One of the 30 tasks aggregated into the public-OSS recall table above, run against the actual Django source tree (524 K LOC).

Query: "Where does Django turn an incoming URL into the view function that should handle it?"
Expected canonical: django/urls/resolvers.py (URLResolver.resolve())
Vocab mismatch: query says "URL into view", code identifier is resolve — the failure mode that grep-as-search collapses on.

Side A — `rg` term-OR scan

The rg-agent extracts up to 8 terms (TF-IDF-ish stopword filter), runs rg -i -F --max-count=20 -C2 per term, concatenates output. For this query the extractor produced:

['function', 'incoming', 'incom', 'django', 'handle', 'should', 'into', 'that']

The high-signal words URL, view, resolve did not survive the extractor — they were either stopword-filtered or pushed past the 8-term cap. That is the vocab-mismatch failure: the query's actual intent is URLResolver.resolve(), but rg searches for function, django, that instead.

Real measured output volumes per rg invocation:

rg output volume per term · django-001
django
1,438,245
that
841,092
function
452,938
should
423,617
handle
238,008
into
175,039
incom
63,593
incoming
4,024
TOTAL
3,636,556

django alone is 1.4 million tokens because the term matches in basically every file of the Django source tree. that is 840 K tokens for the same reason — high-frequency words that the stopword filter let through.

Side B — `skygrep --top 10 --json`

$ skygrep "Where does Django turn an incoming URL into the view function that should handle it?" \
    --json --top 10
$ wc -c   # 10,430 chars ≈ 2,607 tokens

Top files returned include django/urls/resolvers.py and related canonical implementation files.

Reduction — visual comparison

tokens emitted · one query · log scale
rg term-OR
3,636,556
skygrep --top 10
2,607

 ratio ≈ 1,395 × for this query
 

1,395 × is higher than the 60 × – 770 × headline range because vocab-mismatch queries are the worst case for rg (stopwords flood the output) and the best case for skygrep (the embedder bridges "URL into view" → resolve()).

Why the headline is 60 × – 770 ×, not a single number

rg output scales with (repo LOC) × (term-frequency of high-signal terms). skygrep output is ~constant-per-K (top-10 ≈ 10 KB):

Repo	LOC	Per-query `rg` tokens (avg)	Per-query `skygrep` tokens	Ratio
Tokio (Rust)	80 K	~190 K	~3.1 K	61 ×
Django (Python)	524 K	~2.06 M	~2.9 K	703 ×
React (JS+TS)	270 K	~2.28 M	~2.9 K	773 ×

Tokio is the floor (small repo, focused vocabulary). Django and React both blow up because django / react saturate term-OR scans across mid-sized monorepos.

Reproduce yourself (3 commands)

# rg side — counts bytes from term-OR scan
cd /tmp/oss-bench/django
for term in function incoming incom django handle should into that; do
  rg -i -F --max-count 20 -C 2 "$term" .
done | wc -c

# skygrep side — counts bytes from top-10 JSON
skygrep "Where does Django turn an incoming URL into the view function that should handle it?" \
  --json --top 10 | wc -c

# divide chars by 4 to approximate tokens

Numbers will land within ± 5 % of the ones above (variance from your Django clone's commit and rg minor-version output formatting). Per-task analysis for all 30 queries lives in parity-benchmarks.html.

4. Agent tool-context depth benchmark (0.5.13)

0.5.13 adds a benchmark for the context that coding agents actually consume between reasoning steps. It does not call remote Claude, GPT, or any cloud model. Instead it compares two deterministic local policies: one structured skygrep --json --content call per task versus a raw rg agent that runs several term searches and then reads line-window context.

The task set covers eight generic repository-maintenance questions across locate, snippet, deep, and abstract levels, then runs low / medium / high effort profiles. The scoring model is intentionally simple: sufficiency = 60 % expected-path coverage + 40 % evidence-term coverage. Path precision and sufficiency density measure how much irrelevant context the next LLM turn has to filter.

Metric	`skygrep-agent`	raw `rg-agent`	Reading
Tasks × effort profiles	24	24	8 generic tasks × 3 depth profiles
Path coverage	81.9 %	100.0 %	`rg` remains the recall ceiling
Path precision	34.9 %	12.2 %	skygrep returns less irrelevant path noise
Evidence coverage	79.2 %	92.7 %	support packs close much of the evidence gap
Sufficiency score	80.8 %	97.1 %	weighted path + evidence score
Tool calls	24	147	6.12× fewer calls for skygrep
Context tokens	56,424	2,129,655	37.74× less context for skygrep
Sufficiency per 1k tokens	0.344	0.011	31.27× denser context for skygrep

Honest framing: raw rg can still be faster at producing a large unranked dump, and it remains the ceiling when an agent can afford to inspect everything. The 0.5.13 win is agent-context efficiency: compact ranked evidence with far fewer tool calls and far fewer tokens. Reproduce with benchmarks/agent_tool_depth_benchmark.py --summary-only.

5. skylakegrep self-test (regression guard)

Deterministic local benchmark over 30 repository-navigation tasks against this very repo. Compares a single skygrep search call against a simulated grep-agent. Token volumes are estimated as chars / 4. Every release is verified to keep 30 / 30 at top-k 10.

Bar chart of token reduction and recall against top-k values 5, 10, 20, 50. — Bars: token reduction (left axis). Markers and dashed line: expected-file recall (right axis). Reproduce with the `benchmarks/agent_context_benchmark.py --top-k N` command.

top-k	recall	total-token reduction	context-token reduction
5	28 / 30	2.66×	5.53×
10	30 / 30	2.00×	2.90×
20	30 / 30	1.36×	1.53×
50	30 / 30	0.67×	0.60×

Vs real ripgrep (not the simulated grep-agent), this same task set shows ~17.7× total-token reduction at equal recall. See the benchmark protocol for definitions and limitations.

Which number to cite for which claim

"skygrep cuts agent tool calls" → benchmark 1: −37.6 % single-turn, −82 % multi-turn (real Claude Code sub-agents, 20 + 1 hand-labelled tasks).
"skygrep matches rg on hit-rate while emitting 60×–770× less context" → benchmark 2: 30 / 30 (100 %) across Django + React + Tokio public OSS, vs rg's 30 / 30 baseline (which dumps 20 M+ tokens per query for the agent to filter).
"this is what one query actually looks like" → benchmark 3: django-001 worked example with real measured volumes — rg 3.6 M tokens vs skygrep 2.6 K tokens (≈ 1,395 ×) on a single vocab-mismatch query.
"skygrep is more token-efficient than ripgrep when feeding LLM context" → benchmark 5: ~17.7× total-token reduction at equal recall on the 30-task self-test.
"skygrep gives agents denser context with fewer tool calls" → benchmark 4: 6.12× fewer tool calls, 37.74× less context, and 31.27× higher sufficiency density on the 0.5.13 agent tool-context benchmark.

Don't combine these into a single number; they answer different questions. The honest framing is in parity-benchmarks.html's "Strongest claims" section.