Evaluation · benchmarks
Empirical results across four benchmark dimensions.
skylakegrep ships six reproducible benchmarks. Each measures a
different question; quoting the right number for the right claim
matters. The repository's
parity-benchmarks.html
holds the full raw tables, methodology, and per-task data; this
page is the curated summary.
0a. Wrong-path quick-answer (NEW in 0.5.7)
The newest UX gate. Measures how fast the parallel proactive
umbrella streams a first answer when the user is in an
unrelated cwd and the answer lives in a sibling folder.
Setup: cd into an empty directory,
SKYGREP_PROACTIVE_DIRS=/tmp/oss-bench, query
"do I have any package configuration files". The
user runs the real skygrep CLI; we measure
wall clock to first stream block AND wall clock to script
exit.
| Surface | Wall clock | Result |
|---|---|---|
| First stream block (proactive umbrella / cross-folder) | ~1.1 s | 5 cosine-ranked package.json / jest.config.js hits from React + Django repos in SKYGREP_PROACTIVE_DIRS |
| Script exit (full pipeline) | 1.091 s | cascade has nothing to do (cwd empty); umbrella subprocesses delivered directly |
| Same query in 0.5.4 (sequential pre-refactor) | ~12 m 50 s on a code-repo cwd that wasn't the right project | cascade rerank ran 99.7 s on σ-zero query before the proactive answer surfaced |
Concrete output captured in the v0.5.7 release notes
(full notes →): the
streaming markers 🔍 / ▾ / 🌐 / 🌊 / ⚡ arrive
with route + quality labels (filename_extend, ~100 ms-1 s;
pure filename glob, no semantic understanding /
cross-folder lazy, embed budget 5 seeds, σ-validated
cosine) so the user can judge the answer's quality
against the route that produced it. Cascade has a 30 s hard
timeout; cross-folder lazy has an 8 s hard timeout.
0b. Cold-start lazy auto-trigger (NEW in 0.5.3)
The newest tier — measures whether the
--lazy auto-trigger that fires on a never-indexed
project actually beats plain ripgrep cold-start on
vocabulary-mismatch queries. Run via
benchmarks/release-0.5.3-rg-vs-lazy.py
— 10 hand-labelled Django queries phrased as natural language
("where is the migration runner that applies pending schema
changes to the database"), every query starts from a freshly
cleaned SQLite DB, and the script invokes the real
skygrep search CLI (not the python API) so the
measured numbers are exactly what a user would see. Compares
--no-lazy (pure rg cold-start) vs default
(auto-trigger).
| Config | hit @ 5 | avg latency | note |
|---|---|---|---|
--no-lazy | 0 / 10 | 4.85 s | pure ripgrep cold-start; vocabulary mismatch finds nothing |
| default (auto-trigger) | 4 / 10 | 20.76 s | LLM-routed dir picker + token-shortcut + import-diffusion |
| delta | +4 / 10 | +15.9 s / query | real, measurable +30 % hit-rate over rg cold-start |
Specific hits: Q1 → django/urls/resolvers.py;
Q3 → migration.py + executor.py;
Q4 → backends.py; Q7 → base.py.
0.5.3 release notes call out the misses (Q2, Q5, Q6, Q8, Q9, Q10
still don't hit because qwen 2.5:3b can't reliably pick the right
dir for some oracle phrasings — tracked as a 0.6 candidate for a
larger router model).
1. End-to-end Claude Code agent
Real Claude Code sub-agents answering hand-labelled code-search questions in two prompted conditions — rg-only (skygrep forbidden) vs skygrep-on (rg/grep/find forbidden) — across 21 questions and 1 multi-turn session in three repos.
| Bench | Tasks | rg-only tools | skygrep tools | Δ tools | Δ tokens |
|---|---|---|---|---|---|
| Multi-turn (3-turn Rust workspace session) | 1 × 3 | 38 | 7 | −82 % | −5 % |
| 6 medium tasks | 6 | 25 | 6 | −76 % | −8 % |
| 14 single-turn ( + ) | 14 | 124 | 87 | −30 % | +12 % |
| 20-task single-turn aggregate | 20 | 149 | 93 | −37.6 % | +6.5 % |
| Strict-label correctness (20 tasks) | — | 12 / 20 | 14 / 20 | +2 tasks | — |
The cleanest, most consistent signal is tool-call reduction: −37.6 % single-turn, −82 % multi-turn. Each agent tool call costs an LLM round-trip + network RTT + serialization + context-window growth, so reducing them shortens the agent loop even when total tokens are equal. Token cost across the 20-task aggregate is roughly flat (+6.5 %); we do not claim skygrep saves the LLM bill.
Best-case task in the original aggregate: a vocabulary-mismatch
question (NL phrasing didn't match any code identifier) — skygrep
finished in 1 tool call vs rg-only's 25 (25× fewer), at one
eighth the wall time. Worst-case: a token-friendly question whose
vocabulary (auth / session / token) overlapped directly with
code-path tokens, so rg's straightforward scan was already
efficient — skygrep was 40 % more tool calls.
The lexical pre-gate addresses this via a
conservative four-condition gate that detects exactly these
queries and short-circuits to rg internally in
~50 ms. Methodology and aggregate published in
the release notes;
see the release notes for the routing fix.
2. Public OSS recall (Django · React · Tokio)
30 hand-labelled questions across three popular open-source
codebases. Anyone can clone the repos and rerun
benchmarks/public_oss_bench.py
to reproduce every number below.
| Repo | Language | LOC ≈ | Tasks | skygrep recall | rg recall | Token reduction |
|---|---|---|---|---|---|---|
django/django | Python | 524 K | 10 | 10 / 10 | 10 / 10 | 703 × |
tokio-rs/tokio | Rust | 80 K | 10 | 10 / 10 | 10 / 10 | 61 × |
facebook/react | JS+TS | 270 K | 10 | 10 / 10 | 10 / 10 | 773 × |
| Aggregate | 3 langs | ~ 870 K | 30 | 30 / 30 (100 %) | 30 / 30 | 60×–770× |
Honest framing: rg's 100 % is a recall-ceiling
baseline — it returns 20 M+ tokens of term-OR scan output per
query, so the answer is in there but the agent has to read the
whole haystack. skygrep returns the right file ranked
top-10 in 30 / 30 cases while emitting 60×–770× less context.
React reached 10 / 10 after the Option-C substrate upgrade
(bge-m3 embedder + content-agnostic
non-canonical-path filter); the original failure modes on
react-007 and react-010 and the
resolution are documented in
parity-benchmarks.html
as the engineering record, not erased.
3. Worked example — django-001 (one query, real numbers)
Reproduce locally to verify every number. One of the 30 tasks aggregated into the public-OSS recall table above, run against the actual Django source tree (524 K LOC).
Query: "Where does Django turn an incoming URL into the view function that should handle it?"
Expected canonical:django/urls/resolvers.py(URLResolver.resolve())
Vocab mismatch: query says "URL into view", code identifier isresolve— the failure mode that grep-as-search collapses on.
Side A — rg term-OR scan
The rg-agent extracts up to 8 terms (TF-IDF-ish stopword filter),
runs rg -i -F --max-count=20 -C2 per term,
concatenates output. For this query the extractor produced:
['function', 'incoming', 'incom', 'django', 'handle', 'should', 'into', 'that']
The high-signal words URL, view,
resolve did not survive the extractor — they were
either stopword-filtered or pushed past the 8-term cap. That is
the vocab-mismatch failure: the query's actual intent is
URLResolver.resolve(), but rg searches
for function, django, that
instead.
Real measured output volumes per rg invocation:
django1,438,245that841,092function452,938should423,617handle238,008into175,039incom63,593incoming4,024
django alone is 1.4 million tokens
because the term matches in basically every file of the Django
source tree. that is 840 K tokens
for the same reason — high-frequency words that the stopword
filter let through.
Side B — skygrep --top 10 --json
$ skygrep "Where does Django turn an incoming URL into the view function that should handle it?" \
--json --top 10
$ wc -c # 10,430 chars ≈ 2,607 tokens
Top files returned include django/urls/resolvers.py
and related canonical implementation files.
Reduction — visual comparison
rg term-OR3,636,556skygrep --top 102,607
1,395 × is higher than the 60 × – 770 × headline range
because vocab-mismatch queries are the worst case for rg
(stopwords flood the output) and the best case for skygrep
(the embedder bridges "URL into view" → resolve()).
Why the headline is 60 × – 770 ×, not a single number
rg output scales with
(repo LOC) × (term-frequency of high-signal terms).
skygrep output is ~constant-per-K (top-10 ≈ 10 KB):
| Repo | LOC | Per-query rg tokens (avg) | Per-query skygrep tokens | Ratio |
|---|---|---|---|---|
| Tokio (Rust) | 80 K | ~190 K | ~3.1 K | 61 × |
| Django (Python) | 524 K | ~2.06 M | ~2.9 K | 703 × |
| React (JS+TS) | 270 K | ~2.28 M | ~2.9 K | 773 × |
Tokio is the floor (small repo, focused vocabulary). Django and
React both blow up because django / react
saturate term-OR scans across mid-sized monorepos.
Reproduce yourself (3 commands)
# rg side — counts bytes from term-OR scan
cd /tmp/oss-bench/django
for term in function incoming incom django handle should into that; do
rg -i -F --max-count 20 -C 2 "$term" .
done | wc -c
# skygrep side — counts bytes from top-10 JSON
skygrep "Where does Django turn an incoming URL into the view function that should handle it?" \
--json --top 10 | wc -c
# divide chars by 4 to approximate tokens
Numbers will land within ± 5 % of the ones above (variance from
your Django clone's commit and rg minor-version output formatting).
Per-task analysis for all 30 queries lives in
parity-benchmarks.html.
4. Agent tool-context depth benchmark (0.5.13)
0.5.13 adds a benchmark for the context that coding agents actually
consume between reasoning steps. It does not call remote Claude, GPT, or
any cloud model. Instead it compares two deterministic local policies:
one structured skygrep --json --content call per task versus
a raw rg agent that runs several term searches and then reads
line-window context.
The task set covers eight generic repository-maintenance questions across locate, snippet, deep, and abstract levels, then runs low / medium / high effort profiles. The scoring model is intentionally simple: sufficiency = 60 % expected-path coverage + 40 % evidence-term coverage. Path precision and sufficiency density measure how much irrelevant context the next LLM turn has to filter.
| Metric | skygrep-agent | raw rg-agent | Reading |
|---|---|---|---|
| Tasks × effort profiles | 24 | 24 | 8 generic tasks × 3 depth profiles |
| Path coverage | 81.9 % | 100.0 % | rg remains the recall ceiling |
| Path precision | 34.9 % | 12.2 % | skygrep returns less irrelevant path noise |
| Evidence coverage | 79.2 % | 92.7 % | support packs close much of the evidence gap |
| Sufficiency score | 80.8 % | 97.1 % | weighted path + evidence score |
| Tool calls | 24 | 147 | 6.12× fewer calls for skygrep |
| Context tokens | 56,424 | 2,129,655 | 37.74× less context for skygrep |
| Sufficiency per 1k tokens | 0.344 | 0.011 | 31.27× denser context for skygrep |
Honest framing: raw rg can still be faster at producing a
large unranked dump, and it remains the ceiling when an agent can afford
to inspect everything. The 0.5.13 win is agent-context efficiency:
compact ranked evidence with far fewer tool calls and far fewer tokens.
Reproduce with
benchmarks/agent_tool_depth_benchmark.py --summary-only.
5. skylakegrep self-test (regression guard)
Deterministic local benchmark over 30 repository-navigation tasks
against this very repo. Compares a single skygrep search
call against a simulated grep-agent. Token volumes are estimated
as chars / 4. Every release is verified to keep
30 / 30 at top-k 10.
benchmarks/agent_context_benchmark.py --top-k N
command.
| top-k | recall | total-token reduction | context-token reduction |
|---|---|---|---|
| 5 | 28 / 30 | 2.66× | 5.53× |
| 10 | 30 / 30 | 2.00× | 2.90× |
| 20 | 30 / 30 | 1.36× | 1.53× |
| 50 | 30 / 30 | 0.67× | 0.60× |
Vs real ripgrep (not the simulated grep-agent), this same task set shows ~17.7× total-token reduction at equal recall. See the benchmark protocol for definitions and limitations.
Which number to cite for which claim
- "skygrep cuts agent tool calls" → benchmark 1: −37.6 % single-turn, −82 % multi-turn (real Claude Code sub-agents, 20 + 1 hand-labelled tasks).
- "skygrep matches rg on hit-rate while emitting 60×–770× less context" → benchmark 2: 30 / 30 (100 %) across Django + React + Tokio public OSS, vs rg's 30 / 30 baseline (which dumps 20 M+ tokens per query for the agent to filter).
-
"this is what one query actually looks like" →
benchmark 3:
django-001worked example with real measured volumes —rg3.6 M tokens vsskygrep2.6 K tokens (≈ 1,395 ×) on a single vocab-mismatch query. - "skygrep is more token-efficient than ripgrep when feeding LLM context" → benchmark 5: ~17.7× total-token reduction at equal recall on the 30-task self-test.
- "skygrep gives agents denser context with fewer tool calls" → benchmark 4: 6.12× fewer tool calls, 37.74× less context, and 31.27× higher sufficiency density on the 0.5.13 agent tool-context benchmark.
Don't combine these into a single number; they answer different
questions. The honest framing is in
parity-benchmarks.html's
"Strongest claims" section.