Reproducible bench protocol
Parity benchmarks
Reproducible head-to-head comparisons of skygrep against real
ripgrep, run on three popular open-source codebases. Anyone with
the repos cloned can rerun in a few minutes and verify every number
on this page.
Headline:
skygrepmatchesrgexactly across all three codebases (10 / 10 each on Django · Tokio · React, 30 / 30 aggregate). React reached 10 / 10 after the Option-C substrate upgrade (bge-m31024-d symmetric embedder + a content-agnostic non-canonical-path filter); the two original React misses (react-007,react-010) and how they were resolved are documented below as the engineering record, not erased.
skygrepreturns 60×–770× less context tokens thanrg's term-OR scan, so a downstream agent loop that consumes the context pays dramatically less even when recall is the same.
Setup
git clone --depth=1 https://github.com/django/django /tmp/oss-bench/django
git clone --depth=1 https://github.com/facebook/react /tmp/oss-bench/react
git clone --depth=1 https://github.com/tokio-rs/tokio /tmp/oss-bench/tokio
cd skylakegrep
.venv/bin/python benchmarks/public_oss_bench.py
The runner:
- Reads each fixture from
benchmarks/cross_repo/{django,react,tokio}.json— 10 hand-labeled questions per repo, each with a canonical expected file plus zero or moreexpected_alternativesfor queries with multiple legitimate answers. - Indexes the OSS repo into a tmp SQLite DB (5–10 min one-time per repo).
- For each task, runs both:
rg: term-OR over up to 8 extracted query terms × 20 matches per term × 2-line context window (the real ripgrep agent baseline).skygrep: one semantic top-10 search.
- Reports per-task hit / miss + average latency + total context tokens emitted.
Aggregate result
| Repo | LOC ≈ | skygrep recall | rg recall | sky lat | rg lat | sky tokens | rg tokens | token reduction |
|---|---|---|---|---|---|---|---|---|
| Django (Python) | 524K | 10 / 10 | 10 / 10 | 10.13 s | 2.97 s | ~29 K | ~20.6 M | 703 × |
| Tokio (Rust) | 80K | 10 / 10 | 10 / 10 | 21.91 s | 1.49 s | ~31 K | ~1.9 M | 61 × |
| React (JS+TS) | 270K | 10 / 10 | 10 / 10 | 11.71 s | 4.58 s | ~29 K | ~22.8 M | 773 × |
| Aggregate | 30 / 30 (100 %) | 30 / 30 (100 %) | ~ 60×–770× less |
Worked example — django-001 (one query, real numbers)
Reproduce locally to verify every number. Cloned at the same commit the headline table was measured against, against the actual Django repo (524 K LOC). This is one of the 30 tasks aggregated into the table above.
Query: "Where does Django turn an incoming URL into the view function that should handle it?" Expected canonical:
django/urls/resolvers.py(URLResolver.resolve()) Vocab mismatch: query says "URL into view", code identifier isresolve— the failure mode that grep-as-search collapses on.
Side A — rg term-OR scan (the rg-agent baseline)
The rg-agent extracts up to 8 terms (TF-IDF-ish stopword filter),
runs rg -i -F --max-count=20 -C2 per term, concatenates the
output. For this query the extractor produced:
['function', 'incoming', 'incom', 'django', 'handle', 'should', 'into', 'that']
The high-signal words URL, view, resolve did not survive the
extractor — they were either stopword-filtered or pushed past the
8-term cap. That's the vocab-mismatch failure: the query's actual
intent is URLResolver.resolve(), but rg searches for function,
django, that instead.
Real measured output volumes per rg invocation (one per term):
rg term |
Output chars | Tokens (chars / 4) |
|---|---|---|
function |
1,811,753 | 452,938 |
incoming |
16,095 | 4,024 |
incom |
254,373 | 63,593 |
django |
5,752,979 | 1,438,245 |
handle |
952,033 | 238,008 |
should |
1,694,468 | 423,617 |
into |
700,156 | 175,039 |
that |
3,364,369 | 841,092 |
| Total | 14,546,226 | 3,636,556 |
django alone is 1.4 million tokens because the term matches in
basically every file of the Django source tree. that is 840 K
tokens for the same reason — high-frequency words that the
stopword filter let through.
Side B — skygrep --top 10 --json
skygrep "Where does Django turn an incoming URL into the view function that should handle it?" \
--json --top 10
Real measured output: 10,430 chars ≈ 2,607 tokens. Top files
returned include django/urls/resolvers.py and related canonical
implementation files.
Reduction
| Tokens | Notes | |
|---|---|---|
rg term-OR scan |
3,636,556 | dominated by django (1.4 M) and that (840 K) — stopwords flooded the haystack |
skygrep --top 10 |
2,607 | top-K cosine over bge-m3 substrate, no token-level matching |
| Ratio | ≈ 1,395 × | for this query |
This is higher than the 60 × – 770 × headline range because
vocab-mismatch queries are the worst case for rg (stopwords flood
the output) and the best case for skygrep (the embedder bridges
"URL into view" → resolve()).
Why the headline range is 60 × – 770 ×, not a single number
rg output volume scales with (repo LOC) × (term-frequency of the
high-signal terms in the query). skygrep output is
~constant-per-K (top-10 ≈ 10 KB of JSON):
| Repo | LOC | Per-query rg tokens (avg) | Per-query skygrep tokens | Ratio |
|---|---|---|---|---|
| Tokio (Rust) | 80 K | ~190 K | ~3.1 K | 61 × |
| Django (Python) | 524 K | ~2.06 M | ~2.9 K | 703 × |
| React (JS+TS) | 270 K | ~2.28 M | ~2.9 K | 773 × |
Tokio is the floor (small repo, focused vocabulary). Django and
React both blow up because django / react saturate term-OR scans
across mid-sized monorepos.
Reproduce yourself (3 commands)
# rg side — counts bytes from term-OR scan
cd /tmp/oss-bench/django
for term in function incoming incom django handle should into that; do
rg -i -F --max-count 20 -C 2 "$term" .
done | wc -c
# skygrep side — counts bytes from top-10 JSON
skygrep "Where does Django turn an incoming URL into the view function that should handle it?" \
--json --top 10 | wc -c
# divide chars by 4 to approximate tokens
Numbers will land within ± 5 % of the table above (variance from your Django clone's commit and rg minor-version output formatting).
How to read these numbers
"rg recall = 100 %" looks impressive but isn't apples-to-apples
The rg agent in this benchmark is the ripgrep equivalent of "search
for every word the user said and dump every line that matches." It
collects up to 160 file-tokens of context per query
(8 terms × 20 matches × 2-line context windows). In practice that
context is enormous: 20 M tokens of rg output across the 10
Django queries vs. 29 K tokens of skygrep output. So:
- Yes,
rgfinds the answer in the dump. Always, on Django and Tokio. The expected file is somewhere in the 100 K-line haystack. - No,
rgdoes not give the agent a useful starting point. The agent now has to read 20 M tokens to figure out which one of the 100 file-fragments is the actual answer. That is not a realistic real-world workflow — it's a recall ceiling on a deliberately permissive ripgrep configuration. skygrepreturns the right file ranked top-10 in 30 of 30 cases. That is the user-facing number.
Why React used to lag (and how it was resolved)
The React fixture used to surface two honest skygrep weaknesses, both fixed by the Option-C substrate upgrade. The original failure modes are documented here as the engineering record:
- Test-fixture path bias.
react-007asks for theReact.createElementimplementation. The canonical answer ispackages/react/src/jsx/ReactJSXElement.js. skygrep's top-10 used to be dominated by test fixturesfixtures/legacy-jsx-runtimes/react-{14,15,16,17}/cjs/...— filename-token similarity to "jsx-runtime" pulled the legacy fixture files ahead of the real source. - Devtools vs reconciler conflict.
react-010asks for the Profiler component implementation. The canonical answer is in thereact-reconciler/package. skygrep used to return severalreact-devtools-shared/.../profilingHooks.jsfiles instead — the devtools profiler had many more "profiler" filename mentions than the reconciler's internal timer.
Resolution (Option C). Both classes of failure shared a single
root cause: the mxbai-embed-large substrate ranked re-export /
fixture aggregators above canonical implementations, and any
post-hoc graph prior could not recover candidates that were not in
the rerank pool. Two changes lifted React to 10 / 10:
bge-m3substrate (1024-d, symmetric, multilingual XLM-RoBERTa). The canonical reconciler / source files now land inside the top-K cosine pool for both queries. A better prior cannot save you when cosine never surfaces the right file — upgrading the embedder is the only fix that worked.- Content-agnostic non-canonical-path filter (24 universal
aux conventions:
/fixtures/,/examples/,/vendor/,/node_modules/,/dist/,.development.js,.production.min.js,.min.js, …). This is a structural prior, not a language-specific rule, and applies to any corpus that follows similar conventions (markdown notes with/drafts/, knowledge graphs with/archive/, etc.).
Why per-query latency is higher than rg's
The benchmark cold-loads the cross-encoder reranker once per process
(~30 s) and runs each query through the cascade including the HyDE
escalation path on uncertain queries. In skygrep serve daemon mode
the reranker stays warm in memory, and warm queries land in the
~0.5 – 2 s band. The 11–20 s/q numbers here are an honest CLI-from-
cold-start measurement, not a daemon throughput claim.
For an AI agent the relevant cost is the LLM round-trip after the search, which scales with token count of the context — and that is where skygrep's 60–770× reduction lives.
Per-task detail
Run the benchmark without --summary-only to get every task's
expected path, returned top-10, and per-tier hit / miss:
.venv/bin/python benchmarks/parity_vs_ripgrep.py \
--root /tmp/oss-bench/react \
--tasks benchmarks/cross_repo/react.json \
--top-k 10 > /tmp/react-detail.json
python3 -c "
import json; d = json.load(open('/tmp/react-detail.json'))
for t in d['tasks']:
print(f\"{t['id']}: skygrep_hit={t['skygrep']['hit']} rg_hit={t['rg']['hit']}\")
"
Reproducing yourself
The fixtures, runner, and parity_vs_ripgrep.py are all in this
repo. Pin the OSS clones at the commits you tested against (git
log --oneline -1 inside each clone) when reporting numbers, since
upstream code drifts.