benchmark · graph-walk 0.3.0 → 0.3.1 rollback
Real-corpus benchmark — 0.3.0 graph-walk + 0.3.1 rollback
Date: 2026-05-06
Corpus: skylakegrep/src/ (~30 Python files)
Bench script: benchmarks/graph_walk_bench.py
This document records the first end-to-end benchmark of the v2
graph-walk substrate introduced in 0.3.0, plus the rollback decision
made in 0.3.1 after the bench exposed a stack of preset hyperparameters
in violation of docs/PRINCIPLES.md Principle 1.
Bench setup
- 30 Python source files indexed (one chunk per file, no L2 embeddings)
- 284 symbols extracted via lite regex (
def name/class name) - Graph substrate built: 33 containment edges + 18 name-sim edges + 168 path-prox edges → 219 edges over 34 nodes
- 5 representative cold-start queries with ground-truth expectations
0.3.0 results — 2 / 5 hits (40%)
Q1. "where is the PPR walk implemented"
expect graph_walk.py → ✗ miss (top-5: __init__.py / cli.py / ...)
Q2. "how does the cold-start seed mapping work"
expect query_seeds.py → ✗ miss
Q3. "where do we build the graph edges during indexing"
expect graph_substrate.py → ✗ miss
Q4. "find the cascade search function"
expect storage.py → ✓ hit
Q5. "where is the LLM router decision class"
expect llm_router.py → ✓ hit (rank 2)
Walk latency: p50 = 6.4 ms · max = 18.3 ms. Latency claim held; accuracy claim did not — 3 of 5 queries failed.
Root cause — preset hyperparameters dilute the seed signal
A debug pass showed the seed mapper itself was correct on Q1:
FILENAME hits: graph_walk.py score=1.0
SYMBOL hits: graph_walk.py score=4.5 (5+ symbol hits via ppr_walk, WalkResult, …)
indexer.py score=3.0 (false-positive symbol substring hits)
storage.py score=1.5
FINAL SEEDS:
prob=0.550 graph_walk.py ← right answer at top
prob=0.300 indexer.py
prob=0.150 storage.py
So query_to_seeds() correctly placed graph_walk.py at 55 % of the
seed mass. But after ppr_walk() traversed the graph through the
preset edge weights, graph_walk.py dropped out of the top-5
entirely. The walk was diluting the correct signal into structurally-
adjacent siblings.
The 9+ preset hyperparameters in 0.3.0:
query_seeds.match_filenames score_per_hit = 1.0query_seeds.match_symbols score_per_hit = 1.5query_seeds.match_path_tokens score_per_hit = 0.5query_seeds.match_semantic threshold = 0.45graph_substrate path_prox weight = 0.35(lowered from 0.7 in a failed mid-bench tuning attempt — STILL a preset)graph_walk.DEFAULT_ALPHA = 0.15graph_walk.DEFAULT_EPS = 1e-3graph_walk.DEFAULT_MAX_VISITED = 200graph_walk.DEFAULT_TOP_K_EDGES = 8
Each one a hand-chosen magic number. Each one a future tech-debt liability. Each one trading off bench-quality vs implementation cleanliness with no principled basis.
User feedback (2026-05-06)
你这里所谓的权重是怎么调的你是 preset 的呢还是这个 automatically 的调的现在 performance 下降了对吗, 你记住所有的东西都不能 preset 要不然这就变成了hyperparameter我们不能要这么多的hyperparameter 因为这会accumulate这种technical debt我们要的intelligence是要足够的 generic的并且保证之前版本的accuracy和latency
Translation: "How are these weights tuned, preset or auto? Performance dropped, right? Remember — nothing can be preset, otherwise it becomes hyperparameter; we can't accumulate this kind of technical debt; the intelligence we want must be generic enough; and previous versions' accuracy and latency must be preserved."
This is exactly the Principle 1 anti-pattern (Understanding >
Enumeration) the project's own principles file forbids.
0.3.1 rollback — three concrete moves
-
Cascade integration removed.
_graph_walk_candidatesis no longer called fromcascade_search. TheSKYGREP_GRAPH_WALKenv var is dead — toggling it has no effect. The escalation path reverts to 0.2.21's pure Round-A ∪ Round-C union. -
Preset score-per-hit constants stripped from
query_seeds.py. All four matchers (match_filenames,match_symbols,match_path_tokens,match_semantic) now return raw token-hit counts (or, for semantic, the cosine score itself which is data-derived). No 1.0 / 1.5 / 0.5 ratios. -
path_proxedge weight derived from data. Instead of the preset 0.35 (or earlier 0.7) constant, the weight is now1 / (1 + max(0, 8 - depth))— depth-derived from the path itself, no free parameter.
What stays:
- The schema (graph_node, graph_edge tables). Idempotent, no
cost in default behaviour.
- The modules (graph_walk.py, query_seeds.py,
graph_substrate.py). They're useful primitives even when not on
the cascade critical path; future work can use them with derived
weights.
- The 20 unit tests. They cover the modules in isolation; passing
them is necessary even when integration is shelved.
What's deferred (until weights are derived from corpus stats):
- Graph-walk integration into cascade
- Public SKYGREP_GRAPH_WALK flag
- PPR walk's α / eps / max_visited need principled defaults
0.3.1 invariants — by construction
- Latency: identical to 0.2.21. Cascade path unchanged; the rolled-back integration was on the escalation path only and was gated behind a now-dead env var.
- Accuracy: identical to 0.2.21 on the public-OSS bench. The rollback strictly removes the integration that hurt 3/5 queries on this internal bench.
Lesson recorded in auto-memory
feedback_comprehensive_test_before_release.md — every release must
include real-corpus end-to-end benchmark with concrete numbers, not
just unit tests. Receipt: 0.3.0 shipped on by-construction
arguments; 0.3.1 is the principled rollback.
What "principled v2" would look like
For the next attempt at integrating graph-walk into the cascade, edge weights and walk parameters must come from corpus statistics or from learned signals — not from human picks. Concrete options:
- TF-IDF weighting: each token → file edge weight =
log(N_files / N_files_containing_token). Self-deriving from the indexed corpus. - Learned restart probability: α tuned per-query by an LLM head that classifies "this query needs local lookup" (high α) vs "this query needs broader exploration" (low α). The router is already doing this for cascade scope; extending it for graph walk is one prompt change, not a new hyperparameter.
- σ-adaptive walk stop: replace
epswith the same Bayesian σ-evidence framing already used incascade_search—eps_eff = max(eps_floor, k · σ(top_K_residuals)). Ties the walk stop to the data, not a constant.
These are not implemented in 0.3.1; they are the design contract for when graph-walk reintegration is attempted again.