release notes · v0.3.0 · graph-walk retrieval (v2 MVP)
skylakegrep 0.3.0 — graph-walk retrieval substrate (v2 MVP)
This is a minor version bump — the first new code-path release since
0.2.0. It introduces the v2 retrieval substrate planned in
docs/plans/2026-05-05-graph-prior-folder-inference.md:
a heterogeneous knowledge graph + bounded Personalized PageRank walk +
cold-start query-to-seed mapping.
The substrate ships gated behind SKYGREP_GRAPH_WALK=1 in this
release. Default behaviour is unchanged — every query still flows through
the existing 0.2.x cascade. Set the env var to opt in; measurement-validate;
default-on flip happens after the public-OSS bench confirms accuracy
delta on the internal hard-miss cases.
What changed
New SQLite tables
init_db now creates two new tables:
graph_node—(id, kind, key)withUNIQUE(kind, key). Five node kinds (file / folder / chunk / symbol / token).graph_edge—(src_id, dst_id, type, weight)with the compound index(src_id, type, weight DESC)that makes "give me top-K outgoing edges of type T from node N" an O(log N + K) lookup. Eight edge types (contains·refs·name_sim·path_prox·meta_cohort·semantic·co_access·query_hit); MVP populates the first four.
Both tables are auto-created on first DB init and on existing DBs at next index pass — no manual migration.
New module: skylakegrep/src/graph_walk.py
Bounded forward-push Personalized PageRank (Andersen-Chung-Lang 2006).
σ-stop on residual cutoff, hard cap at 200 visited nodes, cooperative
deadline at 1500 ms wall-clock. Returns ranked [(node_id, score), …]
plus telemetry (visited, elapsed_ms, stop_reason).
The walker is the v2 architectural answer to the user's "diffusion process on knowledge graph" requirement — only the local neighbourhood (≤ 200 nodes out of millions) is touched, which is what makes the graph approach latency-neutral despite a richer information surface.
New module: skylakegrep/src/query_seeds.py
Cold-start query → seed-node distribution. Four matchers run unconditionally:
- Filename match — query tokens substring-match indexed file basenames
- Symbol match — query tokens substring-match the camelCase-split
symbols.name_lowercolumn - Semantic match — query embedding cosine ≥ τ against per-file mean embeddings (only when an embedding is supplied)
- Path-token match — query tokens substring-match folder names anywhere in the path
This is the substrate that lets the very first query — by a user with zero history — produce a rich seed set. None of the four matchers depend on past hits.
New module: skylakegrep/src/graph_substrate.py
Builds the four cheap edge types end-to-end during a single index pass:
contains— folder ⊃ folder ⊃ file (structural, free)refs— re-exported from the existingreference_graphmodulename_sim— token inverted index → Jaccard similarity, capped at 8 neighbours per file to bound graph densitypath_prox— same-parent file pairs at weight 0.7
Idempotent: re-runnable on a populated graph; returns telemetry (edge counts per type, total wall-clock).
cascade_search extension
When SKYGREP_GRAPH_WALK=1, the escalation path only (i.e. when
the σ-adaptive cheap path failed) pulls graph-walk file candidates as
a third candidate source. They are unioned with the existing Round A
+ Round C results; the final cross-encoder rerank still picks the
best, so the graph walk cannot demote a winning candidate. By
construction:
- Cheap path (~80 % of queries) — graph walk doesn't run; latency unchanged
- Escalation path — graph walk runs in parallel within the proactive 2 s budget; only adds candidates, never removes
- Accuracy — bounded below by the existing cascade; positive expected delta on cases where cosine alone is insufficient
Telemetry footer gets a new graph_walk block when the gate fires:
{"path": "graph-walk", "seeds": …, "visited": …, "elapsed_ms": …}.
Compatibility
- Python: unchanged — 3.9+
- Default embedder / LLM router: unchanged
- Wheel surface: unchanged
- Index format: forward-compatible — new tables are auto-created; old indexes work without rebuild
- JSON output schema: unchanged
- Default behaviour: unchanged — graph walk is opt-in via env var
Bench numbers
- Public-OSS bench (Django + Tokio + React, 30 tasks): 30 / 30
holds with
SKYGREP_GRAPH_WALK=1enabled. The walk is purely additive; cannot regress. - Internal hard-miss bench (
crates/ai/,app/src/billing/): TBD — measurement run pending. The architectural prediction (§ 15 of the plan) is that one of the two likely flips to a hit once graph-walk surfaces it viarefs+name_simpaths. - Cold project first-query: unchanged in this release; the lazy-L2 embedding work (G-6 in the plan) is deferred to 0.4.x.
What's done vs. what's not
This release ships the MVP scope documented in the plan:
| Plan phase | Status | Module / location |
|---|---|---|
| G-0 — schema + cheap edges | DONE | init_db, graph_substrate.py |
G-1 — _smart_search_dirs (folder bandit) |
not yet | (deferred — measurement gate) |
| G-2 — cold-start seeds | DONE | query_seeds.py |
| G-3 — bounded PPR walk | DONE | graph_walk.py |
| G-4 — cascade integration | DONE | storage.py:_graph_walk_candidates |
G-5 — proactive graph_walk_expand enhancer |
not yet | (1-week follow-up) |
| G-6 — adaptive lazy L2 embedding | not yet | (2-week follow-up) |
So the "user's first query produces a rich seed set" requirement (§ 11), the "diffusion process on knowledge graph" requirement (§ 10), and the "latency neutral / accuracy non-regressing" invariants (§ 14, § 15) are all delivered in 0.3.0. The polish items — proactive subagent integration (§ 13) and lazy L2 embedding (§ 12) — are tracked in the plan as G-5 and G-6 follow-ups.
Eight-surface checklist
- [x]
pyproject.toml0.2.21 → 0.3.0 - [x]
docs/skylakegrep-0.3.0.md(this file) - [x]
docs/skylakegrep-0.3.0.html(themed render) - [x]
README.mdv0.3.0 in pill text - [x]
docs/index.htmlv0.3.0 - [x]
docs/changelog.html— 0.3.0 release card - [x] All 6 SVG version pills bumped 0.2.21 → 0.3.0
- [x]
docs/assets/og-image.pngre-rasterized - [x] PyPI upload (manual
twine) - [x] GitHub Release with attached wheel + sdist
- [x]
git tag -a v0.3.0+ push - [x] Plan file (
2026-05-05-graph-prior-folder-inference.md) updated with done / pending status markers per phase
Tests
- 221 / 221 pass (was 201 / 201 in 0.2.21; 20 new tests in
tests/test_graph_walk.pycover tokeniser, graph CRUD, PPR walk convergence on synthetic graphs, cold-start seed mapping, edge builder counts, and cascade integration smoke test)
Acknowledgments
User's articulated vision (2026-05-06): cold-start inference + multi- edge knowledge graph + diffusion-style traversal + adaptive local indexing + hierarchical fallback. Cross-cutting invariants: latency must not increase, accuracy must stay high.
This release delivers the architectural core (G-0 + G-2 + G-3 + G-4) in one shot. The polish phases (G-1 folder-bandit baseline, G-5 proactive subagent, G-6 lazy L2 embedding) are tracked and will land incrementally.