Concept · indexing model
Source files become line-attributed chunks with dense vectors.
Indexing is divided into five stages, all of which run in process and commit to a single SQLite file:
-
Discover. Recursively walk the repository, restricted
to a fixed set of supported source extensions. Apply the union of
.gitignore,.skygrepignore, and a built-in skip-list of build/cache/vendor directories. - Chunk. When a tree-sitter grammar is available for the language, walk the AST and emit nodes whose span is within configured size limits (≤ 50 lines and ≤ 1000 chars by default, with a minimum of 3 lines). Otherwise fall back to a deterministic line-window splitter.
-
Embed. Send chunks in batches to the Ollama
/api/embedendpoint; fall back to single-text/api/embeddingsrequests if the batch endpoint is unavailable on the running model. -
Persist. Insert one row into
chunksand one parallel row intovectors, keyed by the same integer id. Embeddings are stored as raw float32 BLOBs. -
Reconcile. On incremental runs, compare disk
mtimeagainstfile_mtimein the index. Reindex changed files; delete rows for files no longer present under the indexed root.
Supported languages
The extension-to-parser map lives in
skylakegrep/src/indexer.py:
.py, .js, .ts,
.tsx, .jsx, .go,
.rs, .java, .c,
.cpp, .h, .cs,
.rb, .php, .swift,
.kt, .scala, .vue,
.svelte. When a tree-sitter grammar package is missing
for a supported extension, the file still indexes via the line-window
fallback.
Concept · ranking
Hybrid scoring with capped per-file diversification.
A query produces a single ranked list. Scoring proceeds in four steps:
- Cosine. Embed the query with the same model used at index time and compute cosine similarity against the candidate embedding matrix in a single NumPy pass.
-
Lexical adjustment. When the raw query text is
available and
--semantic-onlyis not set, compute a token-and-phrase overlap score against the chunk and its file path, then take a weighted sum:0.8 · cosine + 0.2 · lexical. -
Span deduplication. Drop later candidates whose
(file, start_line, end_line, text)tuple has already been observed. - Diversification. Cap the number of accepted candidates per file at 2 before filling the remaining top-k slots from the overflow.
The numerical defaults (LEXICAL_WEIGHT = 0.2,
MAX_RESULTS_PER_FILE = 2, term/phrase weights
0.75/0.25) are module-level constants in
skylakegrep/src/storage.py. They are not exposed
through CLI flags in this release.
Metadata as a facet, not an intent
Filesystem metadata such as opened time, modified time, created time, and size is treated as a query-plan facet. If metadata fully answers the question, skygrep uses a fast filesystem lane and skips semantic work. If the query still contains a target or content constraint, metadata becomes a modifier: retrieval still finds relevant files, and metadata only reranks those already-relevant candidates.
Concept · output
Three rendering modes.
Human-readable (default)
Plain text records, one per result, ordered by descending score:
=== skylakegrep/src/storage.py:197-271 (score: 0.824) ===
def search(conn, query_embedding, top_k=10, …):
query_vec = np.array(query_embedding, dtype=np.float32)
…
JSON (--json)
Stable output for scripts and coding agents. In 0.5.13, JSON agent
calls can also include optional candidate-recall provenance fields and
same-file support evidence when --content is requested. See
JSON schema.
Synthesized answer (--answer)
The retrieved snippets are passed as context to a local Ollama generation model along with a fixed system prompt. The model is instructed to cite file paths and line ranges and to refrain from asking follow-up questions. The original ranked sources are still printed below the synthesized answer.
--agentic precedes the search with an LLM step that
decomposes the question into up to --max-subqueries
related queries; each is searched independently and the results are
merged by score.