skylakegrep Concepts

Concept · indexing model

Source files become line-attributed chunks with dense vectors.

Indexing is divided into five stages, all of which run in process and commit to a single SQLite file:

  1. Discover. Recursively walk the repository, restricted to a fixed set of supported source extensions. Apply the union of .gitignore, .skygrepignore, and a built-in skip-list of build/cache/vendor directories.
  2. Chunk. When a tree-sitter grammar is available for the language, walk the AST and emit nodes whose span is within configured size limits (≤ 50 lines and ≤ 1000 chars by default, with a minimum of 3 lines). Otherwise fall back to a deterministic line-window splitter.
  3. Embed. Send chunks in batches to the Ollama /api/embed endpoint; fall back to single-text /api/embeddings requests if the batch endpoint is unavailable on the running model.
  4. Persist. Insert one row into chunks and one parallel row into vectors, keyed by the same integer id. Embeddings are stored as raw float32 BLOBs.
  5. Reconcile. On incremental runs, compare disk mtime against file_mtime in the index. Reindex changed files; delete rows for files no longer present under the indexed root.

Supported languages

The extension-to-parser map lives in skylakegrep/src/indexer.py: .py, .js, .ts, .tsx, .jsx, .go, .rs, .java, .c, .cpp, .h, .cs, .rb, .php, .swift, .kt, .scala, .vue, .svelte. When a tree-sitter grammar package is missing for a supported extension, the file still indexes via the line-window fallback.

Concept · ranking

Hybrid scoring with capped per-file diversification.

A query produces a single ranked list. Scoring proceeds in four steps:

  1. Cosine. Embed the query with the same model used at index time and compute cosine similarity against the candidate embedding matrix in a single NumPy pass.
  2. Lexical adjustment. When the raw query text is available and --semantic-only is not set, compute a token-and-phrase overlap score against the chunk and its file path, then take a weighted sum: 0.8 · cosine + 0.2 · lexical.
  3. Span deduplication. Drop later candidates whose (file, start_line, end_line, text) tuple has already been observed.
  4. Diversification. Cap the number of accepted candidates per file at 2 before filling the remaining top-k slots from the overflow.

The numerical defaults (LEXICAL_WEIGHT = 0.2, MAX_RESULTS_PER_FILE = 2, term/phrase weights 0.75/0.25) are module-level constants in skylakegrep/src/storage.py. They are not exposed through CLI flags in this release.

Metadata as a facet, not an intent

Filesystem metadata such as opened time, modified time, created time, and size is treated as a query-plan facet. If metadata fully answers the question, skygrep uses a fast filesystem lane and skips semantic work. If the query still contains a target or content constraint, metadata becomes a modifier: retrieval still finds relevant files, and metadata only reranks those already-relevant candidates.

Concept · output

Three rendering modes.

Human-readable (default)

Plain text records, one per result, ordered by descending score:

=== skylakegrep/src/storage.py:197-271 (score: 0.824) ===
def search(conn, query_embedding, top_k=10, …):
 query_vec = np.array(query_embedding, dtype=np.float32)
 …

JSON (--json)

Stable output for scripts and coding agents. In 0.5.13, JSON agent calls can also include optional candidate-recall provenance fields and same-file support evidence when --content is requested. See JSON schema.

Synthesized answer (--answer)

The retrieved snippets are passed as context to a local Ollama generation model along with a fixed system prompt. The model is instructed to cite file paths and line ranges and to refrain from asking follow-up questions. The original ranked sources are still printed below the synthesized answer.

--agentic precedes the search with an LLM step that decomposes the question into up to --max-subqueries related queries; each is searched independently and the results are merged by score.