Code Indexing¶
Purpose¶
The code-index POC provides a local Qdrant-backed search index for this bf-dev workspace.
It is intended for developer-only exploration and agent assistance, not for production traffic or CI hosting.
The Qdrant REST and gRPC ports are bound to 127.0.0.1 only by default.
What It Includes¶
- A dedicated Qdrant service started with
./code-index-backend. - A Python CLI at
./code-indexwithrebuild,update,status,validate, andquery. - Persisted local state under
bfd-caches/code-index/, with SQLite metadata inbfd-caches/code-index/state.sqlite3by default. - An optional
.pre-commit-config.yamlhook that runs./code-index update. - Separate Qdrant collections per repo root, with workspace-wide queries fanning out across them.
- Tree-sitter-aware chunking when a parser is available, with a newline-aware fallback when it is not.
Setup¶
- Install the repo-managed tool dependencies:
- Start the isolated Qdrant service:
The REST API listens on 127.0.0.1:${CODE_INDEX_QDRANT_PORT:-6333} and the built-in dashboard is available at http://127.0.0.1:6333/dashboard by default.
- Optionally copy the example config and adjust it:
- Build the first index:
To rebuild only selected repo roots, repeat --project. Use root for bf-dev files at the workspace root:
./code-index rebuild --project bf-manage-core --project bf-manage-web
./code-index rebuild --project root
./code-index rebuild --project bf-manage-roaming --ignore '**/build/**' --ignore '**/dist/**'
The code-index CLI resolves the state database path from state_dir in .code-index.json and stores the database at <state_dir>/state.sqlite3. With the default config, that is bfd-caches/code-index/state.sqlite3.
Default Workflow¶
Use update as the normal local workflow:
Add --profile to rebuild, update, status, validate, or query when you want timing data in the JSON response:
update compares tracked files against the last saved manifest and only refreshes changed or deleted files.
This keeps incremental indexing as the default path for day-to-day work.
It now uses a cheap stat scan first and only hashes file contents when the saved metadata suggests a file may have changed, which avoids reindexing mtime-only churn.
You can scope the same workflow to selected repo roots with --project, and add one-off path globs with --ignore.
Use status to inspect the local state:
status prints the resolved SQLite path in config.state_path and includes the persisted metadata snapshot in state. For deeper inspection, use the local sqlite3 shell against that path.
Use validate when you want an integrity check across the saved SQLite state, the current discovered file set, and the scoped Qdrant collections without mutating anything:
validate is a read-only command. It does not repair drift, rewrite SQLite metadata, re-embed files, or recreate collections. Use it when status shows something suspicious, after local migrations or manual recovery work, or when you want a machine-readable integrity report before deciding whether a later update or rebuild is needed.
When validate completes successfully, it exits 0 if no integrity issues were found and 2 if the validation pass found drift. Operational failures still exit non-zero as normal command errors.
At a high level, the JSON response includes a validation summary with issue categories such as:
- state chunks missing from Qdrant
- orphan chunks present in Qdrant but not in saved state
- payload mismatches for chunks that exist in both places
- collection count drift between saved state and Qdrant
- file snapshot drift between saved state and the current working tree
- discovery drift between saved state paths and the current Git-tracked discovery set
Use status for a fast local snapshot, validate for a deeper read-only integrity pass, and update when you want to bring the index forward to match current tracked files.
Use query to search the local collection:
./code-index query "where is profile.ini loaded"
./code-index query --project bf-manage-core "incident model"
SQLite State Inspection¶
The code-index state store is a local SQLite database, not a committed project artifact. By default it lives at:
You can confirm the resolved path at any time:
Inspect the schema and metadata with sqlite3:
sqlite3 bfd-caches/code-index/state.sqlite3 ".tables"
sqlite3 bfd-caches/code-index/state.sqlite3 "SELECT key, value FROM metadata ORDER BY key;"
sqlite3 bfd-caches/code-index/state.sqlite3 "SELECT path, collection_name, size, mtime_ns FROM files ORDER BY path LIMIT 20;"
sqlite3 bfd-caches/code-index/state.sqlite3 "SELECT path, chunk_ordinal, chunk_id FROM file_chunks ORDER BY path, chunk_ordinal LIMIT 20;"
Expected tables are metadata, files, and file_chunks.
SQLite runs in WAL mode for this state store. While the indexer or another SQLite client has the database open, sidecar files such as state.sqlite3-wal and state.sqlite3-shm may be present beside the main database file. That is expected local SQLite behavior; do not delete those files while the database is in use.
Manual Recovery¶
Use rebuild when the collection is missing, the schema/config changed, or validate confirms drift that you want to replace with a clean rebuild:
rebuild is the manual full refresh escape hatch.
Do not use it as the default hook or commit-time workflow.
With no --project, rebuild recreates the whole workspace index. With one or more --project flags, it only recreates those project collections and keeps the rest of the state intact.
Optional Pre-commit Integration¶
This repo includes a local-only .pre-commit-config.yaml entry for ./code-index update.
It does not write directly into .git/hooks/.
If you want to enable it:
Config Contract¶
.code-index.json is optional.
If present, it can override:
state_dirchunk_sizechunk_overlapmax_file_bytesupdate_batch_sizequery_limitquery_multiplierinclude_extensionsexclude_dirsexclude_path_globsfiletype_mapchunk_filtersqdrant.hostqdrant.portqdrant.grpc_portqdrant.httpsqdrant.api_key_envqdrant.collection_nameacts as the collection name prefix. Each repo is indexed into<prefix>__rootor<prefix>__<repo-name>.qdrant.on_diskqdrant.hnsw.distanceqdrant.hnsw.ef_constructqdrant.hnsw.search_efqdrant.hnsw.mqdrant.hnsw.full_scan_thresholdqdrant.hnsw.max_indexing_threadsqdrant.hnsw.on_diskqdrant.hnsw.payload_membedding_provider.kindembedding_provider.modelembedding_provider.base_urlembedding_provider.api_key_envembedding_provider.options
query_multiplier controls query over-fetch before the service deduplicates chunk hits down to the best result per file.
filetype_map lets you force a parser/language for unusual file names or extensions using a simple name-or-suffix-to-language mapping.
chunk_filters applies regex-based chunk exclusion by language before chunks are upserted, using either a language-keyed object or an explicit rule list.
The default provider uses local Qdrant FastEmbed integration, and the CLI also supports openai, ollama, and sentence-transformer embedding functions when the relevant local dependencies and credentials are available.
Tree-sitter chunking is opportunistic for an allowlisted set of code-heavy languages such as Python, TypeScript, TSX, JavaScript, Go, Java, Rust, C, and C++; if parser resolution fails, the indexer falls back to newline-aware window chunking.
The default max_file_bytes is 256000, so larger files are skipped unless you raise that limit in .code-index.json.
CLI --ignore flags append extra globs on top of exclude_path_globs for a single run.
Environment overrides are also supported:
CODE_INDEX_QDRANT_HOSTCODE_INDEX_QDRANT_PORTCODE_INDEX_QDRANT_GRPC_PORTCODE_INDEX_COLLECTION
Migrating Existing Local State¶
If you already have legacy local state from an older checkout that used bfd-caches/code-index/state.json, decide which path you want before running the new SQLite-backed workflow:
- If you do not need to preserve the old incremental state, delete the old JSON file and run
./code-index rebuild. - If you do want to preserve it, first stop any running code-index work, ensure
uv sync --devhas completed, ensure the Qdrant backend is available if you plan to validate counts, and import the JSON intobfd-caches/code-index/state.sqlite3with a one-time local migration script.
Migration prerequisites:
- No concurrent
./code-index rebuildor./code-index updateprocess is running. - You know the target SQLite path (
./code-index statusreports it asconfig.state_path). - You keep the legacy
state.jsonuntouched until you have validated the imported row counts and spot-checked a few file-to-chunk mappings withsqlite3.
The migration helper is intentionally temporary and local-only. Do not commit it. After a successful import and validation, delete the temporary migration script from your working tree.
Rollback¶
To stop using the POC entirely:
- Stop and remove the isolated service:
- Remove the local code-index state, including the SQLite database and any WAL sidecars:
- If you enabled pre-commit, uninstall it or remove the local hook from
.pre-commit-config.yaml.
If you only need to recover from bad local SQLite state, prefer ./code-index rebuild first. That recreates the indexed state without requiring a full feature rollback.
If you created a one-time migration helper for legacy state.json import, delete that script as part of rollback or immediately after a successful migration. It is a temporary local tool and should not remain in the repo.
Constraints¶
- This is a local-only proof of concept.
- Default indexing excludes common secret-bearing paths and file names, but it is still your responsibility not to index sensitive material carelessly.
- The default provider uses local Qdrant FastEmbed integration unless you opt into another provider in
.code-index.json. - Tests are intentionally written to avoid requiring a live Qdrant service.
- The CLI runs through
uv run --dev, souv sync --devmust complete successfully first. - Query quality depends on the configured embedding provider, tree-sitter parser availability, and current chunk/query settings.
- The standing Qdrant service is treated as persistent local state; destructive rebuild behavior remains an explicit CLI workflow.
- Discovery currently follows Git-tracked files only; untracked files are ignored for now.
- SQLite writes use WAL mode and immediate write transactions for local durability; avoid manually deleting
state.sqlite3-walorstate.sqlite3-shmwhile code-index commands are running.