DuckDB Vector Search Benchmarking

This skill guides AI agents working with a DuckDB Vector Search benchmarking project that systematically analyzes performance characteristics of DuckDB's VSS (Vector Similarity Search) extension for text vector search scenarios.

Project Architecture

The project follows a **functional programming paradigm** with:

**Pure Layer**: Business logic with no side effects (data generation, transformations, calculations)

**Effect Layer**: IO operations wrapped in monadic types (database, file I/O, metrics collection)

**Pipeline Layer**: Function composition to build complex workflows from simple functions

Experiment Matrix

48 experimental combinations testing:

**Data Scales**: 10K, 100K, 250K vectors

**Vector Dimensions**: 128, 256, 512, 1024

**Search Types**: Pure vector search, Hybrid (vector + BM25)

**Filter Conditions**: With/without metadata filtering

Step-by-Step Instructions

1. Environment Setup

First, verify and set up the Python environment:

```bash

Install dependencies using uv (recommended)

uv sync

Alternative: pip install

pip install duckdb faker pandas numpy psutil matplotlib seaborn plotly pyrsistent

Install and verify DuckDB VSS extension

python test_duckdb_vss_installation.py

Verify VSS installation

python -c "import duckdb; conn = duckdb.connect(); print(conn.execute('SELECT * FROM duckdb_extensions() WHERE extension_name = \'vss\'').fetchall())"

Check DuckDB version (VSS requires compatible version)

python -c "import duckdb; print(f'DuckDB version: {duckdb.__version__}')"

Run tests to verify setup (87 tests, 99% success rate)

pytest tests/ -v

```

2. Understanding the Codebase Structure

Navigate the functional architecture:

```

src/

├── types/ # Type definitions (frozen dataclasses)

├── pure/ # Pure functions (no side effects)

│ ├── generators/ # Data generation

│ ├── transformers/ # Data transformations

│ └── calculators/ # Metrics and analysis

├── effects/ # Side effect management

│ ├── db/ # Database IO operations

│ ├── io/ # File IO operations

│ └── metrics/ # Performance monitoring

├── pipelines/ # Function composition pipelines

└── runners/ # Main entry points

├── experiment_runner.py # CLI experiment runner with parallel support

├── parallel_runner.py # Parallel execution engine

├── checkpoint.py # Checkpoint management

└── monitoring.py # Resource monitoring

```

3. Running Experiments

Execute benchmarks using the experiment runner:

```bash

Run all 48 experiment combinations in parallel (recommended)

python -m src.runners.experiment_runner --all --parallel

Run with custom parallel settings

python -m src.runners.experiment_runner --all --parallel --workers 6 --max-memory 8000

Run specific experiment matrix with parallel execution

python -m src.runners.experiment_runner --data-scale small --dimensions 128,256 --parallel

Resume from checkpoint if interrupted

python -m src.runners.experiment_runner --resume --checkpoint-dir checkpoints/

Monitor experiment progress

python -m src.tools.monitor --experiment-dir experiments/

```

4. Functional Programming Patterns

When modifying or extending code, adhere to these patterns:

**Immutability**:

```python

@dataclass(frozen=True)

class ExperimentConfig:

data_scale: int

dimension: int

search_type: str

```

**Pure Functions**: Separate business logic from I/O operations

```python

Pure function - no side effects

def generate_vector(seed: int, dimension: int) -> Vector:

rng = np.random.default_rng(seed)

return normalize(rng.random(dimension))

```

**Effect Handling**: Wrap all side effects in IO monad

```python

Effect wrapper

def insert_documents(conn: Connection, docs: List[Document]) -> IO[int]:

return IO(lambda: conn.executemany("INSERT ...", docs))

```

**Error Handling**: Use Either type instead of exceptions

```python

def validate_config(config: ExperimentConfig) -> Either[str, ExperimentConfig]:

if config.dimension not in [128, 256, 512, 1024]:

return Left("Invalid dimension")

return Right(config)

```

5. DuckDB VSS Operations

Key SQL patterns for vector search:

**Create HNSW Index**:

```sql

CREATE INDEX idx_name ON table_name USING HNSW(vector_column)

WITH (ef_construction = 128, ef_search = 64, M = 16, metric = 'cosine');

```

**Vector Similarity Search**:

```sql

SELECT * FROM table_name

ORDER BY array_distance(vector_column, query_vector::FLOAT[n])

LIMIT k;

```

**Hybrid Search (Vector + BM25)**:

```sql

WITH vector_results AS (

SELECT id, array_distance(vector, ?::FLOAT[n]) as v_score

FROM table_name

ORDER BY v_score LIMIT 100

text_results AS (

SELECT id, fts_main_table.match_bm25(id, ?) as t_score

FROM table_name

WHERE text LIKE '%' || ? || '%'

)

SELECT v.id, (0.7 * (1 - v.v_score)) + (0.3 * t.t_score) as score

FROM vector_results v

JOIN text_results t ON v.id = t.id

ORDER BY score DESC LIMIT k;

```

6. Testing Strategy

Run tests at different levels:

```bash

All tests (87 unit tests, 99% success rate)

python -m pytest tests/ -v

Test only pure functions

python -m pytest tests/pure/ -v

Test runners including parallel execution

python -m pytest tests/runners/ -v

Test effects with mocks

python -m pytest tests/effects/ -v --db-mode=mock

```

7. Performance Optimization

When optimizing or troubleshooting:

Test different HNSW parameters (ef_construction, ef_search, M)

Monitor memory usage carefully as indexes are RAM-resident

Use parallel experiment execution with `--parallel` flag

Dynamic worker scaling based on available memory and CPU cores

Use connection pooling wrapped in Reader monad

Profile queries with EXPLAIN ANALYZE

8. Korean Text Data Generation

The project uses Korean text for realistic benchmarking:

```python

from faker import Faker

fake = Faker('ko_KR')

Generate with deterministic seeds for reproducibility

Categories: news articles, product reviews, documents, social media

Include metadata for filtering tests (category, timestamp, user_id)

```

Important Constraints

DuckDB VSS Limitations

VSS extension is experimental, not production-ready

Requires vectors as FLOAT[n] arrays (32-bit floats only)

HNSW indexes must fit entirely in RAM

Updates mark deletions only; periodic compression needed

Type System Requirements

Use `@dataclass(frozen=True)` for all data structures

Define NewTypes for domain concepts: `Vector`, `DocumentId`, etc.

Wrap side effects in `IO[T]` monad

Use `Either[E, T]` for error handling in pure code

Experiment Workflow

The complete workflow with checkpointing support:

1. **Data Generation**: Korean text with deterministic seeds

2. **Database Initialization**: Create tables and load VSS extension

3. **Data Insertion**: Batch insertion with performance metrics

4. **Index Building**: HNSW index with dimension-optimized parameters

5. **Search Execution**: Pure vector or hybrid search with filtering

6. **Result Analysis**: Calculate accuracy metrics and generate reports

Each stage supports checkpointing for resumability. Check `plan/06-experiment-workflow.md` for detailed workflow design.

Key Functions Reference

Pure Functions (src/pure/)

`generate_vector(seed, dimension)`: Create normalized vectors deterministically

`calculate_recall_at_k(retrieved, relevant, k)`: Accuracy metrics

`batch_documents(documents, batch_size)`: Split data for processing

Effect Functions (src/effects/)

`with_db_connection(config, f)`: Managed database connections

`measure_io(io)`: Performance measurement wrapper

`parallel_map_io(f, items)`: Concurrent IO execution

Pipeline Composition (src/pipelines/)

`single_experiment_pipeline(config)`: Complete experiment execution

`data_preparation_pipeline(config)`: Generate test data

`analysis_pipeline(results)`: Aggregate and visualize results

Usage Examples

When a user asks to:

**Run benchmarks**: Use the parallel experiment runner with appropriate flags

**Add new test scenarios**: Create new pure functions and compose into pipelines

**Debug performance issues**: Check HNSW parameters and use EXPLAIN ANALYZE

**Modify data generation**: Update pure generator functions with frozen dataclasses

**Add metrics**: Create pure calculator functions and wrap measurements in IO monad

Always maintain the functional programming paradigm: pure functions for logic, monadic wrappers for effects, and explicit error handling with Either types.

DuckDB Vector Search Benchmarking

DuckDB Vector Search Benchmarking

Project Architecture

Experiment Matrix

Step-by-Step Instructions

1. Environment Setup

Install dependencies using uv (recommended)

Alternative: pip install

Install and verify DuckDB VSS extension

Verify VSS installation

Check DuckDB version (VSS requires compatible version)

Run tests to verify setup (87 tests, 99% success rate)

2. Understanding the Codebase Structure

3. Running Experiments

Run all 48 experiment combinations in parallel (recommended)

Run with custom parallel settings

Run specific experiment matrix with parallel execution

Resume from checkpoint if interrupted

Monitor experiment progress

4. Functional Programming Patterns

Pure function - no side effects

Effect wrapper

5. DuckDB VSS Operations

6. Testing Strategy

All tests (87 unit tests, 99% success rate)

Test only pure functions

Test runners including parallel execution

Test effects with mocks

7. Performance Optimization

8. Korean Text Data Generation

Generate with deterministic seeds for reproducibility

Categories: news articles, product reviews, documents, social media

Include metadata for filtering tests (category, timestamp, user_id)

Important Constraints

DuckDB VSS Limitations

Type System Requirements

Experiment Workflow

Key Functions Reference

Pure Functions (src/pure/)

Effect Functions (src/effects/)

Pipeline Composition (src/pipelines/)

Usage Examples

Reviews (0)