Guide AI agents through systematic performance analysis of DuckDB's VSS extension using functional programming patterns, Korean text data, and parallel experiment execution.
This skill guides AI agents working with a DuckDB Vector Search benchmarking project that systematically analyzes performance characteristics of DuckDB's VSS (Vector Similarity Search) extension for text vector search scenarios.
The project follows a **functional programming paradigm** with:
48 experimental combinations testing:
First, verify and set up the Python environment:
```bash
uv sync
pip install duckdb faker pandas numpy psutil matplotlib seaborn plotly pyrsistent
python test_duckdb_vss_installation.py
python -c "import duckdb; conn = duckdb.connect(); print(conn.execute('SELECT * FROM duckdb_extensions() WHERE extension_name = \'vss\'').fetchall())"
python -c "import duckdb; print(f'DuckDB version: {duckdb.__version__}')"
pytest tests/ -v
```
Navigate the functional architecture:
```
src/
├── types/ # Type definitions (frozen dataclasses)
├── pure/ # Pure functions (no side effects)
│ ├── generators/ # Data generation
│ ├── transformers/ # Data transformations
│ └── calculators/ # Metrics and analysis
├── effects/ # Side effect management
│ ├── db/ # Database IO operations
│ ├── io/ # File IO operations
│ └── metrics/ # Performance monitoring
├── pipelines/ # Function composition pipelines
└── runners/ # Main entry points
├── experiment_runner.py # CLI experiment runner with parallel support
├── parallel_runner.py # Parallel execution engine
├── checkpoint.py # Checkpoint management
└── monitoring.py # Resource monitoring
```
Execute benchmarks using the experiment runner:
```bash
python -m src.runners.experiment_runner --all --parallel
python -m src.runners.experiment_runner --all --parallel --workers 6 --max-memory 8000
python -m src.runners.experiment_runner --data-scale small --dimensions 128,256 --parallel
python -m src.runners.experiment_runner --resume --checkpoint-dir checkpoints/
python -m src.tools.monitor --experiment-dir experiments/
```
When modifying or extending code, adhere to these patterns:
**Immutability**:
```python
@dataclass(frozen=True)
class ExperimentConfig:
data_scale: int
dimension: int
search_type: str
```
**Pure Functions**: Separate business logic from I/O operations
```python
def generate_vector(seed: int, dimension: int) -> Vector:
rng = np.random.default_rng(seed)
return normalize(rng.random(dimension))
```
**Effect Handling**: Wrap all side effects in IO monad
```python
def insert_documents(conn: Connection, docs: List[Document]) -> IO[int]:
return IO(lambda: conn.executemany("INSERT ...", docs))
```
**Error Handling**: Use Either type instead of exceptions
```python
def validate_config(config: ExperimentConfig) -> Either[str, ExperimentConfig]:
if config.dimension not in [128, 256, 512, 1024]:
return Left("Invalid dimension")
return Right(config)
```
Key SQL patterns for vector search:
**Create HNSW Index**:
```sql
CREATE INDEX idx_name ON table_name USING HNSW(vector_column)
WITH (ef_construction = 128, ef_search = 64, M = 16, metric = 'cosine');
```
**Vector Similarity Search**:
```sql
SELECT * FROM table_name
ORDER BY array_distance(vector_column, query_vector::FLOAT[n])
LIMIT k;
```
**Hybrid Search (Vector + BM25)**:
```sql
WITH vector_results AS (
SELECT id, array_distance(vector, ?::FLOAT[n]) as v_score
FROM table_name
ORDER BY v_score LIMIT 100
),
text_results AS (
SELECT id, fts_main_table.match_bm25(id, ?) as t_score
FROM table_name
WHERE text LIKE '%' || ? || '%'
)
SELECT v.id, (0.7 * (1 - v.v_score)) + (0.3 * t.t_score) as score
FROM vector_results v
JOIN text_results t ON v.id = t.id
ORDER BY score DESC LIMIT k;
```
Run tests at different levels:
```bash
python -m pytest tests/ -v
python -m pytest tests/pure/ -v
python -m pytest tests/runners/ -v
python -m pytest tests/effects/ -v --db-mode=mock
```
When optimizing or troubleshooting:
The project uses Korean text for realistic benchmarking:
```python
from faker import Faker
fake = Faker('ko_KR')
```
The complete workflow with checkpointing support:
1. **Data Generation**: Korean text with deterministic seeds
2. **Database Initialization**: Create tables and load VSS extension
3. **Data Insertion**: Batch insertion with performance metrics
4. **Index Building**: HNSW index with dimension-optimized parameters
5. **Search Execution**: Pure vector or hybrid search with filtering
6. **Result Analysis**: Calculate accuracy metrics and generate reports
Each stage supports checkpointing for resumability. Check `plan/06-experiment-workflow.md` for detailed workflow design.
When a user asks to:
Always maintain the functional programming paradigm: pure functions for logic, monadic wrappers for effects, and explicit error handling with Either types.
Leave a review
No reviews yet. Be the first to review this skill!
# Download SKILL.md from killerskills.ai/api/skills/duckdb-vector-search-benchmarking/raw