PDFKB-MCP Development Assistant

Expert assistant for working with pdfkb-mcp, a Model Context Protocol (MCP) server providing intelligent document search and retrieval from PDF collections with semantic search, vector storage, and dual interfaces (MCP + web UI).

What This Skill Does

This skill provides comprehensive guidance for developing, testing, and maintaining the pdfkb-mcp project. It understands the architecture, development workflows, configuration patterns, and best practices specific to this codebase.

Instructions

Project Context

You are working with **pdfkb-mcp**, an MCP server that enables intelligent PDF document search and retrieval with:

Semantic search powered by OpenAI/HuggingFace/local embeddings and ChromaDB vector storage

Multiple PDF parsing strategies (PyMuPDF4LLM, Marker, MinerU, Docling, LLM) with fallback mechanisms

Intelligent multi-stage caching system that invalidates when configuration changes

Dual interface: MCP protocol integration + modern FastAPI web interface with WebSocket support

Plugin-based parser architecture for extensibility

Architecture Overview

**Core Components:**

`src/pdfkb/main.py` - FastMCP-based server with MCP tools (add_document, search_documents, list_documents, remove_document)

`src/pdfkb/vector_store.py` - ChromaDB-based semantic search implementation

`src/pdfkb/pdf_processor.py` - Document processing orchestrator

`src/pdfkb/intelligent_cache.py:139` - Multi-stage caching with smart invalidation logic

`src/pdfkb/parsers/` - Modular PDF parser plugins

`src/pdfkb/chunker/` - Text chunking strategies (LangChain, Unstructured)

`src/pdfkb/config.py` - Environment-based configuration system

`src/pdfkb/web/` - FastAPI web server with WebSocket support

Development Workflow

**Always use Hatch for development tasks:**

```bash

Run tests

hatch run test

Run tests with coverage

hatch run test-cov

Format code (Black + isort)

hatch run format

Lint code (Black, isort, flake8)

hatch run lint

Generate HTML coverage report

hatch run cov-html

```

**Critical Rule:** After any significant code changes, ALWAYS run:

1. `hatch run format`

2. `hatch run lint`

3. `hatch run test`

Testing Guidelines

**Never run the web server during tests** - it's blocking and will hang test execution

Use pytest markers: `unit`, `integration`, `slow`, `performance`, `asyncio`

Test file patterns: `test_*.py` or `*_test.py`

Focus on fast unit tests; mark slow tests appropriately

Configuration Management

Main config class: `ServerConfig` in `src/pdfkb/config.py`

All environment variables prefixed with `PDFKB_`

Parser selection: `PDFKB_PDF_PARSER` (pymupdf4llm, marker, mineru, docling, llm)

Chunker selection: `PDFKB_PDF_CHUNKER` (langchain, unstructured)

Web interface: disabled by default, enable with `PDFKB_WEB_ENABLE=true`

Embedding providers: `PDFKB_EMBEDDING_PROVIDER` (local, openai, huggingface)

**Essential Environment Variables:**

`PDFKB_OPENAI_API_KEY` - Required only for OpenAI embeddings (local is default)

`PDFKB_OPENAI_API_BASE` - Custom base URL for OpenAI-compatible endpoints (e.g., Nebius)

`HF_TOKEN` - Required for HuggingFace embeddings (from https://huggingface.co/settings/tokens)

`PDFKB_KNOWLEDGEBASE_PATH` - PDF directory path

`PDFKB_MIN_CHUNK_SIZE` - Minimum chunk size in characters (default: 0 = disabled)

**Optional Parser Installations:**

Marker: `pip install "pdfkb-mcp[marker]"`

Docling: `pip install "pdfkb-mcp[docling]"`

MinerU: `pip install "pdfkb-mcp[mineru]"`

LLM: `pip install "pdfkb-mcp[llm]"`

Common Development Tasks

**Adding a New Parser:**

1. Create `src/pdfkb/parsers/parser_newname.py`

2. Implement the `PDFParser` interface

3. Register in parser registry

4. Add tests in `tests/parsers/`

5. Update documentation

**Modifying Caching Logic:**

1. Edit `src/pdfkb/intelligent_cache.py`

2. Understand invalidation rules (configuration changes trigger cache invalidation)

3. Test with multiple parser/chunker configurations

4. Verify cache directory structure

**Adding Web Endpoints:**

1. Extend `src/pdfkb/web/server.py`

2. Follow FastAPI patterns

3. Add WebSocket support if needed

4. Update web interface tests

**Changing Chunking Strategy:**

1. Modify chunker classes in `src/pdfkb/chunker/`

2. Ensure chunkers respect `min_chunk_size` configuration

3. Test with various PDF structures

4. Validate cache invalidation behavior

Version Management

Version is managed by `bump2version`

**Never manually change version numbers**

Only bump version when explicitly requested by the user

Use semantic versioning (major.minor.patch)

Commit Message Conventions

Use conventional commit format without Anthropic/Claude references:

`feat:` - New features

`fix:` - Bug fixes

`chore:` - Maintenance tasks

`docs:` - Documentation updates

`refactor:` - Code refactoring

`test:` - Test additions/modifications

`perf:` - Performance improvements

**Example:** `feat: add support for HuggingFace embeddings`

Diagrams and Visualization

Use Mermaid charts for architecture diagrams, flowcharts, and sequence diagrams

Include visual representations when explaining complex workflows

Code Quality Standards

1. **Type Hints:** Use comprehensive type annotations

2. **Error Handling:** Implement robust error handling with informative messages

3. **Logging:** Use structured logging throughout

4. **Documentation:** Maintain docstrings for public APIs

5. **Testing:** Aim for high coverage, especially for critical paths

Key Architecture Patterns to Follow

1. **Plugin-based Design:** Parsers are modular and interchangeable

2. **Intelligent Caching:** Multi-stage caching with configuration-aware invalidation

3. **Background Processing:** Non-blocking document processing queue

4. **Dual Interface:** MCP protocol and web UI share underlying services

5. **Fallback Mechanisms:** Graceful degradation when optional dependencies missing

Troubleshooting Common Issues

**Cache invalidation not working:** Check `intelligent_cache.py:139` for invalidation rules

**Parser not found:** Verify optional dependencies installed for that parser

**Web server hanging tests:** Ensure web server is not started in test environment

**Embedding failures:** Verify API keys and provider configuration

**Chunk size issues:** Check `PDFKB_MIN_CHUNK_SIZE` and chunker implementation

When to Explore the Codebase

Before making changes:

1. Read the relevant source files (main.py, config.py, etc.)

2. Check existing tests for patterns

3. Review parser implementations for consistency

4. Understand caching behavior for the affected components

Best Practices

Keep parsers independent and testable

Maintain backward compatibility in configuration

Document environment variable changes

Test with multiple parser/chunker combinations

Validate cache behavior after config changes

Use type hints consistently

Follow existing code style (Black + isort)

Example Usage

**User:** "Add support for a new PDF parser using pdfplumber"

**Expected Approach:**

1. Read existing parser implementations in `src/pdfkb/parsers/`

2. Create `src/pdfkb/parsers/parser_pdfplumber.py` implementing `PDFParser` interface

3. Add parser registration logic

4. Create tests in `tests/parsers/test_parser_pdfplumber.py`

5. Update configuration documentation

6. Run `hatch run format`, `hatch run lint`, `hatch run test`

7. Commit with message: `feat: add pdfplumber parser support`

**User:** "Why is the cache not invalidating when I change the chunker?"

**Expected Approach:**

1. Examine `src/pdfkb/intelligent_cache.py:139` for invalidation rules

2. Check `src/pdfkb/config.py` for chunker configuration handling

3. Verify cache key generation includes chunker type

4. Test cache behavior with different chunker settings

5. If bug found, fix invalidation logic and add regression test

Constraints

Never manually modify version numbers (use bump2version)

Do not run web server during test execution

Always use Hatch for development commands

Maintain backward compatibility in configuration changes

Follow conventional commit format without Anthropic/Claude references

Keep parsers modular and independent

Respect the multi-stage caching architecture

PDFKB-MCP Development Assistant

PDFKB-MCP Development Assistant

What This Skill Does

Instructions

Project Context

Architecture Overview

Development Workflow

Run tests

Run tests with coverage

Format code (Black + isort)

Lint code (Black, isort, flake8)

Generate HTML coverage report

Testing Guidelines

Configuration Management

Common Development Tasks

Version Management

Commit Message Conventions

Diagrams and Visualization

Code Quality Standards

Key Architecture Patterns to Follow

Troubleshooting Common Issues

When to Explore the Codebase

Best Practices

Example Usage

Constraints

Reviews (0)