PromTree PDF Extraction & RAG Pipeline

AI coding assistant specialized for the PromTree project: a production-grade system that extracts material property information from TDS/MSDS PDF files and provides RAG-based chatbot functionality.

What This Skill Does

This skill helps you work with a complex full-stack application that:

Extracts structured data from technical PDF documents (TDS/MSDS files)

Stores parsed data in multiple databases (MongoDB, PostgreSQL, Neo4j, Qdrant)

Implements hybrid RAG (Retrieval-Augmented Generation) using vector search, knowledge graphs, and Elasticsearch

Provides a modern React frontend with ChatGPT-like interface

Uses FastAPI with Beanie ODM for async MongoDB operations

Step-by-Step Instructions

1. Understanding the Architecture

When asked about the system architecture:

1. **Primary Backend**: Located in `app/` directory, uses FastAPI + Beanie ODM + MongoDB

2. **Frontend**: React 19.1 + TypeScript + Vite in `frontend/` directory

3. **RAG System**: Advanced hybrid retrieval in `retriever/` directory

4. **Extraction Pipelines**: TDS extraction in `db3/`, MSDS in `db1/` and `db1_2py/`

5. **Database Stack**:

- MongoDB: Document storage (markdown, chunks)

- PostgreSQL: Structured property data

- Elasticsearch: Full-text search with Korean support (Nori analyzer)

- Neo4j: Knowledge graph storage

- Qdrant: Vector database

2. Environment Setup

When helping with environment setup:

1. **Check for uv package manager**: This project uses `uv` for Python dependency management

```bash

which uv || curl -LsSf https://astral.sh/uv/install.sh | sh

```

2. **Start Docker services**:

```bash

docker-compose up -d

```

Services: MongoDB (27017), Qdrant (6333, 6334), Elasticsearch (9200), Neo4j (7474, 7687)

3. **Verify `.env` file** contains:

- `MONGO_INITDB_ROOT_USERNAME=promtree`

- `MONGO_INITDB_ROOT_PASSWORD=ssafy13s307`

- `MONGO_HOST=localhost`

- `MONGO_PORT=27017`

- `GOOGLE_API_KEY=<api_key>`

4. **Backend setup**:

```bash

uv sync

source .venv/bin/activate

python main.py

```

5. **Frontend setup**:

```bash

cd frontend

npm install

npm run dev

```

3. Working with the Backend (app/)

When modifying backend code:

1. **Models** are in `app/models/` using Beanie Document classes

2. **API routes** are in `app/routers/` (users, chats, collections)

3. **Services** contain business logic in `app/services/`

4. **All identifiers use snake_case**: `chat_id`, `collection_id`, `document_id`, `created_at`, `updated_at`

5. **Message contents** use `contents` field (not `content` or `message`)

6. **API docs** available at http://localhost:8000/docs

Key API endpoints:

**Users**: `/users/register`, `/users/login`, `/users/logout`, `/users/delete`, `/users/info`, `/users/settings`

**Chats**: `/chats`, `/chats/{chat_id}` (GET/POST/PATCH/DELETE)

**Collections**: `/collections`, `/collections/search?q={query}`, `/collections/{collection_id}`

**Documents**: `/collections/{collection_id}` (list/upload), `/collections/{collection_id}/{document_id}` (delete)

4. Working with the Frontend (frontend/)

When modifying frontend code:

1. **Main routing** in `App.tsx` (collections, collection-detail, chat pages)

2. **Key components**:

- `Chat.tsx`: Main chat interface with message history

- `Sidebar.tsx`: Navigation sidebar with chat history

- `Collections.tsx`: Grid view of document collections

- `CollectionDetail.tsx`: Document list within collection, file upload

3. **API client** centralized in `lib/api.ts`

4. **Custom hooks**:

- `useToast.ts`: Toast notifications

- `useUpload.ts`: File upload progress

5. **State management**: User state in localStorage, data fetched from backend API

5. Working with RAG System (retriever/)

When working with RAG functionality:

1. **Main RAG class**: `rag_system.py` with FAISS retrieval

2. **Hybrid RAG**: `lightrag/lightrag_hybrid_rag.py` combines vector + graph search

3. **Chunking**: `chunker/markdown_chunker.py` handles markdown with table unpivoting

4. **Elasticsearch**: `indexer/elasticsearch_indexer.py` for BM25 keyword search

5. **Knowledge Graph**: `knowledge_graph/neo4j_knowledge_graph.py` for Neo4j operations

6. **Vector stores**: `vector_store/` supports Qdrant and Weaviate

MongoDB chunks schema:

```python

{

"vector_id": int,

"content": str,

"source_file_name": str,

"page_num": int,

"chunk_index": int

}

```

6. PDF Extraction Pipelines

When working with extraction:

1. **TDS extraction**: `app/core/tds.py` (integrated into main app)

2. **MSDS extraction**: `app/core/msds.py` (integrated into main app)

3. **Legacy TDS**: `db3/` directory (standalone venv if needed)

4. **Legacy MSDS**: `db1/` (OCR-based) and `db1_2py/` (PyMuPDF-based)

5. **PDF parsing**: `app/promtree/parsing.py` converts PDF to markdown

6. **Table handling**: `app/promtree/unpivot.py` unpivots HTML tables

7. Git Workflow

When committing changes:

1. **Remote**: `https://lab.ssafy.com/s13-final/S13P31S307.git`

2. **Branch flow**: `feature/* → develop → master`

3. **Commit format**: `[S13P31S307-<issue-number>] <Type>: <description>`

4. **Branch naming**: `S13P31S307-<issue-number>-<description>`

5. **Main branch**: `master` (not `main`)

8. Troubleshooting

When debugging issues:

1. **Port conflicts**:

```bash

lsof -ti:8000 | xargs kill -9 # Backend

lsof -ti:5173 | xargs kill -9 # Frontend

```

2. **Database connection**:

```bash

docker-compose restart mongodb

docker-compose ps

docker logs mongodb

```

3. **RAG initialization**: First message query may take 30-60 seconds to initialize FAISS embeddings

4. **Verify service health**:

- MongoDB: `mongosh mongodb://promtree:ssafy13s307@localhost:27017`

- Qdrant: http://localhost:6333/dashboard

- Neo4j: http://localhost:7474 (credentials: neo4j/ssafy13s307)

- Elasticsearch: `curl -u elastic:ssafy13s307 http://localhost:9200`

- Backend API: http://localhost:8000/docs

Important Constraints

1. **Use uv for Python dependencies**, not pip directly (except in legacy modules)

2. **Primary codebase is `app/`**, not `backend/` (which was removed)

3. **All API identifiers use snake_case**, not camelCase

4. **Collection creation requires `type` field** (`"msds"` or `"tds"`)

5. **Message field is `contents`**, not `content` or `message`

6. **Main git branch is `master`**, not `main`

7. **Korean text support** is critical (Elasticsearch uses Nori analyzer)

8. **First RAG query may be slow** due to FAISS initialization

Example Usage

**Start full stack**:

```bash

Terminal 1: Start databases

docker-compose up -d

Terminal 2: Start backend

uv sync && source .venv/bin/activate && python main.py

Terminal 3: Start frontend

cd frontend && npm install && npm run dev

```

**Test API**:

```bash

Register user

curl -X POST http://localhost:8000/users/register \

-H "Content-Type: application/json" \

-d '{"email":"[email protected]","password":"secret","name":"Test User"}'

Create collection

curl -X POST http://localhost:8000/collections \

-H "Content-Type: application/json" \

-d '{"name":"My Collection","description":"Test","type":"tds"}'

```

**Upload document**:

```bash

curl -X POST http://localhost:8000/collections/{collection_id} \

-F "[email protected]"

```

PromTree PDF Extraction & RAG Pipeline

PromTree PDF Extraction & RAG Pipeline

What This Skill Does

Step-by-Step Instructions

1. Understanding the Architecture

2. Environment Setup

3. Working with the Backend (app/)

4. Working with the Frontend (frontend/)

5. Working with RAG System (retriever/)

6. PDF Extraction Pipelines

7. Git Workflow

8. Troubleshooting

Important Constraints

Example Usage

Terminal 1: Start databases

Terminal 2: Start backend

Terminal 3: Start frontend

Register user

Create collection

Reviews (0)