Extract material properties from TDS/MSDS PDFs with 99%+ accuracy and enable RAG-based chatbot for material property prediction and Q&A using MongoDB, PostgreSQL, Elasticsearch, Neo4j, and Qdrant.
AI coding assistant specialized for the PromTree project: a production-grade system that extracts material property information from TDS/MSDS PDF files and provides RAG-based chatbot functionality.
This skill helps you work with a complex full-stack application that:
When asked about the system architecture:
1. **Primary Backend**: Located in `app/` directory, uses FastAPI + Beanie ODM + MongoDB
2. **Frontend**: React 19.1 + TypeScript + Vite in `frontend/` directory
3. **RAG System**: Advanced hybrid retrieval in `retriever/` directory
4. **Extraction Pipelines**: TDS extraction in `db3/`, MSDS in `db1/` and `db1_2py/`
5. **Database Stack**:
- MongoDB: Document storage (markdown, chunks)
- PostgreSQL: Structured property data
- Elasticsearch: Full-text search with Korean support (Nori analyzer)
- Neo4j: Knowledge graph storage
- Qdrant: Vector database
When helping with environment setup:
1. **Check for uv package manager**: This project uses `uv` for Python dependency management
```bash
which uv || curl -LsSf https://astral.sh/uv/install.sh | sh
```
2. **Start Docker services**:
```bash
docker-compose up -d
```
Services: MongoDB (27017), Qdrant (6333, 6334), Elasticsearch (9200), Neo4j (7474, 7687)
3. **Verify `.env` file** contains:
- `MONGO_INITDB_ROOT_USERNAME=promtree`
- `MONGO_INITDB_ROOT_PASSWORD=ssafy13s307`
- `MONGO_HOST=localhost`
- `MONGO_PORT=27017`
- `GOOGLE_API_KEY=<api_key>`
4. **Backend setup**:
```bash
uv sync
source .venv/bin/activate
python main.py
```
5. **Frontend setup**:
```bash
cd frontend
npm install
npm run dev
```
When modifying backend code:
1. **Models** are in `app/models/` using Beanie Document classes
2. **API routes** are in `app/routers/` (users, chats, collections)
3. **Services** contain business logic in `app/services/`
4. **All identifiers use snake_case**: `chat_id`, `collection_id`, `document_id`, `created_at`, `updated_at`
5. **Message contents** use `contents` field (not `content` or `message`)
6. **API docs** available at http://localhost:8000/docs
Key API endpoints:
When modifying frontend code:
1. **Main routing** in `App.tsx` (collections, collection-detail, chat pages)
2. **Key components**:
- `Chat.tsx`: Main chat interface with message history
- `Sidebar.tsx`: Navigation sidebar with chat history
- `Collections.tsx`: Grid view of document collections
- `CollectionDetail.tsx`: Document list within collection, file upload
3. **API client** centralized in `lib/api.ts`
4. **Custom hooks**:
- `useToast.ts`: Toast notifications
- `useUpload.ts`: File upload progress
5. **State management**: User state in localStorage, data fetched from backend API
When working with RAG functionality:
1. **Main RAG class**: `rag_system.py` with FAISS retrieval
2. **Hybrid RAG**: `lightrag/lightrag_hybrid_rag.py` combines vector + graph search
3. **Chunking**: `chunker/markdown_chunker.py` handles markdown with table unpivoting
4. **Elasticsearch**: `indexer/elasticsearch_indexer.py` for BM25 keyword search
5. **Knowledge Graph**: `knowledge_graph/neo4j_knowledge_graph.py` for Neo4j operations
6. **Vector stores**: `vector_store/` supports Qdrant and Weaviate
MongoDB chunks schema:
```python
{
"vector_id": int,
"content": str,
"source_file_name": str,
"page_num": int,
"chunk_index": int
}
```
When working with extraction:
1. **TDS extraction**: `app/core/tds.py` (integrated into main app)
2. **MSDS extraction**: `app/core/msds.py` (integrated into main app)
3. **Legacy TDS**: `db3/` directory (standalone venv if needed)
4. **Legacy MSDS**: `db1/` (OCR-based) and `db1_2py/` (PyMuPDF-based)
5. **PDF parsing**: `app/promtree/parsing.py` converts PDF to markdown
6. **Table handling**: `app/promtree/unpivot.py` unpivots HTML tables
When committing changes:
1. **Remote**: `https://lab.ssafy.com/s13-final/S13P31S307.git`
2. **Branch flow**: `feature/* → develop → master`
3. **Commit format**: `[S13P31S307-<issue-number>] <Type>: <description>`
4. **Branch naming**: `S13P31S307-<issue-number>-<description>`
5. **Main branch**: `master` (not `main`)
When debugging issues:
1. **Port conflicts**:
```bash
lsof -ti:8000 | xargs kill -9 # Backend
lsof -ti:5173 | xargs kill -9 # Frontend
```
2. **Database connection**:
```bash
docker-compose restart mongodb
docker-compose ps
docker logs mongodb
```
3. **RAG initialization**: First message query may take 30-60 seconds to initialize FAISS embeddings
4. **Verify service health**:
- MongoDB: `mongosh mongodb://promtree:ssafy13s307@localhost:27017`
- Qdrant: http://localhost:6333/dashboard
- Neo4j: http://localhost:7474 (credentials: neo4j/ssafy13s307)
- Elasticsearch: `curl -u elastic:ssafy13s307 http://localhost:9200`
- Backend API: http://localhost:8000/docs
1. **Use uv for Python dependencies**, not pip directly (except in legacy modules)
2. **Primary codebase is `app/`**, not `backend/` (which was removed)
3. **All API identifiers use snake_case**, not camelCase
4. **Collection creation requires `type` field** (`"msds"` or `"tds"`)
5. **Message field is `contents`**, not `content` or `message`
6. **Main git branch is `master`**, not `main`
7. **Korean text support** is critical (Elasticsearch uses Nori analyzer)
8. **First RAG query may be slow** due to FAISS initialization
**Start full stack**:
```bash
docker-compose up -d
uv sync && source .venv/bin/activate && python main.py
cd frontend && npm install && npm run dev
```
**Test API**:
```bash
curl -X POST http://localhost:8000/users/register \
-H "Content-Type: application/json" \
-d '{"email":"[email protected]","password":"secret","name":"Test User"}'
curl -X POST http://localhost:8000/collections \
-H "Content-Type: application/json" \
-d '{"name":"My Collection","description":"Test","type":"tds"}'
```
**Upload document**:
```bash
curl -X POST http://localhost:8000/collections/{collection_id} \
-F "[email protected]"
```
Leave a review
No reviews yet. Be the first to review this skill!
# Download SKILL.md from killerskills.ai/api/skills/promtree-pdf-extraction-and-rag-pipeline/raw