Law-GPT Pakistan Legal Assistant

A RAG-based chatbot specialized in answering questions about Pakistan's Constitution and Legal System. Uses Streamlit for the interface, FAISS for vector search, and Google's Gemini model for response generation.

What This Skill Does

This skill helps you work with the Law-GPT codebase, a production-ready RAG application that:

Extracts and chunks text from legal PDF documents

Creates vector embeddings using HuggingFace sentence-transformers

Retrieves relevant legal context using FAISS similarity search

Generates responses with Google Gemini while enforcing context-only answers

Validates inputs and responses to prevent hallucinations and injection attacks

Step-by-Step Instructions

1. Initial Setup and Verification

**Check project structure:**

Verify `Pakistan.pdf` exists (the legal knowledge base)

Confirm `requirements.txt` contains all dependencies

Check for `config.py` with environment variable management

Verify `preprocess_pdf.py` exists for initial PDF processing

**Install dependencies:**

```bash

pip install -r requirements.txt

```

**Configure API credentials:**

Set `GOOGLE_API_KEY` environment variable OR

Add to Streamlit secrets (`.streamlit/secrets.toml`)

Never commit API keys to version control

2. Preprocess the PDF (Critical First Step)

**MUST run before first use:**

```bash

python preprocess_pdf.py

```

This creates `preprocessed_text.json` containing extracted text from the PDF.

**Verify preprocessing succeeded:**

Check that `preprocessed_text.json` was created

File should contain structured text data from Pakistan.pdf

3. Understanding the RAG Pipeline

**RAG Flow (5 stages):**

1. **PDF Preprocessing** (`preprocess_pdf.py`):

- Extracts text from Pakistan.pdf

- Saves to `preprocessed_text.json`

2. **Text Chunking** (in `create_qa_system()`):

- Uses `RecursiveCharacterTextSplitter`

- Configurable `CHUNK_SIZE` and `CHUNK_OVERLAP`

- Breaks document into semantically meaningful segments

3. **Embeddings** (HuggingFace):

- Creates vector representations of chunks

- Uses sentence-transformers model

- Cached via `@st.cache_resource`

4. **Vector Store** (FAISS):

- Stores chunk embeddings

- Retrieves top-k similar chunks based on query

- Uses `RETRIEVAL_K` and `SCORE_THRESHOLD` settings

5. **LLM Integration** (Gemini):

- Generates responses using retrieved context

- Enforces context-only answers via prompt template

- Validates responses to prevent hallucinations

4. Key Functions to Understand

**Core security and validation functions:**

`is_law_related_question()`: Filters non-legal queries using keyword matching

`validate_response()`: Ensures responses use only provided context (no external knowledge)

`sanitize_input()`: Prevents injection attacks by cleaning user input

`create_qa_system()`: Initializes the entire RAG pipeline with caching

**When modifying these functions:**

Always run tests afterward: `python test_app.py`

Security functions are critical—maintain strict validation

Preserve caching decorators for performance

5. Configuration System

**Three-tier configuration priority:**

1. **Environment variables** (highest priority)

2. **Streamlit secrets** (`.streamlit/secrets.toml`)

3. **Default values** (in `Config` class)

**Key configurations in `config.py`:**

`GOOGLE_API_KEY`: Required for Gemini model (REQUIRED)

`CHUNK_SIZE` / `CHUNK_OVERLAP`: Controls text splitting (default: 1000/200)

`RETRIEVAL_K` / `SCORE_THRESHOLD`: Controls context retrieval (default: 4/0.5)

`MAX_INPUT_LENGTH`: Security limit on user input (default: 500)

`LOG_LEVEL`: Logging verbosity (default: INFO)

**To modify configuration:**

Update environment variables for deployment

Edit `Config` class defaults for development

Use Streamlit secrets for cloud deployment

6. Context Enforcement Strategy

**Four-layer approach to prevent hallucinations:**

1. **Pre-filtering**: `is_law_related_question()` rejects non-legal queries

2. **Prompt engineering**: Custom template explicitly instructs to use only provided context

3. **Similarity threshold**: `SCORE_THRESHOLD` ensures relevant context retrieval

4. **Post-validation**: `validate_response()` catches external knowledge indicators

**When adding features:**

Preserve all four layers of validation

Test with out-of-context questions

Verify responses cite only the PDF content

7. Running and Testing

**Development server:**

```bash

streamlit run app.py

```

**Run full test suite:**

```bash

python test_app.py

```

**Test specific components:**

```bash

python test_app.py --individual

```

**Production deployment:**

```bash

streamlit run app.py --server.port 8501 --server.address 0.0.0.0

```

**Tests validate:**

Law-related question detection accuracy

Response validation logic

Input sanitization effectiveness

Preprocessed text file existence

API key configuration

8. Performance Optimizations

**Caching strategy:**

`@st.cache_resource`: Caches QA system initialization (expensive)

`@lru_cache`: Caches preprocessed text loading

Streamlit's built-in embedding cache

FAISS vector store is loaded once per session

**Configuration tuning:**

Reduce `CHUNK_SIZE` for faster processing but less context

Increase `RETRIEVAL_K` for more comprehensive answers but slower

Lower `SCORE_THRESHOLD` for broader retrieval but risk of irrelevant context

9. Error Handling and Logging

**Logging configuration:**

Set `LOG_LEVEL` environment variable (DEBUG, INFO, WARNING, ERROR)

Technical details logged to console

User-friendly error messages displayed in UI

**Common error scenarios:**

Missing `preprocessed_text.json`: Run `preprocess_pdf.py`

Missing API key: Set `GOOGLE_API_KEY` environment variable

Import errors: Reinstall dependencies from `requirements.txt`

10. Deployment Checklist

**Pre-deployment:**

[ ] Run `python preprocess_pdf.py` to generate `preprocessed_text.json`

[ ] Verify `GOOGLE_API_KEY` is configured in deployment environment

[ ] Run `python test_app.py` to validate all tests pass

[ ] Review `config.py` defaults for production settings

**Deployment options:**

**Streamlit Cloud**: Use secrets management for API keys

**Docker**: See `README_PRODUCTION.md` for container setup

**PM2**: Use `ecosystem.config.js` for process management

**Self-hosted**: Bind to `0.0.0.0:8501` for external access

Important Constraints

1. **Context-Only Responses**: The chatbot MUST only answer from the Pakistan.pdf content—never use external knowledge

2. **Law Domain Only**: Non-law questions are automatically rejected via `is_law_related_question()`

3. **Input Sanitization**: All user inputs MUST pass through `sanitize_input()` before processing

4. **API Key Security**: NEVER commit API keys—always use environment variables or secrets management

5. **Preprocessed Text Required**: Application will fail if `preprocessed_text.json` is missing—always preprocess first

Examples

**Adding a new configuration option:**

```python

In config.py

class Config:

NEW_SETTING = os.getenv("NEW_SETTING", "default_value")

```

**Modifying chunk size for better context:**

```python

In app.py, update create_qa_system()

text_splitter = RecursiveCharacterTextSplitter(

chunk_size=1500, # Increased from 1000

chunk_overlap=300 # Increased from 200

)

```

**Testing a new validation rule:**

```python

In test_app.py

def test_new_validation():

result = new_validation_function("test input")

assert result is True

```

Notes

Always run the test suite after modifying core functions

Caching decorators are critical for performance—do not remove

The prompt template in `create_qa_system()` is the primary enforcement mechanism for context-only responses

FAISS similarity search is fast but memory-intensive for large documents

Gemini model can be replaced with other LLMs by modifying the LLM initialization in `create_qa_system()`

Law-GPT Pakistan Legal Assistant

Law-GPT Pakistan Legal Assistant

What This Skill Does

Step-by-Step Instructions

1. Initial Setup and Verification

2. Preprocess the PDF (Critical First Step)

3. Understanding the RAG Pipeline

4. Key Functions to Understand

5. Configuration System

6. Context Enforcement Strategy

7. Running and Testing

8. Performance Optimizations

9. Error Handling and Logging

10. Deployment Checklist

Important Constraints

Examples

In config.py

In app.py, update create_qa_system()

In test_app.py

Notes

Reviews (0)