AI Ghost Mode - Digital Footprint Minimizer
Privacy-focused AI tool that audits public social media data (Reddit, GitHub, Instagram) to identify privacy risks and suggest cleanup actions using ML-powered analysis.
Project Architecture
This project uses a dual-service architecture with a FastAPI backend (port 8000) and Streamlit frontend (port 8501).
**Data Flow**: Scrapers → RiskAnalyzer → AnalysisResult → Frontend Dashboard
**Core Components**:
`src/scrapers/` - Platform data collectors using async/await patterns`src/analyzers/risk_analyzer.py` - AI risk detection engine`src/frontend/dashboard.py` - Streamlit multi-page UI`src/config.py` - Centralized configuration with `Config` classKey Implementation Patterns
Configuration Management
Use the singleton `Config` class exported from `src/config.py` for all configurationAdd `sys.path.append(os.path.dirname(os.path.dirname(__file__)))` to resolve relative importsLoad all settings from `.env` with sensible defaults (reference `.env.example`)Risk Analysis Architecture
Use `RiskItem` and `AnalysisResult` dataclasses for structured dataLoad AI models once in `RiskAnalyzer.__init__()` with graceful fallbackApply risk scoring on 0-10 scale with severity weights: `{"low": 1, "medium": 3, "high": 5}`Implement cross-platform identity linking using Levenshtein distance for string similarityAI Model Integration
**spaCy NER**: Use `en_core_web_sm` for person/location detection**HuggingFace Pipelines**: - Sentiment analysis: `cardiffnlp/twitter-roberta-base-sentiment-latest`
- Toxicity detection: `unitary/toxic-bert`
**Regex Patterns**: - Email detection: `r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'`
- Phone number detection patterns
Async Data Collection
Define typed dataclass models: `RedditPost`, `RedditComment`, `RedditProfile`Integrate with Reddit API using PRAW with proper error handling and rate limitingUse `asyncio.gather()` for parallel platform processingDevelopment Workflows
Environment Setup
```bash
Install spaCy language model
.venv/Scripts/python.exe -m spacy download en_core_web_sm
Start backend API (port 8000)
Start Streamlit dashboard (port 8501)
```
File Relationships
**Scrapers**: Import `config.Config` for API credentials**Analyzers**: Accept scraper output, return `AnalysisResult` objects**Frontend**: Polls backend API, displays risk visualizations**Testing**: Use `pytest tests/` with mocked external API callsProject Conventions
Risk Classification
**Risk Types**: "Email Exposure", "Phone Number", "Toxic Content", "Username Similarity"**Severity Levels**: Use lowercase strings - "low", "medium", "high"**Platform Context**: Always include platform name in risk item context fieldCoding Standards
All scraper methods must be async and return typed dataclass listsUse `loguru` for logging with file rotationNever log personal data or sensitive informationInclude confidence scores with all AI predictionsPrivacy & Security Constraints
1. **Local Processing Only**: Never transmit data to external services
2. **Public Data Only**: Respect platform Terms of Service and rate limits
3. **Confidence Scoring**: All AI predictions must include confidence levels
4. **Minimal Data Retention**: Process and display results without unnecessary storage
Testing Guidelines
Mock all external API calls in testsTest risk detection with sample data that covers all severity levelsVerify async data collection handles rate limiting and errors gracefullyEnsure frontend correctly displays risk scores and recommendationsCommon Tasks
Adding a New Platform Scraper
1. Create async scraper class in `src/scrapers/`
2. Define platform-specific dataclass models
3. Implement rate limiting and error handling
4. Add scraper to parallel processing in main analysis flow
Adding a New Risk Type
1. Define detection logic in `RiskAnalyzer`
2. Add severity classification rules
3. Update risk scoring weights if needed
4. Add corresponding visualization to frontend dashboard
Modifying Risk Scoring
1. Update severity weights in `RiskAnalyzer`
2. Adjust confidence thresholds for AI models
3. Recalibrate cross-platform identity linking thresholds
4. Update frontend to reflect new scoring ranges