AI assistant for ESG reasoning research project using DSPy optimization, ColBERT RAG, and MMESGBench evaluation. Guides experiment setup, evaluation, and documentation.
AI assistant for working on the ESG reasoning and green finance research project. This skill helps with experiment setup, evaluation, result interpretation, and documentation for DSPy prompt optimization research on the MMESGBench dataset.
**Research Question**: Can DSPy prompt optimization match or exceed traditional fine-tuning (LoRA + RL) on ESG question answering with lower compute and fewer labels?
**Dataset**: MMESGBench - 933 ESG question-answer pairs from 45 corporate ESG reports
**Architecture**: PostgreSQL + pgvector retrieval → DSPy ChainOfThought reasoning → Structured answer extraction
**Key Models**: qwen-max, qwen2.5-7b-instruct with text-embedding-v4
When starting work on this project:
1. **Read core documentation in order**:
- README.md - Current status and quick overview
- RESEARCH_FINDINGS.md - Complete analysis and recommendations
- CHANGELOG.md - Historical log of all work
- DEV_SET_ERROR_ANALYSIS.md - Error patterns by format type
2. **Verify authoritative data sources**:
- Data splits: Use ONLY `dspy_implementation/data_splits/` (train_186.json, dev_93.json, test_654.json)
- Results: Use files dated `20251019_130401` as authoritative
- Ignore old result files in other directories
3. **Check environment setup**:
- Conda environment: `esg_reasoning`
- Working directory: `/Users/victoryim/Local_Git/CC`
- Required env vars: `DASHSCOPE_API_KEY`, `PG_URL`, `ESG_COLLECTION_NAME`
**Critical: Use Corrected Evaluator** (fixed Nov 9, 2025)
Always use the corrected evaluation function that recognizes null-equivalent responses:
```python
from src.evaluation import eval_score
answer_score = eval_score(gt, pred, answer_type)
correct = (answer_score >= 0.5) # ANLS 0.5 threshold
```
**Authoritative Dev Set Results (93 questions)**:
**Test Set Results (654 questions)**:
**Key Finding**: GEPA (reflection-based) outperforms MIPROv2 (teacher-student) on dev set. Reflection learns from actual student failures; teacher-student uses generic prompts that don't transfer well to 7B models.
**Baseline Evaluation**:
```bash
python dspy_implementation/evaluate_baseline.py \
--model qwen2.5-7b-instruct \
--dataset dev \
--output baseline_dev_predictions.json
```
**GEPA Optimization**:
```bash
python dspy_implementation/gepa_skip_baseline.py
```
**Dynamic Cheatsheet Evaluation**:
```bash
python dspy_implementation/dc_module/dc_evaluator.py \
--dataset test --variant cumulative
python dspy_implementation/dc_module/dc_evaluator.py \
--dataset test --variant cumulative --warmup
```
For evaluation/optimization scripts (>10 min runtime), implement:
Reference: `archive_old_project/code_old/colbert_full_dataset_evaluation.py`
**DC vs DSPy Comparison**:
**Dataset Confusion**:
**Model Switching**:
**Old Result Files**:
**Optimizer Parameters**:
**Format-Specific Performance**:
**Prompt Length Trade-offs**:
**Cost-Performance**:
After completing work:
1. **Update README.md**: Quick status and new results
2. **Update RESEARCH_FINDINGS.md**: Detailed analysis and insights
3. **Update CHANGELOG.md**: Historical log entry
4. **Commit with clear message**: Include date and summary
**Don't create new documentation files** - update the existing three.
Before running experiments:
Before claiming results:
**PostgreSQL with pgvector**:
1. **Evaluation Bug Fixes**: Two critical bugs were fixed (Nov 7 & Nov 9, 2025):
- Null equivalence: "null" now treated as equivalent to "Not answerable"
- ANLS string comparison: Fixed to use full string comparison
- Impact: +13.6% accuracy improvement for DC on test set
2. **Fair Comparisons**: Always compare like-to-like:
- Same dataset (dev/test/full)
- Same model (qwen-max/qwen2.5-7b)
- Same learning paradigm (cold start vs warm start)
3. **Research Contribution**: This work demonstrates that reflection-based optimization (GEPA) outperforms teacher-student (MIPROv2) for small language models (7B parameters), with significant cost-performance trade-offs.
Leave a review
No reviews yet. Be the first to review this skill!
# Download SKILL.md from killerskills.ai/api/skills/esg-reasoning-research-assistant/raw