ESG Reasoning Research Assistant

AI assistant for working on the ESG reasoning and green finance research project. This skill helps with experiment setup, evaluation, result interpretation, and documentation for DSPy prompt optimization research on the MMESGBench dataset.

Project Context

**Research Question**: Can DSPy prompt optimization match or exceed traditional fine-tuning (LoRA + RL) on ESG question answering with lower compute and fewer labels?

**Dataset**: MMESGBench - 933 ESG question-answer pairs from 45 corporate ESG reports

**Architecture**: PostgreSQL + pgvector retrieval → DSPy ChainOfThought reasoning → Structured answer extraction

**Key Models**: qwen-max, qwen2.5-7b-instruct with text-embedding-v4

Instructions

1. Initial Project Orientation

When starting work on this project:

1. **Read core documentation in order**:

- README.md - Current status and quick overview

- RESEARCH_FINDINGS.md - Complete analysis and recommendations

- CHANGELOG.md - Historical log of all work

- DEV_SET_ERROR_ANALYSIS.md - Error patterns by format type

2. **Verify authoritative data sources**:

- Data splits: Use ONLY `dspy_implementation/data_splits/` (train_186.json, dev_93.json, test_654.json)

- Results: Use files dated `20251019_130401` as authoritative

- Ignore old result files in other directories

3. **Check environment setup**:

- Conda environment: `esg_reasoning`

- Working directory: `/Users/victoryim/Local_Git/CC`

- Required env vars: `DASHSCOPE_API_KEY`, `PG_URL`, `ESG_COLLECTION_NAME`

2. Understanding Current Results

**Critical: Use Corrected Evaluator** (fixed Nov 9, 2025)

Always use the corrected evaluation function that recognizes null-equivalent responses:

```python

from src.evaluation import eval_score

answer_score = eval_score(gt, pred, answer_type)

correct = (answer_score >= 0.5) # ANLS 0.5 threshold

```

**Authoritative Dev Set Results (93 questions)**:

GEPA: 61.3% (+7.5% vs baseline)

Baseline: 53.8%

MIPROv2: 52.7% (-1.1% vs baseline)

DC-Cold: 44.1% (-9.7% vs baseline)

**Test Set Results (654 questions)**:

Hybrid (Format-Based): 50.2% (best overall)

MIPROv2: 47.4%

Baseline: 46.9%

GEPA: 46.3%

**Key Finding**: GEPA (reflection-based) outperforms MIPROv2 (teacher-student) on dev set. Reflection learns from actual student failures; teacher-student uses generic prompts that don't transfer well to 7B models.

3. Running Experiments

**Baseline Evaluation**:

```bash

python dspy_implementation/evaluate_baseline.py \

--model qwen2.5-7b-instruct \

--dataset dev \

--output baseline_dev_predictions.json

```

**GEPA Optimization**:

```bash

python dspy_implementation/gepa_skip_baseline.py

```

**Dynamic Cheatsheet Evaluation**:

```bash

Cold start (fair comparison to DSPy)

python dspy_implementation/dc_module/dc_evaluator.py \

--dataset test --variant cumulative

Warm start (test-time learning)

python dspy_implementation/dc_module/dc_evaluator.py \

--dataset test --variant cumulative --warmup

```

4. Code Quality Standards

For evaluation/optimization scripts (>10 min runtime), implement:

✅ Checkpoint/resume mechanism

✅ Structured logging (file + console)

✅ Retry logic with exponential backoff

✅ Progress bars (tqdm)

✅ Structured JSON output

✅ Corrected eval_score() function

Reference: `archive_old_project/code_old/colbert_full_dataset_evaluation.py`

5. Common Pitfalls to Avoid

**DC vs DSPy Comparison**:

❌ NEVER compare DC-Warm vs DSPy (unfair - DC learns from test set)

✅ Compare DC-Cold vs DSPy Baseline (both use no test data)

Always specify DC-Cold or DC-Warm explicitly

**Dataset Confusion**:

Always specify: "Dev set (93)" or "Test set (654)" or "Full dataset (933)"

Don't mix results from different dataset sizes without clear labels

**Model Switching**:

Always specify model name (qwen-max vs qwen2.5-7b)

Don't compare results across models without noting the change

**Old Result Files**:

Use ONLY files from `dspy_implementation/data_splits/`

Ignore results from other directories unless explicitly documented

**Optimizer Parameters**:

MIPROv2: Use `requires_permission_to_run=False`

GEPA: Do NOT use `requires_permission_to_run` parameter

GEPA metrics: Return `ScoreWithFeedback` object, not plain dict

6. Key Research Insights

**Format-Specific Performance**:

GEPA excels at structured formats (Int: +10.5%, Float: +7.7%, List: +15.4%)

Baseline better for text formats (Str: +5.9%, null: +7.2%)

Recommendation: Hybrid approach with format-specific routing

**Prompt Length Trade-offs**:

Baseline: 0 characters

GEPA: 7,749 characters (too long)

Optimal: ~3,000 characters

**Cost-Performance**:

qwen2.5-7b (GEPA): 54.8% @ $0.0006/1K = 100x cheaper

qwen-max: ~69.9% @ $0.06/1K = expensive

Result: 78% performance at 1% cost

7. Documentation Updates

After completing work:

1. **Update README.md**: Quick status and new results

2. **Update RESEARCH_FINDINGS.md**: Detailed analysis and insights

3. **Update CHANGELOG.md**: Historical log entry

4. **Commit with clear message**: Include date and summary

**Don't create new documentation files** - update the existing three.

8. Pre-Flight Checklist

Before running experiments:

[ ] Conda environment activated (`esg_reasoning`)

[ ] Using correct data split (`dspy_implementation/data_splits/`)

[ ] Clear output file naming with timestamp

[ ] Checkpoint/resume mechanism (if >10 min runtime)

[ ] Using corrected eval_score()

[ ] Logging to `logs/` directory

Before claiming results:

[ ] Verified correct dataset (dev 93 or test 654?)

[ ] Specified model clearly (qwen-max or qwen2.5-7b?)

[ ] Compared against correct baseline

[ ] Statistical significance tested (if claiming improvement)

Database Configuration

**PostgreSQL with pgvector**:

Collection: MMESG

Chunks: 54,608 (1024-dim embeddings)

Embeddings: text-embedding-v4

Retrieval: top-5 cosine similarity

Important Notes

1. **Evaluation Bug Fixes**: Two critical bugs were fixed (Nov 7 & Nov 9, 2025):

- Null equivalence: "null" now treated as equivalent to "Not answerable"

- ANLS string comparison: Fixed to use full string comparison

- Impact: +13.6% accuracy improvement for DC on test set

2. **Fair Comparisons**: Always compare like-to-like:

- Same dataset (dev/test/full)

- Same model (qwen-max/qwen2.5-7b)

- Same learning paradigm (cold start vs warm start)

3. **Research Contribution**: This work demonstrates that reflection-based optimization (GEPA) outperforms teacher-student (MIPROv2) for small language models (7B parameters), with significant cost-performance trade-offs.

Quick Links

GitHub: https://github.com/tyyim/esg_reason

Notion: https://www.notion.so/5f2084ba49f64166b17d52aff4abc7c2

MMESGBench: https://github.com/microsoft/Multimodal-ESG-Benchmark

ESG Reasoning Research Assistant

ESG Reasoning Research Assistant

Project Context

Instructions

1. Initial Project Orientation

2. Understanding Current Results

3. Running Experiments

Cold start (fair comparison to DSPy)

Warm start (test-time learning)

4. Code Quality Standards

5. Common Pitfalls to Avoid

6. Key Research Insights

7. Documentation Updates

8. Pre-Flight Checklist

Database Configuration

Important Notes

Quick Links

Reviews (0)