AI Alignment Research Assistant

Expert assistant for AI safety and alignment research experiments. This skill helps you work with AI safety coursework and research projects, focusing on scalable oversight, preference learning, and alignment techniques.

What This Skill Does

Provides guidance for working with AI alignment research code, including:

Navigating experiment structures and reusable patterns

Setting up secure API key management

Running training, evaluation, and inference pipelines

Understanding alignment research methodologies

Reusing code patterns across experiments

Instructions

When assisting with AI alignment research projects:

1. Understand Repository Structure

Each subdirectory represents a self-contained experiment or assignment

Look for `README_EXPERIMENT.md` in each experiment directory for specific documentation

Shared utilities at repository root (`.env.example`, `check_env.py`)

Standard structure:

- Training scripts (`train.py`)

- Generation/inference scripts (`generate.py`, `sandbox.py`)

- Evaluation utilities (`eval/` directory)

2. Security-First API Key Management

**CRITICAL**: Always verify secure API key handling before running any code:

```bash

Check if .env exists and is gitignored

git check-ignore .env

Verify API keys are loaded

python check_env.py

Scan for hardcoded keys (should find nothing)

grep -r "sk-[a-zA-Z0-9]" --include="*.py" .

grep -r "hf_[a-zA-Z0-9]" --include="*.py" .

```

When setting up new experiments:

Copy `.env.example` to `.env` at repository root

Load with `from dotenv import load_dotenv; load_dotenv()`

Access via `os.getenv("KEY_NAME")`

Never hardcode keys in source files

3. Reuse Existing Patterns

Before writing new code, check for reusable utilities:

**Model Query Interface** (`hw0/eval/query_utils.py`):

For model inference, comparative evaluation, response generation

Handles device management, chat templates, memory optimization

**LLM-as-Judge Evaluation** (`hw0/eval/judge.py`):

For qualitative assessment, alignment metrics, comparative studies

Structured scoring rubrics, batch processing

**LoRA Finetuning** (`hw0/train.py`):

For parameter-efficient finetuning, behavioral studies

Chat format preprocessing, PEFT configuration, modular hyperparameters

When creating new experiments, copy and adapt these patterns instead of rewriting from scratch.

4. Follow Data Format Conventions

**Training data**: JSONL with chat messages

```json

{"messages": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}

```

**Evaluation output**: CSV with standard fields

```

id,question,response,[additional_scoring_columns]

```

5. Experiment Development Workflow

When starting new experiments:

1. Create experiment directory: `mkdir harvard-cs-2881-[name]/`

2. Copy relevant utilities from existing experiments

3. Create `README_EXPERIMENT.md` documenting:

- Research questions and goals

- Specific commands for this experiment

- Architecture details

- Results and findings

4. Keep experiments self-contained (separate dependencies, data, models)

6. Common Development Tasks

**Environment setup**:

```bash

pip install torch transformers peft datasets accelerate bitsandbytes

pip install openai python-dotenv

```

**Memory-efficient training**:

```bash

python train.py --use_4bit # Enable 4-bit quantization

watch -n 1 nvidia-smi # Monitor GPU usage

```

**Quick testing**:

```bash

python sandbox.py # Interactive model testing

```

7. Development Philosophy

**Favor adaptation over reinvention**: Reuse and modify existing utilities

**Modular experiments**: Each experiment should run independently

**Reproducible by default**: Use configuration files, random seeds, clear documentation

**Safety-focused**: Maintain clear boundaries between research code and production systems when exploring misalignment or adversarial examples

8. Security Checklist Before Commits

Always verify before committing:

```bash

git status # Check staged files

git check-ignore .env # Verify .env is ignored

Scan for leaked keys (should return nothing)

grep -r "sk-[a-zA-Z0-9]" --include="*.py" --include="*.md" .

```

9. Course Context Awareness

This work relates to Harvard CS 2881: AI Safety, covering:

Emergent misalignment

Scalable oversight

RLHF and preference learning

Interpretability techniques

Each experiment connects to specific course concepts. Check individual `README_EXPERIMENT.md` files for assignment details and theoretical background.

Examples

**Starting a new experiment:**

```bash

Create directory

mkdir harvard-cs-2881-hw2

Copy reusable utilities

cp harvard-cs-2881-hw0/eval/query_utils.py harvard-cs-2881-hw2/

cp harvard-cs-2881-hw0/train.py harvard-cs-2881-hw2/

Document the experiment

vim harvard-cs-2881-hw2/README_EXPERIMENT.md

```

**Secure API setup:**

```bash

Copy template and add keys

cp .env.example .env

vim .env # Add OPENAI_API_KEY and HF_TOKEN

Verify setup

python check_env.py

```

**Running an experiment:**

```bash

cd harvard-cs-2881-hw1-RL

python scripts/train.py --model_name gpt2 --use_4bit

python scripts/evaluate.py --checkpoint ./checkpoints/best_model

```

Important Notes

**Preferred editor**: vim (use in all examples and documentation)

**GPU management**: Use 4-bit quantization (`--use_4bit`) for memory-constrained environments

**API key security**: Never commit `.env` files or hardcode keys; always revoke compromised keys immediately

**Experiment isolation**: Keep each experiment self-contained to avoid cross-contamination

**Documentation**: Maintain detailed `README_EXPERIMENT.md` files with research questions, commands, and findings

AI Alignment Research Assistant

AI Alignment Research Assistant

What This Skill Does

Instructions

1. Understand Repository Structure

2. Security-First API Key Management

Check if .env exists and is gitignored

Verify API keys are loaded

Scan for hardcoded keys (should find nothing)

3. Reuse Existing Patterns

4. Follow Data Format Conventions

5. Experiment Development Workflow

6. Common Development Tasks

7. Development Philosophy

8. Security Checklist Before Commits

Scan for leaked keys (should return nothing)

9. Course Context Awareness

Examples

Create directory

Copy reusable utilities

Document the experiment

Copy template and add keys

Verify setup

Important Notes

Reviews (0)