Multi-MCP RL Training Assistant

Expert assistant for working with SkyRL-based reinforcement learning implementations that train language models on multi-turn tool use with MCP (Model Context Protocol) servers. Specializes in GRPO (Group Relative Policy Optimization) training, debugging, and monitoring.

What This Skill Does

This skill helps you work with a complex RL training pipeline that:

Trains language models using real MCP tool execution (not mock data)

Implements GRPO algorithm with value heads and reference policies

Supports vLLM for 10x faster inference and LoRA for efficient fine-tuning

Handles parallel trajectory collection with actual tool calls

Provides comprehensive testing, debugging, and monitoring capabilities

Step-by-Step Instructions

1. Initial Setup and Validation

When setting up the environment:

1. **Check Dependencies**: Verify all requirements are installed

- Run `pip install -r requirements.txt`

- Ensure SkyRL is installed from https://github.com/Sky-T/SkyRL

- Verify Python 3.12+ is available

2. **Validate API Keys**: Check that `.env` file in parent directory contains:

- `OPENAI_API_KEY`

- `POLYGON_API_KEY`

- `FMP_API_KEY`

- `TAVILY_API_KEY`

- `SLACK_BOT_TOKEN`

3. **Verify MCP Servers**: Ensure `../mcp_tools/limited/` is accessible with required servers

4. **Run Setup Validation**: Execute `python training/validate_setup.py`

5. **Check Training Data**: Confirm `data/processed/train.json` or `data/inputs/train.json` exists and is properly formatted

2. Choosing the Right Training Command

Help users select the appropriate training approach:

**For Production/Speed (Recommended):**

```bash

ENABLE_VLLM=true VLLM_MAX_MODEL_LEN=4096 VLLM_GPU_MEMORY_UTILIZATION=0.3 ./training/scripts/launch_real_env_gpu_vllm.sh

```

10x faster inference

Requires 16GB+ VRAM

Best for actual training runs

**For Single GPU (Memory-Efficient):**

```bash

./training/scripts/launch_qwen3_training.sh

```

Uses LoRA with 4-bit quantization

Requires 8-16GB VRAM

Good balance of speed and memory

**For Multi-GPU (Full Fine-Tuning):**

```bash

./training/scripts/launch_distributed.sh

```

Uses DeepSpeed ZeRO-3

Requires 2x A100 40GB minimum

Full parameter updates

**For Development/Debugging:**

```bash

./training/scripts/launch_real_env_cpu.sh

```

No GPU required

Much slower but useful for debugging

3. Monitoring Training Health

During training, watch for these critical indicators:

**PPO Ratio Health (Critical):**

**Healthy**: `std_ratio: 0.00587` (values between 0.005-0.05)

**Broken**: `std_ratio: 0.000000` (will halt training)

Look for: "✅ Added tiny perturbation to 10 parameters to break PPO degeneracy"

**vLLM Performance:**

Generation time: 0.9-7.8s per action

LoRA updates showing ID increments

Memory usage stable at configured utilization

**Training Speed:**

Total per epoch: ~110-115 seconds

Trajectory collection: ~90-95 seconds

Training step: ~17 seconds

**Monitor Commands:**

```bash

Watch training logs

tail -f outputs/real-env-grpo-*/training.log

Monitor GPU usage

./monitor_gpu.sh

```

4. Debugging Common Issues

**Memory Issues:**

For MPS memory issues on macOS:

```bash

export DEVICE_TYPE="cpu"

export DISABLE_BITSANDBYTES=1

```

For CUDA out of memory:

Reduce batch_size in training config

Enable gradient_checkpointing

Use LoRA instead of full fine-tuning

Switch to CPU mode temporarily

For BitsAndBytes issues:

```bash

export DISABLE_BITSANDBYTES=1

```

**Training Issues:**

For SkyRL import errors:

```bash

cd /path/to/SkyRL && pip install -e .

export PYTHONPATH="${PYTHONPATH}:$(pwd):$(pwd)/.."

```

For MCP server connection failures:

Verify API keys in .env file

Check internet connectivity

Test individual servers: `cd mcp_tools/limited && python fmp_limited_server.py`

For training instabilities:

Check GRPO hyperparameters (KL coefficient: 0.01, learning rate: 5e-6)

Verify gradient clipping is enabled

Monitor value function loss

**Enable Detailed Debugging:**

```bash

export PYTHONPATH="$(pwd):$(pwd)/.."

export CUDA_LAUNCH_BLOCKING=1

python training/scripts/train_qwen3_grpo_real_env.py --debug

```

5. Testing and Validation

Guide users through comprehensive testing:

**Quick Smoke Test:**

```bash

python training/tests/smoke_test.py

```

**Full Test Suite:**

```bash

python training/tests/run_comprehensive_tests.py

```

**Test Real Tool Execution:**

```bash

python training/scripts/test_real_tool_execution.py

```

**Test MCP Integration:**

```bash

python training/tests/test_mcp_integration.py

```

**Test Critical Fixes:**

```bash

python test_critical_fixes_v2.py

```

6. Understanding the Architecture

Help users understand the training flow:

1. **Policy** (`training/core/qwen_policy_with_value.py`) - Manages Qwen model with value head

2. **Environment** (`environments/mcp_tool_environment.py`) - Handles real MCP tool execution

3. **Trajectory Collector** (`training/data/trajectory_collector.py`) - Collects parallel rollouts

4. **GRPO Trainer** (`training/core/grpo_trainer_gradient_fix.py`) - Implements RL training loop

5. **Tool Manager** (`environments/simple_shared_manager.py`) - Manages MCP server connections

**Data Flow:**

Training data loaded from JSON

Tasks sent to parallel environments

Real MCP tools execute with retry mechanisms

Trajectories update policy via GRPO

Reference policy updated via EMA every 100 steps

7. Configuration Guidance

Point users to key configuration files:

`training/configs/training_config_qwen3_0.6b.yaml` - Main hyperparameters

`training/configs/grpo_config_fixed.yaml` - GRPO algorithm settings

`training/configs/deepspeed_config.json` - Multi-GPU configuration

**Key Hyperparameters:**

Learning rate: 5e-6 (conservative for stability)

Batch size: 1 (with gradient accumulation)

KL coefficient: 0.01 (prevents policy collapse)

Reference policy EMA: Every 100 steps

8. Critical Implementation Details

Ensure users understand:

**PPO Degeneracy Fix:**

Parameter perturbation applied to break ratio degeneracy

Tiny noise (1e-6) added to first 10 LoRA weights

This is critical - without it, training will fail

**vLLM Integration:**

10x speed improvement over standard inference

LoRA weights updated seamlessly during training

GPU memory utilization configurable (0.3 recommended)

**Tool Use Configuration:**

Policy runs with `rl_mode=True` and `force_rate=0.0`

Tool call argument aliasing handles common variations

Constrained prompting prevents `<think>` blocks

Temperature 0.2, max 512 tokens, repetition penalty enabled

**Output Structure:**

Training logs: `outputs/real-env-grpo-*/training.log`

vLLM logs: `outputs/real-env-grpo-vllm-*/training.log`

WandB projects: "skyrl-grpo-real-env" and "multi-mcp-rl-vllm"

Configs copied to output directory for reproducibility

Important Constraints

1. **Real Environment Only**: This system uses actual MCP tool execution, not mock data

2. **SkyRL Dependency**: Must be installed separately from https://github.com/Sky-T/SkyRL

3. **API Keys Required**: Training will fail without proper .env configuration

4. **Hardware Requirements**: Match training mode to available resources

5. **PPO Health Critical**: Monitor std_ratio - training cannot proceed if degenerate

6. **Python Version**: Requires Python 3.12+ for optimal compatibility

Example Usage

**Starting a Training Run:**

```bash

User: "I want to start training with vLLM on my GPU"

Assistant: Run setup validation first, then launch vLLM training

python training/validate_setup.py

ENABLE_VLLM=true VLLM_MAX_MODEL_LEN=4096 VLLM_GPU_MEMORY_UTILIZATION=0.3 ./training/scripts/launch_real_env_gpu_vllm.sh

```

**Debugging Memory Issues:**

```bash

User: "Training crashed with CUDA out of memory"

Assistant: Try these solutions in order

export VLLM_GPU_MEMORY_UTILIZATION=0.2 # Reduce memory usage

Or switch to LoRA mode

./training/scripts/launch_qwen3_training.sh

```

**Monitoring Training:**

```bash

User: "How do I know if training is working?"

Assistant: Monitor these indicators

tail -f outputs/real-env-grpo-vllm-*/training.log

Look for std_ratio > 0.005 and generation times 0.9-7.8s

```

Multi-MCP RL Training Assistant

Multi-MCP RL Training Assistant

What This Skill Does

Step-by-Step Instructions

1. Initial Setup and Validation

2. Choosing the Right Training Command

3. Monitoring Training Health

Watch training logs

Monitor GPU usage

4. Debugging Common Issues

5. Testing and Validation

6. Understanding the Architecture

7. Configuration Guidance

8. Critical Implementation Details

Important Constraints

Example Usage

User: "I want to start training with vLLM on my GPU"

Assistant: Run setup validation first, then launch vLLM training

User: "Training crashed with CUDA out of memory"

Assistant: Try these solutions in order

Or switch to LoRA mode

User: "How do I know if training is working?"

Assistant: Monitor these indicators

Look for std_ratio > 0.005 and generation times 0.9-7.8s

Reviews (0)