# End-to-End GRPO Training Tutorial with FSDP

This document provides instructions for end-to-end training using the ChatLearn, pytorch FSDP and vLLM framework, and the qwen3 model.

## Environment Setup
1. Docker Image Preparation

We recommend running the following example in PAI [DSW](https://help.aliyun.com/zh/pai/user-guide/create-and-manage-dsw-instances/)/[DLC](https://help.aliyun.com/zh/pai/user-guide/create-a-training-task?spm=a2c4g.11186623.help-menu-30347.d_3_3_5_5.2dfb1925l3QjwG). You need to use the following image to launch the instance.
```bash
dsw-registry.cn-shanghai.cr.aliyuncs.com/pai-training-algorithm/chatlearn:torch2.6.0-vllm0.8.5-ubuntu24.04-cuda12.6-py312
```

You can use a VPC address to accelerate image pulling. The image address should be adjusted based on the current region. For example, if you need to launch a DSW instance in Shanghai, you can use the following image `dsw-registry-vpc.cn-shanghai.cr.aliyuncs.com/pai-training-algorithm/chatlearn:torch2.6.0-vllm0.8.5-ubuntu24.04-cuda12.6-py312`.

2. Code Preparation

```bash
git clone https://github.com/alibaba/ChatLearn.git && cd ChatLearn
```

## Data Preparation
We take [MATH-lighteval](https://www.modelscope.cn/datasets/AI-ModelScope/MATH-lighteval) as exmaple.
```bash
# download dataset
mkdir -p dataset
modelscope download --dataset AI-ModelScope/MATH-lighteval --local_dir dataset/MATH-lighteval
# preprocess dataset
python chatlearn/data/data_preprocess/math_lighteval.py --input_dir dataset/MATH-lighteval --local_dir dataset/MATH-lighteval
```

## Training
You can run the following command to start training:

### Qwen3-8B
Run this command on server with 8 GPUs
```bash
# download model weight
modelscope download --model Qwen/Qwen3-8B --local_dir pretrained_models/Qwen3-8B
bash scripts/train_fsdp_vllm_qwen3_8b_grpo.sh
```

## Using Wandb
If you want to use Wandb to log the training process, you need to modify the configuration with: 
```bash
export WANDB_API_KEY="Your-Wandb-api-key"
```
Change the configuration to:
```bash
runtime_args.log_args_dict.enable_wandb=True
runtime_args.log_args_dict.wandb_project="Your-Wandb-Project-Name"
```

## Model Conversion
Saving FSDP models is time-consuming. Chatlearn provides an offline model conversion feature, which converts FSDP-sharded checkpoints back to HuggingFace format. The script is as follows:
```bash
export CHATLEARN=$(pwd)
python chatlearn/offline_ckpt_converter.py \
    --hf_dir ${CHATLEARN}/Qwen3-8B/ \
    --ckpt_dir ${CHATLEARN}/output/qwen3-grpo-8b/save_model/policy_trainer \
    --save_dir ${CHATLEARN}/output/qwen3-grpo-8b/save_model/huggingface/ \
    --iter 200 \
    --groupgemm 0
```
If you are training an MoE model with groupgemm, please make sure to set:
```bash
   --groupgemm 1
```
This script will convert the final FSDP sharded model after training back into a HuggingFace model and save it in the path "${CHATLEARN}/output/qwen3-grpo-8b/save_model/huggingface/".

## FAQ
### How to Speed Up PolicyTrainer Training?
1. Set models.policy_trainer.packing=True and configure models.policy_trainer.max_token_in_packing to the maximum token count that fits GPU memory.

2. For the Qwen3-MoE model, enable models.policy_trainer.groupgemm=True to activate the GroupGEMM patch, improving MoE layer training speed.

### Why Does FSDP Initialization Cause Ray OOM Errors When Load Weights in Transformers?
Enable models.policy_trainer.meta_init=True to mitigate this issue. This may cause extra time cost for initialization.

### Why Does This Error Occur During Inference?
```bash
ray.exceptions.RayChannelTimeoutError: System error: If the execution is expected to take a long time, increase RAY_CGRAPH_get_timeout which is currently 10 seconds. Otherwise, this may indicate that the execution is hanging.
```
Check the model input parameter: If models.policy.tensor_model_parallel_size is not 1, set models.policy.enforce_eager=True.

### Why Does torch.OutOfMemoryError: CUDA Out of Memory Occur During Training?
1. If models.policy_trainer.packing=True, try reducing models.policy_trainer.max_token_in_packing.

2. If models.policy_trainer.packing=False, decrease runtime_args.train_micro_batch_size.

3. If OOM persists even with runtime_args.train_micro_batch_size=1 or when models.policy_trainer.max_token_in_packing is smaller than the generation length, increase models.policy_trainer.ulysses_sequence_parallel_size (recommended: a power of 2, not exceeding the number of GPUs per node).


### Why Does CUDA OOM Still Occur After These Adjustments?
Consider scaling up the number of GPUs—FSDP memory consumption scales roughly linearly with the total GPU count.

### Why Does vLLM Initialization Cause CUDA OOM?
Increase models.policy.gpu_memory_utilization (recommended: no higher than 0.95).