Multi-Node Distributed Training¶
Multi-Node Distributed Training in PAI DLC Environment¶
ChatLearn has been adapted to the PAI DLC distributed environment, allowing you to directly use your original single-node scripts for multi-node distributed reinforcement learning training.
You need to adjust parameters such as
generation_batch_sizeandtrain_micro_batch_sizebased on the total number of GPUs to achieve optimal throughput configuration.
Multi-Node Distributed Training in Custom Environments¶
If performing multi-node training in a non-DLC environment, you need to manually set up the environment variables related to distributed training. In addition to common distributed environment variables, ChatLearn requires the additional setting LOCAL_MASTER_KEY=$MASTER_ADDR. Below is a two-node reinforcement learning example. Run the following commands respectively on the rank0 and rank1 nodes.
RANK0:
export MASTER_ADDR=your_master_node_ip_address
export NNODES=2
export RANK=0
export LOCAL_MASTER_KEY=$MASTER_ADDR
# Execute reinforcement learning training
bash scripts/train_fsdp_vllm_qwen3_8b_grpo.sh
RANK1:
export MASTER_ADDR=your_master_node_ip_address
export NNODES=2
export RANK=1
export LOCAL_MASTER_KEY=$MASTER_ADDR
# Execute reinforcement learning training
bash scripts/train_fsdp_vllm_qwen3_8b_grpo.sh