FAQ¶

Converting Models Between Megatron and Hugging Face Formats¶

Pai-Megatron-Patch provides comprehensive support for converting model formats between Mcore and Hugging Face (HF). Refer to the documentation for detailed instructions on model conversion. For example, to convert the DeepSeek 671B model, run the following command:

bash scripts/deepseek_v3/run_32xH20.sh \
A37B \
/mnt/deepseek-ckpts/DeepSeek-V3-bf16 \
/mnt/deepseek-ckpts/DeepSeek-V3-to-mcore \
false \
true \
bf16

How to Accelerate FSDP Training Speed¶

Set models.policy_trainer.packing=True, and set models.policy_trainer.max_token_in_packing to the maximum total number of tokens that fully utilizes GPU memory.
For Qwen3-MoE models, set models.policy_trainer.groupgemm=True to enable the GroupGEMM patch, which improves training speed of MoE layers.

Common Errors¶

`ray.exceptions.RayChannelTimeoutError` During Rollout Inference¶

ray.exceptions.RayChannelTimeoutError: System error: If the execution is expected to take a long time, increase RAY_CGRAPH_get_timeout which is currently 10 seconds. Otherwise, this may indicate that the execution is hanging.

When this error occurs during vLLM rollout inference, check whether models.policy.tensor_model_parallel_size is not equal to 1. If tensor_model_parallel_size is greater than 1, set models.policy.enforce_eager=True.

Why does `torch.OutOfMemoryError: CUDA out of memory` occur during FSDP training?¶

If models.policy_trainer.packing=True, try reducing models.policy_trainer.max_token_in_packing.
If models.policy_trainer.packing=False, try reducing runtime_args.train_micro_batch_size.
If runtime_args.train_micro_batch_size=1, or if OOM still occurs even when models.policy_trainer.max_token_in_packing is smaller than the generation length, consider increasing models.policy_trainer.ulysses_sequence_parallel_size. It is recommended to set it to a power of 2 and not exceed the number of GPUs per node.

FAQ¶

Converting Models Between Megatron and Hugging Face Formats¶

How to Accelerate FSDP Training Speed¶

Common Errors¶

ray.exceptions.RayChannelTimeoutError During Rollout Inference¶

Why does Ray report an OOM error when loading weights with transformers during FSDP model initialization?¶

Why does torch.OutOfMemoryError: CUDA out of memory occur during FSDP training?¶

`ray.exceptions.RayChannelTimeoutError` During Rollout Inference¶

Why does `torch.OutOfMemoryError: CUDA out of memory` occur during FSDP training?¶