Config Explanation¶

ChatLearn’s configuration comprises two main components:

runtime_args: Primary training configurations for the framework.
models: Configuration settings for each individual model.

Configuration templates are provided in ChatLearn/template.

runtime_args¶

runtime_args:
  # setup config
  train_backend: fsdp
  rollout_backend: vllm
  exp_name: grpo_fsdp
  colocation: [policy,policy_trainer,ref_policy]
  # path config
  output_dir: your_output_dir
  data_path: your_data_path
  eval_data_path: your_eval_data_path
  data_checkpoint_path: ${runtime_args.output_dir}/data_checkpoint_path/
  # config for training
  num_episode: 200
  sample_per_episode: 512
  train_global_batch_size: 512
  save_episode_interval: 200
  # config for data
  data_shuffle: True
  data_rerank: True
  # config for eval
  eval_episode_interval: 5
  enable_eval_before_training: False
  log_args_dict:
    log_dir: ${runtime_args.output_dir}
    enable_wandb: False
    wandb_project: your_wandb_project
    wandb_dir: ${runtime_args.output_dir}
    wandb_name: ${runtime_args.exp_name}
    wandb_id: ${runtime_args.exp_name}
    wandb_resume: allow

runtime_args.train_backend: Training backend to use; supports fsdp or megatron
runtime_args.rollout_backend: Rollout backend to use; choose between vllm or sglang
runtime_args.exp_name: Experiment name, used for logging metrics.
runtime_args.colocation: Models listed here will be colocated on the GPU, and will execute sequentially.
runtime_args.output_dir: Directory to save all intermediate training results.
runtime_args.data_path: Training data path. Ensure data files in this directory are compatible with the data reading code.
runtime_args.eval_data_path: Evaluation data path. Ensure data files in this directory are compatible with the data reading code.
runtime_args.data_checkpoint_path: Path for saving data checkpoints. The default is data_checkpoint_path/ under your runtime_args.output_dir
runtime_args.num_episode: Total number of training episodes. Each episode involves several weight updates.
runtime_args.sample_per_episode: Number of training samples per episode(sample_per_episode=prompt_per_episode*num_inference_per_prompt).
runtime_args.train_global_batch_size: runtime_args.sample_per_episode will be divided into multiple global batches of this size. Each batch is used in one training round (global across actors).
runtime_args.save_episode_interval: Interval for saving intermediate checkpoints.
runtime_args.eval_episode_interval: Interval between evaluation rounds.
runtime_args.data_shuffle: If enabled, dataset samples will be shuffled, ignoring the original order.
runtime_args.data_rerank: If enabled, multiple replicas of the same data sample will not be assigned to the same rollout actor.
runtime_args.enable_eval_before_training: Whether to do evaluation before training.
runtime_args.log_args_dict: Logging configuration. Ensure you are logged in to Weights & Biases when runtime_args.enable_wandb=True.

models¶

For the GRPO algorithm, ChatLearn uses four models: policy_trainer, ref_policy, policy, reward. Note: policy_trainer and ref_policy share the same training backend.

policy_trainer¶

Common config¶

policy_trainer:
  free_gpu_memory:
    offload_weights: True
    offload_optimizer_states: True
    free_grad_buffers: True
  optimizer:
    lr: 2e-6
    clip_grad: 1
  trainable: True
  generation_batch_size: 8
  train_micro_batch_size: ${runtime_args.train_micro_batch_size}
  packing: False
  max_token_in_packing: 32768
  load: your_hf_model_path
  pos_clip_ratio: 0.2
  neg_clip_ratio: 0.2
  entropy_coef: 0.0
  kl_coef: 0.0
  gpu_per_process: 1
  num_gpu: 1

models.policy_trainer.free_gpu_memory.*: Controls GPU memory offloading; set all to True for colocation scenarios (non-colocation is not supported yet).
models.policy_trainer.optimizer.lr: Learning rate.
models.policy_trainer.optimizer.clip_grad: Gradient clipping rate.
models.policy_trainer.trainable: Enable training, this should always be True for trainer
models.policy_trainer.generation_batch_size: Batch size for a single model forward pass (used to compute old_logprobs).
models.policy_trainer.train_micro_batch_size: Local batch size per model for forward/backward pass; used for gradient accumulation.
models.policy_trainer.packing: If enabled, samples are regrouped into batches with fewer total tokens then models.policy_trainer.max_token_in_packing in each batch. After regrouping, each batch will be packed into single sequence. If enabled, models.policy_trainer.generation_batch_size and models.policy_trainer.train_micro_batch_size will be ignored.
models.policy_trainer.max_token_in_packing: Used for regroup when models.policy_trainer.packing is enabled
models.policy_trainer.load: Path to load base checkpoint model.
models.policy_trainer.pos_clip_ratio, models.policy_trainer.neg_clip_ratio: GRPO algorithm coefficients.
models.policy_trainer.entropy_coef: Set above 0.0 to enable entropy loss in backward pass.
models.policy_trainer.kl_coef: Set above 0.0 to enable KL loss in backward pass.
models.policy_trainer.gpu_per_process: GPUs assigned to each Ray actor.
models.policy_trainer.num_gpu: Total GPUs for training (on multinode, this is across all nodes).

FSDP policy_trainer config¶

The following are specific configuration options for the FSDP training backend:

policy_trainer: 
  fsdp_size: ${models.policy_trainer.num_gpu}
  ulysses_sequence_parallel_size: 1
  meta_init: False
  groupgemm: False
  gradient_checkpointing: True
  save_hf: True

models.policy_trainer.fsdp_size: Sets the FSDP parallel group size; by default, this includes all available GPUs.
models.policy_trainer.ulysses_sequence_parallel_size: Enables Ulysses sequence parallelism when set greater than 1. Currently support Qwen3-Dense and Qwen2.5-Dense.
models.policy_trainer.meta_init: Enables meta initialization for the FSDP wrapper. Model weights are loaded only on rank 0 and broadcasted to other ranks during setup.
models.policy_trainer.groupgemm: Replace Sequential MLP with GroupGEMM, currently only support Qwen3-Moe.
models.policy_trainer.gradient_checkpointing: Enables recomputation of intermediate activations during training to save memory (gradient checkpointing).
models.policy_trainer.save_hf: If True, saves Hugging Face format checkpoints during training. An offline merge script is provided to merge FSDP distributed checkpoints into a Hugging Face checkpoint.

Megatron policy_trainer config¶

The following are specific configuration options for the Megatron-Core training backend:

policy_trainer: 
  bf16: True
  seq_length: 2048
  tokenizer_type: 'HuggingFaceTokenizer'
  tokenizer_model: ${models.policy.load}
  tensor_model_parallel_size: 1
  pipeline_model_parallel_size: 1
  expert_tensor_parallel_size: null
  expert_model_parallel_size: 1
  virtual_pipeline_model_parallel_size: null
  decoder_first_pipeline_num_layers: null
  decoder_last_pipeline_num_layers: null
  moe_router_force_load_balancing: False
  # train config
  load: your_megatron_model_path
  sequence_parallel: True
  use_distributed_optimizer: True
  recompute_granularity: null
  # other 
  use_group_sequence_policy: False

models.policy_trainer.bf16: Enables bfloat16 precision. If set to False, fp32 will be used.
models.policy_trainer.seq_length: Sequence length for Megatron Training. If models.policy_trainer.packing is enabled, the value will be ignored. Otherwise, the value must be equal to the value used for data generation.
models.policy_trainer.tokenizer_type: Tokenizer type for Megatron Training. For most cases, HuggingFaceTokenizer is recommended.
models.policy_trainer.tokenizer_model: Path to the tokenizer model for Megatron training.
models.policy_trainer.tensor_model_parallel_size: Tensor model parallel world size.
models.policy_trainer.pipeline_model_parallel_size: Pipeline model parallel world size.
models.policy_trainer.expert_tensor_parallel_size: Expert tensor model parallel world size.
models.policy_trainer.expert_model_parallel_size:Expert model parallel world size.
models.policy_trainer.virtual_pipeline_model_parallel_size: Virtual pipeline model parallel world size. Used when pipeline_model_parallel_size larger than 1.
models.policy_trainer.decoder_first_pipeline_num_layers: Number of decoder layers of the first pipeline stage. Used when num_layers of the model cannot be divided by pipeline_model_parallel_size.
models.policy_trainer.decoder_last_pipeline_num_layers: Number of decoder layers of the last pipeline stage. Used when num_layers of the model cannot be divided by pipeline_model_parallel_size.
models.policy_trainer.moe_router_force_load_balancing: (Benchmarking) Forces load balancing for MoE routers if enabled.
models.policy_trainer.load: Path to the model checkpoint.
models.policy_trainer.sequence_parallel: Whether to use sequence parallelism. Valid when tensor_model_parallel_size larger than 1.
models.policy_trainer.use_distributed_optimizer: Whether to use distributed optimizer to reduce memory consumption. Recommended to set True.
models.policy_trainer.recompute_granularity: Select recompute granularity to save memory usage. Should be null, sel or full.
models.policy_trainer.use_group_sequence_policy: Whether to use GSPO.

ref_policy¶

ref_policy uses the same backend as policy_trainer, but can be customized separately.

common ref_policy config¶

ref_policy:
  free_gpu_memory:
    offload_weights: True
  generation_batch_size: 8
  gpu_per_process: 1
  num_gpu: ${models.policy_trainer.num_gpu}
  trainable: False
  load: ${models.policy_trainer.load}
  packing: ${models.policy_trainer.packing}
  max_token_in_packing: ${models.policy_trainer.max_token_in_packing}

models.ref_policy.free_gpu_memory.offload_weights: Enables offloading of weights. This should be set to True in colocation scenarios (non-colocation is not supported yet).
models.ref_policy.generation_batch_size: Batch size used for each forward pass when computing ref_logprobs. It can differ from models.policy_trainer.generation_batch_size
models.ref_policy.trainable: Should always be False for the this model, as it only performs inference.
models.ref_policy.packing: Same functionality as in models.policy_trainer.packing—controls batch regrouping and packing.
models.ref_policy.max_token_in_packing: Used for regroup when models.ref_policy.packing is enabled. This value can differ from models.policy_trainer.packing
models.ref_policy.load:Path to the base model checkpoint. This should match models.policy_trainer.load unless a different checkpoint is needed.
models.ref_policy.gpu_per_process: Number of GPUs to assign to each Ray actor.
models.ref_policy.num_gpu: Total number of GPUs allocated for this model. In multinode training, this represents the total across all nodes.

FSDP ref_policy config¶

Custom settings for the FSDP training backend.

ref_policy: 
  fsdp_size: ${models.policy_trainer.num_gpu}
  meta_init: False
  groupgemm: False

models.ref_policy.fsdp_size, models.ref_policy.meta_init, and models.ref_policy.groupgemm: These mirror the options in models.policy_trainer, but can be set independently to override the defaults.

Megatron ref_policy config¶

Custom settings for the Megatron training backend.

ref_policy:
  seq_length: ${models.policy_trainer.seq_length}
  tokenizer_type: 'HuggingFaceTokenizer'
  tokenizer_model: ${models.policy.load}
  bf16: True
  sequence_parallel: True
  tensor_model_parallel_size: ${models.policy_trainer.tensor_model_parallel_size}
  pipeline_model_parallel_size: ${models.policy_trainer.pipeline_model_parallel_size}
  expert_tensor_parallel_size: ${models.policy_trainer.expert_tensor_parallel_size}
  expert_model_parallel_size: ${models.policy_trainer.expert_model_parallel_size}
  decoder_first_pipeline_num_layers: ${models.policy_trainer.decoder_first_pipeline_num_layers}
  decoder_last_pipeline_num_layers: ${models.policy_trainer.decoder_last_pipeline_num_layers}
  moe_router_force_load_balancing: ${models.policy_trainer.moe_router_force_load_balancing}
  load: ${models.policy_trainer.load}

All the above configurations are the same as policy trainer, but can be overridden for reference policy model. However, to improve the numerical stability, we recommend to keep two models consistent, especially in MoE training.

policy¶

SgLang and Vllm share same configuration

policy:
  free_gpu_memory:
    offload_weights: True
  generation_batch_size: 256
  gpu_per_process: 1
  num_gpu: ${models.policy_trainer.num_gpu}
  tensor_model_parallel_size: 1
  trainable: False
  load: ${models.policy_trainer.load}
  num_inference_per_prompt: 32
  seq_length: 2048
  max_seq_len_to_capture: 2348
  temperature: 1.0
  top_p: 1.0
  eval_temperature: 0.6
  eval_top_p: 0.95
  eval_top_k: 20
  enable_thinking: False
  gpu_memory_utilization: 0.8

models.policy.free_gpu_memory.offload_weights: If enabled, model weights are offloaded. Recommended for colocation scenarios. Non-colocation is not supported yet.
models.policy.generation_batch_size: Sets max_num_seqs for VLLM
models.policy.gpu_per_process: Number of GPUs assigned per Ray actor.
models.policy.num_gpu: Total GPUs allotted for policy inference. In multinode setups, sums across all nodes.
models.policy.tensor_model_parallel_size: Specifies tensor parallel size for model parallelism.
models.policy.trainable: Policy model is non-trainable.
models.policy.load: Path to model checkpoint; should match models.policy_trainer.load for GRPO.
models.policy.num_inference_per_prompt: Number of responses generated per prompt.
models.policy.seq_length: Maximum response sequence length(prompt length + response length).
models.policy.max_seq_len_to_capture: Max sequence length captured during inference. Must be ≥ models.policy.seq_length.
models.policy.temperature, models.policy.top_p: Sampling hyperparameters for training rollouts.
models.policy.eval_temperature, models.policy.eval_top_p, models.policy.eval_top_k: Sampling hyperparameters for evaluation rollouts.
models.policy.enable_thinking: Enables “thinking mode” for Qwen3 models, if applicable.
models.policy.gpu_memory_utilization: Target GPU memory utilization for the rollout engine. Use with caution in colocation mode to avoid OOM.

reward¶

models:
  reward:
    num_cpu: 2
    cpu_per_process: 1
    generation_batch_size: 256

models.reward.num_cpu: Total CPUs to allocate for the rule-based (CPU-based) reward actors.
models.reward.cpu_per_process: Number of CPUs used by each individual reward actor.
models.reward.generation_batch_size: Batch size for a single reward forward computation.