Common Issues

ECC Error

ECC Error is a machine failure. It is recommended to use Continued Training and Fault Tolerance to automatically blacklist faulty machines and restart the job.

How to build a custom training flow for multiple reward models

The provided examples are for training a single reward model. If you need to customize the training flow for multiple reward models, please refer to Custom Inference and Training Workflow.

RuntimeError: Error(s) in loading state_dict for VocabParallelEmbedding

RuntimeError: Error(s) in loading state_dict for VocabParallelEmbedding:
   size mismatch for weight: copying a param with shape torch.Size([xxx, xxx]) from checkpoint, the shape in the current model is torch.Size([[xxx, xxx]]).

This is generally caused by changes in the TP and requires adjusting the parameter make_vocab_size_divisible_by to align the shape of the padded embedding parameters.

YAML Configuration

Refer to Configuration File.

How to enable ‘Efficient memory sharing’ to reduce memory usage

Refer to the documentation on Efficient memory sharing.

Megatron Model Conversion and Parallel Strategy

cd $CHATLEARN
model_type=GPT # for reward model, set model_type to REWARD
load_dir=xxx
save_dir=xxx
target_tp=xxx
target_pp=xxx
python chatlearn/tools/megatron_checkpoint_utils.py --model-type ${model_type} --load-dir ${load_dir} --save-dir ${save_dir} \
    --target-tensor-parallel-size ${target_tp} --target-pipeline-parallel-size ${target_pp}

Note that this script has only been validated on official Megatron-LM scripts.

Applying for custom_port

In the DLC environment, the current RLHF training has already allocated 50 ports to meet all usage scenarios. It is recommended to set the advanced configuration as follows:

customPortList=30000-30050

Task failure but DLC status shows success

  1. Redirect the log to a file

python train_rlhf.py -c configs/llama2/rlhf.yaml 2>&1 | tee -a ${LOG_DIR}/log_${RANK}.txt

In this situation, the exit code is always 0, and the DLC job will show as successful. It is necessary to change it to the following:

python train_rlhf.py -c configs/llama2/rlhf.yaml 2>&1 | tee -a ${LOG_DIR}/log_${RANK}.txt ; exit ${PIPESTATUS[0]}
  1. There are some additional operations after the training command, causing the error code to be different from the training command’s error code. It is recommended to add set -e at the beginning of the command, so that it exits at the first encountered error command.

Adjusting lr error in continued training

Megatron checks if the lr has changed during load_checkpoint. It is necessary to set the Megatron model parameter override_opt_param_scheduler to True to bypass the check.

How to specify the frequency of model saving during training

In rlhf.yaml, configure save_episode_interval.