Common Issues¶
ECC Error¶
ECC Error is a machine failure. It is recommended to use Continued Training and Fault Tolerance to automatically blacklist faulty machines and restart the job.
How to build a custom training flow for multiple reward models¶
The provided examples are for training a single reward model. If you need to customize the training flow for multiple reward models, please refer to Custom Inference and Training Workflow.
RuntimeError: Error(s) in loading state_dict for VocabParallelEmbedding¶
RuntimeError: Error(s) in loading state_dict for VocabParallelEmbedding:
size mismatch for weight: copying a param with shape torch.Size([xxx, xxx]) from checkpoint, the shape in the current model is torch.Size([[xxx, xxx]]).
This is generally caused by changes in the TP and requires adjusting the parameter make_vocab_size_divisible_by
to align the shape of the padded embedding parameters.
YAML Configuration¶
Refer to Configuration File.
How to enable ‘Efficient memory sharing’ to reduce memory usage¶
Refer to the documentation on Efficient memory sharing.
Megatron Model Conversion and Parallel Strategy¶
cd $CHATLEARN
model_type=GPT # for reward model, set model_type to REWARD
load_dir=xxx
save_dir=xxx
target_tp=xxx
target_pp=xxx
python chatlearn/tools/megatron_checkpoint_utils.py --model-type ${model_type} --load-dir ${load_dir} --save-dir ${save_dir} \
--target-tensor-parallel-size ${target_tp} --target-pipeline-parallel-size ${target_pp}
Note that this script has only been validated on official Megatron-LM scripts.
Applying for custom_port¶
In the DLC environment, the current RLHF training has already allocated 50 ports to meet all usage scenarios. It is recommended to set the advanced configuration as follows:
customPortList=30000-30050
Task failure but DLC status shows success¶
Redirect the log to a file
python train_rlhf.py -c configs/llama2/rlhf.yaml 2>&1 | tee -a ${LOG_DIR}/log_${RANK}.txt
In this situation, the exit code is always 0, and the DLC job will show as successful. It is necessary to change it to the following:
python train_rlhf.py -c configs/llama2/rlhf.yaml 2>&1 | tee -a ${LOG_DIR}/log_${RANK}.txt ; exit ${PIPESTATUS[0]}
There are some additional operations after the training command, causing the error code to be different from the training command’s error code. It is recommended to add
set -e
at the beginning of the command, so that it exits at the first encountered error command.
Adjusting lr error in continued training¶
Megatron checks if the lr has changed during load_checkpoint. It is necessary to set the Megatron model parameter override_opt_param_scheduler
to True to bypass the check.
How to specify the frequency of model saving during training¶
In rlhf.yaml, configure save_episode_interval
.