Resuming and Fault Tolerance¶
The alignment task involves the computation and interaction of multiple models. As the model scale and computational resources increase, occasional exceptions may occur due to the dependent software stack and hardware environment, leading to task interruption.
To ensure that interrupted tasks can automatically resume their state, ChatLearn provides the resuming function, which in combination with PAI-DLC’s AIMaster, can automatically detect errors and resume functionality.
Configuring ChatLearn Resuming¶
Resuming an alignment task requires consideration of the following points:
Recording and restoring data progress: For recording data status, users need to configure
data_checkpoint_path
in the training configuration master file, such asrlhf.yaml
. Ifdata_checkpoint_path
is not empty, ChatLearn will record the current data progress and store the data checkpoint during eachsave_checkpoint
.Restoring training states such as episodes and iterations: When users configure
data_checkpoint_path
and the corresponding data checkpoint exists in the folder, ChatLearn will automatically restore the training state to the latest checkpoint status and set theresume_training
variable toTrue
.Loading checkpoints: When
resume_training==True
, the checkpoints forreference
andreward
in RLHF remain unchanged. However,ppo_policy
andppo_value
need to load the checkpoints stored during training, rather than the original initialized checkpoints. Therefore, special processing needs to be done in the setup phase.
if self.resume_training:
self.args.load = get_args().save
self.args.load_iteration = -1
self.args.no_load_optim = False
self.args.no_load_rng = False
self.args.no_load_args = False
self.args.no_load_scheduler = False
self.args.finetune = False
For more details, refer to examples/megatron/scripts/train_rlhf_llama.sh
.
If a user configures data_checkpoint_path
in the program but does not want to enable the resuming function, they can also disable this functionality by configuring enable_resume_training: False
.
Combining with DLC AIMaster to Achieve Fault Tolerance and Automatic Resuming¶
DLC provides fault tolerance monitoring based on AIMaster. AIMaster is a task-level component that, when enabled for fault tolerance monitoring, launches an AIMaster instance to run with other task instances, serving the roles of task monitoring, fault judgment, and resource control.
Users can combine AIMaster’s fault tolerance functionality with ChatLearn’s resuming functionality to achieve automatic resuming of training tasks.
The following is an example of fault tolerance monitoring configuration, which includes enabling hang detection and error detection. When the hang exceeds 1 hour or AIMaster detects an error, the task will be automatically restarted, with a maximum number of restarts being 3 times.
For more fault tolerance configuration, please refer to the DLC Fault Tolerance Documentation.