Efficient Memory Sharing (EMS)¶
ChatLearn provides EMS feature to significantly reduce the GPU memory usage during the alignment training. It maximizes the use of limited resources to train models with larger-scale or to improve overall training efficiency by improving the model’s parallel strategy and increasing the batch size after GPU memory saved.
When multiple models in ChatLearn share the same resources for training or inference, enabling the EMS feature allows these models to sequentially share GPU memory:
After each model is initialized, tensors/buffers that constantly reside in GPU memory (such as weights, gradient buffers, and optimization states) are unloaded to the RAM or freed to release their occupied GPU memory.
Before training or inference for a specific model, the tensors/buffers are loaded from the RAM or reconstructed, and then training or inference takes place.
Once the training or inference is complete, the tensors/buffers are again unloaded to the RAM or freed to release their occupied GPU memory.
By repeating the above process, multiple models sequentially share GPU memory, maximizing the efficiency of GPU memory usage.
Usage¶
Users can specify whether to enable the EMS feature by configuring the free_memory
(bool type, default is False) parameter for each model. This can be directly modified in the rlhf.yaml
for each model. For example, to enable the EMS feature for the policy model:
policy:
model_config_file: old_policy_inference.yaml
...
free_memory: ${free_memory_policy:True}
Alternatively, it can also be configured in the training script using environment variables:
Policy model:
export free_memory_policy=True
Reference model:
export free_memory_reference=True
Reward model:
export free_memory_reward=True
Value model:
export free_memory_value=True
PPO policy model:
export free_memory_ppo_policy=True
PPO value model:
export free_memory_ppo_value=True
A complete example can be found in the llama2 configuration.