End-to-end Training Tutorial with Llama Model¶

This document provides instructions for end-to-end training using the ChatLearn, Megatron-LM and vLLM framework, and the Llama/Llama2 model. ChatLearn supports three training policies as follows:

RLHF(Reinforcement Learning from Human Feedback): which includes three stages of training: SFT, Reward, and RLHF training.
Direct Preference Optimization(DPO): which includes two stages of training: SFT and DPO training.
OnlineDPO/GRPO: which fall in between RLHF and DPO, includes three stages of training: SFT, Reward, and DPO training.

The following is a collection of general environment variables used in this tutorial script:

ENV	Explanation
`CHATLEARN`	The location where the ChatLearn code is cloned https://github.com/alibaba/ChatLearn.git
`MEGATRON`	The location where the Megatron-LM code is cloned https://github.com/NVIDIA/Megatron-LM.git
`DATASET_ROOT`	The root directory for storing the SFT/Reward/RLHF/DPO/OnlineDPO/GRPO training dataset collection.
`TOKENIZER_MODEL`	The path of `tokenizer.model` used by the Tokenizer.

Setup: Image / Code and Data Preparation¶

Image / Code¶

Please refer to Environment and Code Setup.

Data¶

Please refer to 3-stage data to prepare your training data.

SFT¶

SFT refers to the process of fine-tuning a pre-trained language model using annotated dialogue data. In this example, we need to download the pre-trained model, and then start a simple SFT training demonstration.

Download and Convert Pretrained Models¶

If you are using a model from HuggingFace transformers, you will first need to download the pre-trained checkpoint, such as the Llama2 model available on HuggingFace Hub (meta-llama/Llama-2-7b-hf), or a locally saved SFT model. Then, you can use the following code to convert the HuggingFace transformers model into the Megatron-LM model format:

For the llama2-7B model, we will convert the model into a checkpoint with TP (tensor_model_parallel_size)=4 and PP (pipeline_model_parallel_size)=1, and the model will be saved in MEGATRON_LLAMA_CKPT_PATH.
For the llama2-13B model, we will convert the model into a checkpoint with TP=8 and PP=1.
For the llama2-70B model, we will convert the model into a checkpoint with TP=8 and PP=4.

export MEGATRON=path-to-megatron-lm
export CHATLEARN=path-to-chatlearn

cd ${CHATLEARN}/examples/megatron/

TP=num_of_tp \
PP=num_of_pp \
LOAD_PATH=path-to-hf-model \
TOKENIZER_MODEL=$LOAD_PATH/tokenizer.model \
SAVE_PATH=path-to-megatron-model \
bash scripts/convert_hf_to_megatron.sh

Start SFT Training¶

The script below is an example of SFT training. The DATASET_PATH is the path to the SFT training set, such as $DATASET_ROOT/sft/train.jsonl. The model_size is an environment variable specified in the script to indicate the size of the model, which can be llama2-7B, llama2-13B, or llama2-70B.

export CHATLEARN=path-to-chatlearn
export MEGATRON=path-to-megatron-lm
cd ${CHATLEARN}/examples/megatron/

export model_size=llama2-7B

LOAD_PATH=$MEGATRON_LLAMA2_CKPT_PATH \
TOKENIZER_MODEL=$LLAMA2_TOKENIZER_MODEL \
DATASET_PATH=$DATASET_ROOT/sft/ \
bash scripts/train_sft_llama.sh

The training logs and the completed models will be stored in ${CHATLEARN}/output/sft by default. For specific definitions, please refer to the script ${CHATLEARN}/examples/megatron/scripts/train_sft_llama.sh.

In our training script, the resource requirements (assuming the resources are A100-80GB/A800-80GB GPUs) are as follows:

llama2-7B SFT: 8 GPUs
llama2-13B SFT: 8 GPUs
llama2-70B SFT: 4*8 GPUs

For the environment variables and configurations required for distributed execution, please refer to Distributed Execution.

Reward Model Training¶

The Reward model refers to the model that serves as a proxy for human evaluation in RLHF. It provides real-time evaluation and scoring of the model’s generated question responses. Given a question and model response, the Reward model produces a scalar representing the quality of the model’s response.

Hint: No need of reward model training for DPO mode.

Start Reward Model Training¶

Based on InstructGPT[1], the Reward model training is initialized with the SFT model checkpoint. The training code is as follows:

export CHATLEARN=path-to-chatlearn
export MEGATRON=path-to-megatron-lm
cd ${CHATLEARN}/examples/megatron/

LOAD_PATH=path-to-sft-ckpt \
TOKENIZER_MODEL=$LLAMA2_TOKENIZER_MODEL \
DATASET_PATH=$DATASET_ROOT/rm/ \
bash scripts/train_reward_llama.sh

The training logs and trained models will be saved by default in ${CHATLEARN}/output/reward. The specific definitions can be found in the script ${CHATLEARN}/examples/megatron/scripts/train_reward_llama.sh.

The resource requirements for training a reward model of the same scale are the same as those for SFT models.

For the environment variables and configurations required for distributed execution, please refer to Distributed Execution.

Alignment Training¶

ChatLearn supports multiple alignments: RLHF, DPO, OnlineDPO, GRPO

Start Alignment Training¶

Take Llama2-7B for example as follows.

RLHF¶

Here is a training script for Llama2-7B Policy and 7B Reward models. In this example, the user needs to set POLICY_LOAD to the checkpoint path generated by SFT, which used for Policy and Value model. The Policy and Reference models will be initialized with the SFT checkpoint. REWARD_LOAD should be set to the checkpoint path generated by the Reward training, and the user can specify the iteration number for the loaded checkpoint. The Reward and Value models will be initialized with the weights of the Reward model. TOKENIZER_MODEL should be set to the folder path where the tokenizer.model for Llama2Tokenizer is located.

export CHATLEARN=path-to-chatlearn
export MEGATRON=path-to-megatron-lm
export DATASET_PATH=$DATASET_ROOT/alignment/train.jsonl

cd ${CHATLEARN}/examples/megatron/

export model_size=llama2-7B

POLICY_LOAD=path-to-sft-ckpt \
REWARD_LOAD=path-to-rm-ckpt \
REWARD_LOAD_ITERATION=1000 \
TOKENIZER_MODEL=$LLAMA2_TOKENIZER_MODEL \
bash run_scripts/train_rlhf_llama.sh

OnlineDPO/GRPO¶

OnlineDPO/GRPO training process is similar to RLHF.

export CHATLEARN=path-to-chatlearn
export MEGATRON=path-to-megatron-lm
export DATASET_PATH=$DATASET_ROOT/alignment/train.jsonl

cd ${CHATLEARN}/examples/megatron/

export model_size=llama2-7B

POLICY_LOAD=path-to-sft-ckpt \
REWARD_LOAD=path-to-rm-ckpt \
REWARD_LOAD_ITERATION=1000 \
TOKENIZER_MODEL=$LLAMA2_TOKENIZER_MODEL \
bash scripts/train_online_dpo_llama.sh

DPO¶

Here is a training script for Llama2-7B Policy and 7B Reward models. In this example, the user needs to set POLICY_LOAD to the checkpoint path generated by SFT. The Policy and Reference models will be initialized with the SFT checkpoint. TOKENIZER_MODEL should be set to the folder path where the tokenizer.model for Llama2Tokenizer is located.

export CHATLEARN=path-to-chatlearn
export MEGATRON=path-to-megatron-lm
export DATASET_PATH=$DATASET_ROOT/alignment/train.jsonl

cd ${CHATLEARN}/examples/megatron/

export model_size=llama2-7B

POLICY_LOAD=path-to-sft-ckpt \
TOKENIZER_MODEL=$LLAMA2_TOKENIZER_MODEL \
bash scripts/train_dpo_llama.sh

GRPO Math¶

To train a GRPO Math model, first refer to Math data to prepare the mathematics dataset. Below is an example of training a Llama2-7B model.

export CHATLEARN=path-to-chatlearn
export MEGATRON=path-to-megatron-lm
export DATASET_PATH=$DATASET_ROOT/math/train.jsonl

cd ${CHATLEARN}/examples/megatron/

export model_size=llama2-7B

POLICY_LOAD=path-to-sft-ckpt \
REWARD_LOAD=path-to-rm-ckpt \
REWARD_LOAD_ITERATION=1000 \
TOKENIZER_MODEL=$LLAMA2_TOKENIZER_MODEL \
bash scripts/train_grpo_math_llama.sh

Models of Other Sizes¶

If you need to train a llama2-13B / llama2-70B model, simply change export model_size=llama2-7B with export model_size=llama2-13B / export model_size=llama2-70B. You can also modify the model configuration and other parameters according to your needs.

In our training script, the resource requirements (assuming the resources are A100-80GB / A800-80GB GPUs) are as follows:

llama2-7B RLHF: 8 GPUs
llama2-13B RLHF: 2*8 GPUs
llama2-70B RLHF: 4*8 GPUs

For the environment variables and configurations required for distributed execution, please refer to Distributed Execution.

Evaluation¶

Firstly, we can use ChatLearn’s model conversion tool to convert the Megatron-LM formatted model to HuggingFace’s transformers model format.

export CHATLEARN=path-to-chatlearn
export MEGATRON=path-to-megatron-lm

cd $CHATLEARN/examples/megatron/

LOAD_PATH=path-to-megatron-model \
SAVE_PATH=path-to-hf-model \
VOCAB_PATH=path-to-vocab \
target_params_dtype=bf16 \
bash scripts/convert_megatron_to_hf.sh

We evaluated the performance of Llama2-7B on the HH dataset, after training with SFT, RLHF, DPO or training with OnlineDPO, using the GPT-4 API provided by MT-Bench. The results show that RLHF improves the average performance of the model compared to SFT. For RLHF, there is a significant improvement in the domains of Humanities, Math, Roleplay, Reasoning, and Writing. The performance gains observed here are due to the use of a Reward model trained on the open-source HH dataset. Customizing the Reward model contributes to achieving better results.

Metric	llama_sft	llama_rlhf	llama_dpo	llama_onlinedpo
Coding	2.05	1.65	2.17	1.75
Extraction	4.40	4.0	4.35	3.70
Humanities	5.85	7.17	6.70	7.52
Math	1.15	1.70	1.25	1.05
Reasoning	3.15	3.30	3.15	2.00
Roleplay	4.75	5.50	5.65	6.10
STEM	6.05	5.75	6.77	7.10
Writing	4.55	4.75	4.8	5.30
Avg	3.94	4.22	4.33	4.31

Reference¶

Training language models to follow instructions with human feedback https://arxiv.org/abs/2203.02155