End-to-end training tutorial based on the Qwen model

This document describes DPO training based on the ChatLearn, DeepSpeed framework, and Qwen model.

The following is a collection of common environment variables used in this tutorial script:

ENV

Meaning

CHATLEARN

Location where the ChatLearn code repository is cloned https://github.com/alibaba/ChatLearn.git

DATASET_ROOT

Root directory where the training datasets are stored

Setup: Image, Code, and Data Preparation

Image / Code

Please refer to Environment and Code Setup.

Data

The data format required by qwen2 is chatml:

{"type": "chatml", "messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Tell me something about large language models."}, {"role": "assistant", "content": "Large language models are a type of language model that is trained on a large corpus of text data. They are capable of generating human-like text and are used in a variety of natural language processing tasks..."}], "source": "unknown"}

The following script can convert Dahoas/full-hh-rlhf to data in chatml format and store it in the file $DATASET_ROOT/alignment/train.jsonl:

cd ${CHATLEARN}/examples/huggingface/
DATASET_ROOT=path-to-dataset-root
python data/preprocess_data_chatml.py $DATASET_ROOT

DPO

Here is an example of DPO training for Qwen2-7B. In this example, the user needs to set policy_model_path to the initialization model checkpoint path, and the Policy model and Reference model will be initialized with this checkpoint.

export CHATLEARN=path-to-chatlearn
export DATASET_PATH=$DATASET_ROOT/alignment/train.jsonl
export policy_model_path=path-to-qwen2-ckpt
cd ${CHATLEARN}/examples/huggingface/
bash scripts/train_dpo_qwen.sh