Config

class chatlearn.utils.arguments.RuntimeEnvConfig[source]

Runtime env config, you can refer https://docs.ray.io/en/latest/ray-core/handling-dependencies.html for more information.

pip: List[str] = []

pip install packages

py_modules: List[str] = []

python modules

working_dir: str = '/home/docs/checkouts/readthedocs.org/user_builds/chatlearn/checkouts/v1.0.2/docs/en'

working directory

platform: str = ''

platform, e.g., DLC

excludes: List[str] = []

excludes files from packaging

get(key)[source]

Get other config by key

Parameters:

key (str) – Key to get config.

class chatlearn.utils.arguments.RuntimeConfig[source]

training related configs.

num_episode: int = 5000

[required] number of episodes. One episode includes a inference and training loop.

sample_per_episode: int = 1000

[required] number of samples per episode.

num_training_epoch: int = 1

[optional] number of training epoch per episode. default set to 1.

generation_batch_size: int = 2

[required] generation(inference) batch size.

train_micro_batch_size: int = 2

[required] training micro batch size.

train_global_batch_size: int = None

[required] training global batch size.

save_episode_interval: int = None

[required] save checkpoint per save_episode_interval episodes.

log_interval: int = 1

[optional] log time and memory per log_interval iterations.

data_path: str = None

data_path for dataset

Type:

[required]

colocation: List[str] = []

colocate models into the same device

Type:

[optional]

eval_episode_interval: int = 0

eval every N episode, if 0, will not eval

Type:

[optional]

enable_resume_training: bool = True

enable resume training when data checkpoint is set

Type:

[optional]

data_checkpoint_path: str = None

checkpoint for dataloader

Type:

[optional]

max_data_ckpt_nums: int = None

max data checkpoint nums

Type:

[optional]

load_data_checkpoint_iteration: int = None

load data checkpoint from iteration

Type:

[optional]

stream_data_loader_type: str = 'fixed'

stream_data_loader type, [“fixed”, “dynamic”]

Type:

[optional]

debug: bool = False

private

nsys: bool = False

enable nsys nvtx

profiler_dir: str = None

profiler dir

coalesce_param: bool = True

coalesce parameters in model sync

coalesced_buffer_mb: int = 100

coalesce_buffer size in mb

concurrent_comm: bool = True

concurrent parameter sync

param_sync_comm_type: str = 'broadcast'

parameter sync communication type, broadcast/p2p

param_sync_max_workers: int = None

parameter sync max workers

max_relay_episode: int = 0

max number of relay episodes, if max_relay_episode is set to -1, then relay all episodes if max_relay_episode is set to 0, then relay is disabled

relay_episode_offset: int = 0

relay after n episodes

consumed_samples: int = 0

consumed samples

concurrent_setup: bool = False

concurrent model setup

bucket_size_mb_in_memory_manager: int = 1024

bucket size in the memory manager to reduce peak memory

free_sync_collective_group: bool = False

free collective group after parameter synchronization and rebuild before next synchronization

cpu_schedule_strategy: str = 'SPREAD'

[optional] cpu only model schedule policy, PACK or SPREAD PACK: All provided bundles are packed onto a single node on a best-effort basis. SPREAD: Each bundle is spread onto separate nodes on a best-effort basis.

exp_name: str = 'CHATLEARN'

exp name for each run

output_dir: str = './'

output dir

get(key)[source]

Get other config by key.

Parameters:

key (str) – key to get config

class chatlearn.utils.arguments.ModelConfig[source]

Config for model.

num_device: int = 0

[legacy] number of GPU used for one model, default 0.

num_gpu: int = 0

[required] number of GPU used for one model, default 0, same as num_device

num_cpu: int = 0

[required] number of GPU used for one model, default 0

gpu_per_process: int = None

[optional] gpu per process, e.g., for PyTorch DDP, Megatron, DeepSpeed, gpu_per_process is set to 1

cpu_per_process: int = None

[optional] cpu per process

num_replica: int = 1

[optional] number of module replica, for gpu model, num_replica = num_gpu // (TP * PP * DP), for cpu model, num_replica = num_cpu // cpu_per_process

trainable: bool = False

[required] whether model is trainable

tensor_model_parallel_size: int = None

[optional] tensor model parallel size

pipeline_model_parallel_size: int = None

[optional] pipeline model parallel size

zero_size: int = None

[optional] zero size

model_config_file: str = ''

[optional] config file for model

config_dir: str = ''
model_type: str = ''

[optional] model type, e.g., Torch/Tensorflow, etc

generation_batch_size: int = -1

[optional] generation batch size, will overwrite generation batch size in RuntimeConfig

offload_optimizer_states = False

offload optimizer states

sync_frequency = 1

parameter sync frequency

offload_weights = False

offload weights

free_grad_buffers = False

free grad buffers

free_memory = False

overall switch for offload optimizer states/weights and free grad buffers

args_dict: dict = None

[optional] placeholder for other args

lora: LoraConfig = None

lora config

batch_generation: BatchGenerationConfig = None

batch generation config

class chatlearn.utils.arguments.BatchGenerationConfig[source]

Config for batch generation ranking and memory-efficiency.

ranking: bool = False

[optional] sort prompts by length each episode.

min_prompt_length: int = 0

[optional] min prompt length in the first stage of batch generation.

class chatlearn.utils.arguments.LoraConfig[source]

Config for lora

enable_lora: bool = False

enable lora, default False.

part_module_name: str = None

The “name_scope” parameter is used to specify a particular module to be converted to its LoRA. By default, it is set to None, which means there is no restriction on the module and any module can be converted using the “lora_layer” parameter. However, if “name_scope” is set to a specific value (e.g., “encoder”), only the modules whose name_scope contains the value “encoder” will be converted to LoRA.

lora_dim: int = 8

The rank value of the LoRA, which is the r dimension of the A/B matrix.

lora_dropout: float = 0.0

The LoRA dropout ratio refers to whether dropout computation is inserted in the forward pass of the LoRA layer. By default, the dropout ratio is set to 0.0.

lora_scaling: float = 1.0

When adding the values of the LoRA A and B matrices to the original weight matrix, the scaling value is set as “W = W + A * B * lora_scaling”. By default, the scaling value is set to 1.0.

lora_layer: str = 'ColumnParallelLinear,Embedding,LinearLayer,RowParallelLinear,VocabParallelEmbedding'

The layer class names involved in LoRA training in the model, separated by commas.

column_only_qkv: bool = False

LoRA training is enabled only in the ColumnParallelLinear layer of the MHA QKV module.