Config¶

class chatlearn.utils.arguments.RuntimeEnvConfig[source]¶

Runtime env config, you can refer https://docs.ray.io/en/latest/ray-core/handling-dependencies.html for more information.

pip: List[str] = []¶: pip install packages

py_modules: List[str] = []¶: python modules

working_dir: str = '/home/docs/checkouts/readthedocs.org/user_builds/chatlearn/checkouts/v1.0.2/docs/en'¶: working directory

platform: str = ''¶: platform, e.g., DLC

excludes: List[str] = []¶: excludes files from packaging

get(key)[source]¶

Get other config by key

Parameters:: key (str) – Key to get config.

class chatlearn.utils.arguments.RuntimeConfig[source]¶

training related configs.

num_episode: int = 5000¶: [required] number of episodes. One episode includes a inference and training loop.

sample_per_episode: int = 1000¶: [required] number of samples per episode.

num_training_epoch: int = 1¶: [optional] number of training epoch per episode. default set to 1.

generation_batch_size: int = 2¶: [required] generation(inference) batch size.

train_micro_batch_size: int = 2¶: [required] training micro batch size.

train_global_batch_size: int = None¶: [required] training global batch size.

save_episode_interval: int = None¶: [required] save checkpoint per save_episode_interval episodes.

log_interval: int = 1¶: [optional] log time and memory per log_interval iterations.

data_path: str = None¶

data_path for dataset

Type:: [required]

colocation: List[str] = []¶

colocate models into the same device

Type:: [optional]

eval_episode_interval: int = 0¶

eval every N episode, if 0, will not eval

Type:: [optional]

enable_resume_training: bool = True¶

enable resume training when data checkpoint is set

Type:: [optional]

data_checkpoint_path: str = None¶

checkpoint for dataloader

Type:: [optional]

max_data_ckpt_nums: int = None¶

max data checkpoint nums

Type:: [optional]

load_data_checkpoint_iteration: int = None¶

load data checkpoint from iteration

Type:: [optional]

stream_data_loader_type: str = 'fixed'¶

stream_data_loader type, [“fixed”, “dynamic”]

Type:: [optional]

debug: bool = False¶: private

nsys: bool = False¶: enable nsys nvtx

profiler_dir: str = None¶: profiler dir

coalesce_param: bool = True¶: coalesce parameters in model sync

coalesced_buffer_mb: int = 100¶: coalesce_buffer size in mb

concurrent_comm: bool = True¶: concurrent parameter sync

param_sync_comm_type: str = 'broadcast'¶: parameter sync communication type, broadcast/p2p

param_sync_max_workers: int = None¶: parameter sync max workers

max_relay_episode: int = 0¶: max number of relay episodes, if max_relay_episode is set to -1, then relay all episodes if max_relay_episode is set to 0, then relay is disabled

relay_episode_offset: int = 0¶: relay after n episodes

consumed_samples: int = 0¶: consumed samples

concurrent_setup: bool = False¶: concurrent model setup

bucket_size_mb_in_memory_manager: int = 1024¶: bucket size in the memory manager to reduce peak memory

free_sync_collective_group: bool = False¶: free collective group after parameter synchronization and rebuild before next synchronization

cpu_schedule_strategy: str = 'SPREAD'¶: [optional] cpu only model schedule policy, PACK or SPREAD PACK: All provided bundles are packed onto a single node on a best-effort basis. SPREAD: Each bundle is spread onto separate nodes on a best-effort basis.

exp_name: str = 'CHATLEARN'¶: exp name for each run

output_dir: str = './'¶: output dir

get(key)[source]¶

Get other config by key.

Parameters:: key (str) – key to get config

class chatlearn.utils.arguments.ModelConfig[source]¶

Config for model.

num_device: int = 0¶: [legacy] number of GPU used for one model, default 0.

num_gpu: int = 0¶: [required] number of GPU used for one model, default 0, same as num_device

num_cpu: int = 0¶: [required] number of GPU used for one model, default 0

gpu_per_process: int = None¶: [optional] gpu per process, e.g., for PyTorch DDP, Megatron, DeepSpeed, gpu_per_process is set to 1

cpu_per_process: int = None¶: [optional] cpu per process

num_replica: int = 1¶: [optional] number of module replica, for gpu model, num_replica = num_gpu // (TP * PP * DP), for cpu model, num_replica = num_cpu // cpu_per_process

trainable: bool = False¶: [required] whether model is trainable

tensor_model_parallel_size: int = None¶: [optional] tensor model parallel size

pipeline_model_parallel_size: int = None¶: [optional] pipeline model parallel size

zero_size: int = None¶: [optional] zero size

model_config_file: str = ''¶: [optional] config file for model

config_dir: str = ''¶

model_type: str = ''¶: [optional] model type, e.g., Torch/Tensorflow, etc

generation_batch_size: int = -1¶: [optional] generation batch size, will overwrite generation batch size in RuntimeConfig

offload_optimizer_states = False¶: offload optimizer states

sync_frequency = 1¶: parameter sync frequency

offload_weights = False¶: offload weights

free_grad_buffers = False¶: free grad buffers

free_memory = False¶: overall switch for offload optimizer states/weights and free grad buffers

args_dict: dict = None¶: [optional] placeholder for other args

lora: LoraConfig = None¶: lora config

batch_generation: BatchGenerationConfig = None¶: batch generation config

class chatlearn.utils.arguments.BatchGenerationConfig[source]¶

Config for batch generation ranking and memory-efficiency.

ranking: bool = False¶: [optional] sort prompts by length each episode.

min_prompt_length: int = 0¶: [optional] min prompt length in the first stage of batch generation.

class chatlearn.utils.arguments.LoraConfig[source]¶

Config for lora

enable_lora: bool = False¶: enable lora, default False.

part_module_name: str = None¶: The “name_scope” parameter is used to specify a particular module to be converted to its LoRA. By default, it is set to None, which means there is no restriction on the module and any module can be converted using the “lora_layer” parameter. However, if “name_scope” is set to a specific value (e.g., “encoder”), only the modules whose name_scope contains the value “encoder” will be converted to LoRA.

lora_dim: int = 8¶: The rank value of the LoRA, which is the r dimension of the A/B matrix.

lora_dropout: float = 0.0¶: The LoRA dropout ratio refers to whether dropout computation is inserted in the forward pass of the LoRA layer. By default, the dropout ratio is set to 0.0.

lora_scaling: float = 1.0¶: When adding the values of the LoRA A and B matrices to the original weight matrix, the scaling value is set as “W = W + A * B * lora_scaling”. By default, the scaling value is set to 1.0.

lora_layer: str = 'ColumnParallelLinear,Embedding,LinearLayer,RowParallelLinear,VocabParallelEmbedding'¶: The layer class names involved in LoRA training in the model, separated by commas.

column_only_qkv: bool = False¶: LoRA training is enabled only in the ColumnParallelLinear layer of the MHA QKV module.