Distributed Execution

This document will provide instructions on how to execute a distributed training task.

PAI DLC Distributed Execution

Aliyun PAI DLC [1] can conveniently and efficiently support training for various tasks.

The screenshots of the PAI-DLC task creation page is shown as follows. Select the job type as PyTorch and paste the command into the Execution Command window.

image.png

image.png

Non-PAI-DLC environment

If you want to submit distributed training in a non-PAI-DLC environment, the following environment variables need to be configured on each node before executing the script:

export MASTER_ADDR=xxx
export MASTER_PORT=xxx
export WORLD_SIZE=xxx
export GPUS_PER_NODE=8
export RANK=xx

Reference

  1. Aliyun Machine Learning PAI-DLC: https://www.aliyun.com/activity/bigdata/pai-dlc