DeepSpeed 安装使用

DeepSpeed+Ubuntu+CPU

目前CPU支持很有限,仅支持部分推理。环境配置可参考CI配置

安装intel_extension_for_pytorch

1
2
python -m pip install intel_extension_for_pytorch
python -m pip install oneccl_bind_pt==2.0 -f https://developer.intel.com/ipex-whl-stable-cpu

安装oneCCL

1
2
3
4
5
6
7
8
git clone https://github.com/oneapi-src/oneCCL
cd oneCCL
mkdir build
cd build
cmake ..
make
make install
source ./_install/env/setvars.sh

安装Transformers(用于跑用例)

1
2
3
4
git clone https://github.com/huggingface/transformers
cd transformers
git rev-parse --short HEAD
pip install .

安装DeepSpeed

1
pip install DeepSpeed

ds_report

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
deepspeed_not_implemented [NO] ....... [OKAY]
deepspeed_ccl_comm ....... [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/hua/anaconda3/lib/python3.10/site-packages/torch']
torch version .................... 2.0.1+cu117
deepspeed install path ........... ['/home/hua/anaconda3/lib/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.9.4+e5fe5f65, e5fe5f65, master
deepspeed wheel compiled w. ...... torch 0.0

DeepSpeed+Ubuntu+GPU

操作系统

华为云上gpu的镜像是16.04的,很多依赖的软件版本过低,建议升级到20.04,DeepSpeedExample有些需要高版本的glibc,可以直接升级到22.04。

1
2
3
4
sudo apt-get update
sudo apt-get upgrade
sudo apt-get dist-upgrade
sudo do-release-upgrade

GPU驱动和cuda

  • 驱动下载页面选择相应的型号,选择cuda 11.7

  • CUDA toolkit下载页面选择系统版本,然后下载runfile(使用apt总是会升级驱动和cuda到最新版本,所以直接下载二进制安装)。

pip源

1
pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple

安装pytorch

DeepSpeed 0.9.2(pip版本)目前依赖的是pytorch1.13.1,安装对应的版本。

1
pip install torch==1.13.1+cu117 torchvision==0.14.1+cu117 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu117

triton==1.0.0

DeepSpeed依赖triton版本是1.0.0,pip仓库无法直接安装,git上下载源码编译安装。

1
2
3
4
5
6
7
8
# 编译依赖
sudo apt-get install llvm-11 llvm-11-*
# 源码安装
wget https://github.com/openai/triton/archive/refs/tags/v1.0.zip
unzip v1.0.zip
cd triton/python
pip install cmake
pip install .

安装DeepSpeed

1
2
3
4
5
6
# 预编译算子(编译不通过:https://github.com/microsoft/DeepSpeed/issues/425)
# workground: Setting NVCC_PREPEND_FLAGS="--forward-unknown-opts"
DS_BUILD_OPS=1 pip install deepspeed --global-option="build_ext" --global-option="-j8"

# JIT_load
pip install deepspeed

ds_report

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
async_io ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
sparse_attn ............ [NO] ....... [OKAY]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/hua/anaconda3/lib/python3.10/site-packages/torch']
torch version .................... 1.13.1+cu117
deepspeed install path ........... ['/home/hua/anaconda3/lib/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.9.2, unknown, unknown
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 11.7
deepspeed wheel compiled w. ...... torch 1.13, cuda 11.7

运行例子

1
2
3
https://github.com/microsoft/DeepSpeedExamples.git
cd DeepSpeedExamples/training/cifar
./run_ds.py

python报错,修改方法:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
diff --git a/training/cifar/cifar10_deepspeed.py b/training/cifar/cifar10_deepspeed.py
index 33ea569..d1117c3 100755
--- a/training/cifar/cifar10_deepspeed.py
+++ b/training/cifar/cifar10_deepspeed.py
@@ -159,7 +159,7 @@ def imshow(img):

# get some random training images
dataiter = iter(trainloader)
-images, labels = dataiter.next()
+images, labels = next(dataiter)

# show images
imshow(torchvision.utils.make_grid(images))
@@ -309,7 +309,7 @@ print('Finished Training')
# Okay, first step. Let us display an image from the test set to get familiar.

dataiter = iter(testloader)
-images, labels = dataiter.next()
+images, labels = next(dataiter)

# print images
imshow(torchvision.utils.make_grid(images))
diff --git a/training/cifar/cifar10_tutorial.py b/training/cifar/cifar10_tutorial.py
index 2154e36..114e8c5 100644
--- a/training/cifar/cifar10_tutorial.py
+++ b/training/cifar/cifar10_tutorial.py
@@ -110,7 +110,7 @@ def imshow(img):

# get some random training images
dataiter = iter(trainloader)
-images, labels = dataiter.next()
+images, labels = next(dataiter)

# show images
imshow(torchvision.utils.make_grid(images))
@@ -219,7 +219,7 @@ torch.save(net.state_dict(), PATH)
# Okay, first step. Let us display an image from the test set to get familiar.

dataiter = iter(testloader)
-images, labels = dataiter.next()
+images, labels = next(dataiter)

# print images
imshow(torchvision.utils.make_grid(images))

简单模型改写为DeepSpeed

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
import torch

with_ds = True

if with_ds:
import deepspeed

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

x_data = torch.tensor([[1.0], [2.0], [3.0]]).to(device)
y_data = torch.tensor([[2.0], [4.0], [6.0]]).to(device)


class LinearModel(torch.nn.Module):
def __init__(self):
super(LinearModel, self).__init__()
self.linear = torch.nn.Linear(1, 1).to(device)

def forward(self, x):
y_pred = self.linear(x)
return y_pred


model = LinearModel()
criterion = torch.nn.MSELoss(size_average=False)

if not with_ds:
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
else:
ds_config = {
"train_micro_batch_size_per_gpu": 2,
"optimizer": {
"type": "SGD",
"params": {
"lr": 1e-2
}
},
}

model, _, _, _ = deepspeed.initialize(model=model, model_parameters=model.parameters(), config=ds_config)

for epoch in range(10):
y_pred = model(x_data)
loss = criterion(y_pred, y_data)
print(epoch, loss.item())

if not with_ds:
optimizer.zero_grad()
loss.backward()
optimizer.step()
else:
model.backward(loss)
model.step()

print('w = ', model.linear.weight.item())
print('b = ', model.linear.bias.item())

x_test = torch.tensor([[4.0]]).to(device)
y_test = model(x_test)
print('y_pred = ', y_test.data)

DeepSpeed基本执行流程

这个简单的模型实际上还是完全调用的pytorch的函数,ds包的wapper其实啥都没干,完全透传

  1. 首先获取Accelerator(加速器),判断能不能import intel_extension_for_deepspeed,如果能就用XPU,如果不能就用CUDA(Accelerator将设备管理,内存管理,Tensor等等进行了抽象,不同的后端设备继承实现),高版本这部分逻辑有改变,可以通过环境变量控制。切入点1,这里需要判断是否能够使用npu
  2. 选择并行计算后端,如果Torch.distributed已经初始化,则使用直接使用,检查是否在Aure或者aws机器上,针对这些机器做环境变量配置,否则尝试寻找mpi,然后根据Accelerator的comm backend类型初始化TorchBackend.切入点2,这里需要针对华为云机器做专门的环境变量配置,以及支持昇腾并行后端
  3. 解析ds的配置文件
  4. 创建ds引擎
    1. 检查环境变量,配置dist相关配置,包括rank,world size 等等
    2. 用dist分发模型参数,所有进程同步模型参数
    3. 根据配置创建optimizer,例如,上例中,就会生成torch自带的SGD优化器,也可以指定ds提供的Adam,lamb等优化器
    4. 配置checkpoint
    5. 编译Utils,就是flateen_unflateen.cpp
  5. 执行forward,计算损失 切入点3,以下基本调用的都是pytorch的optimizer,这些需要pytorch有昇腾支持,另外,DS提供的优化器也需要昇腾支持
  6. 反向传播
    1. 梯度累加
    2. 根据是否使用zero优化,自动精度,混合精度等调用optimizer其他wapper的backward,并传入合适的参数
    3. 多进程计算梯度并收集结果
  7. 更新参数
    1. 如果到了梯度累加的预制,根据不同的配置,最终调用optimizer.step更新参数,并清空梯度信息