DeepSpeed 安装使用

发表于 2023-06-12 更新于 2023-06-19

DeepSpeed+Ubuntu+CPU

目前CPU支持很有限，仅支持部分推理。环境配置可参考CI配置

安装intel_extension_for_pytorch

1 2	python -m pip install intel_extension_for_pytorch python -m pip install oneccl_bind_pt==2.0 -f https://developer.intel.com/ipex-whl-stable-cpu

安装oneCCL

git clone https://github.com/oneapi-src/oneCCL
cd oneCCL
mkdir build
cd build
cmake ..
make
make install
source ./_install/env/setvars.sh

安装Transformers(用于跑用例)

git clone https://github.com/huggingface/transformers
cd transformers
git rev-parse --short HEAD
pip install .

安装DeepSpeed

1	pip install DeepSpeed

ds_report

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
deepspeed_not_implemented  [NO] ....... [OKAY]
deepspeed_ccl_comm ....... [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/hua/anaconda3/lib/python3.10/site-packages/torch']
torch version .................... 2.0.1+cu117
deepspeed install path ........... ['/home/hua/anaconda3/lib/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.9.4+e5fe5f65, e5fe5f65, master
deepspeed wheel compiled w. ...... torch 0.0

DeepSpeed+Ubuntu+GPU

操作系统

华为云上gpu的镜像是16.04的，很多依赖的软件版本过低，建议升级到20.04，DeepSpeedExample有些需要高版本的glibc，可以直接升级到22.04。

sudo apt-get update
sudo apt-get upgrade
sudo apt-get dist-upgrade
sudo do-release-upgrade

GPU驱动和cuda

驱动下载页面选择相应的型号，选择cuda 11.7
CUDA toolkit下载页面选择系统版本，然后下载runfile（使用apt总是会升级驱动和cuda到最新版本，所以直接下载二进制安装）。

pip源

1	pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple

安装pytorch

DeepSpeed 0.9.2（pip版本）目前依赖的是pytorch1.13.1，安装对应的版本。

1	pip install torch==1.13.1+cu117 torchvision==0.14.1+cu117 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu117

triton==1.0.0

DeepSpeed依赖triton版本是1.0.0，pip仓库无法直接安装，git上下载源码编译安装。

# 编译依赖
sudo apt-get install llvm-11 llvm-11-*
# 源码安装
wget https://github.com/openai/triton/archive/refs/tags/v1.0.zip
unzip v1.0.zip
cd triton/python
pip install cmake
pip install .

安装DeepSpeed

# 预编译算子(编译不通过：https://github.com/microsoft/DeepSpeed/issues/425)
# workground: Setting NVCC_PREPEND_FLAGS="--forward-unknown-opts"
DS_BUILD_OPS=1 pip install deepspeed --global-option="build_ext" --global-option="-j8"

# JIT_load
pip install deepspeed

ds_report

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
async_io ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
sparse_attn ............ [NO] ....... [OKAY]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/hua/anaconda3/lib/python3.10/site-packages/torch']
torch version .................... 1.13.1+cu117
deepspeed install path ........... ['/home/hua/anaconda3/lib/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.9.2, unknown, unknown
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 11.7
deepspeed wheel compiled w. ...... torch 1.13, cuda 11.7

运行例子

1
2
3

https://github.com/microsoft/DeepSpeedExamples.git
cd DeepSpeedExamples/training/cifar
./run_ds.py

python报错，修改方法：

diff --git a/training/cifar/cifar10_deepspeed.py b/training/cifar/cifar10_deepspeed.py
index 33ea569..d1117c3 100755
--- a/training/cifar/cifar10_deepspeed.py
+++ b/training/cifar/cifar10_deepspeed.py
@@ -159,7 +159,7 @@ def imshow(img):

 # get some random training images
 dataiter = iter(trainloader)
-images, labels = dataiter.next()
+images, labels = next(dataiter)

 # show images
 imshow(torchvision.utils.make_grid(images))
@@ -309,7 +309,7 @@ print('Finished Training')
 # Okay, first step. Let us display an image from the test set to get familiar.

 dataiter = iter(testloader)
-images, labels = dataiter.next()
+images, labels = next(dataiter)

 # print images
 imshow(torchvision.utils.make_grid(images))
diff --git a/training/cifar/cifar10_tutorial.py b/training/cifar/cifar10_tutorial.py
index 2154e36..114e8c5 100644
--- a/training/cifar/cifar10_tutorial.py
+++ b/training/cifar/cifar10_tutorial.py
@@ -110,7 +110,7 @@ def imshow(img):

 # get some random training images
 dataiter = iter(trainloader)
-images, labels = dataiter.next()
+images, labels = next(dataiter)

 # show images
 imshow(torchvision.utils.make_grid(images))
@@ -219,7 +219,7 @@ torch.save(net.state_dict(), PATH)
 # Okay, first step. Let us display an image from the test set to get familiar.

 dataiter = iter(testloader)
-images, labels = dataiter.next()
+images, labels = next(dataiter)

 # print images
 imshow(torchvision.utils.make_grid(images))

简单模型改写为DeepSpeed

import torch

with_ds = True

if with_ds:
    import deepspeed

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

x_data = torch.tensor([[1.0], [2.0], [3.0]]).to(device)
y_data = torch.tensor([[2.0], [4.0], [6.0]]).to(device)


class LinearModel(torch.nn.Module):
    def __init__(self):
        super(LinearModel, self).__init__()
        self.linear = torch.nn.Linear(1, 1).to(device)

    def forward(self, x):
        y_pred = self.linear(x)
        return y_pred


model = LinearModel()
criterion = torch.nn.MSELoss(size_average=False)

if not with_ds:
    optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
else:
    ds_config = {
        "train_micro_batch_size_per_gpu": 2,
        "optimizer": {
            "type": "SGD",
            "params": {
                "lr": 1e-2
            }
        },
    }

    model, _, _, _ = deepspeed.initialize(model=model, model_parameters=model.parameters(), config=ds_config)

for epoch in range(10):
    y_pred = model(x_data)
    loss = criterion(y_pred, y_data)
    print(epoch, loss.item())

    if not with_ds:
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    else:
        model.backward(loss)
        model.step()

print('w = ', model.linear.weight.item())
print('b = ', model.linear.bias.item())

x_test = torch.tensor([[4.0]]).to(device)
y_test = model(x_test)
print('y_pred = ', y_test.data)

DeepSpeed基本执行流程

这个简单的模型实际上还是完全调用的pytorch的函数，ds包的wapper其实啥都没干，完全透传

首先获取Accelerator（加速器），判断能不能import intel_extension_for_deepspeed，如果能就用XPU，如果不能就用CUDA（Accelerator将设备管理，内存管理，Tensor等等进行了抽象，不同的后端设备继承实现），高版本这部分逻辑有改变，可以通过环境变量控制。切入点1，这里需要判断是否能够使用npu
选择并行计算后端，如果Torch.distributed已经初始化，则使用直接使用，检查是否在Aure或者aws机器上，针对这些机器做环境变量配置，否则尝试寻找mpi，然后根据Accelerator的comm backend类型初始化TorchBackend.切入点2，这里需要针对华为云机器做专门的环境变量配置，以及支持昇腾并行后端
解析ds的配置文件
创建ds引擎
1. 检查环境变量，配置dist相关配置，包括rank，world size 等等
2. 用dist分发模型参数，所有进程同步模型参数
3. 根据配置创建optimizer，例如，上例中，就会生成torch自带的SGD优化器，也可以指定ds提供的Adam，lamb等优化器
4. 配置checkpoint
5. 编译Utils，就是flateen_unflateen.cpp
执行forward，计算损失 切入点3，以下基本调用的都是pytorch的optimizer，这些需要pytorch有昇腾支持，另外，DS提供的优化器也需要昇腾支持
反向传播
1. 梯度累加
2. 根据是否使用zero优化，自动精度，混合精度等调用optimizer其他wapper的backward，并传入合适的参数
3. 多进程计算梯度并收集结果
更新参数
1. 如果到了梯度累加的预制，根据不同的配置，最终调用optimizer.step更新参数，并清空梯度信息