DeepSpeed+Ubuntu+CPU
目前CPU支持很有限,仅支持部分推理。环境配置可参考CI配置
安装intel_extension_for_pytorch
1 2 python -m pip install intel_extension_for_pytorch python -m pip install oneccl_bind_pt==2.0 -f https://developer.intel.com/ipex-whl-stable-cpu
安装oneCCL
1 2 3 4 5 6 7 8 git clone https://github.com/oneapi-src/oneCCL cd oneCCL mkdir build cd build cmake .. make make install source ./_install/env/setvars.sh
安装Transformers(用于跑用例)
1 2 3 4 git clone https://github.com/huggingface/transformers cd transformers git rev-parse --short HEAD pip install .
安装DeepSpeed
ds_report
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 -------------------------------------------------- DeepSpeed C++/CUDA extension op report -------------------------------------------------- NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op. -------------------------------------------------- JIT compiled ops requires ninja ninja .................. [OKAY] -------------------------------------------------- op name ................ installed .. compatible -------------------------------------------------- deepspeed_not_implemented [NO] ....... [OKAY] deepspeed_ccl_comm ....... [NO] ....... [OKAY] -------------------------------------------------- DeepSpeed general environment info: torch install path ............... ['/home/hua/anaconda3/lib/python3.10/site-packages/torch'] torch version .................... 2.0.1+cu117 deepspeed install path ........... ['/home/hua/anaconda3/lib/python3.10/site-packages/deepspeed'] deepspeed info ................... 0.9.4+e5fe5f65, e5fe5f65, master deepspeed wheel compiled w. ...... torch 0.0
DeepSpeed+Ubuntu+GPU
操作系统
华为云上gpu的镜像是16.04的,很多依赖的软件版本过低,建议升级到20.04,DeepSpeedExample有些需要高版本的glibc,可以直接升级到22.04。
1 2 3 4 sudo apt-get update sudo apt-get upgrade sudo apt-get dist-upgrade sudo do-release-upgrade
GPU驱动和cuda
pip源
1 pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple
安装pytorch
DeepSpeed
0.9.2(pip版本)目前依赖的是pytorch1.13.1,安装对应的版本。
1 pip install torch==1.13.1+cu117 torchvision==0.14.1+cu117 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu117
triton==1.0.0
DeepSpeed依赖triton版本是1.0.0,pip仓库无法直接安装,git上下载源码编译安装。
1 2 3 4 5 6 7 8 # 编译依赖 sudo apt-get install llvm-11 llvm-11-* # 源码安装 wget https://github.com/openai/triton/archive/refs/tags/v1.0.zip unzip v1.0.zip cd triton/python pip install cmake pip install .
安装DeepSpeed
1 2 3 4 5 6 # 预编译算子(编译不通过:https://github.com/microsoft/DeepSpeed/issues/425) # workground: Setting NVCC_PREPEND_FLAGS="--forward-unknown-opts" DS_BUILD_OPS=1 pip install deepspeed --global-option="build_ext" --global-option="-j8" # JIT_load pip install deepspeed
ds_report
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 -------------------------------------------------- DeepSpeed C++/CUDA extension op report -------------------------------------------------- NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op. -------------------------------------------------- JIT compiled ops requires ninja ninja .................. [OKAY] -------------------------------------------------- op name ................ installed .. compatible -------------------------------------------------- async_io ............... [NO] ....... [OKAY] cpu_adagrad ............ [NO] ....... [OKAY] cpu_adam ............... [NO] ....... [OKAY] fused_adam ............. [NO] ....... [OKAY] fused_lamb ............. [NO] ....... [OKAY] quantizer .............. [NO] ....... [OKAY] random_ltd ............. [NO] ....... [OKAY] sparse_attn ............ [NO] ....... [OKAY] spatial_inference ...... [NO] ....... [OKAY] transformer ............ [NO] ....... [OKAY] stochastic_transformer . [NO] ....... [OKAY] transformer_inference .. [NO] ....... [OKAY] utils .................. [NO] ....... [OKAY] -------------------------------------------------- DeepSpeed general environment info: torch install path ............... ['/home/hua/anaconda3/lib/python3.10/site-packages/torch'] torch version .................... 1.13.1+cu117 deepspeed install path ........... ['/home/hua/anaconda3/lib/python3.10/site-packages/deepspeed'] deepspeed info ................... 0.9.2, unknown, unknown torch cuda version ............... 11.7 torch hip version ................ None nvcc version ..................... 11.7 deepspeed wheel compiled w. ...... torch 1.13, cuda 11.7
运行例子
1 2 3 https://github.com/microsoft/DeepSpeedExamples.git cd DeepSpeedExamples/training/cifar ./run_ds.py
python报错,修改方法:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 @@ -159,7 +159,7 @@ def imshow(img): # get some random training images dataiter = iter(trainloader) -images, labels = dataiter.next() +images, labels = next(dataiter) # show images imshow(torchvision.utils.make_grid(images)) @@ -309,7 +309,7 @@ print('Finished Training') # Okay, first step. Let us display an image from the test set to get familiar. dataiter = iter(testloader) -images, labels = dataiter.next() +images, labels = next(dataiter) # print images imshow(torchvision.utils.make_grid(images)) @@ -110,7 +110,7 @@ def imshow(img): # get some random training images dataiter = iter(trainloader) -images, labels = dataiter.next() +images, labels = next(dataiter) # show images imshow(torchvision.utils.make_grid(images)) @@ -219,7 +219,7 @@ torch.save(net.state_dict(), PATH) # Okay, first step. Let us display an image from the test set to get familiar. dataiter = iter(testloader) -images, labels = dataiter.next() +images, labels = next(dataiter) # print images imshow(torchvision.utils.make_grid(images))
简单模型改写为DeepSpeed
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 import torchwith_ds = True if with_ds: import deepspeed device = torch.device('cuda' if torch.cuda.is_available() else 'cpu' ) x_data = torch.tensor([[1.0 ], [2.0 ], [3.0 ]]).to(device) y_data = torch.tensor([[2.0 ], [4.0 ], [6.0 ]]).to(device) class LinearModel (torch.nn.Module): def __init__ (self ): super (LinearModel, self).__init__() self.linear = torch.nn.Linear(1 , 1 ).to(device) def forward (self, x ): y_pred = self.linear(x) return y_pred model = LinearModel() criterion = torch.nn.MSELoss(size_average=False ) if not with_ds: optimizer = torch.optim.SGD(model.parameters(), lr=0.01 ) else : ds_config = { "train_micro_batch_size_per_gpu" : 2 , "optimizer" : { "type" : "SGD" , "params" : { "lr" : 1e-2 } }, } model, _, _, _ = deepspeed.initialize(model=model, model_parameters=model.parameters(), config=ds_config) for epoch in range (10 ): y_pred = model(x_data) loss = criterion(y_pred, y_data) print (epoch, loss.item()) if not with_ds: optimizer.zero_grad() loss.backward() optimizer.step() else : model.backward(loss) model.step() print ('w = ' , model.linear.weight.item())print ('b = ' , model.linear.bias.item())x_test = torch.tensor([[4.0 ]]).to(device) y_test = model(x_test) print ('y_pred = ' , y_test.data)
DeepSpeed基本执行流程
这个简单的模型实际上还是完全调用的pytorch的函数,ds包的wapper其实啥都没干,完全透传
首先获取Accelerator(加速器),判断能不能import
intel_extension_for_deepspeed,如果能就用XPU,如果不能就用CUDA(Accelerator将设备管理,内存管理,Tensor等等进行了抽象,不同的后端设备继承实现),高版本这部分逻辑有改变,可以通过环境变量控制。切入点1,这里需要判断是否能够使用npu
选择并行计算后端,如果Torch.distributed已经初始化,则使用直接使用,检查是否在Aure或者aws机器上,针对这些机器做环境变量配置,否则尝试寻找mpi,然后根据Accelerator的comm
backend类型初始化TorchBackend.切入点2,这里需要针对华为云机器做专门的环境变量配置,以及支持昇腾并行后端
解析ds的配置文件
创建ds引擎
检查环境变量,配置dist相关配置,包括rank,world size 等等
用dist分发模型参数,所有进程同步模型参数
根据配置创建optimizer,例如,上例中,就会生成torch自带的SGD优化器,也可以指定ds提供的Adam,lamb等优化器
配置checkpoint
编译Utils,就是flateen_unflateen.cpp
执行forward,计算损失
切入点3,以下基本调用的都是pytorch的optimizer,这些需要pytorch有昇腾支持,另外,DS提供的优化器也需要昇腾支持
反向传播
梯度累加
根据是否使用zero优化,自动精度,混合精度等调用optimizer其他wapper的backward,并传入合适的参数
多进程计算梯度并收集结果
更新参数
如果到了梯度累加的预制,根据不同的配置,最终调用optimizer.step更新参数,并清空梯度信息