部署
系统里面默认是没有安装mpi-operator
, 因此需要自行安装.
1 2 3
| wget https://github.com/kubeflow/mpi-operator/blob/master/deploy/mpi-operator.yaml
kubectl create -f mpi-operator.yaml
|
在国内永远有镜像替换的烦恼, 在mpi-operator之中需要自行替换这两个镜像
1 2
| mpioperator/mpi-operator:latest mpioperator/kubectl-delivery:latest
|
你可以直接在DockerHub上找到这两个镜像或者自行编译: operator地址 delivery地址
根据以下命令查看是否安装成功(crd已经创建 && pod为running)
1 2 3 4 5
| mpijobs.kubeflow.org 2019-09-18T07:44:48Z
mpi-operator-584466c4f6-frw4x 1/1 Running 1 3d
|
例子
根据官网的介绍创建tensorflow-benchmarks
的例子.
文件地址
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51
| apiVersion: kubeflow.org/v1alpha2 kind: MPIJob metadata: name: tensorflow-benchmarks spec: slotsPerWorker: 1 cleanPodPolicy: Running mpiReplicaSpecs: Launcher: replicas: 1 template: spec: containers: - image: mpioperator/tensorflow-benchmarks:latest name: tensorflow-benchmarks command: - mpirun - --allow-run-as-root - -np - "2" - -bind-to - none - -map-by - slot - -x - NCCL_DEBUG=INFO - -x - LD_LIBRARY_PATH - -x - PATH - -mca - pml - ob1 - -mca - btl - ^openib - python - scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py - --model=resnet101 - --batch_size=64 - --variable_update=horovod Worker: replicas: 2 template: spec: containers: - image: mpioperator/tensorflow-benchmarks:latest name: tensorflow-benchmarks resources: limits: nvidia.com/gpu: 1
|
在这里定义了一个Launcher
,2个Worker
, Work使用GPU, 因此在Launcher
的mpirun
参数之中-np
必须为2, 说明是使用了两个并发.
等待两个Worker启动之后, Launcher开始刷日志:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
| NAME READY STATUS RESTARTS AGE mpi-operator-584466c4f6-frw4x 1/1 Running 1 3d tensorflow-benchmarks-launcher-vsq8b 1/1 Running 0 40s tensorflow-benchmarks-worker-0 1/1 Running 0 40s tensorflow-benchmarks-worker-1 1/1 Running 0 40s
Done warm up Step Img/sec total_loss Done warm up Step Img/sec total_loss 1 images/sec: 109.6 +/- 0.0 (jitter = 0.0) 9.181 1 images/sec: 109.9 +/- 0.0 (jitter = 0.0) 9.110 10 images/sec: 108.1 +/- 0.7 (jitter = 1.3) 8.864 10 images/sec: 108.1 +/- 0.7 (jitter = 1.5) 9.184 20 images/sec: 107.9 +/- 0.8 (jitter = 1.1) 9.246 20 images/sec: 107.9 +/- 0.8 (jitter = 1.2) 9.073 30 images/sec: 107.7 +/- 0.6 (jitter = 1.5) 9.147 30 images/sec: 107.7 +/- 0.6 (jitter = 1.5) 9.096 40 images/sec: 107.9 +/- 0.4 (jitter = 1.8) 9.069 40 images/sec: 107.9 +/- 0.4 (jitter = 1.8) 9.194 50 images/sec: 108.3 +/- 0.4 (jitter = 2.2) 9.206 50 images/sec: 108.3 +/- 0.4 (jitter = 2.0) 9.485 60 images/sec: 108.3 +/- 0.3 (jitter = 2.1) 9.139 60 images/sec: 108.3 +/- 0.3 (jitter = 2.0) 9.237 70 images/sec: 107.8 +/- 0.5 (jitter = 2.2) 9.132 70 images/sec: 107.8 +/- 0.5 (jitter = 2.2) 9.045 80 images/sec: 107.8 +/- 0.4 (jitter = 2.3) 9.092 80 images/sec: 107.8 +/- 0.4 (jitter = 2.2) 9.098 90 images/sec: 107.7 +/- 0.4 (jitter = 2.2) 9.205 90 images/sec: 107.7 +/- 0.4 (jitter = 2.3) 9.145 100 images/sec: 107.7 +/- 0.4 (jitter = 2.1) 9.050 ---------------------------------------------------------------- total images/sec: 215.33 ---------------------------------------------------------------- 100 images/sec: 107.7 +/- 0.4 (jitter = 2.1) 9.013 ---------------------------------------------------------------- total images/sec: 215.32 ----------------------------------------------------------------
|
这里用到了mpioperator/tensorflow-benchmarks:latest镜像, 需要说明的是, 最新的latest版本是基于cuda10的, 华为云的K8s环境目前只支持cuda9, 因此使用0.2.0
的tag版本, 能够完成任务.
实现简析
在Launcher
的日志上, 首先出现的是分发命令
1 2 3 4 5 6
| POD_NAME=tensorflow-benchmarks-worker-1 shift /opt/kube/kubectl exec tensorflow-benchmarks-worker-1 -- /bin/sh -c PATH=/usr/local/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/usr/local/lib:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ; /usr/local/bin/orted -mca ess "env" -mca ess_base_jobid "2828730368" -mca ess_base_vpid 2 -mca ess_base_num_procs "3" -mca orte_node_regex "tensorflow-benchmarks-launcher-[1:5]l8hm,tensorflow-benchmarks-worker-[1:0-1]@0(3)" -mca orte_hnp_uri "2828730368.0;tcp://172.16.0.86:37557" -mca pml "ob1" -mca btl "^openib" -mca plm "rsh" -mca plm_rsh_agent "/etc/mpi/kubexec.sh" -mca orte_default_hostfile "/etc/mpi/hostfile" -mca hwloc_base_binding_policy "none" -mca rmaps_base_mapping_policy "slot" -mca pmix "^s1,s2,cray,isolated" POD_NAME=tensorflow-benchmarks-worker-0 shift /opt/kube/kubectl exec tensorflow-benchmarks-worker-0 -- /bin/sh -c PATH=/usr/local/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/usr/local/lib:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ; /usr/local/bin/orted -mca ess "env" -mca ess_base_jobid "2828730368" -mca ess_base_vpid 1 -mca ess_base_num_procs "3" -mca orte_node_regex "tensorflow-benchmarks-launcher-[1:5]l8hm,tensorflow-benchmarks-worker-[1:0-1]@0(3)" -mca orte_hnp_uri "2828730368.0;tcp://172.16.0.86:37557" -mca pml "ob1" -mca btl "^openib" -mca plm "rsh" -mca plm_rsh_agent "/etc/mpi/kubexec.sh" -mca orte_default_hostfile "/etc/mpi/hostfile" -mca hwloc_base_binding_policy "none" -mca rmaps_base_mapping_policy "slot" -mca pmix "^s1,s2,cray,isolated"
|
通过/opt/kube/kubectl exec
的命令方式, 将执行的真实命令发送到worker
上, 这两个命令唯一的差别为-mca ess_base_vpid
的序号.
worker
的启动命令为sleep 365d
, 除了接受Launcher
的命令之外, 不做其他任何的东西.
那么Launcher
是如何实现命令的封装的呢?
通过mpi-job启动的时候, 将hostfile
和kubexec.sh
封装在configmap
之中.
hostfile
通过name-worker-id
的方式组装, 而kubexec.sh
应该每个都一样
因此, 每个mpi-job都会有对应的config
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
| apiVersion: v1 data: hostfile: | tensorflow-benchmarks-worker-0 slots=1 tensorflow-benchmarks-worker-1 slots=1 kubexec.sh: | set -x POD_NAME=$1 shift /opt/kube/kubectl exec ${POD_NAME} -- /bin/sh -c "$*" kind: ConfigMap metadata: creationTimestamp: 2019-09-22T04:07:36Z labels: app: tensorflow-benchmarks name: tensorflow-benchmarks-config namespace: mpi-operator ownerReferences: - apiVersion: kubeflow.org/v1alpha2 blockOwnerDeletion: true controller: true kind: MPIJob name: tensorflow-benchmarks uid: 82cd05da-dcee-11e9-ac58-fa163e3a1ebd resourceVersion: "39344012" selfLink: /api/v1/namespaces/mpi-operator/configmaps/tensorflow-benchmarks-config uid: 82cef441-dcee-11e9-860a-fa163e132ef9
|
那么这两个文件是如何注入到mpi的流程之中呢?
应该是通过环境变量OMPI_MCA_plm_rsh_agent
和OMPI_MCA_orte_default_hostfile
, 这两个应该会像回调函数一样, mpirun的过程之中被执行(这部分是猜测的).
MPI到底是什么?
说了这么久的Kubeflow的MPI Operator, 但对于MPI陌生的人, 应该是完全陌生的领域.
在这里例子之中, 最后有个参数是--variable_update=horovod
, Horovod就是一种基于MPI架构实现分布式训练框架, 我准备再开个番外篇专门介绍一下Horovod
, 在其中学习一下MPI.