[Kubeflow系列]MPI-Operator介绍

部署

系统里面默认是没有安装mpi-operator, 因此需要自行安装.

1
2
3
wget https://github.com/kubeflow/mpi-operator/blob/master/deploy/mpi-operator.yaml
# 修改镜像地址
kubectl create -f mpi-operator.yaml

在国内永远有镜像替换的烦恼, 在mpi-operator之中需要自行替换这两个镜像

1
2
mpioperator/mpi-operator:latest
mpioperator/kubectl-delivery:latest

你可以直接在DockerHub上找到这两个镜像或者自行编译: operator地址 delivery地址

根据以下命令查看是否安装成功(crd已经创建 && pod为running)

1
2
3
4
5
# kubectl get crd|grep mpi
mpijobs.kubeflow.org 2019-09-18T07:44:48Z

# kubectl get pod -n mpi-operator|grep mpi
mpi-operator-584466c4f6-frw4x 1/1 Running 1 3d

例子

根据官网的介绍创建tensorflow-benchmarks的例子.

文件地址

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
apiVersion: kubeflow.org/v1alpha2
kind: MPIJob
metadata:
name: tensorflow-benchmarks
spec:
slotsPerWorker: 1
cleanPodPolicy: Running
mpiReplicaSpecs:
Launcher:
replicas: 1
template:
spec:
containers:
- image: mpioperator/tensorflow-benchmarks:latest
name: tensorflow-benchmarks
command:
- mpirun
- --allow-run-as-root
- -np
- "2"
- -bind-to
- none
- -map-by
- slot
- -x
- NCCL_DEBUG=INFO
- -x
- LD_LIBRARY_PATH
- -x
- PATH
- -mca
- pml
- ob1
- -mca
- btl
- ^openib
- python
- scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py
- --model=resnet101
- --batch_size=64
- --variable_update=horovod
Worker:
replicas: 2
template:
spec:
containers:
- image: mpioperator/tensorflow-benchmarks:latest
name: tensorflow-benchmarks
resources:
limits:
nvidia.com/gpu: 1

在这里定义了一个Launcher,2个Worker, Work使用GPU, 因此在Launchermpirun参数之中-np必须为2, 说明是使用了两个并发.

等待两个Worker启动之后, Launcher开始刷日志:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
# kubectl get pod -n mpi-operator
NAME READY STATUS RESTARTS AGE
mpi-operator-584466c4f6-frw4x 1/1 Running 1 3d
tensorflow-benchmarks-launcher-vsq8b 1/1 Running 0 40s
tensorflow-benchmarks-worker-0 1/1 Running 0 40s
tensorflow-benchmarks-worker-1 1/1 Running 0 40s

# kubectl logs -f tensorflow-benchmarks-launcher-vsq8b -n mpi-operator
Done warm up
Step Img/sec total_loss
Done warm up
Step Img/sec total_loss
1 images/sec: 109.6 +/- 0.0 (jitter = 0.0) 9.181
1 images/sec: 109.9 +/- 0.0 (jitter = 0.0) 9.110
10 images/sec: 108.1 +/- 0.7 (jitter = 1.3) 8.864
10 images/sec: 108.1 +/- 0.7 (jitter = 1.5) 9.184
20 images/sec: 107.9 +/- 0.8 (jitter = 1.1) 9.246
20 images/sec: 107.9 +/- 0.8 (jitter = 1.2) 9.073
30 images/sec: 107.7 +/- 0.6 (jitter = 1.5) 9.147
30 images/sec: 107.7 +/- 0.6 (jitter = 1.5) 9.096
40 images/sec: 107.9 +/- 0.4 (jitter = 1.8) 9.069
40 images/sec: 107.9 +/- 0.4 (jitter = 1.8) 9.194
50 images/sec: 108.3 +/- 0.4 (jitter = 2.2) 9.206
50 images/sec: 108.3 +/- 0.4 (jitter = 2.0) 9.485
60 images/sec: 108.3 +/- 0.3 (jitter = 2.1) 9.139
60 images/sec: 108.3 +/- 0.3 (jitter = 2.0) 9.237
70 images/sec: 107.8 +/- 0.5 (jitter = 2.2) 9.132
70 images/sec: 107.8 +/- 0.5 (jitter = 2.2) 9.045
80 images/sec: 107.8 +/- 0.4 (jitter = 2.3) 9.092
80 images/sec: 107.8 +/- 0.4 (jitter = 2.2) 9.098
90 images/sec: 107.7 +/- 0.4 (jitter = 2.2) 9.205
90 images/sec: 107.7 +/- 0.4 (jitter = 2.3) 9.145
100 images/sec: 107.7 +/- 0.4 (jitter = 2.1) 9.050
----------------------------------------------------------------
total images/sec: 215.33
----------------------------------------------------------------
100 images/sec: 107.7 +/- 0.4 (jitter = 2.1) 9.013
----------------------------------------------------------------
total images/sec: 215.32
----------------------------------------------------------------

这里用到了mpioperator/tensorflow-benchmarks:latest镜像, 需要说明的是, 最新的latest版本是基于cuda10的, 华为云的K8s环境目前只支持cuda9, 因此使用0.2.0的tag版本, 能够完成任务.

实现简析

Launcher的日志上, 首先出现的是分发命令

1
2
3
4
5
6
POD_NAME=tensorflow-benchmarks-worker-1
shift
/opt/kube/kubectl exec tensorflow-benchmarks-worker-1 -- /bin/sh -c PATH=/usr/local/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/usr/local/lib:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ; /usr/local/bin/orted -mca ess "env" -mca ess_base_jobid "2828730368" -mca ess_base_vpid 2 -mca ess_base_num_procs "3" -mca orte_node_regex "tensorflow-benchmarks-launcher-[1:5]l8hm,tensorflow-benchmarks-worker-[1:0-1]@0(3)" -mca orte_hnp_uri "2828730368.0;tcp://172.16.0.86:37557" -mca pml "ob1" -mca btl "^openib" -mca plm "rsh" -mca plm_rsh_agent "/etc/mpi/kubexec.sh" -mca orte_default_hostfile "/etc/mpi/hostfile" -mca hwloc_base_binding_policy "none" -mca rmaps_base_mapping_policy "slot" -mca pmix "^s1,s2,cray,isolated"
POD_NAME=tensorflow-benchmarks-worker-0
shift
/opt/kube/kubectl exec tensorflow-benchmarks-worker-0 -- /bin/sh -c PATH=/usr/local/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/usr/local/lib:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ; /usr/local/bin/orted -mca ess "env" -mca ess_base_jobid "2828730368" -mca ess_base_vpid 1 -mca ess_base_num_procs "3" -mca orte_node_regex "tensorflow-benchmarks-launcher-[1:5]l8hm,tensorflow-benchmarks-worker-[1:0-1]@0(3)" -mca orte_hnp_uri "2828730368.0;tcp://172.16.0.86:37557" -mca pml "ob1" -mca btl "^openib" -mca plm "rsh" -mca plm_rsh_agent "/etc/mpi/kubexec.sh" -mca orte_default_hostfile "/etc/mpi/hostfile" -mca hwloc_base_binding_policy "none" -mca rmaps_base_mapping_policy "slot" -mca pmix "^s1,s2,cray,isolated"

通过/opt/kube/kubectl exec的命令方式, 将执行的真实命令发送到worker上, 这两个命令唯一的差别为-mca ess_base_vpid的序号.

worker的启动命令为sleep 365d, 除了接受Launcher的命令之外, 不做其他任何的东西.

那么Launcher是如何实现命令的封装的呢?

通过mpi-job启动的时候, 将hostfilekubexec.sh封装在configmap之中.

hostfile通过name-worker-id 的方式组装, 而kubexec.sh应该每个都一样

因此, 每个mpi-job都会有对应的config

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# kubectl get configmap tensorflow-benchmarks-config -o yaml -n mpi-operator
apiVersion: v1
data:
hostfile: |
tensorflow-benchmarks-worker-0 slots=1
tensorflow-benchmarks-worker-1 slots=1
kubexec.sh: |
#!/bin/sh
set -x
POD_NAME=$1
shift
/opt/kube/kubectl exec ${POD_NAME} -- /bin/sh -c "$*"
kind: ConfigMap
metadata:
creationTimestamp: 2019-09-22T04:07:36Z
labels:
app: tensorflow-benchmarks
name: tensorflow-benchmarks-config
namespace: mpi-operator
ownerReferences:
- apiVersion: kubeflow.org/v1alpha2
blockOwnerDeletion: true
controller: true
kind: MPIJob
name: tensorflow-benchmarks
uid: 82cd05da-dcee-11e9-ac58-fa163e3a1ebd
resourceVersion: "39344012"
selfLink: /api/v1/namespaces/mpi-operator/configmaps/tensorflow-benchmarks-config
uid: 82cef441-dcee-11e9-860a-fa163e132ef9

那么这两个文件是如何注入到mpi的流程之中呢?

应该是通过环境变量OMPI_MCA_plm_rsh_agentOMPI_MCA_orte_default_hostfile, 这两个应该会像回调函数一样, mpirun的过程之中被执行(这部分是猜测的).

MPI到底是什么?

说了这么久的Kubeflow的MPI Operator, 但对于MPI陌生的人, 应该是完全陌生的领域.

在这里例子之中, 最后有个参数是--variable_update=horovod, Horovod就是一种基于MPI架构实现分布式训练框架, 我准备再开个番外篇专门介绍一下Horovod, 在其中学习一下MPI.