[Kubeflow系列]Kubeflow的部署

首先, 官网上有对应的安装文档, 地址在这里官方安装指南.

但是在国内总是有一些中国特设的问题, 所以在这儿记录一下安装的流程.

安装的环境是在华为云的云容器引擎服务CCE, 其他环境可能略有不同.

此外华为云上已有一个KubeFlow安装指南

依赖安装

首先安装官方文档, 安装如下的三个依赖:

  1. kubectl
  2. ks
  3. kfctl

将这三个依赖都放入到环境之中, kubectl需要设置好环境, kubectl get pod不能有一次

1
mv ks kubectl kfctl /usr/local/bin/

我安装的是v0.5版本的kubeflow, 版本升级之后不保证成功

生成配置文件

执行以下命令生成配置文件, 这个步骤需要在联网环境下执行, 在华为云上虚拟机就需要绑定弹性IP.

1
2
3
4
5
6
mkdir /root/kubeflow
cd /root/kubeflow
export KFAPP=kubeflow
kfctl init ${KFAPP}
cd ${KFAPP}
kfctl generate all -V

完成后, 会在目录下生成如下文件:

替换镜像

默认生成的配置, 其中的镜像都是GoogleCloud的公开镜像, 在国内网络之中, 肯定无法访问, 需要将这些镜像想办法下载下来传入到华为云的容器镜像服务之内.

镜像下载上传

我将我用到的镜像都列举出来, 配置代理之后, 再拉取到本地

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
docker pull  gcr.io/ml-pipeline/api-server:0.1.16
docker pull gcr.io/ml-pipeline/persistenceagent:0.1.16
docker pull gcr.io/ml-pipeline/scheduledworkflow:0.1.16
docker pull gcr.io/ml-pipeline/frontend:0.1.16
docker pull gcr.io/ml-pipeline/viewer-crd-controller:0.1.16
docker pull gcr.io/kubeflow-images-public/notebook-controller:v20190401-v0.4.0-rc.1-308-g33618cc9-e3b0c4
docker pull gcr.io/kubeflow-images-public/pytorch-operator:v0.5.0
docker pull gcr.io/kubeflow-images-public/katib/studyjob-controller:v0.1.2-alpha-156-g4ab3dbd
docker pull gcr.io/kubeflow-images-public/tf_operator:v0.5.0
docker pull gcr.io/kubeflow-images-public/katib/vizier-core:v0.1.2-alpha-156-g4ab3dbd
docker pull gcr.io/kubeflow-images-public/katib/vizier-core-rest:v0.1.2-alpha-156-g4ab3dbd
docker pull gcr.io/kubeflow-images-public/centraldashboard:v0.5.0
docker pull gcr.io/kubeflow-images-public/jupyter-web-app:v0.5.0
docker pull gcr.io/kubeflow-images-public/katib/katib-ui:v0.1.2-alpha-156-g4ab3dbd
docker pull gcr.io/kubeflow-images-public/katib/suggestion-bayesianoptimization:v0.1.2-alpha-156-g4ab3dbd
docker pull gcr.io/kubeflow-images-public/katib/suggestion-grid:v0.1.2-alpha-156-g4ab3dbd
docker pull gcr.io/kubeflow-images-public/katib/suggestion-hyperband:v0.1.2-alpha-156-g4ab3dbd
docker pull gcr.io/kubeflow-images-public/katib/suggestion-random:v0.1.2-alpha-156-g4ab3dbd
docker pull tensorflow/tensorflow:1.8.0
docker pull mysql:5.6
docker pull mysql:8.0.3
docker pull quay.io/datawire/ambassador:0.37.0
docker pull argoproj/argoui:v2.2.0
docker pull argoproj/workflow-controller:v2.2.0
docker pull metacontroller/metacontroller:v0.3.0
docker pull minio/minio:RELEASE.2018-02-09T22-40-05Z

而后再加入SWR的tag

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
docker tag  gcr.io/ml-pipeline/api-server:0.1.16                                                              swr.cn-north-5.myhuaweicloud.com/kubeflow/api-server:0.1.16                                                          
docker tag gcr.io/ml-pipeline/persistenceagent:0.1.16 swr.cn-north-5.myhuaweicloud.com/kubeflow/persistenceagent:0.1.16
docker tag gcr.io/ml-pipeline/scheduledworkflow:0.1.16 swr.cn-north-5.myhuaweicloud.com/kubeflow/scheduledworkflow:0.1.16
docker tag gcr.io/ml-pipeline/frontend:0.1.16 swr.cn-north-5.myhuaweicloud.com/kubeflow/frontend:0.1.16
docker tag gcr.io/ml-pipeline/viewer-crd-controller:0.1.16 swr.cn-north-5.myhuaweicloud.com/kubeflow/viewer-crd-controller:0.1.16
docker tag gcr.io/kubeflow-images-public/notebook-controller:v20190401-v0.4.0-rc.1-308-g33618cc9-e3b0c4 swr.cn-north-5.myhuaweicloud.com/kubeflow/notebook-controller:v20190401-v0.4.0-rc.1-308-g33618cc9-e3b0c4
docker tag gcr.io/kubeflow-images-public/pytorch-operator:v0.5.0 swr.cn-north-5.myhuaweicloud.com/kubeflow/pytorch-operator:v0.5.0
docker tag gcr.io/kubeflow-images-public/katib/studyjob-controller:v0.1.2-alpha-156-g4ab3dbd swr.cn-north-5.myhuaweicloud.com/kubeflow/studyjob-controller:v0.1.2-alpha-156-g4ab3dbd
docker tag gcr.io/kubeflow-images-public/tf_operator:v0.5.0 swr.cn-north-5.myhuaweicloud.com/kubeflow/tf_operator:v0.5.0
docker tag gcr.io/kubeflow-images-public/katib/vizier-core:v0.1.2-alpha-156-g4ab3dbd swr.cn-north-5.myhuaweicloud.com/kubeflow/vizier-core:v0.1.2-alpha-156-g4ab3dbd
docker tag gcr.io/kubeflow-images-public/katib/vizier-core-rest:v0.1.2-alpha-156-g4ab3dbd swr.cn-north-5.myhuaweicloud.com/kubeflow/vizier-core-rest:v0.1.2-alpha-156-g4ab3dbd
docker tag gcr.io/kubeflow-images-public/centraldashboard:v0.5.0 swr.cn-north-5.myhuaweicloud.com/kubeflow/centraldashboard:v0.5.0
docker tag gcr.io/kubeflow-images-public/jupyter-web-app:v0.5.0 swr.cn-north-5.myhuaweicloud.com/kubeflow/jupyter-web-app:v0.5.0
docker tag gcr.io/kubeflow-images-public/katib/katib-ui:v0.1.2-alpha-156-g4ab3dbd swr.cn-north-5.myhuaweicloud.com/kubeflow/katib-ui:v0.1.2-alpha-156-g4ab3dbd
docker tag gcr.io/kubeflow-images-public/katib/suggestion-bayesianoptimization:v0.1.2-alpha-156-g4ab3dbd swr.cn-north-5.myhuaweicloud.com/kubeflow/suggestion-bayesianoptimization:v0.1.2-alpha-156-g4ab3dbd
docker tag gcr.io/kubeflow-images-public/katib/suggestion-grid:v0.1.2-alpha-156-g4ab3dbd swr.cn-north-5.myhuaweicloud.com/kubeflow/suggestion-grid:v0.1.2-alpha-156-g4ab3dbd
docker tag gcr.io/kubeflow-images-public/katib/suggestion-hyperband:v0.1.2-alpha-156-g4ab3dbd swr.cn-north-5.myhuaweicloud.com/kubeflow/suggestion-hyperband:v0.1.2-alpha-156-g4ab3dbd
docker tag gcr.io/kubeflow-images-public/katib/suggestion-random:v0.1.2-alpha-156-g4ab3dbd swr.cn-north-5.myhuaweicloud.com/kubeflow/suggestion-random:v0.1.2-alpha-156-g4ab3dbd
docker tag tensorflow/tensorflow:1.8.0 swr.cn-north-5.myhuaweicloud.com/kubeflow/tensorflow:1.8.0
docker tag mysql:5.6 swr.cn-north-5.myhuaweicloud.com/kubeflow/mysql:5.6
docker tag mysql:8.0.3 swr.cn-north-5.myhuaweicloud.com/kubeflow/mysql:8.0.3
docker tag quay.io/datawire/ambassador:0.37.0 swr.cn-north-5.myhuaweicloud.com/kubeflow/ambassador:0.37.0
docker tag argoproj/argoui:v2.2.0 swr.cn-north-5.myhuaweicloud.com/kubeflow/argoui:v2.2.0
docker tag argoproj/workflow-controller:v2.2.0 swr.cn-north-5.myhuaweicloud.com/kubeflow/workflow-controller:v2.2.0
docker tag metacontroller/metacontroller:v0.3.0 swr.cn-north-5.myhuaweicloud.com/kubeflow/metacontroller:v0.3.0
docker tag minio/minio:RELEASE.2018-02-09T22-40-05Z swr.cn-north-5.myhuaweicloud.com/kubeflow/minio:RELEASE.2018-02-09T22-40-05Z

再推到SWR之中

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
docker push  swr.cn-north-5.myhuaweicloud.com/kubeflow/api-server:0.1.16                                                          
docker push swr.cn-north-5.myhuaweicloud.com/kubeflow/persistenceagent:0.1.16
docker push swr.cn-north-5.myhuaweicloud.com/kubeflow/scheduledworkflow:0.1.16
docker push swr.cn-north-5.myhuaweicloud.com/kubeflow/frontend:0.1.16
docker push swr.cn-north-5.myhuaweicloud.com/kubeflow/viewer-crd-controller:0.1.16
docker push swr.cn-north-5.myhuaweicloud.com/kubeflow/notebook-controller:v20190401-v0.4.0-rc.1-308-g33618cc9-e3b0c4
docker push swr.cn-north-5.myhuaweicloud.com/kubeflow/pytorch-operator:v0.5.0
docker push swr.cn-north-5.myhuaweicloud.com/kubeflow/studyjob-controller:v0.1.2-alpha-156-g4ab3dbd
docker push swr.cn-north-5.myhuaweicloud.com/kubeflow/tf_operator:v0.5.0
docker push swr.cn-north-5.myhuaweicloud.com/kubeflow/vizier-core:v0.1.2-alpha-156-g4ab3dbd
docker push swr.cn-north-5.myhuaweicloud.com/kubeflow/vizier-core-rest:v0.1.2-alpha-156-g4ab3dbd
docker push swr.cn-north-5.myhuaweicloud.com/kubeflow/centraldashboard:v0.5.0
docker push swr.cn-north-5.myhuaweicloud.com/kubeflow/jupyter-web-app:v0.5.0
docker push swr.cn-north-5.myhuaweicloud.com/kubeflow/katib-ui:v0.1.2-alpha-156-g4ab3dbd
docker push swr.cn-north-5.myhuaweicloud.com/kubeflow/suggestion-bayesianoptimization:v0.1.2-alpha-156-g4ab3dbd
docker push swr.cn-north-5.myhuaweicloud.com/kubeflow/suggestion-grid:v0.1.2-alpha-156-g4ab3dbd
docker push swr.cn-north-5.myhuaweicloud.com/kubeflow/suggestion-hyperband:v0.1.2-alpha-156-g4ab3dbd
docker push swr.cn-north-5.myhuaweicloud.com/kubeflow/suggestion-random:v0.1.2-alpha-156-g4ab3dbd
docker push swr.cn-north-5.myhuaweicloud.com/kubeflow/tensorflow:1.8.0
docker push swr.cn-north-5.myhuaweicloud.com/kubeflow/mysql:5.6
docker push swr.cn-north-5.myhuaweicloud.com/kubeflow/mysql:8.0.3
docker push swr.cn-north-5.myhuaweicloud.com/kubeflow/ambassador:0.37.0
docker push swr.cn-north-5.myhuaweicloud.com/kubeflow/argoui:v2.2.0
docker push swr.cn-north-5.myhuaweicloud.com/kubeflow/workflow-controller:v2.2.0
docker push swr.cn-north-5.myhuaweicloud.com/kubeflow/metacontroller:v0.3.0
docker push swr.cn-north-5.myhuaweicloud.com/kubeflow/minio:RELEASE.2018-02-09T22-40-05Z

华为云有几个限制, 需要你注意:

  1. 路径不能有多级, 即容器名字不能含有/
  2. 路径不能含有点号., 有部分镜像带有版本号, 需要修改原始名称, 但是tag里面又可以存在.

另外这部分镜像可能是不全的, 因为例如model db等组件是没有安装的

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
docker push swr.cn-north-1.myhuaweicloud.com/hzw-kubeflow/api-server:0.1.16                                                          
docker push swr.cn-north-1.myhuaweicloud.com/hzw-kubeflow/persistenceagent:0.1.16
docker push swr.cn-north-1.myhuaweicloud.com/hzw-kubeflow/scheduledworkflow:0.1.16
docker push swr.cn-north-1.myhuaweicloud.com/hzw-kubeflow/frontend:0.1.16
docker push swr.cn-north-1.myhuaweicloud.com/hzw-kubeflow/viewer-crd-controller:0.1.16
docker push swr.cn-north-1.myhuaweicloud.com/hzw-kubeflow/notebook-controller:v20190401-v0.4.0-rc.1-308-g33618cc9-e3b0c4
docker push swr.cn-north-1.myhuaweicloud.com/hzw-kubeflow/pytorch-operator:v0.5.0
docker push swr.cn-north-1.myhuaweicloud.com/hzw-kubeflow/studyjob-controller:v0.1.2-alpha-156-g4ab3dbd
docker push swr.cn-north-1.myhuaweicloud.com/hzw-kubeflow/tf_operator:v0.5.0
docker push swr.cn-north-1.myhuaweicloud.com/hzw-kubeflow/vizier-core:v0.1.2-alpha-156-g4ab3dbd
docker push swr.cn-north-1.myhuaweicloud.com/hzw-kubeflow/vizier-core-rest:v0.1.2-alpha-156-g4ab3dbd
docker push swr.cn-north-1.myhuaweicloud.com/hzw-kubeflow/centraldashboard:v0.5.0
docker push swr.cn-north-1.myhuaweicloud.com/hzw-kubeflow/jupyter-web-app:v0.5.0
docker push swr.cn-north-1.myhuaweicloud.com/hzw-kubeflow/katib-ui:v0.1.2-alpha-156-g4ab3dbd
docker push swr.cn-north-1.myhuaweicloud.com/hzw-kubeflow/suggestion-bayesianoptimization:v0.1.2-alpha-156-g4ab3dbd
docker push swr.cn-north-1.myhuaweicloud.com/hzw-kubeflow/suggestion-grid:v0.1.2-alpha-156-g4ab3dbd
docker push swr.cn-north-1.myhuaweicloud.com/hzw-kubeflow/suggestion-hyperband:v0.1.2-alpha-156-g4ab3dbd
docker push swr.cn-north-1.myhuaweicloud.com/hzw-kubeflow/suggestion-random:v0.1.2-alpha-156-g4ab3dbd
docker push swr.cn-north-1.myhuaweicloud.com/hzw-kubeflow/tensorflow:1.8.0
docker push swr.cn-north-1.myhuaweicloud.com/hzw-kubeflow/mysql:5.6
docker push swr.cn-north-1.myhuaweicloud.com/hzw-kubeflow/mysql:8.0.3
docker push swr.cn-north-1.myhuaweicloud.com/hzw-kubeflow/ambassador:0.37.0
docker push swr.cn-north-1.myhuaweicloud.com/hzw-kubeflow/argoui:v2.2.0
docker push swr.cn-north-1.myhuaweicloud.com/hzw-kubeflow/workflow-controller:v2.2.0
docker push swr.cn-north-1.myhuaweicloud.com/hzw-kubeflow/metacontroller:v0.3.0
docker push swr.cn-north-1.myhuaweicloud.com/hzw-kubeflow/minio:RELEASE.2018-02-09T22-40-05Z
docker push swr.cn-north-1.myhuaweicloud.com/hzw-kubeflow/volume-nfs:0.8
docker push swr.cn-north-1.myhuaweicloud.com/hzw-kubeflow/argoexec:v2.2.0
docker push swr.cn-north-1.myhuaweicloud.com/hzw-kubeflow/metrics-collector:v0.1.2-alpha-156-g4ab3dbd
docker push swr.cn-north-1.myhuaweicloud.com/hzw-kubeflow/modeldb-artifact-store:kubeflow
docker push swr.cn-north-1.myhuaweicloud.com/hzw-kubeflow/modeldb-backend:kubeflow
docker push swr.cn-north-1.myhuaweicloud.com/hzw-kubeflow/modeldb-backend-proxy:kubeflow
docker push swr.cn-north-1.myhuaweicloud.com/hzw-kubeflow/mysql:5.7
docker push swr.cn-north-1.myhuaweicloud.com/hzw-kubeflow/modeldb-frontend:kubeflow
docker push swr.cn-north-1.myhuaweicloud.com/hzw-kubeflow/serving:1.11.1

这份长的整整有35个镜像之多

镜像替换

经过最终排查, 发现所有依赖镜像都在以下两个文件之中

1
2
vim ks_app/vendor/kubeflow/pipeline/pipeline.libsonnet
vim ks_app/components/params.libsonnet

使用vim打开这两个文件, 并搜索/Image\c, 忽略大小搜索所有Image, 并将后面的值修改为SWR上的镜像地址

怎么找到哪些镜像是你需要替换的镜像

有两种方法:

  1. 搜索上面那两个文件, 获取其中的镜像列表
  2. 先安装容器, 然后再看什么镜像拉取不下来

实际上使用第一种会更快一点, 基本上也就包含了所有的镜像

安装kubeflow组件

执行一下命令, 在k8s里部署kubeflow

1
kfctl apply all -V

至此, 你以为任务已经完成, 但是一到k8s的页面一看, 发现Pod无法启动, 我们再一个一个解决问题.

kubeflow大多数是deployment, 所以工作负载->无状态页面上可以看到组件状态

镜像拉取权限

遇到的第一个问题, 发现Pod的镜像一直无法拉取, 查了半天资料才发现华为云还有这么一个坑爹的限制: 必须显式指定imagePullSecrets

那就解决它吧

首先执行以下命令获取全部的serviceaccount

1
kubectl get serviceaccount -n kubeflow

其次, 执行以下命令, 给每个serviceaccount添加默认的imagePullSecrets

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
kubectl -n kubeflow patch serviceaccount ambassador                             -p '{"imagePullSecrets": [{"name": "default-secret"}]}'     
kubectl -n kubeflow patch serviceaccount argo -p '{"imagePullSecrets": [{"name": "default-secret"}]}'
kubectl -n kubeflow patch serviceaccount argo-ui -p '{"imagePullSecrets": [{"name": "default-secret"}]}'
kubectl -n kubeflow patch serviceaccount centraldashboard -p '{"imagePullSecrets": [{"name": "default-secret"}]}'
kubectl -n kubeflow patch serviceaccount default -p '{"imagePullSecrets": [{"name": "default-secret"}]}'
kubectl -n kubeflow patch serviceaccount jupyter-notebook -p '{"imagePullSecrets": [{"name": "default-secret"}]}'
kubectl -n kubeflow patch serviceaccount jupyter-web-app -p '{"imagePullSecrets": [{"name": "default-secret"}]}'
kubectl -n kubeflow patch serviceaccount katib-ui -p '{"imagePullSecrets": [{"name": "default-secret"}]}'
kubectl -n kubeflow patch serviceaccount meta-controller-service -p '{"imagePullSecrets": [{"name": "default-secret"}]}'
kubectl -n kubeflow patch serviceaccount metrics-collector -p '{"imagePullSecrets": [{"name": "default-secret"}]}'
kubectl -n kubeflow patch serviceaccount ml-pipeline -p '{"imagePullSecrets": [{"name": "default-secret"}]}'
kubectl -n kubeflow patch serviceaccount ml-pipeline-persistenceagent -p '{"imagePullSecrets": [{"name": "default-secret"}]}'
kubectl -n kubeflow patch serviceaccount ml-pipeline-scheduledworkflow -p '{"imagePullSecrets": [{"name": "default-secret"}]}'
kubectl -n kubeflow patch serviceaccount ml-pipeline-ui -p '{"imagePullSecrets": [{"name": "default-secret"}]}'
kubectl -n kubeflow patch serviceaccount ml-pipeline-viewer-crd-service-account -p '{"imagePullSecrets": [{"name": "default-secret"}]}'
kubectl -n kubeflow patch serviceaccount notebook-controller -p '{"imagePullSecrets": [{"name": "default-secret"}]}'
kubectl -n kubeflow patch serviceaccount pipeline-runner -p '{"imagePullSecrets": [{"name": "default-secret"}]}'
kubectl -n kubeflow patch serviceaccount pytorch-operator -p '{"imagePullSecrets": [{"name": "default-secret"}]}'
kubectl -n kubeflow patch serviceaccount studyjob-controller -p '{"imagePullSecrets": [{"name": "default-secret"}]}'
kubectl -n kubeflow patch serviceaccount tf-job-dashboard -p '{"imagePullSecrets": [{"name": "default-secret"}]}'
kubectl -n kubeflow patch serviceaccount tf-job-operator -p '{"imagePullSecrets": [{"name": "default-secret"}]}'
kubectl -n kubeflow patch serviceaccount vizier-core -p '{"imagePullSecrets": [{"name": "default-secret"}]}'

删除失败Pod, 过段时间后, 大部分的都正常了, 还有部分有问题.

创建数据库PVC

这个时候, 你应该能发现, mysql的相关负载一直是黄, 估计是存储有问题, 用下面命令查看一下pvc状态, 果然如你所料

1
2
3
4
5
kubectl get pvc -n kubeflow
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
katib-mysql Pending 12h
minio-pvc Pending 12h
mysql-pv-claim Pending 12h

那么主动在管理页面上创建三个对应的pvc(图中多了一个notebook的pvc请忽略)

由于CCE的PVC必然是带有cce-sfs-这类的前缀, 因此实际上PVC的名字发生了改变, 需要修改对应的相应的Deployment的yaml配置项

这点在华为云的页面上直接可以操作: 点开对应的Deployment的配置项, 找到编辑YAML完成任务修改.

重启Deployment之后, 这几个负载也变绿了.

修改vizier-db配置项

但是还有一个负载顽强跑不起来, 查看事件发现是, readinessProbe有出错, 发现initialDelaySeconds默认为1s, 启动时间太短, 修改为15秒后, 负载正常启动.

1
2
3
4
5
6
7
8
9
10
11
readinessProbe:
exec:
command:
- /bin/bash
- '-c'
- '"""mysql -D $$MYSQL_DATABASE -p$$MYSQL_ROOT_PASSWORD -e ''SELECT 1''"""'
initialDelaySeconds: **15**
timeoutSeconds: 1
periodSeconds: 2
successThreshold: 1
failureThreshold: 3

至此, 所有Deployment都开始正常运行了

完结撒花

最后放一张可视化的页面截图