首先, 官网上有对应的安装文档, 地址在这里官方安装指南 .
但是在国内总是有一些中国特设的问题, 所以在这儿记录一下安装的流程.
安装的环境是在华为云的云容器引擎服务CCE , 其他环境可能略有不同.
此外华为云上已有一个KubeFlow安装指南
依赖安装 首先安装官方文档, 安装如下的三个依赖:
kubectl
ks
kfctl
将这三个依赖都放入到环境之中, kubectl
需要设置好环境, kubectl get pod
不能有一次
1 mv ks kubectl kfctl /usr/local/bin/
我安装的是v0.5版本的kubeflow, 版本升级之后不保证成功
生成配置文件 执行以下命令生成配置文件, 这个步骤需要在联网环境下执行 , 在华为云上虚拟机就需要绑定弹性IP.
1 2 3 4 5 6 mkdir /root/kubeflowcd /root/kubeflowexport KFAPP=kubeflowkfctl init ${KFAPP} cd ${KFAPP} kfctl generate all -V
完成后, 会在目录下生成如下文件:
替换镜像 默认生成的配置, 其中的镜像都是GoogleCloud的公开镜像, 在国内网络之中, 肯定无法访问, 需要将这些镜像想办法下载下来传入到华为云的容器镜像服务 之内.
镜像下载上传 我将我用到的镜像都列举出来, 配置代理之后, 再拉取到本地
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 docker pull gcr.io/ml-pipeline/api-server:0.1.16 docker pull gcr.io/ml-pipeline/persistenceagent:0.1.16 docker pull gcr.io/ml-pipeline/scheduledworkflow:0.1.16 docker pull gcr.io/ml-pipeline/frontend:0.1.16 docker pull gcr.io/ml-pipeline/viewer-crd-controller:0.1.16 docker pull gcr.io/kubeflow-images-public/notebook-controller:v20190401-v0.4.0-rc.1-308-g33618cc9-e3b0c4 docker pull gcr.io/kubeflow-images-public/pytorch-operator:v0.5.0 docker pull gcr.io/kubeflow-images-public/katib/studyjob-controller:v0.1.2-alpha-156-g4ab3dbd docker pull gcr.io/kubeflow-images-public/tf_operator:v0.5.0 docker pull gcr.io/kubeflow-images-public/katib/vizier-core:v0.1.2-alpha-156-g4ab3dbd docker pull gcr.io/kubeflow-images-public/katib/vizier-core-rest:v0.1.2-alpha-156-g4ab3dbd docker pull gcr.io/kubeflow-images-public/centraldashboard:v0.5.0 docker pull gcr.io/kubeflow-images-public/jupyter-web-app:v0.5.0 docker pull gcr.io/kubeflow-images-public/katib/katib-ui:v0.1.2-alpha-156-g4ab3dbd docker pull gcr.io/kubeflow-images-public/katib/suggestion-bayesianoptimization:v0.1.2-alpha-156-g4ab3dbd docker pull gcr.io/kubeflow-images-public/katib/suggestion-grid:v0.1.2-alpha-156-g4ab3dbd docker pull gcr.io/kubeflow-images-public/katib/suggestion-hyperband:v0.1.2-alpha-156-g4ab3dbd docker pull gcr.io/kubeflow-images-public/katib/suggestion-random:v0.1.2-alpha-156-g4ab3dbd docker pull tensorflow/tensorflow:1.8.0 docker pull mysql:5.6 docker pull mysql:8.0.3 docker pull quay.io/datawire/ambassador:0.37.0 docker pull argoproj/argoui:v2.2.0 docker pull argoproj/workflow-controller:v2.2.0 docker pull metacontroller/metacontroller:v0.3.0 docker pull minio/minio:RELEASE.2018-02-09T22-40-05Z
而后再加入SWR的tag
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 docker tag gcr.io/ml-pipeline/api-server:0.1.16 swr.cn-north-5.myhuaweicloud.com/kubeflow/api-server:0.1.16 docker tag gcr.io/ml-pipeline/persistenceagent:0.1.16 swr.cn-north-5.myhuaweicloud.com/kubeflow/persistenceagent:0.1.16 docker tag gcr.io/ml-pipeline/scheduledworkflow:0.1.16 swr.cn-north-5.myhuaweicloud.com/kubeflow/scheduledworkflow:0.1.16 docker tag gcr.io/ml-pipeline/frontend:0.1.16 swr.cn-north-5.myhuaweicloud.com/kubeflow/frontend:0.1.16 docker tag gcr.io/ml-pipeline/viewer-crd-controller:0.1.16 swr.cn-north-5.myhuaweicloud.com/kubeflow/viewer-crd-controller:0.1.16 docker tag gcr.io/kubeflow-images-public/notebook-controller:v20190401-v0.4.0-rc.1-308-g33618cc9-e3b0c4 swr.cn-north-5.myhuaweicloud.com/kubeflow/notebook-controller:v20190401-v0.4.0-rc.1-308-g33618cc9-e3b0c4 docker tag gcr.io/kubeflow-images-public/pytorch-operator:v0.5.0 swr.cn-north-5.myhuaweicloud.com/kubeflow/pytorch-operator:v0.5.0 docker tag gcr.io/kubeflow-images-public/katib/studyjob-controller:v0.1.2-alpha-156-g4ab3dbd swr.cn-north-5.myhuaweicloud.com/kubeflow/studyjob-controller:v0.1.2-alpha-156-g4ab3dbd docker tag gcr.io/kubeflow-images-public/tf_operator:v0.5.0 swr.cn-north-5.myhuaweicloud.com/kubeflow/tf_operator:v0.5.0 docker tag gcr.io/kubeflow-images-public/katib/vizier-core:v0.1.2-alpha-156-g4ab3dbd swr.cn-north-5.myhuaweicloud.com/kubeflow/vizier-core:v0.1.2-alpha-156-g4ab3dbd docker tag gcr.io/kubeflow-images-public/katib/vizier-core-rest:v0.1.2-alpha-156-g4ab3dbd swr.cn-north-5.myhuaweicloud.com/kubeflow/vizier-core-rest:v0.1.2-alpha-156-g4ab3dbd docker tag gcr.io/kubeflow-images-public/centraldashboard:v0.5.0 swr.cn-north-5.myhuaweicloud.com/kubeflow/centraldashboard:v0.5.0 docker tag gcr.io/kubeflow-images-public/jupyter-web-app:v0.5.0 swr.cn-north-5.myhuaweicloud.com/kubeflow/jupyter-web-app:v0.5.0 docker tag gcr.io/kubeflow-images-public/katib/katib-ui:v0.1.2-alpha-156-g4ab3dbd swr.cn-north-5.myhuaweicloud.com/kubeflow/katib-ui:v0.1.2-alpha-156-g4ab3dbd docker tag gcr.io/kubeflow-images-public/katib/suggestion-bayesianoptimization:v0.1.2-alpha-156-g4ab3dbd swr.cn-north-5.myhuaweicloud.com/kubeflow/suggestion-bayesianoptimization:v0.1.2-alpha-156-g4ab3dbd docker tag gcr.io/kubeflow-images-public/katib/suggestion-grid:v0.1.2-alpha-156-g4ab3dbd swr.cn-north-5.myhuaweicloud.com/kubeflow/suggestion-grid:v0.1.2-alpha-156-g4ab3dbd docker tag gcr.io/kubeflow-images-public/katib/suggestion-hyperband:v0.1.2-alpha-156-g4ab3dbd swr.cn-north-5.myhuaweicloud.com/kubeflow/suggestion-hyperband:v0.1.2-alpha-156-g4ab3dbd docker tag gcr.io/kubeflow-images-public/katib/suggestion-random:v0.1.2-alpha-156-g4ab3dbd swr.cn-north-5.myhuaweicloud.com/kubeflow/suggestion-random:v0.1.2-alpha-156-g4ab3dbd docker tag tensorflow/tensorflow:1.8.0 swr.cn-north-5.myhuaweicloud.com/kubeflow/tensorflow:1.8.0 docker tag mysql:5.6 swr.cn-north-5.myhuaweicloud.com/kubeflow/mysql:5.6 docker tag mysql:8.0.3 swr.cn-north-5.myhuaweicloud.com/kubeflow/mysql:8.0.3 docker tag quay.io/datawire/ambassador:0.37.0 swr.cn-north-5.myhuaweicloud.com/kubeflow/ambassador:0.37.0 docker tag argoproj/argoui:v2.2.0 swr.cn-north-5.myhuaweicloud.com/kubeflow/argoui:v2.2.0 docker tag argoproj/workflow-controller:v2.2.0 swr.cn-north-5.myhuaweicloud.com/kubeflow/workflow-controller:v2.2.0 docker tag metacontroller/metacontroller:v0.3.0 swr.cn-north-5.myhuaweicloud.com/kubeflow/metacontroller:v0.3.0 docker tag minio/minio:RELEASE.2018-02-09T22-40-05Z swr.cn-north-5.myhuaweicloud.com/kubeflow/minio:RELEASE.2018-02-09T22-40-05Z
再推到SWR之中
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 docker push swr.cn-north-5.myhuaweicloud.com/kubeflow/api-server:0.1.16 docker push swr.cn-north-5.myhuaweicloud.com/kubeflow/persistenceagent:0.1.16 docker push swr.cn-north-5.myhuaweicloud.com/kubeflow/scheduledworkflow:0.1.16 docker push swr.cn-north-5.myhuaweicloud.com/kubeflow/frontend:0.1.16 docker push swr.cn-north-5.myhuaweicloud.com/kubeflow/viewer-crd-controller:0.1.16 docker push swr.cn-north-5.myhuaweicloud.com/kubeflow/notebook-controller:v20190401-v0.4.0-rc.1-308-g33618cc9-e3b0c4 docker push swr.cn-north-5.myhuaweicloud.com/kubeflow/pytorch-operator:v0.5.0 docker push swr.cn-north-5.myhuaweicloud.com/kubeflow/studyjob-controller:v0.1.2-alpha-156-g4ab3dbd docker push swr.cn-north-5.myhuaweicloud.com/kubeflow/tf_operator:v0.5.0 docker push swr.cn-north-5.myhuaweicloud.com/kubeflow/vizier-core:v0.1.2-alpha-156-g4ab3dbd docker push swr.cn-north-5.myhuaweicloud.com/kubeflow/vizier-core-rest:v0.1.2-alpha-156-g4ab3dbd docker push swr.cn-north-5.myhuaweicloud.com/kubeflow/centraldashboard:v0.5.0 docker push swr.cn-north-5.myhuaweicloud.com/kubeflow/jupyter-web-app:v0.5.0 docker push swr.cn-north-5.myhuaweicloud.com/kubeflow/katib-ui:v0.1.2-alpha-156-g4ab3dbd docker push swr.cn-north-5.myhuaweicloud.com/kubeflow/suggestion-bayesianoptimization:v0.1.2-alpha-156-g4ab3dbd docker push swr.cn-north-5.myhuaweicloud.com/kubeflow/suggestion-grid:v0.1.2-alpha-156-g4ab3dbd docker push swr.cn-north-5.myhuaweicloud.com/kubeflow/suggestion-hyperband:v0.1.2-alpha-156-g4ab3dbd docker push swr.cn-north-5.myhuaweicloud.com/kubeflow/suggestion-random:v0.1.2-alpha-156-g4ab3dbd docker push swr.cn-north-5.myhuaweicloud.com/kubeflow/tensorflow:1.8.0 docker push swr.cn-north-5.myhuaweicloud.com/kubeflow/mysql:5.6 docker push swr.cn-north-5.myhuaweicloud.com/kubeflow/mysql:8.0.3 docker push swr.cn-north-5.myhuaweicloud.com/kubeflow/ambassador:0.37.0 docker push swr.cn-north-5.myhuaweicloud.com/kubeflow/argoui:v2.2.0 docker push swr.cn-north-5.myhuaweicloud.com/kubeflow/workflow-controller:v2.2.0 docker push swr.cn-north-5.myhuaweicloud.com/kubeflow/metacontroller:v0.3.0 docker push swr.cn-north-5.myhuaweicloud.com/kubeflow/minio:RELEASE.2018-02-09T22-40-05Z
华为云有几个限制, 需要你注意:
路径不能有多级, 即容器名字不能含有/
路径不能含有点号.
, 有部分镜像带有版本号, 需要修改原始名称, 但是tag里面又可以存在.
另外这部分镜像可能是不全的, 因为例如model db
等组件是没有安装的
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 docker push swr.cn-north-1.myhuaweicloud.com/hzw-kubeflow/api-server:0.1.16 docker push swr.cn-north-1.myhuaweicloud.com/hzw-kubeflow/persistenceagent:0.1.16 docker push swr.cn-north-1.myhuaweicloud.com/hzw-kubeflow/scheduledworkflow:0.1.16 docker push swr.cn-north-1.myhuaweicloud.com/hzw-kubeflow/frontend:0.1.16 docker push swr.cn-north-1.myhuaweicloud.com/hzw-kubeflow/viewer-crd-controller:0.1.16 docker push swr.cn-north-1.myhuaweicloud.com/hzw-kubeflow/notebook-controller:v20190401-v0.4.0-rc.1-308-g33618cc9-e3b0c4 docker push swr.cn-north-1.myhuaweicloud.com/hzw-kubeflow/pytorch-operator:v0.5.0 docker push swr.cn-north-1.myhuaweicloud.com/hzw-kubeflow/studyjob-controller:v0.1.2-alpha-156-g4ab3dbd docker push swr.cn-north-1.myhuaweicloud.com/hzw-kubeflow/tf_operator:v0.5.0 docker push swr.cn-north-1.myhuaweicloud.com/hzw-kubeflow/vizier-core:v0.1.2-alpha-156-g4ab3dbd docker push swr.cn-north-1.myhuaweicloud.com/hzw-kubeflow/vizier-core-rest:v0.1.2-alpha-156-g4ab3dbd docker push swr.cn-north-1.myhuaweicloud.com/hzw-kubeflow/centraldashboard:v0.5.0 docker push swr.cn-north-1.myhuaweicloud.com/hzw-kubeflow/jupyter-web-app:v0.5.0 docker push swr.cn-north-1.myhuaweicloud.com/hzw-kubeflow/katib-ui:v0.1.2-alpha-156-g4ab3dbd docker push swr.cn-north-1.myhuaweicloud.com/hzw-kubeflow/suggestion-bayesianoptimization:v0.1.2-alpha-156-g4ab3dbd docker push swr.cn-north-1.myhuaweicloud.com/hzw-kubeflow/suggestion-grid:v0.1.2-alpha-156-g4ab3dbd docker push swr.cn-north-1.myhuaweicloud.com/hzw-kubeflow/suggestion-hyperband:v0.1.2-alpha-156-g4ab3dbd docker push swr.cn-north-1.myhuaweicloud.com/hzw-kubeflow/suggestion-random:v0.1.2-alpha-156-g4ab3dbd docker push swr.cn-north-1.myhuaweicloud.com/hzw-kubeflow/tensorflow:1.8.0 docker push swr.cn-north-1.myhuaweicloud.com/hzw-kubeflow/mysql:5.6 docker push swr.cn-north-1.myhuaweicloud.com/hzw-kubeflow/mysql:8.0.3 docker push swr.cn-north-1.myhuaweicloud.com/hzw-kubeflow/ambassador:0.37.0 docker push swr.cn-north-1.myhuaweicloud.com/hzw-kubeflow/argoui:v2.2.0 docker push swr.cn-north-1.myhuaweicloud.com/hzw-kubeflow/workflow-controller:v2.2.0 docker push swr.cn-north-1.myhuaweicloud.com/hzw-kubeflow/metacontroller:v0.3.0 docker push swr.cn-north-1.myhuaweicloud.com/hzw-kubeflow/minio:RELEASE.2018-02-09T22-40-05Z docker push swr.cn-north-1.myhuaweicloud.com/hzw-kubeflow/volume-nfs:0.8 docker push swr.cn-north-1.myhuaweicloud.com/hzw-kubeflow/argoexec:v2.2.0 docker push swr.cn-north-1.myhuaweicloud.com/hzw-kubeflow/metrics-collector:v0.1.2-alpha-156-g4ab3dbd docker push swr.cn-north-1.myhuaweicloud.com/hzw-kubeflow/modeldb-artifact-store:kubeflow docker push swr.cn-north-1.myhuaweicloud.com/hzw-kubeflow/modeldb-backend:kubeflow docker push swr.cn-north-1.myhuaweicloud.com/hzw-kubeflow/modeldb-backend-proxy:kubeflow docker push swr.cn-north-1.myhuaweicloud.com/hzw-kubeflow/mysql:5.7 docker push swr.cn-north-1.myhuaweicloud.com/hzw-kubeflow/modeldb-frontend:kubeflow docker push swr.cn-north-1.myhuaweicloud.com/hzw-kubeflow/serving:1.11.1
这份长的整整有35个镜像之多
镜像替换 经过最终排查, 发现所有依赖镜像都在以下两个文件之中
1 2 vim ks_app/vendor/kubeflow/pipeline/pipeline.libsonnet vim ks_app/components/params.libsonnet
使用vim打开这两个文件, 并搜索/Image\c
, 忽略大小搜索所有Image, 并将后面的值修改为SWR上的镜像地址
怎么找到哪些镜像是你需要替换的镜像 有两种方法:
搜索上面那两个文件, 获取其中的镜像列表
先安装容器, 然后再看什么镜像拉取不下来
实际上使用第一种会更快一点, 基本上也就包含了所有的镜像
安装kubeflow组件 执行一下命令, 在k8s里部署kubeflow
至此, 你以为任务已经完成, 但是一到k8s的页面一看, 发现Pod无法启动, 我们再一个一个解决问题.
kubeflow大多数是deployment, 所以工作负载->无状态页面上可以看到组件状态
镜像拉取权限 遇到的第一个问题, 发现Pod的镜像一直无法拉取, 查了半天资料才发现华为云还有这么一个坑爹的限制: 必须显式指定imagePullSecrets
那就解决它吧
首先执行以下命令获取全部的serviceaccount
1 kubectl get serviceaccount -n kubeflow
其次, 执行以下命令, 给每个serviceaccount
添加默认的imagePullSecrets
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 kubectl -n kubeflow patch serviceaccount ambassador -p '{"imagePullSecrets": [{"name": "default-secret"}]}' kubectl -n kubeflow patch serviceaccount argo -p '{"imagePullSecrets": [{"name": "default-secret"}]}' kubectl -n kubeflow patch serviceaccount argo-ui -p '{"imagePullSecrets": [{"name": "default-secret"}]}' kubectl -n kubeflow patch serviceaccount centraldashboard -p '{"imagePullSecrets": [{"name": "default-secret"}]}' kubectl -n kubeflow patch serviceaccount default -p '{"imagePullSecrets": [{"name": "default-secret"}]}' kubectl -n kubeflow patch serviceaccount jupyter-notebook -p '{"imagePullSecrets": [{"name": "default-secret"}]}' kubectl -n kubeflow patch serviceaccount jupyter-web-app -p '{"imagePullSecrets": [{"name": "default-secret"}]}' kubectl -n kubeflow patch serviceaccount katib-ui -p '{"imagePullSecrets": [{"name": "default-secret"}]}' kubectl -n kubeflow patch serviceaccount meta-controller-service -p '{"imagePullSecrets": [{"name": "default-secret"}]}' kubectl -n kubeflow patch serviceaccount metrics-collector -p '{"imagePullSecrets": [{"name": "default-secret"}]}' kubectl -n kubeflow patch serviceaccount ml-pipeline -p '{"imagePullSecrets": [{"name": "default-secret"}]}' kubectl -n kubeflow patch serviceaccount ml-pipeline-persistenceagent -p '{"imagePullSecrets": [{"name": "default-secret"}]}' kubectl -n kubeflow patch serviceaccount ml-pipeline-scheduledworkflow -p '{"imagePullSecrets": [{"name": "default-secret"}]}' kubectl -n kubeflow patch serviceaccount ml-pipeline-ui -p '{"imagePullSecrets": [{"name": "default-secret"}]}' kubectl -n kubeflow patch serviceaccount ml-pipeline-viewer-crd-service-account -p '{"imagePullSecrets": [{"name": "default-secret"}]}' kubectl -n kubeflow patch serviceaccount notebook-controller -p '{"imagePullSecrets": [{"name": "default-secret"}]}' kubectl -n kubeflow patch serviceaccount pipeline-runner -p '{"imagePullSecrets": [{"name": "default-secret"}]}' kubectl -n kubeflow patch serviceaccount pytorch-operator -p '{"imagePullSecrets": [{"name": "default-secret"}]}' kubectl -n kubeflow patch serviceaccount studyjob-controller -p '{"imagePullSecrets": [{"name": "default-secret"}]}' kubectl -n kubeflow patch serviceaccount tf-job-dashboard -p '{"imagePullSecrets": [{"name": "default-secret"}]}' kubectl -n kubeflow patch serviceaccount tf-job-operator -p '{"imagePullSecrets": [{"name": "default-secret"}]}' kubectl -n kubeflow patch serviceaccount vizier-core -p '{"imagePullSecrets": [{"name": "default-secret"}]}'
删除失败Pod, 过段时间后, 大部分的都正常了, 还有部分有问题.
创建数据库PVC 这个时候, 你应该能发现, mysql的相关负载一直是黄, 估计是存储有问题, 用下面命令查看一下pvc状态, 果然如你所料
1 2 3 4 5 kubectl get pvc -n kubeflow NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE katib-mysql Pending 12h minio-pvc Pending 12h mysql-pv-claim Pending 12h
那么主动在管理页面上创建三个对应的pvc(图中多了一个notebook的pvc请忽略)
由于CCE的PVC必然是带有cce-sfs-
这类的前缀, 因此实际上PVC的名字发生了改变, 需要修改对应的相应的Deployment的yaml配置项
这点在华为云的页面上直接可以操作: 点开对应的Deployment的配置项, 找到编辑YAML
完成任务修改.
重启Deployment之后, 这几个负载也变绿了.
修改vizier-db配置项 但是还有一个负载顽强跑不起来, 查看事件发现是, readinessProbe
有出错, 发现initialDelaySeconds
默认为1s, 启动时间太短, 修改为15秒后, 负载正常启动.1 2 3 4 5 6 7 8 9 10 11 readinessProbe: exec: command: - /bin/bash - '-c' - '"""mysql -D $$MYSQL_DATABASE -p$$MYSQL_ROOT_PASSWORD -e ' 'SELECT 1' '"""' initialDelaySeconds: **15** timeoutSeconds: 1 periodSeconds: 2 successThreshold: 1 failureThreshold: 3
至此, 所有Deployment都开始正常运行了
完结撒花 最后放一张可视化的页面截图
0%
Theme NexT works best with JavaScript enabled