Kubernetes集成Ceph

Posted by elrond on June 19, 2021

1. 版本

软件名 版本 备注
docker 19.03.14  
kubernetes 1.19.7  
ceph 14.2.21 nautilus
ceph-csi release-v3.3 手动更新quay.io/cephcsi/cephcsi:v3.3.0为 quay.io/cephcsi/cephcsi:v3.2-canary 否则由于ceph版本问题认证会不通过
external-storage master  

2. 块存储

2.1. 准备

  • 创建ceph pool
ceph osd pool create k8s 64 64
rbd pool init k8s
  • 获取ceph keyring
ceph auth get client.admin
# 输出如下
key = AQDk18FgMo7NABAA4ufuz3O6/0lE4vsVgHs1yQ==
  • 获取clusterid
ceph -s
# 输出如下
    id:     8cfb6405-d75e-466a-8abf-51ba0480d783
...
  • 获取mon信息
ceph mon stat
# 输出如下
e2: 3 mons at {ceph01=[v2:172.16.2.237:3300/0,v1:172.16.2.237:6789/0],ceph02=[v2:172.16.2.238:3300/0,v1:172.16.2.238:6789/0],ceph03=[v2:172.16.2.239:3300/0,v1:172.16.2.239:6789/0]}, election epoch 10, leader 0 ceph01, quorum 0,1,2 ceph01,ceph02,ceph03
  • 每个客户端安装ceph-common
yum install ceph-common -y 

2.2. csi模式–当前使用

2.2.1. 配置configmap

  • 生成configmap

    修改字段

    • clusterID
    • monitors
      cat <<EOF > csi-config-map.yaml
      ---
      apiVersion: v1
      kind: ConfigMap
      data:
      config.json: |-
      [
        {
          "clusterID": "8cfb6405-d75e-466a-8abf-51ba0480d783",
          "monitors": [
            "172.16.2.237:6789",
            "172.16.2.238:6789",
            "172.16.2.239:6789"
          ]
        }
      ]
      metadata:
      name: ceph-csi-config
      EOF
      
  • 导入configmap

kubectl apply -f csi-config-map.yaml
  • 配置kms
cat <<EOF > csi-kms-config-map.yaml
---
apiVersion: v1
kind: ConfigMap
data:
  config.json: |-
    {}
metadata:
  name: ceph-csi-encryption-kms-config
EOF
kubectl apply -f csi-kms-config-map.yaml

2.2.2. 配置secret

  • 生成secret 修改字段:
    • userID
    • userKey
      cat <<EOF > csi-rbd-secret.yaml
      ---
      apiVersion: v1
      kind: Secret
      metadata:
      name: csi-rbd-secret
      namespace: default
      stringData:
      userID: admin
      userKey: AQDk18FgMo7NABAA4ufuz3O6/0lE4vsVgHs1yQ==
      EOF
      
  • 导入secret
kubectl apply -f csi-rbd-secret.yaml

2.2.3. 配置rbac

kubectl apply -f https://raw.githubusercontent.com/ceph/ceph-csi/master/deploy/rbd/kubernetes/csi-provisioner-rbac.yaml
kubectl apply -f https://raw.githubusercontent.com/ceph/ceph-csi/master/deploy/rbd/kubernetes/csi-nodeplugin-rbac.yaml

2.2.4. 配置provisioner和node plugins

注意⚠️:国内无法访问 k8s.gcr.io需要将下面两个文件中的 k8s.gcr.io/sig-storage 替换为 quay.io/cephcsi

wget https://raw.githubusercontent.com/ceph/ceph-csi/master/deploy/rbd/kubernetes/csi-rbdplugin-provisioner.yaml
wget https://raw.githubusercontent.com/ceph/ceph-csi/master/deploy/rbd/kubernetes/csi-rbdplugin.yaml
for i in $(grep -rn 'k8s.gcr.io' deploy|awk -F':' '{print $1}'|uniq);do sed -i 's/k8s.gcr.io\/sig-storage/quay.io\/cephcsi/g' $i;done
kubectl apply -f  csi-rbdplugin-provisioner.yaml
kubectl apply -f csi-rbdplugin.yaml

2.2.5. 配置storageclass

  • 创建配置 需要修改的参数:
    • pool
cat <<EOF > csi-rbd-sc.yaml
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
   name: csi-rbd-sc
provisioner: rbd.csi.ceph.com
parameters:
   clusterID: b9127830-b0cc-4e34-aa47-9d1a2e9949a8
   pool: k8s
   imageFeatures: layering
   csi.storage.k8s.io/provisioner-secret-name: csi-rbd-secret
   csi.storage.k8s.io/provisioner-secret-namespace: default
   csi.storage.k8s.io/controller-expand-secret-name: csi-rbd-secret
   csi.storage.k8s.io/controller-expand-secret-namespace: default
   csi.storage.k8s.io/node-stage-secret-name: csi-rbd-secret
   csi.storage.k8s.io/node-stage-secret-namespace: default
reclaimPolicy: Delete
allowVolumeExpansion: true
mountOptions:
- discard
EOF
  • 导入
kubectl apply -f csi-rbd-sc.yaml

2.3. external-storage模式 – 版本陈旧不再使用

2.3.1. 配置provisioner

cat <<EOF > deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: rbd-provisioner
spec:
  replicas: 1
  selector:
    matchLabels:
      app: rbd-provisioner
  strategy:
    type: Recreate
  template:
    metadata:
      labels:
        app: rbd-provisioner
    spec:
      containers:
      - name: rbd-provisioner
        image: "quay.io/external_storage/rbd-provisioner:latest"
        env:
        - name: PROVISIONER_NAME
          value: ceph.com/rbd
EOF
kubectl apply -f  deployment.yaml

使用的ceph 14但容器里面的ceph-common版本为13,需要将其升级为14才可以正常创建pvc

2.3.2. 创建clusterstorage

需要修改的参数

  • monitors
  • name
  • pool
  • userId
cat <<EOF > class.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: rbd
provisioner: kubernetes.io/rbd
parameters:
  monitors: 172.16.2.237:6789,172.16.2.238:6789,172.16.2.239:6789
  adminId: admin
  adminSecretName: ceph-secret
  adminSecretNamespace: kube-system
  pool: k8s
  userId: admin
  userSecretName: ceph-secret
  userSecretNamespace: default
  fsType: ext4
  imageFormat: "2"
  imageFeatures: "layering"
kubectl apply -f class.yaml

2.3.3. 创建secret

需要修改的参数:

  • key: 执行 ceph auth get-key client.admin | base64 获取输出
cat <<EOF > secret.yaml
apiVersion: v1
kind: Secret
metadata:
  name: ceph-secret
  namespace: kube-system
type: "kubernetes.io/rbd"
data:
  # ceph auth get-key client.admin | base64
  key: QVFEazE4RmdNbzdOQUJBQTR1ZnV6M082LzBsRTR2c1ZnSHMxeVE9PQ==
kubectl apply -f secret.yaml

2.4. 测试

2.4.1. 创建pvc

cat <<EOF > raw-block-pvc.yaml
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: raw-block-pvc
spec:
  accessModes:
    - ReadWriteOnce
  volumeMode: Block
  resources:
    requests:
      storage: 1Gi
  storageClassName: csi-rbd-sc
EOF
kubectl apply -f raw-block-pod.yaml

验证是否创建成功

kubectl get pvc raw-block-pvc
# 输出如下即正常
raw-block-pvc   Bound    pvc-a0fb907f-e067-443c-89fe-127a794c8f1b   1Gi        RWO            csi-rbd-sc     13h

2.4.2. 创建pod使用刚创建的sc

cat <<EOF > raw-block-pvc.yaml
---
apiVersion: v1
kind: Pod
metadata:
  name: pod-with-raw-block-volume
spec:
  containers:
    - name: fc-container
      image: fedora:26
      command: ["/bin/sh", "-c"]
      args: ["tail -f /dev/null"]
      volumeDevices:
        - name: data
          devicePath: /dev/xvda
  volumes:
    - name: data
      persistentVolumeClaim:
        claimName: raw-block-pvc
EOF
kubectl apply -f raw-block-pvc.yaml

验证创建是否成功

kubectl get po rbd-provisioner-7f85d94d97-mkzdf
# 输出如下
NAME                               READY   STATUS    RESTARTS   AGE
rbd-provisioner-7f85d94d97-mkzdf   1/1     Running   0          17h

验证是否挂载到容器

kubectl exec -it pod-with-raw-block-volume lsblk
# 输出如下
# rbd0 即为刚申请的pvc
rbd0   252:0    0    1G  0 disk

3. 对象存储

对象存储直接使用七层协议 http https即可,无需做复杂的csi

4. 文件系统

4.1. 准备工作

4.1.1. 创建一个文件系统

  • 创建两个pool 一个存文件系统数据另一个存文件系统元数据
ceph osd pool create cephfs_metadata 32 32
ceph osd pool create cephfs_data 256 256
  • 创建fs

后面会用到fs_name

# ceph fs new <fs_name> <metadata> <data>
ceph fs new cephfs cephfs_metadata cephfs_data

4.1.2. 获取集群信息

  • 获取clusterid
ceph -s
# 输出如下
    id:     8cfb6405-d75e-466a-8abf-51ba0480d783
  • 获取secret
ceph auth get client.admin
# 输出如下
[client.admin]
	key = AQDk18FgMo7NABAA4ufuz3O6/0lE4vsVgHs1yQ==
  • 获取mon信息
ceph mon stat
# 输出如下
e2: 3 mons at {ceph01=[v2:172.16.2.237:3300/0,v1:172.16.2.237:6789/0],ceph02=[v2:172.16.2.238:3300/0,v1:172.16.2.238:6789/0],ceph03=[v2:172.16.2.239:3300/0,v1:172.16.2.239:6789/0]}, election epoch 10, leader 0 ceph01, quorum 0,1,2 ceph01,ceph02,ceph03

4.2. 部署ceph-csi-cephfs

部署时可以将整个git项目下载下来在项目的基础上修改git clone git@github.com:ceph/ceph-csi.git -b release-v3.3

下载下来之后进入 deploy/cephfs/kubernetes

4.3. configmap

configmap只给出了模板需要自己填写下

修改:

  • clusterID
  • monitors
cat <<EOF > csi-config-map.yaml
---
apiVersion: v1
kind: ConfigMap
data:
  config.json: |-
    [
      {
        "clusterID": "8cfb6405-d75e-466a-8abf-51ba0480d783",
        "monitors": [
          "172.16.2.237:6789",
          "172.16.2.238:6789",
          "172.16.2.239:6789"
        ]
      }
    ]
metadata:
  name: ceph-csi-config
EOF
kubectl apply -f csi-config-map.yaml

4.3.1. rbac

kubectl create -f csi-provisioner-rbac.yaml
kubectl create -f csi-nodeplugin-rbac.yaml

4.3.2. provisioner

部署provisioner的时候需要启动容器,默认的镜像时由于[kubernetes-csi](https://github.com/kubernetes-csi)中provisioner使用的镜像地址为 k8s.gcr.io/sig-storage/csi-provisioner ,这个地址在中国大陆无法访问,需要手动修改 k8s.gcr.io/sig-storagequay.io/k8scsi

sed -i 's/k8s.gcr.io\/sig-storage/quay.io\/k8scsi/g' csi-cephfsplugin-provisioner.yaml
sed -i 's/k8s.gcr.io\/sig-storage/quay.io\/k8scsi/g' csi-cephfsplugin.yaml
kubectl create -f csi-cephfsplugin-provisioner.yaml
kubectl create -f csi-cephfsplugin.yaml

更新csi镜像版本,默认使用v3.3.0的镜像与ceph14版本不兼容,pvc无法创建,更换为v3.2-canary

sed -i 's/cephcsi:v3.3.0/cephcsi:v3.2-canary/g' csi-cephfsplugin-provisioner.yaml
sed -i 's/cephcsi:v3.3.0/cephcsi:v3.2-canary/g' csi-cephfsplugin.yaml

4.3.3. 确认部署成功

kubectl get po|grep cephfs
# 输出如下
csi-cephfsplugin-47z57                          3/3     Running   0          66m
csi-cephfsplugin-7ttvw                          3/3     Running   0          65m
csi-cephfsplugin-8g9c2                          3/3     Running   0          66m
csi-cephfsplugin-provisioner-55859c9ff7-g7nx4   6/6     Running   0          64m
csi-cephfsplugin-provisioner-55859c9ff7-hqn2r   6/6     Running   0          64m
csi-cephfsplugin-provisioner-55859c9ff7-vlc6b   6/6     Running   0          64m

到这里 deploy/cephfs/kubernetes部分我们部署完了,切个目录examples/cephfs/开始部署客户端

到这里 deploy/cephfs/kubernetes部分我们部署完了,切个目录examples/cephfs/开始部署客户端

4.4. 客户端配置

4.4.1. secret

修改:

  • userID
  • userKey
  • adminID
  • adminKey
cat <<EOF > secret.yaml
---
apiVersion: v1
kind: Secret
metadata:
  name: csi-cephfs-secret
  namespace: default
stringData:
  # Required for statically provisioned volumes
  userID: admin
  userKey: AQDk18FgMo7NABAA4ufuz3O6/0lE4vsVgHs1yQ==

  # Required for dynamically provisioned volumes
  adminID: admin
  adminKey: AQDk18FgMo7NABAA4ufuz3O6/0lE4vsVgHs1yQ==
kubectl apply -f secret.yaml

4.4.2. storageclass

修改

  • clusterID
  • fsName
cat <<EOF > storageclass.yaml
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: csi-cephfs-sc
provisioner: cephfs.csi.ceph.com
parameters:
  clusterID: 8cfb6405-d75e-466a-8abf-51ba0480d783
  fsName: cephfs
  csi.storage.k8s.io/provisioner-secret-name: csi-cephfs-secret
  csi.storage.k8s.io/provisioner-secret-namespace: default
  csi.storage.k8s.io/controller-expand-secret-name: csi-cephfs-secret
  csi.storage.k8s.io/controller-expand-secret-namespace: default
  csi.storage.k8s.io/node-stage-secret-name: csi-cephfs-secret
  csi.storage.k8s.io/node-stage-secret-namespace: default
reclaimPolicy: Delete
allowVolumeExpansion: true
mountOptions:
  - debug
EOF
kubectl apply -f storageclass.yaml

4.5. 测试

4.5.1. 创建pvc

kubectl apply -f pvc.yaml
# 稍等片刻查看
kubectl get pvc csi-cephfs-pvc
# 输出如下
NAME             STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS    AGE
csi-cephfs-pvc   Bound    pvc-cb6988b1-3dde-4227-80ff-5f549709b768   1Gi        RWX            csi-cephfs-sc   62m

4.5.2. 创建pod

kubectl apply -f pod.yaml
# 查看
kubectl exec -it csi-cephfs-demo-pod df
# 输出如下 可以看出以cephfs方式挂载
# 且目录为/volumes/csi/csi-vol-0af68b6a-cfff-11eb-b335-fad30d8a412c/c721c454-23d6-4d73-8a4f-6d051e322487
172.16.2.237:6789,172.16.2.238:6789,172.16.2.239:6789:/volumes/csi/csi-vol-0af68b6a-cfff-11eb-b335-fad30d8a412c/c721c454-23d6-4d73-8a4f-6d051e322487   1048576       0   1048576   0% /var/lib/www

5. 参考

6. Troubleshooting

6.1. pvc 一直在pending状态–csi容器无法调度

6.1.1. 现象

kubectl get pvc
NAME            STATUS    VOLUME   CAPACITY   ACCESS MODES   STORAGECLASS   AGE
raw-block-pvc   Pending                                      csi-rbd-sc     8m17s

6.1.2. 排查过程

  • describe pvc
kubectl describe pvc raw-block-pvc
# 关键错误
waiting for a volume to be created, either by external provisioner "rbd.csi.ceph.com" or manually created by system administrator
  • 查看 rbdplugin pod
kubectl get po
NAME                                        READY   STATUS    RESTARTS   AGE
csi-rbdplugin-provisioner-9db69594c-gkvfv   0/7     Pending   0          56m
csi-rbdplugin-provisioner-9db69594c-s5sxk   0/7     Pending   0          56m
csi-rbdplugin-provisioner-9db69594c-zc9lt   0/7     Pending   0          56m
  • describe po
kubectl describe po csi-rbdplugin-provisioner-9db69594c-gkvfv
# 关键错误
0/3 nodes are available: 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate

原因是三个节点都没有去污点 所以没有节点可以调度

6.1.3. 解决方式

去污点

kubectl taint nodes --all node-role.kubernetes.io/master-

6.2. pvc 一直在pending状态 – provisioner版本太低

  • 查看pvc
kubectl describe pvc claim1
# 关键错误
Failed to provision volume with StorageClass "rbd": failed to create rbd image: executable file not found in $PATH, command output:
  • 查看provisioner
kubectl logs rbd-provisioner-7f85d94d97-mkzdf
# 错误信息
W0617 11:31:58.969042       1 controller.go:746] Retrying syncing claim "default/claim1" because failures 3 < threshold 15
E0617 11:31:58.969097       1 controller.go:761] error syncing claim "default/claim1": failed to provision volume with StorageClass "rbd": failed to create rbd image: exit status 13, command output: did not load config file, using default settings.
2021-06-17 11:31:55.915 7fc8835c1900 -1 Errors while parsing config file!
2021-06-17 11:31:55.915 7fc8835c1900 -1 parse_file: cannot open /etc/ceph/ceph.conf: (2) No such file or directory
2021-06-17 11:31:55.915 7fc8835c1900 -1 parse_file: cannot open /root/.ceph/ceph.conf: (2) No such file or directory
2021-06-17 11:31:55.915 7fc8835c1900 -1 parse_file: cannot open ceph.conf: (2) No such file or directory
2021-06-17 11:31:55.916 7fc8835c1900 -1 Errors while parsing config file!
2021-06-17 11:31:55.916 7fc8835c1900 -1 parse_file: cannot open /etc/ceph/ceph.conf: (2) No such file or directory
2021-06-17 11:31:55.916 7fc8835c1900 -1 parse_file: cannot open /root/.ceph/ceph.conf: (2) No such file or directory
2021-06-17 11:31:55.916 7fc8835c1900 -1 parse_file: cannot open ceph.conf: (2) No such file or directory
2021-06-17 11:31:55.953 7fc8835c1900 -1 auth: unable to find a keyring on /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin,: (2) No such file or directory
2021-06-17 11:31:58.958 7fc8835c1900 -1 monclient: get_monmap_and_config failed to get config
2021-06-17 11:31:58.958 7fc8835c1900 -1 auth: unable to find a keyring on /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin,: (2) No such file or directory
rbd: couldn't connect to the cluster!

通过上面的错误信息没有找到头绪

  • 登入provisioner pod进行debug

将ceph配置和keyring拷贝到pod里面 执行ceph -s 报错

-1 monclient: get_monmap_and_config failed to get config

开启ceph debug

ceph -s --debug-auth=20/20 看到结果是认证错误,在查看ceph版本 rbd-provisioner 用的ceph版本是13 应该是版本不同消息结构不同导致的,所以升级到14 升级到14之后解决问题

6.3. cephfs pvc 无法成功创建

6.3.1. 问题现象

pvc创建之后一直在pending状态

kubectl describe pvc csi-cephfs-pvc
# 输出如下
'ProvisioningFailed' failed to provision volume with StorageClass "csi-cephfs-sc": rpc error: code = InvalidArgument desc = failed to get connection: connecting failed: rados: ret=-13, Permission denied

6.3.2. 问题分析

6.3.2.1. pvc创建流程

                                        +----------------+
                                        |ceph mon ip:port|
                   +----------------+   |ceph uuid       |
pvc->storageclass->| csi provisioner |->|ceph user       |->ceph cluster
   |               +------+---------+   |ceph keyring    |
   |                      |             +----------------+
 secret               configmap
  • pvc中指定了sc
  • sc中指定了ceph集群的secret和uuid
  • secret中带了ceph集群的用户名和对应的keyring
  • sc带着secret和uuid去请求csi创建请求会走到 csi-cephfsplugin-provisioner-xxx pod的 csi-provisioner 容器
  • 容器中挂载了configmap configmap中指定了ceph集群的uuid和mon ip端口
  • 获取到以上四个信息去ceph集群创建fs文件

6.3.2.2. 分析过程

  • csi 镜像由于没有提供sh shell类似的命令行debug比较困难, 所以用其他方式debug
  • 在本机安装合适的ceph-common拷贝配置文件 使用csi中的用户 挂载文件系统如果成功则证明
    • 网络策略正常
    • 用户正常
  • 对比configmap中的uuid和mon的ip port与实际集群是否相符
  • 对比secret中的secretkey user是否与集群实际相符
  • 对比完以上则对比csi镜像版本,主要需要确认csi的ceph版本是否与要连接的ceph集群版本一致,最快的方式是查看镜像的Manifest Layers 中的ceph版本
    • 例如 https://quay.io/repository/cephcsi/cephcsi/manifest/、sha256:dbdff0c77c626a48616e5ad57ac1c6d7e00890b83003604312010a0f1b148676 这个是 CEPH_POINT_RELEASE=-16.2.5 说明ceph版本是16.2.5 15以下的版本无法使用 17以上的版本可能也无法使用
    • 或者可以通过查看对应版本的csi源码 https://github.com/ceph/ceph-csi/blob/devel/build.env build.env中的 BASE_IMAGE=docker.io/ceph/ceph:v16 |CEPH_VERSION=octopus

6.3.3. 附

  • 查看镜像版本
kubectl get pods -l app=csi-cephfsplugin-provisioner -o jsonpath="{.items[*].spec.containers[*].image}" |tr -s '[[:space:]]' '\n' |sort |uniq -c
# 14版本的ceph镜像版本如下
      6 quay.io/cephcsi/cephcsi:v3.2-canary
      3 quay.io/k8scsi/csi-attacher:v3.0.2
      3 quay.io/k8scsi/csi-provisioner:v2.0.4
      3 quay.io/k8scsi/csi-resizer:v1.0.1
      3 quay.io/k8scsi/csi-snapshotter:v4.0.0
  • 列举pod的所有容器
kubectl logs csi-cephfsplugin-provisioner-7fcc78cf84-j5cqm
# 输出如下,方括号中是容器名列表
error: a container name must be specified for pod csi-cephfsplugin-provisioner-7fcc78cf84-j5cqm, choose one of: [csi-provisioner csi-resizer csi-snapshotter csi-cephfsplugin-attacher csi-cephfsplugin liveness-prometheus]

6.4. 新加node节点无法挂载pvc

6.4.1. 问题现象

describe pod 关键错误

Warning FailedMount 19s kubelet     MounntVolume.MountDevice failed for volume "pvc-d8273f30-5b17-4151-bb80-053632495303": kubernetes.io/csi: attacher.MountDevice failed to create newCsiDriverClient: driver name cephfs.csi.ceph.com not found in the list of registered CSI drivers

6.4.2. 问题原因

csi-cephfsplugin daemonSetkubelet 中的数据目录与实际节点上的不一致

  • csi-cephfsplugin 定义
kubectl get ds csi-cephfsplugin -o yaml
        - mountPath: /opt/kubelet/pods
          mountPropagation: Bidirectional
          name: mountpoint-dir
        - mountPath: /opt/kubelet/plugins
          mountPropagation: Bidirectional
          name: plugin-dir
      volumes:
      - hostPath:
          path: /opt/kubelet/plugins/cephfs.csi.ceph.com/
          type: DirectoryOrCreate
        name: socket-dir
      - hostPath:
          path: /opt/kubelet/plugins_registry/
          type: DirectoryOrCreate
        name: registration-dir
      - hostPath:
          path: /opt/kubelet/pods
          type: DirectoryOrCreate
        name: mountpoint-dir
      - hostPath:
          path: /opt/kubelet/plugins
          type: Directory
        name: plugin-dir
  • kubelet 启动配置
/usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --network-plugin=cni --pod-infra-container-image=k8s.gcr.io/pause:3.2

两者对比, kubectl root-dir 未定义,默认为/var/lib/kubelet, 指定kubectl数据目录即可(按实际情况选择修改kubelet/daemonSet) --root-dir=/opt/kubectl

修改之后,正常挂载

/usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --network-plugin=cni --pod-infra-container-image=k8s.gcr.io/pause:3.2 --root-dir=/opt/kubelet

6.4.3. 问题排查

csi 架构图

  • master安装 csi-cephfsplugin-provisioner
  • node安装 csi-cephfsplugin

检查kubectl日志

Oct 30 19:06:01 szidc-dev-ep-k8snode-83-226 kubelet: E1031 03:06:01.800245   28008 reconciler.go:193] operationExecutor.UnmountVolume failed (controllerAttachDetachEnabled true) for volume "volume-toi4cquf" (UniqueName: "kubernetes.io/csi/cephfs.csi.ceph.com^0001-0024-8cfb6405-d75e-466a-8abf-51ba0480d783-0000000000000001-6b07dbf4-15ee-11ec-a5ca-36135b1d3054") pod "c0a92c7c-2b0f-454a-b914-f16e18193afa" (UID: "c0a92c7c-2b0f-454a-b914-f16e18193afa") : UnmountVolume.NewUnmounter failed for volume "volume-toi4cquf" (UniqueName: "kubernetes.io/csi/cephfs.csi.ceph.com^0001-0024-8cfb6405-d75e-466a-8abf-51ba0480d783-0000000000000001-6b07dbf4-15ee-11ec-a5ca-36135b1d3054") pod "c0a92c7c-2b0f-454a-b914-f16e18193afa" (UID: "c0a92c7c-2b0f-454a-b914-f16e18193afa") : kubernetes.io/csi: unmounter failed to load volume data file [/opt/kubelet/pods/c0a92c7c-2b0f-454a-b914-f16e18193afa/volumes/kubernetes.io~csi/pvc-dc0625c9-e598-48c0-9695-c6e0ee64248f/mount]: kubernetes.io/csi: failed to open volume data file [/opt/kubelet/pods/c0a92c7c-2b0f-454a-b914-f16e18193afa/volumes/kubernetes.io~csi/pvc-dc0625c9-e598-48c0-9695-c6e0ee64248f/vol_data.json]: open /opt/kubelet/pods/c0a92c7c-2b0f-454a-b914-f16e18193afa/volumes/kubernetes.io~csi/pvc-dc0625c9-e598-48c0-9695-c6e0ee64248f/vol_data.json: no such file or directory

从上面即可看出问题所在

7. 总结

官方文档已非常详尽,主要遇到两个问题

  • External-storage 方式太久没人维护镜像中ceph-common版本停留在13无法和高版本的服务端认证成功,需要手动将容器的ceph-common升级到14

    • 同样的ceph-csi如果没有正常工作,当检查了一切配置确保正常之后, 我们需要要去检查版本,但是csi使用的provisioner镜像不能通过命令行去debug(现在还没找到方法),可以通过dockerfile去对版本,基础镜像配置在git项目中的build.env目录 BASE_IMAGE配置和CEPH_VERSION 如果不匹配则去quay.io找相对应的版本
  • csi方式部署时由于镜像地址被墙,需要想其他办法,例如文中提到的

    • 换镜像地址
    • 网上很多人使用docker proxy解决
    • 自己打镜像也成

    更换镜像地址是最简单的

[^1](https://github.com/container-storage-interface/spec/blob/master/spec.md)