HAMI项目本地调试
English version generated by AI. 查看英文版
1. 前提
在k8s集群上已经有一套安装好并且功能正常的HAMI集群,这样在调试的时候有很多配置可以复用。
- hami版本: v2.4.1
- OS: Ubuntu 22.04.4 LTS
- kernel: 5.15.0-125-generic
- nvcc: V11.5.119
- nvidia-smi: 550.54.14
- CUDA: 12.4
nvidia-smi
Fri Dec 20 02:45:30 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14              Driver Version: 550.54.14      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090        Off |   00000000:00:06.0 Off |                  N/A |
|  0%   26C    P8             11W /  370W |       0MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
2. 创建本地目录保存本地配置
本地启动需要依赖一些文件,可使用统一的文件夹存储
2.1. 创建文件夹
例如,在项目下创建 config 目录
mkdir config
3. hami-device本地调试
3.1. 收集配置
hami-device需要用到两个配置文件,分别为
- config/device-config.yaml
- config.json
获取并保存到到本地目录
k exec  hami-device-plugin-fn8lc -c device-plugin -- cat /device-config.yaml > /config/device.yaml
k exec hami-device-plugin-fn8lc -c device-plugin -- cat /config/config.json > /config/config.json
3.2. 收集环境变量
这里有个小技巧,在导出pod环境变量前,为pod配置
enableServiceLinks, 例如deployment配置spec.template.spec.enableServiceLinks: false, 即可去除k8s自动加上去的信息,仅保留pod本身配置的环境变量
k exec hami-device-plugin-lzm4n env >  start-plugin.sh
注意⚠️: 导出的文件里有个PATH变量,这里需要改成
export ATH=$PATH:xxxxxxxxx用原来的输出替代; 每行前面加上export 使用dlv配合goland进行debug
删除带有IP的配置;保留端口配置;增加debug信息, 最终得到一个如下文件
#!/usr/bin/bash
export PATH=$PATH/k8s-vgpu/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
export HOSTNAME=hami-scheduler-858f6b9dcf-4nn58
export KUBERNETES_SERVICE_PORT=443
export KUBERNETES_SERVICE_PORT_HTTPS=443
export KUBERNETES_PORT=tcp://10.233.0.1:443
export KUBERNETES_PORT_443_TCP=tcp://10.233.0.1:443
export KUBERNETES_PORT_443_TCP_PROTO=tcp
export KUBERNETES_PORT_443_TCP_PORT=443
export KUBERNETES_PORT_443_TCP_ADDR=10.233.0.1
export KUBERNETES_SERVICE_HOST=10.233.0.1
export NVARCH=x86_64
export NVIDIA_REQUIRE_CUDA=cuda>=12.6 brand=unknown,driver>=470,driver<471 brand=grid,driver>=470,driver<471 brand=tesla,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=vapps,driver>=470,driver<471 brand=vpc,driver>=470,driver<471 brand=vcs,driver>=470,driver<471 brand=vws,driver>=470,driver<471 brand=cloudgaming,driver>=470,driver<471 brand=unknown,driver>=535,driver<536 brand=grid,driver>=535,driver<536 brand=tesla,driver>=535,driver<536 brand=nvidia,driver>=535,driver<536 brand=quadro,driver>=535,driver<536 brand=quadrortx,driver>=535,driver<536 brand=nvidiartx,driver>=535,driver<536 brand=vapps,driver>=535,driver<536 brand=vpc,driver>=535,driver<536 brand=vcs,driver>=535,driver<536 brand=vws,driver>=535,driver<536 brand=cloudgaming,driver>=535,driver<536 brand=unknown,driver>=550,driver<551 brand=grid,driver>=550,driver<551 brand=tesla,driver>=550,driver<551 brand=nvidia,driver>=550,driver<551 brand=quadro,driver>=550,driver<551 brand=quadrortx,driver>=550,driver<551 brand=nvidiartx,driver>=550,driver<551 brand=vapps,driver>=550,driver<551 brand=vpc,driver>=550,driver<551 brand=vcs,driver>=550,driver<551 brand=vws,driver>=550,driver<551 brand=cloudgaming,driver>=550,driver<551
export NV_CUDA_CUDART_VERSION=12.6.77-1
export CUDA_VERSION=12.6.3
export LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64
export NVIDIA_VISIBLE_DEVICES=all
export NVIDIA_DRIVER_CAPABILITIES=utility
export NVIDIA_DISABLE_REQUIRE=true
# 手动增加这个变量
export CONFIG_FILE=config/device-config.yaml
/root/go/bin/dlv debug --headless --listen=:12345 --api-version=2 ./cmd/device-plugin/nvidia/
3.3. 更改配置
由于节点的配置文件路径被固化在代码中,所以需要修改代码(todo: 这里可能可以优化)
diff --git a/pkg/device-plugin/nvidiadevice/nvinternal/plugin/server.go b/pkg/device-plugin/nvidiadevice/nvinternal/plugin/server.go
--- a/pkg/device-plugin/nvidiadevice/nvinternal/plugin/server.go
+++ b/pkg/device-plugin/nvidiadevice/nvinternal/plugin/server.go
@@ -101,7 +101,7 @@ type NvidiaDevicePlugin struct {
 }
 func readFromConfigFile(sConfig *nvidia.NvidiaConfig) (string, error) {
-       jsonbyte, err := os.ReadFile("/config/config.json")
+       jsonbyte, err := os.ReadFile("config/config.json")
        mode := "hami-core"
        if err != nil {
                return "", err
(END)
3.4. 日志等级调整
plugin的代码中日志都是以 klog的Warning, Error,Info为等级输出日志的, 所以无需关注日志等级,默认等级为0,所以三种等级的日志都会输出,无需进行额外的日志配置
plugin 启动命令中有个v,这个v不是日志等级,而是启动命令是否打印版本的配置
https://github.com/urfave/cli/blob/main/flag.go
// VersionFlag prints the version for the application
var VersionFlag Flag = &BoolFlag{
 Name:        "version",
 Aliases:     []string{"v"},
 Usage:       "print the version",
 HideDefault: true,
 Local:       true,
}
3.5. 关闭原来的服务
plugin以 daemonSet存在,所以要停止服务有以下方式可以参考
- 修改daemonSet镜像为一个不可达镜像,这样的话所有gpu节点的的plugin都会被停掉
- 修改要调试的节点label,与daemonSet的label不匹配即可
3.6. 启动新的服务
这里以远程方式连接到服务器上调试
- 在Goland中配置 Run/Debug Configurations配置主机以及端口,与start-plugin.sh中的一只即可
- 在服务器上启动plugin服务
bash start-plugin.sh
这样就可以打断点调试plugin代码
4. hami-scheduler本地调试
4.1. 收集环境变量 hami-scheduler
 k exec hami-scheduler-67fc7ccd55-vjntl -c vgpu-scheduler-extender -- env > start-scheduler.sh
4.1.1. TLS配置
k8s admissionWebhook必须用https方式调用,所以启动的服务需要时https的。使用Helm Chart部署的hami服务已经创建好了证书,地址为127.0.0.1,所以启动的服务需要是 https://127.0.0.1:xxxx/webhook
需要收集TLS的CERT、KEY
k get secret hami-scheduler-tls -o jsonpath='{.data.tls\.crt}' |base64 -d > config/tls.crt
k get secret hami-scheduler-tls -o jsonpath='{.data.tls\.key}' |base64 -d > config/tls.key
编写启动命令, 并追加到 start-scheduler.sh
dlv debug --headless --listen=:2345 --api-version=2 ./cmd/scheduler/  -- \
  --device-config-file=config/device-config.yaml \
  -v=10 \
  --scheduler-name=hami-scheduler \
  --http_bind=0.0.0.0:8080 \
  --cert_file=config/tls.crt \
  --key_file=config/tls.key
最终得到一个启动脚本文件 start-scheduler.sh
注意PATH的处理,手动修改 PATH=$PATH:xxxxxxxx
PATH=$PATH:/k8s-vgpu/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
HOSTNAME=hami-scheduler-67fc7ccd55-vjntl
NVIDIA_MIG_MONITOR_DEVICES=all
HOOK_PATH=/usr/local
KUBERNETES_SERVICE_HOST=10.233.0.1
KUBERNETES_SERVICE_PORT=443
KUBERNETES_SERVICE_PORT_HTTPS=443
KUBERNETES_PORT=tcp://10.233.0.1:443
KUBERNETES_PORT_443_TCP=tcp://10.233.0.1:443
KUBERNETES_PORT_443_TCP_PROTO=tcp
KUBERNETES_PORT_443_TCP_PORT=443
KUBERNETES_PORT_443_TCP_ADDR=10.233.0.1
NVARCH=x86_64
NVIDIA_REQUIRE_CUDA=cuda>=12.6 brand=unknown,driver>=470,driver<471 brand=grid,driver>=470,driver<471 brand=tesla,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=vapps,driver>=470,driver<471 brand=vpc,driver>=470,driver<471 brand=vcs,driver>=470,driver<471 brand=vws,driver>=470,driver<471 brand=cloudgaming,driver>=470,driver<471 brand=unknown,driver>=535,driver<536 brand=grid,driver>=535,driver<536 brand=tesla,driver>=535,driver<536 brand=nvidia,driver>=535,driver<536 brand=quadro,driver>=535,driver<536 brand=quadrortx,driver>=535,driver<536 brand=nvidiartx,driver>=535,driver<536 brand=vapps,driver>=535,driver<536 brand=vpc,driver>=535,driver<536 brand=vcs,driver>=535,driver<536 brand=vws,driver>=535,driver<536 brand=cloudgaming,driver>=535,driver<536 brand=unknown,driver>=550,driver<551 brand=grid,driver>=550,driver<551 brand=tesla,driver>=550,driver<551 brand=nvidia,driver>=550,driver<551 brand=quadro,driver>=550,driver<551 brand=quadrortx,driver>=550,driver<551 brand=nvidiartx,driver>=550,driver<551 brand=vapps,driver>=550,driver<551 brand=vpc,driver>=550,driver<551 brand=vcs,driver>=550,driver<551 brand=vws,driver>=550,driver<551 brand=cloudgaming,driver>=550,driver<551
NV_CUDA_CUDART_VERSION=12.6.77-1
CUDA_VERSION=12.6.3
LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64
NVIDIA_VISIBLE_DEVICES=all
NVIDIA_DRIVER_CAPABILITIES=utility
NVIDIA_DISABLE_REQUIRE=true
dlv debug --headless --listen=:2345 --api-version=2 ./cmd/scheduler/  -- \
  # device-config.yaml 使用之前从plugin中获取的即可
  --device-config-file=config/device-config.yaml \
  # 日志等级调高方便调试
  -v=10 \
  # 调度器名称需要
  --scheduler-name=hami-scheduler \
  # webhook需要使用的url
  --http_bind=0.0.0.0:8080 \
  --cert_file=config/tls.crt \
  --key_file=config/tls.key
4.2. webhook修改
如果要让pod启动时,kube-apiserver可以调用到,需要修改 admissionWebhook配置
k edit mutatingwebhookconfigurations.admissionregistration.k8s.io hami-webhook
原本的配置
webhooks:
- admissionReviewVersions:
  - v1beta1
  clientConfig:
    caBundle: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUJkakNDQVJ5Z0F3SUJBZ0lSQUxjd2FQMjUrMlphdGhTTlFMcG1qT0V3Q2dZSUtvWkl6ajBFQXdJd0R6RU4KTUFzR0ExVUVDaE1FYm1sc01UQWdGdzB5TkRFeU1EWXdOekV4TVRWYUdBOHlNVEkwTVRFeE1qQTNNVEV4TlZvdwpEekVOTUFzR0ExVUVDaE1FYm1sc01UQlpNQk1HQnlxR1NNNDlBZ0VHQ0NxR1NNNDlBd0VIQTBJQUJDUnlXUDdYCkRmT2N4NEVTMVRYaUs0dnFFU2wrcUFHYjI2YzNrOEdMWlZTL1lHaFpLZVVxaEgydVRhTFdWTW1hZVJFbkxqM0cKSStMVFRVTTR6SVhEUld5alZ6QlZNQTRHQTFVZER3RUIvd1FFQXdJQ0JEQVRCZ05WSFNVRUREQUtCZ2dyQmdFRgpCUWNEQVRBUEJnTlZIUk1CQWY4RUJUQURBUUgvTUIwR0ExVWREZ1FXQkJTcVV4bWpGa29YUlpRK0xXVzBNM1pJCnMzck1wakFLQmdncWhrak9QUVFEQWdOSUFEQkZBaUJSY2VRL2tJVkR2VTV3Vjl0K3NRWm93TmFhTWhIMTV5K2sKT3VrR0FlRGVtQUloQUxDZzFrM0JQZUJBNG8reWY5emxvVjM2VEk2RHUzaGdMT1B3MXhaZkFvcDMKLS0tLS1FTkQgQ0VSVElGSUNBVEUtLS0tLQo=
    service:
      name: hami-scheduler
      namespace: kube-system
      path: /webhook
      port: 443
修改为
webhooks:
- admissionReviewVersions:
  - v1beta1
  clientConfig:
    caBundle: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUJkakNDQVJ5Z0F3SUJBZ0lSQUxjd2FQMjUrMlphdGhTTlFMcG1qT0V3Q2dZSUtvWkl6ajBFQXdJd0R6RU4KTUFzR0ExVUVDaE1FYm1sc01UQWdGdzB5TkRFeU1EWXdOekV4TVRWYUdBOHlNVEkwTVRFeE1qQTNNVEV4TlZvdwpEekVOTUFzR0ExVUVDaE1FYm1sc01UQlpNQk1HQnlxR1NNNDlBZ0VHQ0NxR1NNNDlBd0VIQTBJQUJDUnlXUDdYCkRmT2N4NEVTMVRYaUs0dnFFU2wrcUFHYjI2YzNrOEdMWlZTL1lHaFpLZVVxaEgydVRhTFdWTW1hZVJFbkxqM0cKSStMVFRVTTR6SVhEUld5alZ6QlZNQTRHQTFVZER3RUIvd1FFQXdJQ0JEQVRCZ05WSFNVRUREQUtCZ2dyQmdFRgpCUWNEQVRBUEJnTlZIUk1CQWY4RUJUQURBUUgvTUIwR0ExVWREZ1FXQkJTcVV4bWpGa29YUlpRK0xXVzBNM1pJCnMzck1wakFLQmdncWhrak9QUVFEQWdOSUFEQkZBaUJSY2VRL2tJVkR2VTV3Vjl0K3NRWm93TmFhTWhIMTV5K2sKT3VrR0FlRGVtQUloQUxDZzFrM0JQZUJBNG8reWY5emxvVjM2VEk2RHUzaGdMT1B3MXhaZkFvcDMKLS0tLS1FTkQgQ0VSVElGSUNBVEUtLS0tLQo=
    url: https://127.0.0.1:8080/webhook
4.3. 为拓展的hami-scheduler启动一个HTTP服务
由于上面的TLS相关配置都是127.0.0.1的,admissionWebhook是从kube-apiserver POD中调用的,使用的本机ip,所以可以通信,但是hami-scheduler启动了一个新的schedler在POD中使用了虚拟网络所以无法掉通以进程方式直接启动在HOST上的,所以我们需要为hami-scheduler启动一个新的可以掉通的HTTP服务,这里有多种方式,例如转发,或者多启动一个HTTP SERVER,这里选择第二种方式, 需要修改调度器代码
diff --git a/cmd/scheduler/main.go b/cmd/scheduler/main.go
index 3c99042..c85c09f 100644
--- a/cmd/scheduler/main.go
+++ b/cmd/scheduler/main.go
@@ -84,12 +85,29 @@ func start() {
        router.GET("/healthz", routes.HealthzRoute())
        klog.Info("listen on ", config.HTTPBind)
        if len(tlsCertFile) == 0 || len(tlsKeyFile) == 0 {
-               if err := http.ListenAndServe(config.HTTPBind, router); err != nil {
-                       klog.Fatal("Listen and Serve error, ", err)
+               go func() {
+                       if err := http.ListenAndServe(config.HTTPBind, router); err != nil {
+                               klog.Fatal("Listen and Serve error, ", err)
+                       }
+               }()
+       } else {
+               go func() {
+                       if err := http.ListenAndServeTLS(config.HTTPBind, tlsCertFile, tlsKeyFile, router); err != nil {
+                               klog.Fatal("Listen and Serve TLS error, ", err)
+                       }
+               }()
+       }
+
+       // Additional HTTP server on a different port
+       additionalBind := "10.10.10.8:8081" 
+       klog.Info("listen on ", additionalBind)
+       if len(tlsCertFile) == 0 || len(tlsKeyFile) == 0 {
+               if err := http.ListenAndServe(additionalBind, router); err != nil {
+                       klog.Fatal("Additional Listen and Serve error, ", err)
                }
        } else {
-               if err := http.ListenAndServeTLS(config.HTTPBind, tlsCertFile, tlsKeyFile, router); err != nil {
-                       klog.Fatal("Listen and Serve error, ", err)
+               if err := http.ListenAndServeTLS(additionalBind, tlsCertFile, tlsKeyFile, router); err != nil {
+                       klog.Fatal("Additional Listen and Serve TLS error, ", err)
                }
        }
 }
这样就启动了两个不同HOST+PORT的服务, adminission webhook用 https://127.0.0.1:xxxx 拓展调度器用 https://10.10.10.8:8081/xxx
4.3.1. 修改scheduler服务器地址
拓展调度器的服务端准备好了,客户端需要做如下修改
k edit cm hami-scheduler-newversion
修改
data:
  config.yaml: |
  ...
    extenders:
    - urlPrefix: "https://127.0.0.1:8081"
将 urlPrefix 修改为刚刚启动的服务路径
data:
  config.yaml: |
  ...
    extenders:
    - urlPrefix: "https://10.10.10.8:8081"
4.4. 关闭原来的hami-scheduler
deployment hami-scheduler 中有两个容器,
- kube-scheduler拓展原有的k8s调度器,将- schedulerName为- hami-scheduler的pod的调度处理转发到- vgpu-scheduler-extender中
- vgpu-scheduler-extender具体的调度器实现逻辑
其中 kube-scheduler 容器只是作为拓展调度器的注册以及请求转发,需要调试的地方不多,所以使用当前存在的即可,vgpu-scheduler-extender 需要停止,所以最简单的方式还是修改这个容器镜像地址为不可达即可
4.5. 启动本地的scheduler
bash start-scheduler.sh
在Goland中做类似 hami-plugin 配置,即可远程调试
注意dlv启动的端口需要和启动脚本中的保持一致,如果同事需要多个远程调试进程则建议将端口改为不同的
5. 测试
创建一个pod,看调度流程是否正常,如果可以调度成功,且日志正常即可,测试通过后即可进行功能开发或者bug修复
k apply -f examples/nvidia/default_use.yaml
