HAMI项目本地调试

Posted by elrond on December 20, 2024

HAMI项目本地调试

English version generated by AI. 查看英文版

1. 前提

在k8s集群上已经有一套安装好并且功能正常的HAMI集群,这样在调试的时候有很多配置可以复用。

  • hami版本: v2.4.1
  • OS: Ubuntu 22.04.4 LTS
  • kernel: 5.15.0-125-generic
  • nvcc: V11.5.119
  • nvidia-smi: 550.54.14
  • CUDA: 12.4
nvidia-smi
Fri Dec 20 02:45:30 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14              Driver Version: 550.54.14      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090        Off |   00000000:00:06.0 Off |                  N/A |
|  0%   26C    P8             11W /  370W |       0MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

2. 创建本地目录保存本地配置

本地启动需要依赖一些文件,可使用统一的文件夹存储

2.1. 创建文件夹

例如,在项目下创建 config 目录

mkdir config

3. hami-device本地调试

3.1. 收集配置

hami-device需要用到两个配置文件,分别为

  • config/device-config.yaml
  • config.json

获取并保存到到本地目录

k exec  hami-device-plugin-fn8lc -c device-plugin -- cat /device-config.yaml > /config/device.yaml
k exec hami-device-plugin-fn8lc -c device-plugin -- cat /config/config.json > /config/config.json

3.2. 收集环境变量

这里有个小技巧,在导出pod环境变量前,为pod配置 enableServiceLinks, 例如deployment配置 spec.template.spec.enableServiceLinks: false, 即可去除k8s自动加上去的信息,仅保留pod本身配置的环境变量

k exec hami-device-plugin-lzm4n env >  start-plugin.sh

注意⚠️: 导出的文件里有个PATH变量,这里需要改成 export ATH=$PATH:xxxxxx xxx用原来的输出替代; 每行前面加上export 使用dlv配合goland进行debug

删除带有IP的配置;保留端口配置;增加debug信息, 最终得到一个如下文件

#!/usr/bin/bash
export PATH=$PATH/k8s-vgpu/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
export HOSTNAME=hami-scheduler-858f6b9dcf-4nn58
export KUBERNETES_SERVICE_PORT=443
export KUBERNETES_SERVICE_PORT_HTTPS=443
export KUBERNETES_PORT=tcp://10.233.0.1:443
export KUBERNETES_PORT_443_TCP=tcp://10.233.0.1:443
export KUBERNETES_PORT_443_TCP_PROTO=tcp
export KUBERNETES_PORT_443_TCP_PORT=443
export KUBERNETES_PORT_443_TCP_ADDR=10.233.0.1
export KUBERNETES_SERVICE_HOST=10.233.0.1
export NVARCH=x86_64
export NVIDIA_REQUIRE_CUDA=cuda>=12.6 brand=unknown,driver>=470,driver<471 brand=grid,driver>=470,driver<471 brand=tesla,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=vapps,driver>=470,driver<471 brand=vpc,driver>=470,driver<471 brand=vcs,driver>=470,driver<471 brand=vws,driver>=470,driver<471 brand=cloudgaming,driver>=470,driver<471 brand=unknown,driver>=535,driver<536 brand=grid,driver>=535,driver<536 brand=tesla,driver>=535,driver<536 brand=nvidia,driver>=535,driver<536 brand=quadro,driver>=535,driver<536 brand=quadrortx,driver>=535,driver<536 brand=nvidiartx,driver>=535,driver<536 brand=vapps,driver>=535,driver<536 brand=vpc,driver>=535,driver<536 brand=vcs,driver>=535,driver<536 brand=vws,driver>=535,driver<536 brand=cloudgaming,driver>=535,driver<536 brand=unknown,driver>=550,driver<551 brand=grid,driver>=550,driver<551 brand=tesla,driver>=550,driver<551 brand=nvidia,driver>=550,driver<551 brand=quadro,driver>=550,driver<551 brand=quadrortx,driver>=550,driver<551 brand=nvidiartx,driver>=550,driver<551 brand=vapps,driver>=550,driver<551 brand=vpc,driver>=550,driver<551 brand=vcs,driver>=550,driver<551 brand=vws,driver>=550,driver<551 brand=cloudgaming,driver>=550,driver<551
export NV_CUDA_CUDART_VERSION=12.6.77-1
export CUDA_VERSION=12.6.3
export LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64
export NVIDIA_VISIBLE_DEVICES=all
export NVIDIA_DRIVER_CAPABILITIES=utility
export NVIDIA_DISABLE_REQUIRE=true
# 手动增加这个变量
export CONFIG_FILE=config/device-config.yaml
/root/go/bin/dlv debug --headless --listen=:12345 --api-version=2 ./cmd/device-plugin/nvidia/

3.3. 更改配置

由于节点的配置文件路径被固化在代码中,所以需要修改代码(todo: 这里可能可以优化)

diff --git a/pkg/device-plugin/nvidiadevice/nvinternal/plugin/server.go b/pkg/device-plugin/nvidiadevice/nvinternal/plugin/server.go
--- a/pkg/device-plugin/nvidiadevice/nvinternal/plugin/server.go
+++ b/pkg/device-plugin/nvidiadevice/nvinternal/plugin/server.go
@@ -101,7 +101,7 @@ type NvidiaDevicePlugin struct {
 }

 func readFromConfigFile(sConfig *nvidia.NvidiaConfig) (string, error) {
-       jsonbyte, err := os.ReadFile("/config/config.json")
+       jsonbyte, err := os.ReadFile("config/config.json")
        mode := "hami-core"
        if err != nil {
                return "", err
(END)

3.4. 日志等级调整

plugin的代码中日志都是以 klog的Warning, Error,Info为等级输出日志的, 所以无需关注日志等级,默认等级为0,所以三种等级的日志都会输出,无需进行额外的日志配置

plugin 启动命令中有个v,这个v不是日志等级,而是启动命令是否打印版本的配置

https://github.com/urfave/cli/blob/main/flag.go

// VersionFlag prints the version for the application
var VersionFlag Flag = &BoolFlag{
 Name:        "version",
 Aliases:     []string{"v"},
 Usage:       "print the version",
 HideDefault: true,
 Local:       true,
}

3.5. 关闭原来的服务

plugin以 daemonSet存在,所以要停止服务有以下方式可以参考

  • 修改daemonSet镜像为一个不可达镜像,这样的话所有gpu节点的的plugin都会被停掉
  • 修改要调试的节点label,与daemonSet的label不匹配即可

3.6. 启动新的服务

这里以远程方式连接到服务器上调试

  • 在Goland中配置 Run/Debug Configurations 配置主机以及端口,与 start-plugin.sh 中的一只即可
  • 在服务器上启动plugin服务
bash start-plugin.sh

这样就可以打断点调试plugin代码

4. hami-scheduler本地调试

4.1. 收集环境变量 hami-scheduler

 k exec hami-scheduler-67fc7ccd55-vjntl -c vgpu-scheduler-extender -- env > start-scheduler.sh

4.1.1. TLS配置

k8s admissionWebhook必须用https方式调用,所以启动的服务需要时https的。使用Helm Chart部署的hami服务已经创建好了证书,地址为127.0.0.1,所以启动的服务需要是 https://127.0.0.1:xxxx/webhook

需要收集TLS的CERT、KEY

k get secret hami-scheduler-tls -o jsonpath='{.data.tls\.crt}' |base64 -d > config/tls.crt
k get secret hami-scheduler-tls -o jsonpath='{.data.tls\.key}' |base64 -d > config/tls.key

编写启动命令, 并追加到 start-scheduler.sh

dlv debug --headless --listen=:2345 --api-version=2 ./cmd/scheduler/  -- \
  --device-config-file=config/device-config.yaml \
  -v=10 \
  --scheduler-name=hami-scheduler \
  --http_bind=0.0.0.0:8080 \
  --cert_file=config/tls.crt \
  --key_file=config/tls.key

最终得到一个启动脚本文件 start-scheduler.sh

注意PATH的处理,手动修改 PATH=$PATH:xxxxxxxx

PATH=$PATH:/k8s-vgpu/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
HOSTNAME=hami-scheduler-67fc7ccd55-vjntl
NVIDIA_MIG_MONITOR_DEVICES=all
HOOK_PATH=/usr/local
KUBERNETES_SERVICE_HOST=10.233.0.1
KUBERNETES_SERVICE_PORT=443
KUBERNETES_SERVICE_PORT_HTTPS=443
KUBERNETES_PORT=tcp://10.233.0.1:443
KUBERNETES_PORT_443_TCP=tcp://10.233.0.1:443
KUBERNETES_PORT_443_TCP_PROTO=tcp
KUBERNETES_PORT_443_TCP_PORT=443
KUBERNETES_PORT_443_TCP_ADDR=10.233.0.1
NVARCH=x86_64
NVIDIA_REQUIRE_CUDA=cuda>=12.6 brand=unknown,driver>=470,driver<471 brand=grid,driver>=470,driver<471 brand=tesla,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=vapps,driver>=470,driver<471 brand=vpc,driver>=470,driver<471 brand=vcs,driver>=470,driver<471 brand=vws,driver>=470,driver<471 brand=cloudgaming,driver>=470,driver<471 brand=unknown,driver>=535,driver<536 brand=grid,driver>=535,driver<536 brand=tesla,driver>=535,driver<536 brand=nvidia,driver>=535,driver<536 brand=quadro,driver>=535,driver<536 brand=quadrortx,driver>=535,driver<536 brand=nvidiartx,driver>=535,driver<536 brand=vapps,driver>=535,driver<536 brand=vpc,driver>=535,driver<536 brand=vcs,driver>=535,driver<536 brand=vws,driver>=535,driver<536 brand=cloudgaming,driver>=535,driver<536 brand=unknown,driver>=550,driver<551 brand=grid,driver>=550,driver<551 brand=tesla,driver>=550,driver<551 brand=nvidia,driver>=550,driver<551 brand=quadro,driver>=550,driver<551 brand=quadrortx,driver>=550,driver<551 brand=nvidiartx,driver>=550,driver<551 brand=vapps,driver>=550,driver<551 brand=vpc,driver>=550,driver<551 brand=vcs,driver>=550,driver<551 brand=vws,driver>=550,driver<551 brand=cloudgaming,driver>=550,driver<551
NV_CUDA_CUDART_VERSION=12.6.77-1
CUDA_VERSION=12.6.3
LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64
NVIDIA_VISIBLE_DEVICES=all
NVIDIA_DRIVER_CAPABILITIES=utility
NVIDIA_DISABLE_REQUIRE=true

dlv debug --headless --listen=:2345 --api-version=2 ./cmd/scheduler/  -- \
  # device-config.yaml 使用之前从plugin中获取的即可
  --device-config-file=config/device-config.yaml \
  # 日志等级调高方便调试
  -v=10 \
  # 调度器名称需要
  --scheduler-name=hami-scheduler \
  # webhook需要使用的url
  --http_bind=0.0.0.0:8080 \
  --cert_file=config/tls.crt \
  --key_file=config/tls.key

4.2. webhook修改

如果要让pod启动时,kube-apiserver可以调用到,需要修改 admissionWebhook配置

k edit mutatingwebhookconfigurations.admissionregistration.k8s.io hami-webhook

原本的配置

webhooks:
- admissionReviewVersions:
  - v1beta1
  clientConfig:
    caBundle: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUJkakNDQVJ5Z0F3SUJBZ0lSQUxjd2FQMjUrMlphdGhTTlFMcG1qT0V3Q2dZSUtvWkl6ajBFQXdJd0R6RU4KTUFzR0ExVUVDaE1FYm1sc01UQWdGdzB5TkRFeU1EWXdOekV4TVRWYUdBOHlNVEkwTVRFeE1qQTNNVEV4TlZvdwpEekVOTUFzR0ExVUVDaE1FYm1sc01UQlpNQk1HQnlxR1NNNDlBZ0VHQ0NxR1NNNDlBd0VIQTBJQUJDUnlXUDdYCkRmT2N4NEVTMVRYaUs0dnFFU2wrcUFHYjI2YzNrOEdMWlZTL1lHaFpLZVVxaEgydVRhTFdWTW1hZVJFbkxqM0cKSStMVFRVTTR6SVhEUld5alZ6QlZNQTRHQTFVZER3RUIvd1FFQXdJQ0JEQVRCZ05WSFNVRUREQUtCZ2dyQmdFRgpCUWNEQVRBUEJnTlZIUk1CQWY4RUJUQURBUUgvTUIwR0ExVWREZ1FXQkJTcVV4bWpGa29YUlpRK0xXVzBNM1pJCnMzck1wakFLQmdncWhrak9QUVFEQWdOSUFEQkZBaUJSY2VRL2tJVkR2VTV3Vjl0K3NRWm93TmFhTWhIMTV5K2sKT3VrR0FlRGVtQUloQUxDZzFrM0JQZUJBNG8reWY5emxvVjM2VEk2RHUzaGdMT1B3MXhaZkFvcDMKLS0tLS1FTkQgQ0VSVElGSUNBVEUtLS0tLQo=
    service:
      name: hami-scheduler
      namespace: kube-system
      path: /webhook
      port: 443

修改为

webhooks:
- admissionReviewVersions:
  - v1beta1
  clientConfig:
    caBundle: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUJkakNDQVJ5Z0F3SUJBZ0lSQUxjd2FQMjUrMlphdGhTTlFMcG1qT0V3Q2dZSUtvWkl6ajBFQXdJd0R6RU4KTUFzR0ExVUVDaE1FYm1sc01UQWdGdzB5TkRFeU1EWXdOekV4TVRWYUdBOHlNVEkwTVRFeE1qQTNNVEV4TlZvdwpEekVOTUFzR0ExVUVDaE1FYm1sc01UQlpNQk1HQnlxR1NNNDlBZ0VHQ0NxR1NNNDlBd0VIQTBJQUJDUnlXUDdYCkRmT2N4NEVTMVRYaUs0dnFFU2wrcUFHYjI2YzNrOEdMWlZTL1lHaFpLZVVxaEgydVRhTFdWTW1hZVJFbkxqM0cKSStMVFRVTTR6SVhEUld5alZ6QlZNQTRHQTFVZER3RUIvd1FFQXdJQ0JEQVRCZ05WSFNVRUREQUtCZ2dyQmdFRgpCUWNEQVRBUEJnTlZIUk1CQWY4RUJUQURBUUgvTUIwR0ExVWREZ1FXQkJTcVV4bWpGa29YUlpRK0xXVzBNM1pJCnMzck1wakFLQmdncWhrak9QUVFEQWdOSUFEQkZBaUJSY2VRL2tJVkR2VTV3Vjl0K3NRWm93TmFhTWhIMTV5K2sKT3VrR0FlRGVtQUloQUxDZzFrM0JQZUJBNG8reWY5emxvVjM2VEk2RHUzaGdMT1B3MXhaZkFvcDMKLS0tLS1FTkQgQ0VSVElGSUNBVEUtLS0tLQo=
    url: https://127.0.0.1:8080/webhook

4.3. 为拓展的hami-scheduler启动一个HTTP服务

由于上面的TLS相关配置都是127.0.0.1的,admissionWebhook是从kube-apiserver POD中调用的,使用的本机ip,所以可以通信,但是hami-scheduler启动了一个新的schedler在POD中使用了虚拟网络所以无法掉通以进程方式直接启动在HOST上的,所以我们需要为hami-scheduler启动一个新的可以掉通的HTTP服务,这里有多种方式,例如转发,或者多启动一个HTTP SERVER,这里选择第二种方式, 需要修改调度器代码

diff --git a/cmd/scheduler/main.go b/cmd/scheduler/main.go
index 3c99042..c85c09f 100644
--- a/cmd/scheduler/main.go
+++ b/cmd/scheduler/main.go
@@ -84,12 +85,29 @@ func start() {
        router.GET("/healthz", routes.HealthzRoute())
        klog.Info("listen on ", config.HTTPBind)
        if len(tlsCertFile) == 0 || len(tlsKeyFile) == 0 {
-               if err := http.ListenAndServe(config.HTTPBind, router); err != nil {
-                       klog.Fatal("Listen and Serve error, ", err)
+               go func() {
+                       if err := http.ListenAndServe(config.HTTPBind, router); err != nil {
+                               klog.Fatal("Listen and Serve error, ", err)
+                       }
+               }()
+       } else {
+               go func() {
+                       if err := http.ListenAndServeTLS(config.HTTPBind, tlsCertFile, tlsKeyFile, router); err != nil {
+                               klog.Fatal("Listen and Serve TLS error, ", err)
+                       }
+               }()
+       }
+
+       // Additional HTTP server on a different port
+       additionalBind := "10.10.10.8:8081" 
+       klog.Info("listen on ", additionalBind)
+       if len(tlsCertFile) == 0 || len(tlsKeyFile) == 0 {
+               if err := http.ListenAndServe(additionalBind, router); err != nil {
+                       klog.Fatal("Additional Listen and Serve error, ", err)
                }
        } else {
-               if err := http.ListenAndServeTLS(config.HTTPBind, tlsCertFile, tlsKeyFile, router); err != nil {
-                       klog.Fatal("Listen and Serve error, ", err)
+               if err := http.ListenAndServeTLS(additionalBind, tlsCertFile, tlsKeyFile, router); err != nil {
+                       klog.Fatal("Additional Listen and Serve TLS error, ", err)
                }
        }
 }

这样就启动了两个不同HOST+PORT的服务, adminission webhook用 https://127.0.0.1:xxxx 拓展调度器用 https://10.10.10.8:8081/xxx

4.3.1. 修改scheduler服务器地址

拓展调度器的服务端准备好了,客户端需要做如下修改

k edit cm hami-scheduler-newversion

修改

data:
  config.yaml: |
  ...
    extenders:
    - urlPrefix: "https://127.0.0.1:8081"

urlPrefix 修改为刚刚启动的服务路径

data:
  config.yaml: |
  ...
    extenders:
    - urlPrefix: "https://10.10.10.8:8081"

4.4. 关闭原来的hami-scheduler

deployment hami-scheduler 中有两个容器,

  • kube-scheduler 拓展原有的k8s调度器,将 schedulerNamehami-scheduler 的pod的调度处理转发到 vgpu-scheduler-extender
  • vgpu-scheduler-extender 具体的调度器实现逻辑

其中 kube-scheduler 容器只是作为拓展调度器的注册以及请求转发,需要调试的地方不多,所以使用当前存在的即可,vgpu-scheduler-extender 需要停止,所以最简单的方式还是修改这个容器镜像地址为不可达即可

4.5. 启动本地的scheduler

bash start-scheduler.sh

在Goland中做类似 hami-plugin 配置,即可远程调试

注意dlv启动的端口需要和启动脚本中的保持一致,如果同事需要多个远程调试进程则建议将端口改为不同的

5. 测试

创建一个pod,看调度流程是否正常,如果可以调度成功,且日志正常即可,测试通过后即可进行功能开发或者bug修复

k apply -f examples/nvidia/default_use.yaml