HAMI项目本地调试
English version generated by AI. 查看英文版
1. 前提
在k8s集群上已经有一套安装好并且功能正常的HAMI集群,这样在调试的时候有很多配置可以复用。
- hami版本: v2.4.1
- OS: Ubuntu 22.04.4 LTS
- kernel: 5.15.0-125-generic
- nvcc: V11.5.119
- nvidia-smi: 550.54.14
- CUDA: 12.4
nvidia-smi
Fri Dec 20 02:45:30 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14 Driver Version: 550.54.14 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 3090 Off | 00000000:00:06.0 Off | N/A |
| 0% 26C P8 11W / 370W | 0MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
2. 创建本地目录保存本地配置
本地启动需要依赖一些文件,可使用统一的文件夹存储
2.1. 创建文件夹
例如,在项目下创建 config
目录
mkdir config
3. hami-device本地调试
3.1. 收集配置
hami-device需要用到两个配置文件,分别为
- config/device-config.yaml
- config.json
获取并保存到到本地目录
k exec hami-device-plugin-fn8lc -c device-plugin -- cat /device-config.yaml > /config/device.yaml
k exec hami-device-plugin-fn8lc -c device-plugin -- cat /config/config.json > /config/config.json
3.2. 收集环境变量
这里有个小技巧,在导出pod环境变量前,为pod配置
enableServiceLinks
, 例如deployment配置spec.template.spec.enableServiceLinks: false
, 即可去除k8s自动加上去的信息,仅保留pod本身配置的环境变量
k exec hami-device-plugin-lzm4n env > start-plugin.sh
注意⚠️: 导出的文件里有个PATH变量,这里需要改成
export ATH=$PATH:xxxxxx
xxx用原来的输出替代; 每行前面加上export 使用dlv配合goland进行debug
删除带有IP的配置;保留端口配置;增加debug信息, 最终得到一个如下文件
#!/usr/bin/bash
export PATH=$PATH/k8s-vgpu/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
export HOSTNAME=hami-scheduler-858f6b9dcf-4nn58
export KUBERNETES_SERVICE_PORT=443
export KUBERNETES_SERVICE_PORT_HTTPS=443
export KUBERNETES_PORT=tcp://10.233.0.1:443
export KUBERNETES_PORT_443_TCP=tcp://10.233.0.1:443
export KUBERNETES_PORT_443_TCP_PROTO=tcp
export KUBERNETES_PORT_443_TCP_PORT=443
export KUBERNETES_PORT_443_TCP_ADDR=10.233.0.1
export KUBERNETES_SERVICE_HOST=10.233.0.1
export NVARCH=x86_64
export NVIDIA_REQUIRE_CUDA=cuda>=12.6 brand=unknown,driver>=470,driver<471 brand=grid,driver>=470,driver<471 brand=tesla,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=vapps,driver>=470,driver<471 brand=vpc,driver>=470,driver<471 brand=vcs,driver>=470,driver<471 brand=vws,driver>=470,driver<471 brand=cloudgaming,driver>=470,driver<471 brand=unknown,driver>=535,driver<536 brand=grid,driver>=535,driver<536 brand=tesla,driver>=535,driver<536 brand=nvidia,driver>=535,driver<536 brand=quadro,driver>=535,driver<536 brand=quadrortx,driver>=535,driver<536 brand=nvidiartx,driver>=535,driver<536 brand=vapps,driver>=535,driver<536 brand=vpc,driver>=535,driver<536 brand=vcs,driver>=535,driver<536 brand=vws,driver>=535,driver<536 brand=cloudgaming,driver>=535,driver<536 brand=unknown,driver>=550,driver<551 brand=grid,driver>=550,driver<551 brand=tesla,driver>=550,driver<551 brand=nvidia,driver>=550,driver<551 brand=quadro,driver>=550,driver<551 brand=quadrortx,driver>=550,driver<551 brand=nvidiartx,driver>=550,driver<551 brand=vapps,driver>=550,driver<551 brand=vpc,driver>=550,driver<551 brand=vcs,driver>=550,driver<551 brand=vws,driver>=550,driver<551 brand=cloudgaming,driver>=550,driver<551
export NV_CUDA_CUDART_VERSION=12.6.77-1
export CUDA_VERSION=12.6.3
export LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64
export NVIDIA_VISIBLE_DEVICES=all
export NVIDIA_DRIVER_CAPABILITIES=utility
export NVIDIA_DISABLE_REQUIRE=true
# 手动增加这个变量
export CONFIG_FILE=config/device-config.yaml
/root/go/bin/dlv debug --headless --listen=:12345 --api-version=2 ./cmd/device-plugin/nvidia/
3.3. 更改配置
由于节点的配置文件路径被固化在代码中,所以需要修改代码(todo: 这里可能可以优化)
diff --git a/pkg/device-plugin/nvidiadevice/nvinternal/plugin/server.go b/pkg/device-plugin/nvidiadevice/nvinternal/plugin/server.go
--- a/pkg/device-plugin/nvidiadevice/nvinternal/plugin/server.go
+++ b/pkg/device-plugin/nvidiadevice/nvinternal/plugin/server.go
@@ -101,7 +101,7 @@ type NvidiaDevicePlugin struct {
}
func readFromConfigFile(sConfig *nvidia.NvidiaConfig) (string, error) {
- jsonbyte, err := os.ReadFile("/config/config.json")
+ jsonbyte, err := os.ReadFile("config/config.json")
mode := "hami-core"
if err != nil {
return "", err
(END)
3.4. 日志等级调整
plugin的代码中日志都是以 klog的Warning, Error,Info为等级输出日志的, 所以无需关注日志等级,默认等级为0,所以三种等级的日志都会输出,无需进行额外的日志配置
plugin 启动命令中有个v,这个v不是日志等级,而是启动命令是否打印版本的配置
https://github.com/urfave/cli/blob/main/flag.go
// VersionFlag prints the version for the application
var VersionFlag Flag = &BoolFlag{
Name: "version",
Aliases: []string{"v"},
Usage: "print the version",
HideDefault: true,
Local: true,
}
3.5. 关闭原来的服务
plugin以 daemonSet存在,所以要停止服务有以下方式可以参考
- 修改daemonSet镜像为一个不可达镜像,这样的话所有gpu节点的的plugin都会被停掉
- 修改要调试的节点label,与daemonSet的label不匹配即可
3.6. 启动新的服务
这里以远程方式连接到服务器上调试
- 在Goland中配置
Run/Debug Configurations
配置主机以及端口,与start-plugin.sh
中的一只即可 - 在服务器上启动plugin服务
bash start-plugin.sh
这样就可以打断点调试plugin代码
4. hami-scheduler本地调试
4.1. 收集环境变量 hami-scheduler
k exec hami-scheduler-67fc7ccd55-vjntl -c vgpu-scheduler-extender -- env > start-scheduler.sh
4.1.1. TLS配置
k8s admissionWebhook必须用https方式调用,所以启动的服务需要时https的。使用Helm Chart部署的hami服务已经创建好了证书,地址为127.0.0.1,所以启动的服务需要是 https://127.0.0.1:xxxx/webhook
需要收集TLS的CERT、KEY
k get secret hami-scheduler-tls -o jsonpath='{.data.tls\.crt}' |base64 -d > config/tls.crt
k get secret hami-scheduler-tls -o jsonpath='{.data.tls\.key}' |base64 -d > config/tls.key
编写启动命令, 并追加到 start-scheduler.sh
dlv debug --headless --listen=:2345 --api-version=2 ./cmd/scheduler/ -- \
--device-config-file=config/device-config.yaml \
-v=10 \
--scheduler-name=hami-scheduler \
--http_bind=0.0.0.0:8080 \
--cert_file=config/tls.crt \
--key_file=config/tls.key
最终得到一个启动脚本文件 start-scheduler.sh
注意PATH的处理,手动修改 PATH=$PATH:xxxxxxxx
PATH=$PATH:/k8s-vgpu/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
HOSTNAME=hami-scheduler-67fc7ccd55-vjntl
NVIDIA_MIG_MONITOR_DEVICES=all
HOOK_PATH=/usr/local
KUBERNETES_SERVICE_HOST=10.233.0.1
KUBERNETES_SERVICE_PORT=443
KUBERNETES_SERVICE_PORT_HTTPS=443
KUBERNETES_PORT=tcp://10.233.0.1:443
KUBERNETES_PORT_443_TCP=tcp://10.233.0.1:443
KUBERNETES_PORT_443_TCP_PROTO=tcp
KUBERNETES_PORT_443_TCP_PORT=443
KUBERNETES_PORT_443_TCP_ADDR=10.233.0.1
NVARCH=x86_64
NVIDIA_REQUIRE_CUDA=cuda>=12.6 brand=unknown,driver>=470,driver<471 brand=grid,driver>=470,driver<471 brand=tesla,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=vapps,driver>=470,driver<471 brand=vpc,driver>=470,driver<471 brand=vcs,driver>=470,driver<471 brand=vws,driver>=470,driver<471 brand=cloudgaming,driver>=470,driver<471 brand=unknown,driver>=535,driver<536 brand=grid,driver>=535,driver<536 brand=tesla,driver>=535,driver<536 brand=nvidia,driver>=535,driver<536 brand=quadro,driver>=535,driver<536 brand=quadrortx,driver>=535,driver<536 brand=nvidiartx,driver>=535,driver<536 brand=vapps,driver>=535,driver<536 brand=vpc,driver>=535,driver<536 brand=vcs,driver>=535,driver<536 brand=vws,driver>=535,driver<536 brand=cloudgaming,driver>=535,driver<536 brand=unknown,driver>=550,driver<551 brand=grid,driver>=550,driver<551 brand=tesla,driver>=550,driver<551 brand=nvidia,driver>=550,driver<551 brand=quadro,driver>=550,driver<551 brand=quadrortx,driver>=550,driver<551 brand=nvidiartx,driver>=550,driver<551 brand=vapps,driver>=550,driver<551 brand=vpc,driver>=550,driver<551 brand=vcs,driver>=550,driver<551 brand=vws,driver>=550,driver<551 brand=cloudgaming,driver>=550,driver<551
NV_CUDA_CUDART_VERSION=12.6.77-1
CUDA_VERSION=12.6.3
LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64
NVIDIA_VISIBLE_DEVICES=all
NVIDIA_DRIVER_CAPABILITIES=utility
NVIDIA_DISABLE_REQUIRE=true
dlv debug --headless --listen=:2345 --api-version=2 ./cmd/scheduler/ -- \
# device-config.yaml 使用之前从plugin中获取的即可
--device-config-file=config/device-config.yaml \
# 日志等级调高方便调试
-v=10 \
# 调度器名称需要
--scheduler-name=hami-scheduler \
# webhook需要使用的url
--http_bind=0.0.0.0:8080 \
--cert_file=config/tls.crt \
--key_file=config/tls.key
4.2. webhook修改
如果要让pod启动时,kube-apiserver可以调用到,需要修改 admissionWebhook配置
k edit mutatingwebhookconfigurations.admissionregistration.k8s.io hami-webhook
原本的配置
webhooks:
- admissionReviewVersions:
- v1beta1
clientConfig:
caBundle: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUJkakNDQVJ5Z0F3SUJBZ0lSQUxjd2FQMjUrMlphdGhTTlFMcG1qT0V3Q2dZSUtvWkl6ajBFQXdJd0R6RU4KTUFzR0ExVUVDaE1FYm1sc01UQWdGdzB5TkRFeU1EWXdOekV4TVRWYUdBOHlNVEkwTVRFeE1qQTNNVEV4TlZvdwpEekVOTUFzR0ExVUVDaE1FYm1sc01UQlpNQk1HQnlxR1NNNDlBZ0VHQ0NxR1NNNDlBd0VIQTBJQUJDUnlXUDdYCkRmT2N4NEVTMVRYaUs0dnFFU2wrcUFHYjI2YzNrOEdMWlZTL1lHaFpLZVVxaEgydVRhTFdWTW1hZVJFbkxqM0cKSStMVFRVTTR6SVhEUld5alZ6QlZNQTRHQTFVZER3RUIvd1FFQXdJQ0JEQVRCZ05WSFNVRUREQUtCZ2dyQmdFRgpCUWNEQVRBUEJnTlZIUk1CQWY4RUJUQURBUUgvTUIwR0ExVWREZ1FXQkJTcVV4bWpGa29YUlpRK0xXVzBNM1pJCnMzck1wakFLQmdncWhrak9QUVFEQWdOSUFEQkZBaUJSY2VRL2tJVkR2VTV3Vjl0K3NRWm93TmFhTWhIMTV5K2sKT3VrR0FlRGVtQUloQUxDZzFrM0JQZUJBNG8reWY5emxvVjM2VEk2RHUzaGdMT1B3MXhaZkFvcDMKLS0tLS1FTkQgQ0VSVElGSUNBVEUtLS0tLQo=
service:
name: hami-scheduler
namespace: kube-system
path: /webhook
port: 443
修改为
webhooks:
- admissionReviewVersions:
- v1beta1
clientConfig:
caBundle: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUJkakNDQVJ5Z0F3SUJBZ0lSQUxjd2FQMjUrMlphdGhTTlFMcG1qT0V3Q2dZSUtvWkl6ajBFQXdJd0R6RU4KTUFzR0ExVUVDaE1FYm1sc01UQWdGdzB5TkRFeU1EWXdOekV4TVRWYUdBOHlNVEkwTVRFeE1qQTNNVEV4TlZvdwpEekVOTUFzR0ExVUVDaE1FYm1sc01UQlpNQk1HQnlxR1NNNDlBZ0VHQ0NxR1NNNDlBd0VIQTBJQUJDUnlXUDdYCkRmT2N4NEVTMVRYaUs0dnFFU2wrcUFHYjI2YzNrOEdMWlZTL1lHaFpLZVVxaEgydVRhTFdWTW1hZVJFbkxqM0cKSStMVFRVTTR6SVhEUld5alZ6QlZNQTRHQTFVZER3RUIvd1FFQXdJQ0JEQVRCZ05WSFNVRUREQUtCZ2dyQmdFRgpCUWNEQVRBUEJnTlZIUk1CQWY4RUJUQURBUUgvTUIwR0ExVWREZ1FXQkJTcVV4bWpGa29YUlpRK0xXVzBNM1pJCnMzck1wakFLQmdncWhrak9QUVFEQWdOSUFEQkZBaUJSY2VRL2tJVkR2VTV3Vjl0K3NRWm93TmFhTWhIMTV5K2sKT3VrR0FlRGVtQUloQUxDZzFrM0JQZUJBNG8reWY5emxvVjM2VEk2RHUzaGdMT1B3MXhaZkFvcDMKLS0tLS1FTkQgQ0VSVElGSUNBVEUtLS0tLQo=
url: https://127.0.0.1:8080/webhook
4.3. 为拓展的hami-scheduler启动一个HTTP服务
由于上面的TLS相关配置都是127.0.0.1的,admissionWebhook是从kube-apiserver POD中调用的,使用的本机ip,所以可以通信,但是hami-scheduler启动了一个新的schedler在POD中使用了虚拟网络所以无法掉通以进程方式直接启动在HOST上的,所以我们需要为hami-scheduler启动一个新的可以掉通的HTTP服务,这里有多种方式,例如转发,或者多启动一个HTTP SERVER,这里选择第二种方式, 需要修改调度器代码
diff --git a/cmd/scheduler/main.go b/cmd/scheduler/main.go
index 3c99042..c85c09f 100644
--- a/cmd/scheduler/main.go
+++ b/cmd/scheduler/main.go
@@ -84,12 +85,29 @@ func start() {
router.GET("/healthz", routes.HealthzRoute())
klog.Info("listen on ", config.HTTPBind)
if len(tlsCertFile) == 0 || len(tlsKeyFile) == 0 {
- if err := http.ListenAndServe(config.HTTPBind, router); err != nil {
- klog.Fatal("Listen and Serve error, ", err)
+ go func() {
+ if err := http.ListenAndServe(config.HTTPBind, router); err != nil {
+ klog.Fatal("Listen and Serve error, ", err)
+ }
+ }()
+ } else {
+ go func() {
+ if err := http.ListenAndServeTLS(config.HTTPBind, tlsCertFile, tlsKeyFile, router); err != nil {
+ klog.Fatal("Listen and Serve TLS error, ", err)
+ }
+ }()
+ }
+
+ // Additional HTTP server on a different port
+ additionalBind := "10.10.10.8:8081"
+ klog.Info("listen on ", additionalBind)
+ if len(tlsCertFile) == 0 || len(tlsKeyFile) == 0 {
+ if err := http.ListenAndServe(additionalBind, router); err != nil {
+ klog.Fatal("Additional Listen and Serve error, ", err)
}
} else {
- if err := http.ListenAndServeTLS(config.HTTPBind, tlsCertFile, tlsKeyFile, router); err != nil {
- klog.Fatal("Listen and Serve error, ", err)
+ if err := http.ListenAndServeTLS(additionalBind, tlsCertFile, tlsKeyFile, router); err != nil {
+ klog.Fatal("Additional Listen and Serve TLS error, ", err)
}
}
}
这样就启动了两个不同HOST+PORT的服务, adminission webhook用 https://127.0.0.1:xxxx
拓展调度器用 https://10.10.10.8:8081/xxx
4.3.1. 修改scheduler服务器地址
拓展调度器的服务端准备好了,客户端需要做如下修改
k edit cm hami-scheduler-newversion
修改
data:
config.yaml: |
...
extenders:
- urlPrefix: "https://127.0.0.1:8081"
将 urlPrefix
修改为刚刚启动的服务路径
data:
config.yaml: |
...
extenders:
- urlPrefix: "https://10.10.10.8:8081"
4.4. 关闭原来的hami-scheduler
deployment hami-scheduler
中有两个容器,
kube-scheduler
拓展原有的k8s调度器,将schedulerName
为hami-scheduler
的pod的调度处理转发到vgpu-scheduler-extender
中vgpu-scheduler-extender
具体的调度器实现逻辑
其中 kube-scheduler
容器只是作为拓展调度器的注册以及请求转发,需要调试的地方不多,所以使用当前存在的即可,vgpu-scheduler-extender
需要停止,所以最简单的方式还是修改这个容器镜像地址为不可达即可
4.5. 启动本地的scheduler
bash start-scheduler.sh
在Goland中做类似 hami-plugin 配置,即可远程调试
注意dlv启动的端口需要和启动脚本中的保持一致,如果同事需要多个远程调试进程则建议将端口改为不同的
5. 测试
创建一个pod,看调度流程是否正常,如果可以调度成功,且日志正常即可,测试通过后即可进行功能开发或者bug修复
k apply -f examples/nvidia/default_use.yaml