HAMI Project Local Debugging

Posted by elrond on December 20, 2024

English version generated by AI. 查看中文版

HAMI Project Local Debugging

1. Prerequisites

A fully installed and functional HAMI cluster on a k8s cluster is required for debugging, allowing for the reuse of many configurations.

  • HAMI version: v2.4.1
  • OS: Ubuntu 22.04.4 LTS
  • Kernel: 5.15.0-125-generic
  • NVCC: V11.5.119
  • Nvidia-smi: 550.54.14
  • CUDA: 12.4
nvidia-smi
Fri Dec 20 02:45:30 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14              Driver Version: 550.54.14      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090        Off |   00000000:00:06.0 Off |                  N/A |
|  0%   26C    P8             11W /  370W |       0MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

2. Create Local Directory for Configuration

Local startup depends on some files, which can be stored in a unified folder.

2.1. Create Folder

For example, create a config directory under the project.

mkdir config

3. Local Debugging of hami-device

3.1. Collect Configuration

hami-device requires two configuration files:

  • config/device-config.yaml
  • config.json

Retrieve and save them to the local directory.

k exec hami-device-plugin-fn8lc -c device-plugin -- cat /device-config.yaml > /config/device.yaml
k exec hami-device-plugin-fn8lc -c device-plugin -- cat /config/config.json > /config/config.json

3.2. Collect Environment Variables

tip: before exporting pod environment variables, configure enableServiceLinks for the pod, such as setting spec.template.spec.enableServiceLinks: false in the deployment configuration. This allows you to remove the information automatically added by k8s and only keep the environment variables configured by the pod itself.

k exec hami-device-plugin-lzm4n env > start-plugin.sh

Note: The exported file contains a PATH variable, which needs to be changed to export PATH=$PATH:xxxxxx with xxx replaced by the original output; add export in front of each line. Use dlv with goland for debugging.

Remove configurations containing IP addresses; retain port configurations; add debug information to obtain a file as follows:

#!/usr/bin/bash
export PATH=$PATH/k8s-vgpu/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
export HOSTNAME=hami-scheduler-858f6b9dcf-4nn58
export KUBERNETES_SERVICE_PORT=443
export KUBERNETES_SERVICE_PORT_HTTPS=443
export KUBERNETES_PORT=tcp://10.233.0.1:443
export KUBERNETES_PORT_443_TCP=tcp://10.233.0.1:443
export KUBERNETES_PORT_443_TCP_PROTO=tcp
export KUBERNETES_PORT_443_TCP_PORT=443
export KUBERNETES_PORT_443_TCP_ADDR=10.233.0.1
export KUBERNETES_SERVICE_HOST=10.233.0.1
export NVARCH=x86_64
export NVIDIA_REQUIRE_CUDA=cuda>=12.6 brand=unknown,driver>=470,driver<471 brand=grid,driver>=470,driver<471 brand=tesla,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=vapps,driver>=470,driver<471 brand=vpc,driver>=470,driver<471 brand=vcs,driver>=470,driver<471 brand=vws,driver>=470,driver<471 brand=cloudgaming,driver>=470,driver<471 brand=unknown,driver>=535,driver<536 brand=grid,driver>=535,driver<536 brand=tesla,driver>=535,driver<536 brand=nvidia,driver>=535,driver<536 brand=quadro,driver>=535,driver<536 brand=quadrortx,driver>=535,driver<536 brand=nvidiartx,driver>=535,driver<536 brand=vapps,driver>=535,driver<536 brand=vpc,driver>=535,driver<536 brand=vcs,driver>=535,driver<536 brand=vws,driver>=535,driver<536 brand=cloudgaming,driver>=535,driver<536 brand=unknown,driver>=550,driver<551 brand=grid,driver>=550,driver<551 brand=tesla,driver>=550,driver<551 brand=nvidia,driver>=550,driver<551 brand=quadro,driver>=550,driver<551 brand=quadrortx,driver>=550,driver<551 brand=nvidiartx,driver>=550,driver<551 brand=vapps,driver>=550,driver<551 brand=vpc,driver>=550,driver<551 brand=vcs,driver>=550,driver<551 brand=vws,driver>=550,driver<551 brand=cloudgaming,driver>=550,driver<551
export NV_CUDA_CUDART_VERSION=12.6.77-1
export CUDA_VERSION=12.6.3
export LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64
export NVIDIA_VISIBLE_DEVICES=all
export NVIDIA_DRIVER_CAPABILITIES=utility
export NVIDIA_DISABLE_REQUIRE=true
# Manually add this variable
export CONFIG_FILE=config/device-config.yaml
/root/go/bin/dlv debug --headless --listen=:12345 --api-version=2 ./cmd/device-plugin/nvidia/

3.3. Modify Configuration

Since the node’s configuration file path is hardcoded in the code, it needs to be modified (todo: this could potentially be optimized).

diff --git a/pkg/device-plugin/nvidiadevice/nvinternal/plugin/server.go b/pkg/device-plugin/nvidiadevice/nvinternal/plugin/server.go
--- a/pkg/device-plugin/nvidiadevice/nvinternal/plugin/server.go
+++ b/pkg/device-plugin/nvidiadevice/nvinternal/plugin/server.go
@@ -101,7 +101,7 @@ type NvidiaDevicePlugin struct {
 }

 func readFromConfigFile(sConfig *nvidia.NvidiaConfig) (string, error) {
-       jsonbyte, err := os.ReadFile("/config/config.json")
+       jsonbyte, err := os.ReadFile("config/config.json")
        mode := "hami-core"
        if err != nil {
                return "", err
(END)

3.4. Adjust Log Level

The plugin’s code outputs logs at the levels of klog’s Warning, Error, Info, so there is no need to worry about log levels. The default level is 0, so all three levels of logs will be output without additional log configuration.

The v in the plugin startup command is not a log level, but a configuration for whether to print the version.

https://github.com/urfave/cli/blob/main/flag.go

// VersionFlag prints the version for the application
var VersionFlag Flag = &BoolFlag{
 Name:        "version",
 Aliases:     []string{"v"},
 Usage:       "print the version",
 HideDefault: true,
 Local:       true,
}

3.5. Stop the Original Service

The plugin exists as a daemonSet, so there are several ways to stop the service:

  • Modify the daemonSet image to an unreachable image, which will stop all GPU node plugins.
  • Modify the label of the node to be debugged so that it does not match the daemonSet’s label.

3.6. Start the New Service

Connect to the server remotely for debugging:

  • Configure the host and port in Goland’s Run/Debug Configurations to match those in start-plugin.sh.
  • Start the plugin service on the server.
bash start-plugin.sh

This allows for breakpoint debugging of the plugin code.

4. Local Debugging of hami-scheduler

4.1. Collect Environment Variables for hami-scheduler

 k exec hami-scheduler-67fc7ccd55-vjntl -c vgpu-scheduler-extender -- env > start-scheduler.sh

4.1.1. TLS Configuration

The k8s admissionWebhook must be called using HTTPS, so the started service needs to be HTTPS. The Helm Chart-deployed hami service has already created the certificate, with the address at 127.0.0.1, so the started service needs to be https://127.0.0.1:xxxx/webhook.

Collect the TLS CERT and KEY:

k get secret hami-scheduler-tls -o jsonpath='{.data.tls\.crt}' |base64 -d > config/tls.crt
k get secret hami-scheduler-tls -o jsonpath='{.data.tls\.key}' |base64 -d > config/tls.key

Write the startup command and append it to start-scheduler.sh:

dlv debug --headless --listen=:2345 --api-version=2 ./cmd/scheduler/  -- \
  --device-config-file=config/device-config.yaml \
  -v=10 \
  --scheduler-name=hami-scheduler \
  --http_bind=0.0.0.0:8080 \
  --cert_file=config/tls.crt \
  --key_file=config/tls.key

This results in a startup script file start-scheduler.sh.

Note the handling of the PATH, manually modify PATH=$PATH:xxxxxxxx

PATH=$PATH:/k8s-vgpu/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
HOSTNAME=hami-scheduler-67fc7ccd55-vjntl
NVIDIA_MIG_MONITOR_DEVICES=all
HOOK_PATH=/usr/local
KUBERNETES_SERVICE_HOST=10.233.0.1
KUBERNETES_SERVICE_PORT=443
KUBERNETES_SERVICE_PORT_HTTPS=443
KUBERNETES_PORT=tcp://10.233.0.1:443
KUBERNETES_PORT_443_TCP=tcp://10.233.0.1:443
KUBERNETES_PORT_443_TCP_PROTO=tcp
KUBERNETES_PORT_443_TCP_PORT=443
KUBERNETES_PORT_443_TCP_ADDR=10.233.0.1
NVARCH=x86_64
NVIDIA_REQUIRE_CUDA=cuda>=12.6 brand=unknown,driver>=470,driver<471 brand=grid,driver>=470,driver<471 brand=tesla,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=vapps,driver>=470,driver<471 brand=vpc,driver>=470,driver<471 brand=vcs,driver>=470,driver<471 brand=vws,driver>=470,driver<471 brand=cloudgaming,driver>=470,driver<471 brand=unknown,driver>=535,driver<536 brand=grid,driver>=535,driver<536 brand=tesla,driver>=535,driver<536 brand=nvidia,driver>=535,driver<536 brand=quadro,driver>=535,driver<536 brand=quadrortx,driver>=535,driver<536 brand=nvidiartx,driver>=535,driver<536 brand=vapps,driver>=535,driver<536 brand=vpc,driver>=535,driver<536 brand=vcs,driver>=535,driver<536 brand=vws,driver>=535,driver<536 brand=cloudgaming,driver>=535,driver<536 brand=unknown,driver>=550,driver<551 brand=grid,driver>=550,driver<551 brand=tesla,driver>=550,driver<551 brand=nvidia,driver>=550,driver<551 brand=quadro,driver>=550,driver<551 brand=quadrortx,driver>=550,driver<551 brand=nvidiartx,driver>=550,driver<551 brand=vapps,driver>=550,driver<551 brand=vpc,driver>=550,driver<551 brand=vcs,driver>=550,driver<551 brand=vws,driver>=550,driver<551 brand=cloudgaming,driver>=550,driver<551
NV_CUDA_CUDART_VERSION=12.6.77-1
CUDA_VERSION=12.6.3
LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64
NVIDIA_VISIBLE_DEVICES=all
NVIDIA_DRIVER_CAPABILITIES=utility
NVIDIA_DISABLE_REQUIRE=true

dlv debug --headless --listen=:2345 --api-version=2 ./cmd/scheduler/  -- \
 # Use the previously obtained device-config.yaml
  --device-config-file=config/device-config.yaml \
  # Increase log level for easier debugging
  -v=10 \
  # Scheduler name is required
  --scheduler-name=hami-scheduler \
  # URL needed for the webhook
  --http_bind=0.0.0.0:8080 \
  --cert_file=config/tls.crt \
  --key_file=config/tls.key

4.2. Modify Webhook

To allow the kube-apiserver to call the pod at startup, modify the admissionWebhook configuration.

k edit mutatingwebhookconfigurations.admissionregistration.k8s.io hami-webhook

Original configuration:

webhooks:
- admissionReviewVersions:
  - v1beta1
  clientConfig:
    caBundle: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUJkakNDQVJ5Z0F3SUJBZ0lSQUxjd2FQMjUrMlphdGhTTlFMcG1qT0V3Q2dZSUtvWkl6ajBFQXdJd0R6RU4KTUFzR0ExVUVDaE1FYm1sc01UQWdGdzB5TkRFeU1EWXdOekV4TVRWYUdBOHlNVEkwTVRFeE1qQTNNVEV4TlZvdwpEekVOTUFzR0ExVUVDaE1FYm1sc01UQlpNQk1HQnlxR1NNNDlBZ0VHQ0NxR1NNNDlBd0VIQTBJQUJDUnlXUDdYCkRmT2N4NEVTMVRYaUs0dnFFU2wrcUFHYjI2YzNrOEdMWlZTL1lHaFpLZVVxaEgydVRhTFdWTW1hZVJFbkxqM0cKSStMVFRVTTR6SVhEUld5alZ6QlZNQTRHQTFVZER3RUIvd1FFQXdJQ0JEQVRCZ05WSFNVRUREQUtCZ2dyQmdFRgpCUWNEQVRBUEJnTlZIUk1CQWY4RUJUQURBUUgvTUIwR0ExVWREZ1FXQkJTcVV4bWpGa29YUlpRK0xXVzBNM1pJCnMzck1wakFLQmdncWhrak9QUVFEQWdOSUFEQkZBaUJSY2VRL2tJVkR2VTV3Vjl0K3NRWm93TmFhTWhIMTV5K2sKT3VrR0FlRGVtQUloQUxDZzFrM0JQZUJBNG8reWY5emxvVjM2VEk2RHUzaGdMT1B3MXhaZkFvcDMKLS0tLS1FTkQgQ0VSVElGSUNBVEUtLS0tLQo=
    service:
      name: hami-scheduler
      namespace: kube-system
      path: /webhook
      port: 443


Modify the configuration to point to the new service address:

webhooks:
- admissionReviewVersions:
  - v1beta1
  clientConfig:
    url: https://127.0.0.1:8080/webhook
    caBundle: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUJkakNDQVJ5Z0F3SUJBZ0lSQUxjd2FQMjUrMlphdGhTTlFMcG1qT0V3Q2dZSUtvWkl6ajBFQXdJd0R6RU4KTUFzR0ExVUVDaE1FYm1sc01UQWdGdzB5TkRFeU1EWXdOekV4TVRWYUdBOHlNVEkwTVRFeE1qQTNNVEV4TlZvdwpEekVOTUFzR0ExVUVDaE1FYm1sc01UQlpNQk1HQnlxR1NNNDlBZ0VHQ0NxR1NNNDlBd0VIQTBJQUJDUnlXUDdYCkRmT2N4NEVTMVRYaUs0dnFFU2wrcUFHYjI2YzNrOEdMWlZTL1lHaFpLZVVxaEgydVRhTFdWTW1hZVJFbkxqM0cKSStMVFRVTTR6SVhEUld5alZ6QlZNQTRHQTFVZER3RUIvd1FFQXdJQ0JEQVRCZ05WSFNVRUREQUtCZ2dyQmdFRgpCUWNEQVRBUEJnTlZIUk1CQWY4RUJUQURBUUgvTUIwR0ExVWREZ1FXQkJTcVV4bWpGa29YUlpRK0xXVzBNM1pJCnMzck1wakFLQmdncWhrak9QUVFEQWdOSUFEQkZBaUJSY2VRL2tJVkR2VTV3Vjl0K3NRWm93TmFhTWhIMTV5K2sKT3VrR0FlRGVtQUloQUxDZzFrM0JQZUJBNG8reWY5emxvVjM2VEk2RHUzaGdMT1B3MXhaZkFvcDMKLS0tLS1FTkQgQ0VSVElGSUNBVEUtLS0tLQo=
    service:
      name: hami-scheduler
      namespace: kube-system
      path: /webhook
      port: 443


4.3. Start an HTTP Service for the Extended hami-scheduler

Since the TLS configuration is set to only accept connections from 127.0.0.1, it’s necessary to start an HTTP service that can be accessed from the kube-apiserver. This involves modifying the scheduler code to listen on an additional port.

diff --git a/cmd/scheduler/main.go b/cmd/scheduler/main.go
index 3c99042..c85c09f 100644
--- a/cmd/scheduler/main.go
+++ b/cmd/scheduler/main.go
@@ -84,12 +85,29 @@ func start() {
        router.GET("/healthz", routes.HealthzRoute())
        klog.Info("listen on ", config.HTTPBind)
        if len(tlsCertFile) == 0 || len(tlsKeyFile) == 0 {
-               if err := http.ListenAndServe(config.HTTPBind, router); err != nil {
-                       klog.Fatal("Listen and Serve error, ", err)
+               go func() {
+                       if err := http.ListenAndServe(config.HTTPBind, router); err != nil {
+                               klog.Fatal("Listen and Serve error, ", err)
+                       }
+               }()
+       } else {
+               go func() {
+                       if err := http.ListenAndServeTLS(config.HTTPBind, tlsCertFile, tlsKeyFile, router); err != nil {
+                               klog.Fatal("Listen and Serve TLS error, ", err)
+                       }
+               }()
+       }
+
+       // Additional HTTP server on a different port
+       additionalBind := "0.0.0.0:8081" 
+       klog.Info("listen on ", additionalBind)
+       if len(tlsCertFile) == 0 || len(tlsKeyFile) == 0 {
+               if err := http.ListenAndServe(additionalBind, router); err != nil {
+                       klog.Fatal("Additional Listen and Serve error, ", err)
                }
        } else {
-               if err := http.ListenAndServeTLS(config.HTTPBind, tlsCertFile, tlsKeyFile, router); err != nil {
-                       klog.Fatal("Listen and Serve error, ", err)
+               if err := http.ListenAndServeTLS(additionalBind, tlsCertFile, tlsKeyFile, router); err != nil {
+                       klog.Fatal("Additional Listen and Serve TLS error, ", err)
                }
        }
 }

This modification starts an additional HTTP server that listens on a different port, allowing the kube-apiserver to communicate with the scheduler via a secure connection.

4.4. Stop the Original hami-scheduler

To debug the new scheduler setup, the original running instance of the hami-scheduler needs to be stopped. This can be done by scaling down the deployment or modifying the deployment to use an image that does not start the scheduler.

kubectl scale deployment hami-scheduler --replicas=0

4.5. Start the Local Scheduler

With the environment prepared, start the local version of the scheduler using the script prepared earlier.

bash start-scheduler.sh

This command starts the scheduler in debug mode, allowing for real-time debugging and testing of changes.

5. Testing

To verify that the local debugging setup works correctly, deploy a test pod that requires scheduling by the hami-scheduler.

kubectl apply -f test-pod.yaml

Monitor the scheduler’s output and the Kubernetes events to ensure the pod is scheduled correctly and that the scheduler behaves as expected.

By following these steps, local debugging of the HAMI project can be effectively carried out, allowing for rapid development and troubleshooting of the scheduling components.