English version generated by AI. 查看中文版
HAMI Project Local Debugging
- HAMI Project Local Debugging
1. Prerequisites
A fully installed and functional HAMI cluster on a k8s cluster is required for debugging, allowing for the reuse of many configurations.
- HAMI version: v2.4.1
- OS: Ubuntu 22.04.4 LTS
- Kernel: 5.15.0-125-generic
- NVCC: V11.5.119
- Nvidia-smi: 550.54.14
- CUDA: 12.4
nvidia-smi
Fri Dec 20 02:45:30 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14 Driver Version: 550.54.14 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 3090 Off | 00000000:00:06.0 Off | N/A |
| 0% 26C P8 11W / 370W | 0MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
2. Create Local Directory for Configuration
Local startup depends on some files, which can be stored in a unified folder.
2.1. Create Folder
For example, create a config
directory under the project.
mkdir config
3. Local Debugging of hami-device
3.1. Collect Configuration
hami-device requires two configuration files:
- config/device-config.yaml
- config.json
Retrieve and save them to the local directory.
k exec hami-device-plugin-fn8lc -c device-plugin -- cat /device-config.yaml > /config/device.yaml
k exec hami-device-plugin-fn8lc -c device-plugin -- cat /config/config.json > /config/config.json
3.2. Collect Environment Variables
tip: before exporting pod environment variables, configure
enableServiceLinks
for the pod, such as settingspec.template.spec.enableServiceLinks: false
in the deployment configuration. This allows you to remove the information automatically added by k8s and only keep the environment variables configured by the pod itself.
k exec hami-device-plugin-lzm4n env > start-plugin.sh
Note: The exported file contains a PATH variable, which needs to be changed to
export PATH=$PATH:xxxxxx
with xxx replaced by the original output; addexport
in front of each line. Use dlv with goland for debugging.
Remove configurations containing IP addresses; retain port configurations; add debug information to obtain a file as follows:
#!/usr/bin/bash
export PATH=$PATH/k8s-vgpu/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
export HOSTNAME=hami-scheduler-858f6b9dcf-4nn58
export KUBERNETES_SERVICE_PORT=443
export KUBERNETES_SERVICE_PORT_HTTPS=443
export KUBERNETES_PORT=tcp://10.233.0.1:443
export KUBERNETES_PORT_443_TCP=tcp://10.233.0.1:443
export KUBERNETES_PORT_443_TCP_PROTO=tcp
export KUBERNETES_PORT_443_TCP_PORT=443
export KUBERNETES_PORT_443_TCP_ADDR=10.233.0.1
export KUBERNETES_SERVICE_HOST=10.233.0.1
export NVARCH=x86_64
export NVIDIA_REQUIRE_CUDA=cuda>=12.6 brand=unknown,driver>=470,driver<471 brand=grid,driver>=470,driver<471 brand=tesla,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=vapps,driver>=470,driver<471 brand=vpc,driver>=470,driver<471 brand=vcs,driver>=470,driver<471 brand=vws,driver>=470,driver<471 brand=cloudgaming,driver>=470,driver<471 brand=unknown,driver>=535,driver<536 brand=grid,driver>=535,driver<536 brand=tesla,driver>=535,driver<536 brand=nvidia,driver>=535,driver<536 brand=quadro,driver>=535,driver<536 brand=quadrortx,driver>=535,driver<536 brand=nvidiartx,driver>=535,driver<536 brand=vapps,driver>=535,driver<536 brand=vpc,driver>=535,driver<536 brand=vcs,driver>=535,driver<536 brand=vws,driver>=535,driver<536 brand=cloudgaming,driver>=535,driver<536 brand=unknown,driver>=550,driver<551 brand=grid,driver>=550,driver<551 brand=tesla,driver>=550,driver<551 brand=nvidia,driver>=550,driver<551 brand=quadro,driver>=550,driver<551 brand=quadrortx,driver>=550,driver<551 brand=nvidiartx,driver>=550,driver<551 brand=vapps,driver>=550,driver<551 brand=vpc,driver>=550,driver<551 brand=vcs,driver>=550,driver<551 brand=vws,driver>=550,driver<551 brand=cloudgaming,driver>=550,driver<551
export NV_CUDA_CUDART_VERSION=12.6.77-1
export CUDA_VERSION=12.6.3
export LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64
export NVIDIA_VISIBLE_DEVICES=all
export NVIDIA_DRIVER_CAPABILITIES=utility
export NVIDIA_DISABLE_REQUIRE=true
# Manually add this variable
export CONFIG_FILE=config/device-config.yaml
/root/go/bin/dlv debug --headless --listen=:12345 --api-version=2 ./cmd/device-plugin/nvidia/
3.3. Modify Configuration
Since the node’s configuration file path is hardcoded in the code, it needs to be modified (todo: this could potentially be optimized).
diff --git a/pkg/device-plugin/nvidiadevice/nvinternal/plugin/server.go b/pkg/device-plugin/nvidiadevice/nvinternal/plugin/server.go
--- a/pkg/device-plugin/nvidiadevice/nvinternal/plugin/server.go
+++ b/pkg/device-plugin/nvidiadevice/nvinternal/plugin/server.go
@@ -101,7 +101,7 @@ type NvidiaDevicePlugin struct {
}
func readFromConfigFile(sConfig *nvidia.NvidiaConfig) (string, error) {
- jsonbyte, err := os.ReadFile("/config/config.json")
+ jsonbyte, err := os.ReadFile("config/config.json")
mode := "hami-core"
if err != nil {
return "", err
(END)
3.4. Adjust Log Level
The plugin’s code outputs logs at the levels of klog’s Warning, Error, Info, so there is no need to worry about log levels. The default level is 0, so all three levels of logs will be output without additional log configuration.
The
v
in the plugin startup command is not a log level, but a configuration for whether to print the version.
https://github.com/urfave/cli/blob/main/flag.go
// VersionFlag prints the version for the application
var VersionFlag Flag = &BoolFlag{
Name: "version",
Aliases: []string{"v"},
Usage: "print the version",
HideDefault: true,
Local: true,
}
3.5. Stop the Original Service
The plugin exists as a daemonSet, so there are several ways to stop the service:
- Modify the daemonSet image to an unreachable image, which will stop all GPU node plugins.
- Modify the label of the node to be debugged so that it does not match the daemonSet’s label.
3.6. Start the New Service
Connect to the server remotely for debugging:
- Configure the host and port in Goland’s
Run/Debug Configurations
to match those instart-plugin.sh
. - Start the plugin service on the server.
bash start-plugin.sh
This allows for breakpoint debugging of the plugin code.
4. Local Debugging of hami-scheduler
4.1. Collect Environment Variables for hami-scheduler
k exec hami-scheduler-67fc7ccd55-vjntl -c vgpu-scheduler-extender -- env > start-scheduler.sh
4.1.1. TLS Configuration
The k8s admissionWebhook must be called using HTTPS, so the started service needs to be HTTPS. The Helm Chart-deployed hami service has already created the certificate, with the address at 127.0.0.1, so the started service needs to be https://127.0.0.1:xxxx/webhook
.
Collect the TLS CERT and KEY:
k get secret hami-scheduler-tls -o jsonpath='{.data.tls\.crt}' |base64 -d > config/tls.crt
k get secret hami-scheduler-tls -o jsonpath='{.data.tls\.key}' |base64 -d > config/tls.key
Write the startup command and append it to start-scheduler.sh
:
dlv debug --headless --listen=:2345 --api-version=2 ./cmd/scheduler/ -- \
--device-config-file=config/device-config.yaml \
-v=10 \
--scheduler-name=hami-scheduler \
--http_bind=0.0.0.0:8080 \
--cert_file=config/tls.crt \
--key_file=config/tls.key
This results in a startup script file start-scheduler.sh
.
Note the handling of the PATH, manually modify PATH=$PATH:xxxxxxxx
PATH=$PATH:/k8s-vgpu/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
HOSTNAME=hami-scheduler-67fc7ccd55-vjntl
NVIDIA_MIG_MONITOR_DEVICES=all
HOOK_PATH=/usr/local
KUBERNETES_SERVICE_HOST=10.233.0.1
KUBERNETES_SERVICE_PORT=443
KUBERNETES_SERVICE_PORT_HTTPS=443
KUBERNETES_PORT=tcp://10.233.0.1:443
KUBERNETES_PORT_443_TCP=tcp://10.233.0.1:443
KUBERNETES_PORT_443_TCP_PROTO=tcp
KUBERNETES_PORT_443_TCP_PORT=443
KUBERNETES_PORT_443_TCP_ADDR=10.233.0.1
NVARCH=x86_64
NVIDIA_REQUIRE_CUDA=cuda>=12.6 brand=unknown,driver>=470,driver<471 brand=grid,driver>=470,driver<471 brand=tesla,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=vapps,driver>=470,driver<471 brand=vpc,driver>=470,driver<471 brand=vcs,driver>=470,driver<471 brand=vws,driver>=470,driver<471 brand=cloudgaming,driver>=470,driver<471 brand=unknown,driver>=535,driver<536 brand=grid,driver>=535,driver<536 brand=tesla,driver>=535,driver<536 brand=nvidia,driver>=535,driver<536 brand=quadro,driver>=535,driver<536 brand=quadrortx,driver>=535,driver<536 brand=nvidiartx,driver>=535,driver<536 brand=vapps,driver>=535,driver<536 brand=vpc,driver>=535,driver<536 brand=vcs,driver>=535,driver<536 brand=vws,driver>=535,driver<536 brand=cloudgaming,driver>=535,driver<536 brand=unknown,driver>=550,driver<551 brand=grid,driver>=550,driver<551 brand=tesla,driver>=550,driver<551 brand=nvidia,driver>=550,driver<551 brand=quadro,driver>=550,driver<551 brand=quadrortx,driver>=550,driver<551 brand=nvidiartx,driver>=550,driver<551 brand=vapps,driver>=550,driver<551 brand=vpc,driver>=550,driver<551 brand=vcs,driver>=550,driver<551 brand=vws,driver>=550,driver<551 brand=cloudgaming,driver>=550,driver<551
NV_CUDA_CUDART_VERSION=12.6.77-1
CUDA_VERSION=12.6.3
LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64
NVIDIA_VISIBLE_DEVICES=all
NVIDIA_DRIVER_CAPABILITIES=utility
NVIDIA_DISABLE_REQUIRE=true
dlv debug --headless --listen=:2345 --api-version=2 ./cmd/scheduler/ -- \
# Use the previously obtained device-config.yaml
--device-config-file=config/device-config.yaml \
# Increase log level for easier debugging
-v=10 \
# Scheduler name is required
--scheduler-name=hami-scheduler \
# URL needed for the webhook
--http_bind=0.0.0.0:8080 \
--cert_file=config/tls.crt \
--key_file=config/tls.key
4.2. Modify Webhook
To allow the kube-apiserver to call the pod at startup, modify the admissionWebhook configuration.
k edit mutatingwebhookconfigurations.admissionregistration.k8s.io hami-webhook
Original configuration:
webhooks:
- admissionReviewVersions:
- v1beta1
clientConfig:
caBundle: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUJkakNDQVJ5Z0F3SUJBZ0lSQUxjd2FQMjUrMlphdGhTTlFMcG1qT0V3Q2dZSUtvWkl6ajBFQXdJd0R6RU4KTUFzR0ExVUVDaE1FYm1sc01UQWdGdzB5TkRFeU1EWXdOekV4TVRWYUdBOHlNVEkwTVRFeE1qQTNNVEV4TlZvdwpEekVOTUFzR0ExVUVDaE1FYm1sc01UQlpNQk1HQnlxR1NNNDlBZ0VHQ0NxR1NNNDlBd0VIQTBJQUJDUnlXUDdYCkRmT2N4NEVTMVRYaUs0dnFFU2wrcUFHYjI2YzNrOEdMWlZTL1lHaFpLZVVxaEgydVRhTFdWTW1hZVJFbkxqM0cKSStMVFRVTTR6SVhEUld5alZ6QlZNQTRHQTFVZER3RUIvd1FFQXdJQ0JEQVRCZ05WSFNVRUREQUtCZ2dyQmdFRgpCUWNEQVRBUEJnTlZIUk1CQWY4RUJUQURBUUgvTUIwR0ExVWREZ1FXQkJTcVV4bWpGa29YUlpRK0xXVzBNM1pJCnMzck1wakFLQmdncWhrak9QUVFEQWdOSUFEQkZBaUJSY2VRL2tJVkR2VTV3Vjl0K3NRWm93TmFhTWhIMTV5K2sKT3VrR0FlRGVtQUloQUxDZzFrM0JQZUJBNG8reWY5emxvVjM2VEk2RHUzaGdMT1B3MXhaZkFvcDMKLS0tLS1FTkQgQ0VSVElGSUNBVEUtLS0tLQo=
service:
name: hami-scheduler
namespace: kube-system
path: /webhook
port: 443
Modify the configuration to point to the new service address:
webhooks:
- admissionReviewVersions:
- v1beta1
clientConfig:
url: https://127.0.0.1:8080/webhook
caBundle: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUJkakNDQVJ5Z0F3SUJBZ0lSQUxjd2FQMjUrMlphdGhTTlFMcG1qT0V3Q2dZSUtvWkl6ajBFQXdJd0R6RU4KTUFzR0ExVUVDaE1FYm1sc01UQWdGdzB5TkRFeU1EWXdOekV4TVRWYUdBOHlNVEkwTVRFeE1qQTNNVEV4TlZvdwpEekVOTUFzR0ExVUVDaE1FYm1sc01UQlpNQk1HQnlxR1NNNDlBZ0VHQ0NxR1NNNDlBd0VIQTBJQUJDUnlXUDdYCkRmT2N4NEVTMVRYaUs0dnFFU2wrcUFHYjI2YzNrOEdMWlZTL1lHaFpLZVVxaEgydVRhTFdWTW1hZVJFbkxqM0cKSStMVFRVTTR6SVhEUld5alZ6QlZNQTRHQTFVZER3RUIvd1FFQXdJQ0JEQVRCZ05WSFNVRUREQUtCZ2dyQmdFRgpCUWNEQVRBUEJnTlZIUk1CQWY4RUJUQURBUUgvTUIwR0ExVWREZ1FXQkJTcVV4bWpGa29YUlpRK0xXVzBNM1pJCnMzck1wakFLQmdncWhrak9QUVFEQWdOSUFEQkZBaUJSY2VRL2tJVkR2VTV3Vjl0K3NRWm93TmFhTWhIMTV5K2sKT3VrR0FlRGVtQUloQUxDZzFrM0JQZUJBNG8reWY5emxvVjM2VEk2RHUzaGdMT1B3MXhaZkFvcDMKLS0tLS1FTkQgQ0VSVElGSUNBVEUtLS0tLQo=
service:
name: hami-scheduler
namespace: kube-system
path: /webhook
port: 443
4.3. Start an HTTP Service for the Extended hami-scheduler
Since the TLS configuration is set to only accept connections from 127.0.0.1
, it’s necessary to start an HTTP service that can be accessed from the kube-apiserver. This involves modifying the scheduler code to listen on an additional port.
diff --git a/cmd/scheduler/main.go b/cmd/scheduler/main.go
index 3c99042..c85c09f 100644
--- a/cmd/scheduler/main.go
+++ b/cmd/scheduler/main.go
@@ -84,12 +85,29 @@ func start() {
router.GET("/healthz", routes.HealthzRoute())
klog.Info("listen on ", config.HTTPBind)
if len(tlsCertFile) == 0 || len(tlsKeyFile) == 0 {
- if err := http.ListenAndServe(config.HTTPBind, router); err != nil {
- klog.Fatal("Listen and Serve error, ", err)
+ go func() {
+ if err := http.ListenAndServe(config.HTTPBind, router); err != nil {
+ klog.Fatal("Listen and Serve error, ", err)
+ }
+ }()
+ } else {
+ go func() {
+ if err := http.ListenAndServeTLS(config.HTTPBind, tlsCertFile, tlsKeyFile, router); err != nil {
+ klog.Fatal("Listen and Serve TLS error, ", err)
+ }
+ }()
+ }
+
+ // Additional HTTP server on a different port
+ additionalBind := "0.0.0.0:8081"
+ klog.Info("listen on ", additionalBind)
+ if len(tlsCertFile) == 0 || len(tlsKeyFile) == 0 {
+ if err := http.ListenAndServe(additionalBind, router); err != nil {
+ klog.Fatal("Additional Listen and Serve error, ", err)
}
} else {
- if err := http.ListenAndServeTLS(config.HTTPBind, tlsCertFile, tlsKeyFile, router); err != nil {
- klog.Fatal("Listen and Serve error, ", err)
+ if err := http.ListenAndServeTLS(additionalBind, tlsCertFile, tlsKeyFile, router); err != nil {
+ klog.Fatal("Additional Listen and Serve TLS error, ", err)
}
}
}
This modification starts an additional HTTP server that listens on a different port, allowing the kube-apiserver to communicate with the scheduler via a secure connection.
4.4. Stop the Original hami-scheduler
To debug the new scheduler setup, the original running instance of the hami-scheduler needs to be stopped. This can be done by scaling down the deployment or modifying the deployment to use an image that does not start the scheduler.
kubectl scale deployment hami-scheduler --replicas=0
4.5. Start the Local Scheduler
With the environment prepared, start the local version of the scheduler using the script prepared earlier.
bash start-scheduler.sh
This command starts the scheduler in debug mode, allowing for real-time debugging and testing of changes.
5. Testing
To verify that the local debugging setup works correctly, deploy a test pod that requires scheduling by the hami-scheduler.
kubectl apply -f test-pod.yaml
Monitor the scheduler’s output and the Kubernetes events to ensure the pod is scheduled correctly and that the scheduler behaves as expected.
By following these steps, local debugging of the HAMI project can be effectively carried out, allowing for rapid development and troubleshooting of the scheduling components.