Kevin Jiang

tilcuda issues

First of all apologies for the rushed writing, but I’m too happy that I finally resolved my CUDA problem after days of trying. Maybe after I calm down a bit I’ll come back to make it more pleasant to read.

I’m trying to run Nvidia Docker containers on my Linux Mint 21.3 machine. This solves the Failed to initialize NVML: Unknown Error problem. However, contrary to the other posts about Failed to initialize NVML: Unknown Error, where the GPU goes offline after a certain period of time, my GPU wasn’t detected at all!

CUDA works great on bare metal, but once containerized there’s always a problem of the GPU not being detected in Docker.

For my solution, check the last entry.

But first: neofetch

            ...-:::::-...                 user@computer
          .-MMMMMMMMMMMMMMM-.              ----------------
      .-MMMM`..-:::::::-..`MMMM-.          OS: Linux Mint 21.3 x86_64
    .:MMMM.:MMMMMMMMMMMMMMM:.MMMM:.        Host: MS-7D51 1.0
   -MMM-M---MMMMMMMMMMMMMMMMMMM.MMM-       Kernel: 5.15.0-113-generic
 `:MMM:MM`  :MMMM:....::-...-MMMM:MMM:`    Uptime: 52 mins
 :MMM:MMM`  :MM:`  ``    ``  `:MMM:MMM:    Packages: 3113 (dpkg)
.MMM.MMMM`  :MM.  -MM.  .MM-  `MMMM.MMM.   Shell: zsh 5.8.1
:MMM:MMMM`  :MM.  -MM-  .MM:  `MMMM-MMM:   Resolution: 2560x1440
:MMM:MMMM`  :MM.  -MM-  .MM:  `MMMM:MMM:   DE: Cinnamon 6.0.4
:MMM:MMMM`  :MM.  -MM-  .MM:  `MMMM-MMM:   WM: Mutter (Muffin)
.MMM.MMMM`  :MM:--:MM:--:MM:  `MMMM.MMM.   WM Theme: Mint-Y-Dark-Aqua (Mint-Y)
 :MMM:MMM-  `-MMMMMMMMMMMM-`  -MMM-MMM:    Theme: Mint-Y-Aqua [GTK2/3]
  :MMM:MMM:`                `:MMM:MMM:     Icons: Mint-Y-Sand [GTK2/3]
   .MMM.MMMM:--------------:MMMM.MMM.      Terminal: gnome-terminal
     '-MMMM.-MMMMMMMMMMMMMMM-.MMMM-'       CPU: AMD Ryzen 7 3700X (16) @ 3.600GHz
       '.-MMMM``--:::::--``MMMM-.'         GPU: NVIDIA GeForce RTX 3090
            '-MMMMMMMMMMMMM-'              Memory: 8751MiB / 32004MiB
               ``-:::::-``

What we’re trying to solve:

sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi

>> Failed to initialize NVML: Unknown Error

Ideally, we want it to print out an nvidia-smi screen.

My troubleshooting steps:

Cgroupfs:

Set /etc/docker/daemon.json to

{
  "default-runtime": "nvidia",
  "runtimes": {
    "nvidia": {
      "args": [],
      "path": "nvidia-container-runtime"
    }
  },
  "exec-opts": ["native.cgroupdriver=cgroupfs"]
}

Passing in the devices, in docker compose

Add the following to your docker-compose.yaml, under a service:

service: ServiceName
  # ...
  deploy:
    resources:
      reservations:
        devices:
            - driver: nvidia
              capabilities: ["gpu", "compute", "utility"]
              # "gpu" may or may not be present depending on the video card

Pass in the GPU using —gpus all

sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi

>> Failed to initialize NVML: Unknown Error

…drat no change

Passing in Devices (my problem):

sudo nvidia-ctk system create-dev-char-symlinks \
    --create-all
sudo docker run --rm --runtime=nvidia --gpus all --device=/dev/nvidia-uvm-tools --device=/dev/nvidia-modeset --device=/dev/nvidiactl --device=/dev/nvidia0  ubuntu nvidia-smi

>>

Tue Jul 23 14:08:38 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3090        Off | 00000000:2D:00.0  On |                  N/A |
| 36%   34C    P2             106W / 350W |    587MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
+---------------------------------------------------------------------------------------+