tilcuda issues
First of all apologies for the rushed writing, but I’m too happy that I finally resolved my CUDA problem after days of trying. Maybe after I calm down a bit I’ll come back to make it more pleasant to read.
I’m trying to run Nvidia Docker containers on my Linux Mint 21.3 machine. This solves the Failed to initialize NVML: Unknown Error
problem. However, contrary to the other posts about Failed to initialize NVML: Unknown Error
, where the GPU goes offline after a certain period of time, my GPU wasn’t detected at all!
CUDA works great on bare metal, but once containerized there’s always a problem of the GPU not being detected in Docker.
For my solution, check the last entry.
But first: neofetch
...-:::::-... user@computer
.-MMMMMMMMMMMMMMM-. ----------------
.-MMMM`..-:::::::-..`MMMM-. OS: Linux Mint 21.3 x86_64
.:MMMM.:MMMMMMMMMMMMMMM:.MMMM:. Host: MS-7D51 1.0
-MMM-M---MMMMMMMMMMMMMMMMMMM.MMM- Kernel: 5.15.0-113-generic
`:MMM:MM` :MMMM:....::-...-MMMM:MMM:` Uptime: 52 mins
:MMM:MMM` :MM:` `` `` `:MMM:MMM: Packages: 3113 (dpkg)
.MMM.MMMM` :MM. -MM. .MM- `MMMM.MMM. Shell: zsh 5.8.1
:MMM:MMMM` :MM. -MM- .MM: `MMMM-MMM: Resolution: 2560x1440
:MMM:MMMM` :MM. -MM- .MM: `MMMM:MMM: DE: Cinnamon 6.0.4
:MMM:MMMM` :MM. -MM- .MM: `MMMM-MMM: WM: Mutter (Muffin)
.MMM.MMMM` :MM:--:MM:--:MM: `MMMM.MMM. WM Theme: Mint-Y-Dark-Aqua (Mint-Y)
:MMM:MMM- `-MMMMMMMMMMMM-` -MMM-MMM: Theme: Mint-Y-Aqua [GTK2/3]
:MMM:MMM:` `:MMM:MMM: Icons: Mint-Y-Sand [GTK2/3]
.MMM.MMMM:--------------:MMMM.MMM. Terminal: gnome-terminal
'-MMMM.-MMMMMMMMMMMMMMM-.MMMM-' CPU: AMD Ryzen 7 3700X (16) @ 3.600GHz
'.-MMMM``--:::::--``MMMM-.' GPU: NVIDIA GeForce RTX 3090
'-MMMMMMMMMMMMM-' Memory: 8751MiB / 32004MiB
``-:::::-``
What we’re trying to solve:
sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi
>> Failed to initialize NVML: Unknown Error
Ideally, we want it to print out an nvidia-smi screen.
My troubleshooting steps:
Cgroupfs:
-
this is probably not a problem anymore (fixed), but might as well do it because anything to do with CUDA is black magic
-
Relevant Links:
-
tl;dr
Set /etc/docker/daemon.json
to
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"args": [],
"path": "nvidia-container-runtime"
}
},
"exec-opts": ["native.cgroupdriver=cgroupfs"]
}
Passing in the devices, in docker compose
-
Relevant Links:
-
tl;dr
Add the following to your docker-compose.yaml
, under a service:
service: ServiceName
# ...
deploy:
resources:
reservations:
devices:
- driver: nvidia
capabilities: ["gpu", "compute", "utility"]
# "gpu" may or may not be present depending on the video card
Pass in the GPU using —gpus all
-
When you run something in Docker, tack on
--gpus all
sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi
>> Failed to initialize NVML: Unknown Error
…drat no change
Passing in Devices (my problem):
-
My problem:
-
For some reason, the symlinks for all the different devices aren’t made
-
Had to add all the
--device=/dev/nvidia-uvm-tools --device=/dev/nvidia-modeset --device=/dev/nvidiactl --device=/dev/nvidia0
, otherwise it wouldn’t be detected -
Next up is to try
-
sudo nvidia-ctk system create-dev-char-symlinks \
--create-all
-
Test if you have this problem:
-
sudo docker run --rm --runtime=nvidia --gpus all --device=/dev/nvidia-uvm-tools --device=/dev/nvidia-modeset --device=/dev/nvidiactl --device=/dev/nvidia0 ubuntu nvidia-smi
-
I only needed to add
--device=/dev/nvidiactl --device=/dev/nvidia0
, but ymmv -
It works~
sudo docker run --rm --runtime=nvidia --gpus all --device=/dev/nvidia-uvm-tools --device=/dev/nvidia-modeset --device=/dev/nvidiactl --device=/dev/nvidia0 ubuntu nvidia-smi
>>
Tue Jul 23 14:08:38 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01 Driver Version: 535.183.01 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3090 Off | 00000000:2D:00.0 On | N/A |
| 36% 34C P2 106W / 350W | 587MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
+---------------------------------------------------------------------------------------+
-
Relevant links:
-
https://github.com/NVIDIA/nvidia-docker/issues/1730#issue-1573551271 (used the repro steps here to finally figure out that the GPU was indeed getting detected)
-