One of the most annoying errors while working with CUDA and TensorFlow is also one of the most cryptic: failed call to cuInit: CUDA_ERROR_UNKNOWN: unknown error
This shows up in TensorFlow 2.1.0 as something like
[root@b6e99c245339 /]# python3 Python 3.6.3 (default, Oct 24 2019, 00:21:12) [GCC 4.8.5 20150623 (Red Hat 4.8.5-36)] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import tensorflow as tf >>> tf.__version__ '2.1.0' >>> tf.config.list_physical_devices('GPU') 2020-04-30 03:21:28.520077: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1 2020-04-30 03:21:28.524110: E tensorflow/stream_executor/cuda/cuda_driver.cc:351] failed call to cuInit: CUDA_ERROR_UNKNOWN: unknown error 2020-04-30 03:21:28.524189: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: b6e99c245339 2020-04-30 03:21:28.524221: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: b6e99c245339 2020-04-30 03:21:28.524376: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: 418.87.0 2020-04-30 03:21:28.524485: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 418.87.0 2020-04-30 03:21:28.524523: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:310] kernel version seems to match DSO: 418.87.0 []
There’re quite a few discussions on this error, but all of them seem to stem from different root causes. Now that I’ve encountered this a few times, here’s my checklist of fixes to try:
Rebooting the machine
I’ve been advised to this before. This seems to help some people, but it’s not helped me before.
Install nvidia-modprobe
- On Ubuntu:
sudo apt-get install nvidia-modprobe
- On RHEL/CentOS:
yum install nvidia-modprobe
Rebooting the machine may be necessary after running this command.
Ensure that LD_LIBRARY_PATH
contains correct paths to CUDA
In my case, I have
LD_LIBRARY_PATH="/usr/local/cuda-10.1/lib64:/usr/local/cuda-10.1/extras/CUPTI/lib64:/usr/local/cuda-10.1/targets/x86_64-linux/lib:${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}"
Yours may differ, but in one instance, I realized that I had inadvertently included stub libs (e.g. /usr/local/cuda/lib64/stubs
). In another, there was a version mismatch (note that /usr/local/cuda
is symlinked to a specific version, e.g. /usr/local/cuda-10.1
).
Running nvidia-cuda-mps-serve
The Nvidia Multi-Process Service allows concurrent CUDA kernels to run on the same GPU. If I’m running a CUDA application in a Docker container, this server should be running.
Uninstall/Reinstall CUDA with the correct driver versions
It should be obvious if driver versions are mismatched, but I’ve read that uninstall/reinstalling sometimes works.
In some cases, modprobe may have detected the incorrect nvidia version. Running sudo modprobe --force-modversion nvidia-<nvidia-version>-uvm
could help.
Along this note, while rare, it may be worth checking that your CUDA installation was not corrupted by ensuring that your installer(s) match NVIDIA’s checksums.
Don’t forget the --gpus
flag on docker run
By default, Docker does not include GPU devices in containers without the --gpus
flag. To run the TensorFlow docker container, for example, the command is something like
docker run --gpus all -it --rm tensorflow/tensorflow:latest-gpu
In older versions of Docker, this flag was --runtime=nvidia
Set NVIDIA_VISIBLE_DEVICES
and NVIDIA_DRIVERS_CAPABILITIES
.
This one is a kind of a mystery to me, but came up in a situation where tensorflow-gpu was throwing this CUDA_UNKNOWN_ERROR
in a Docker container, when
- TensorFlow could load CUDA libs and use GPUS on my host machine, and
- tools like
nvidia-smi
andnvcc
worked inside the container
Adding the following environment variables to my Dockerfile fixed my problem.
ENV NVIDIA_VISIBLE_DEVICES all ENV NVIDIA_DRIVER_CAPABILITIES compute,utility
According to the nvidia-runtime-container doc, NVIDIA_VISIBLE_DEVICES
“controls which GPUS will be made accessible inside the container” and NVIDIA_DRIVER_CAPABILITIES
“controls which driver libraries/binaries will be mounted inside the container.” (These variables are already set the nvidia-docker image.)
There’re also separate environment variables NVIDIA_REQUIRE_CUDA
and CUDA_VERSION
, that also looked promising, but wasn’t the solution for me.
References
If you’re reading this post, then chances are you’re either me, or you’re facing this error too. Hopefully one of the above works. Otherwise, here were some links that I found useful when googling around.
2 Comments
Thank you so much for this post. I keep banging my head against the wall every time: 1) nvidia drivers get uninstalled on my ubuntu (fixed, my fingers crossed); 2) tf doesn’t see my gpu
You’re sanity check list helped!
Thanks for the post. I’m training a sample model by running it inside a docker container. The process is able to detect the GPU and train successfully when MPS is off on the host. However, cuInit fails when I start MPS daemon on the host. How do I run nvidia-docker along with nvidia-cuda-mps-server?