failed call to cuInit: CUDA_ERROR_UNKNOWN: unknown error

In Software Engineering, Machine Learning & AI

One of the most annoying errors while working with CUDA and TensorFlow is also one of the most cryptic: failed call to cuInit: CUDA_ERROR_UNKNOWN: unknown error

This shows up in TensorFlow 2.1.0 as something like

[root@b6e99c245339 /]# python3
Python 3.6.3 (default, Oct 24 2019, 00:21:12)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-36)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
>>> tf.__version__
'2.1.0'
>>> tf.config.list_physical_devices('GPU')
2020-04-30 03:21:28.520077: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-04-30 03:21:28.524110: E tensorflow/stream_executor/cuda/cuda_driver.cc:351] failed call to cuInit: CUDA_ERROR_UNKNOWN: unknown error
2020-04-30 03:21:28.524189: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: b6e99c245339
2020-04-30 03:21:28.524221: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: b6e99c245339
2020-04-30 03:21:28.524376: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: 418.87.0
2020-04-30 03:21:28.524485: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 418.87.0
2020-04-30 03:21:28.524523: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:310] kernel version seems to match DSO: 418.87.0
[]

There’re quite a few discussions on this error, but all of them seem to stem from different root causes. Now that I’ve encountered this a few times, here’s my checklist of fixes to try:

Rebooting the machine

I’ve been advised to this before. This seems to help some people, but it’s not helped me before.

Install nvidia-modprobe

  • On Ubuntu: sudo apt-get install nvidia-modprobe
  • On RHEL/CentOS: yum install nvidia-modprobe

Rebooting the machine may be necessary after running this command.

Ensure that LD_LIBRARY_PATH contains correct paths to CUDA

In my case, I have

LD_LIBRARY_PATH="/usr/local/cuda-10.1/lib64:/usr/local/cuda-10.1/extras/CUPTI/lib64:/usr/local/cuda-10.1/targets/x86_64-linux/lib:${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}"

Yours may differ, but in one instance, I realized that I had inadvertently included stub libs (e.g. /usr/local/cuda/lib64/stubs). In another, there was a version mismatch (note that /usr/local/cuda is symlinked to a specific version, e.g. /usr/local/cuda-10.1).

Running nvidia-cuda-mps-serve

The Nvidia Multi-Process Service allows concurrent CUDA kernels to run on the same GPU. If I’m running a CUDA application in a Docker container, this server should be running.

Uninstall/Reinstall CUDA with the correct driver versions

It should be obvious if driver versions are mismatched, but I’ve read that uninstall/reinstalling sometimes works.

In some cases, modprobe may have detected the incorrect nvidia version. Running sudo modprobe --force-modversion nvidia-<nvidia-version>-uvm could help.

Along this note, while rare, it may be worth checking that your CUDA installation was not corrupted by ensuring that your installer(s) match NVIDIA’s checksums.

Don’t forget the --gpus flag on docker run

By default, Docker does not include GPU devices in containers without the --gpus flag. To run the TensorFlow docker container, for example, the command is something like

docker run --gpus all -it --rm tensorflow/tensorflow:latest-gpu

In older versions of Docker, this flag was --runtime=nvidia

Set NVIDIA_VISIBLE_DEVICES and NVIDIA_DRIVERS_CAPABILITIES.

This one is a kind of a mystery to me, but came up in a situation where tensorflow-gpu was throwing this CUDA_UNKNOWN_ERROR in a Docker container, when

  • TensorFlow could load CUDA libs and use GPUS on my host machine, and
  • tools like nvidia-smi and nvcc worked inside the container

Adding the following environment variables to my Dockerfile fixed my problem.

ENV NVIDIA_VISIBLE_DEVICES all
ENV NVIDIA_DRIVER_CAPABILITIES compute,utility

According to the nvidia-runtime-container doc, NVIDIA_VISIBLE_DEVICES “controls which GPUS will be made accessible inside the container” and NVIDIA_DRIVER_CAPABILITIES “controls which driver libraries/binaries will be mounted inside the container.” (These variables are already set the nvidia-docker image.)

There’re also separate environment variables NVIDIA_REQUIRE_CUDA and CUDA_VERSION, that also looked promising, but wasn’t the solution for me.

References

If you’re reading this post, then chances are you’re either me, or you’re facing this error too. Hopefully one of the above works. Otherwise, here were some links that I found useful when googling around.

2 Comments

Lara 2020-09-20 Reply

Thank you so much for this post. I keep banging my head against the wall every time: 1) nvidia drivers get uninstalled on my ubuntu (fixed, my fingers crossed); 2) tf doesn’t see my gpu

You’re sanity check list helped!

Swaroop 2020-09-28 Reply

Thanks for the post. I’m training a sample model by running it inside a docker container. The process is able to detect the GPU and train successfully when MPS is off on the host. However, cuInit fails when I start MPS daemon on the host. How do I run nvidia-docker along with nvidia-cuda-mps-server?

Leave a Reply