DL in Docker

(This blog post represents my personal opinion, and should not be interpreted as any official statements from my employer NVIDIA.)

As described in the book Learning Deep Learning, one way of installing Deep Learning (DL) frameworks with GPU support is to use Docker. If you have never used Docker before, all the new concepts can be somewhat overwhelming. The intent of this blog post is to provide some handholding to get you up and running.

To keep things simple, I limit myself to describing how to do this for a specific platform (Ubuntu Linux 20.04), but I provide pointers to the official installation instructions for each component as well so these instructions might still be useful for users of other platforms.

Warning: Any description like this is bound to get outdated over time, so I recommend checking the official installation instructions to confirm that they still are consistent with what I describe.

Docker enables you to create a container that encapsulates an application’s dependencies. This is done by first creating a Docker image, on which the application, and any libraries it depends on, are preinstalled. Docker then takes this Docker image and creates a Docker container, in which the application runs. You can think of a Docker image as the disk with installed software, and think of the Docker container as a machine that has booted and runs from this disk. In order for Docker to work seamlessly with NVIDIA GPUs, NVIDIA has created NVIDIA Container Toolkit, which in turn relies on the NVIDIA GPU driver already being installed on the system. Additionally, NVIDIA provides images with DL frameworks installed through its NVIDIA GPU Cloud (ngc) catalog. That is, to get you up and running, all in all there are four key steps:

  • Install NVIDIA GPU driver (likely already installed on your system)
  • Install Docker
  • Install NVIDIA Container Toolkit
  • Pull (download) an image from the ngc catalog

NVIDIA GPU driver

Use the NVIDIA System Management Interface to check if the NVIDIA GPU driver is installed, and if so, what version:

nvidia-smi

If nvidia-smi is installed and can run on your system, then the NVIDIA GPU driver is installed and the first line in the resulting print-out should state the version of the driver. This is what it looked like on my system:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.86 Driver Version: 470.86 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+

If the NVIDIA driver is not already installed, then follow the official instructions here:
https://www.nvidia.com/Download/index.aspx?lang=en-us
https://docs.nvidia.com/datacenter/tesla/tesla-installation-notes/index.html

Install Docker

First remove older Docker versions. This step is not required if you haven’t installed docker before:

sudo apt-get remove docker docker-engine docker.io containerd runc

The version of Docker that we want to use is not included in the standard Ubuntu apt repository so we must first add the Docker repository, which requires a handful of steps. Pay attention to, and address, any error messages:

sudo apt-get update

sudo apt-get install ca-certificates curl gnupg lsb-release

curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg

echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null

Now install the Docker engine:

sudo apt-get update

sudo apt-get install docker-ce docker-ce-cli containerd.io

To be able to run docker without prefacing each command with sudo, do the following:

sudo groupadd docker

sudo usermod -aG docker $USER

Restart the computer, and then confirm that you can run Docker without sudo:

docker run hello-world

The steps above originate from the following official instructions:
https://docs.docker.com/engine/install/ubuntu/
https://docs.docker.com/engine/install/linux-postinstall/
Some additional information if you are interested:
https://askubuntu.com/questions/1182820/why-is-docker-not-in-official-ubuntu-repos

Install NVIDIA Container Toolkit

First we need to set up the repository:

distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
&& curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - \
&& curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list

Now install the toolkit:

sudo apt-get update

sudo apt-get install -y nvidia-docker2

sudo systemctl restart docker

Confirm that it works by running nvidia-smi inside a Docker container:

docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi

The steps above originated from the following official instructions:
https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/overview.html
https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#installing-on-ubuntu-and-debian

Pull an image from NVIDIA GPU Cloud

NVIDIA provides Docker images with TensorFlow or PyTorch pre-installed. The TensorFlow image is named tensorflow:YY.MM-tf2-py3 and the PyTorch image is named pytorch:YY.MM-py3, where YY and MM designate the year and month when the image was released. E.g., the images released in July 2021 are named tensorflow:21.07-tf2-py3 and pytorch:21.07-py3.

It is important to pull a version of the image that is compatible with your version of the NVIDIA driver. In my case, the NVIDIA driver version is 470.86 (see output from nvidia-smi above). The following page contains information of each monthly image release:
https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html

On that page, in the column for 21.07 we see “Release 21.07 is based on NVIDIA CUDA 11.4.0, which requires NVIDIA Driver release 470 or later. However, if you are running on Data Center GPUs (formerly Tesla), for example, T4, you may use NVIDIA driver release 418.40 (or later R418), 440.33 (or later R440), 450.51 (or later R450), or 460.27 (or later R460).”
That should work with my driver (470.86). If, on the other hand, I want to use the 22.01 image, then I would need to update my driver first because the 22.01 image requires a driver version of 495 or later.

We can now download both the TensorFlow and PyTorch images with the following commands:

docker pull nvcr.io/nvidia/tensorflow:21.07-tf2-py3

docker pull nvcr.io/nvidia/pytorch:21.07-py3

Official information about the TensorFlow and PyTorch images can be found here:
https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tensorflow
https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch

You now have everything needed to start a Docker container and run TensorFlow or PyTorch. This is described next.

Running LDL code examples in a Docker container

In addition to the DL framework itself, you also need the code examples that you want to run. They can be downloaded as a zip file from here:
https://github.com/NVDLI/LDL/archive/refs/heads/main.zip
Then unzip it in a suitable location:

unzip LDL-main.zip

The command lines below assume that it is placed in the following location:

/home/USERNAME/LDL-main

We can now start Docker with the following command line:

nvidia-docker run --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 --rm -it -p 8888:8888 -v /home/USERNAME/LDL-main:/home/LDL nvcr.io/nvidia/tensorflow:21.07-tf2-py3 /bin/bash

In this example we use TensorFlow. See further down for PyTorch.

Some additional information in case you are interested in the details:

  • The command “nvidia-docker” is a script that calls Docker. It is also possible to just use “docker” directly, but then you need to add the option “–gpus all” to the command line.
  • The options “–shm-size=1g –ulimit memlock=-1 –ulimit stack=67108864” are recommended to use with these images. If you don’t include these options on the command line, then you will get a warning message.
  • The option “-p 8888:8888” has to do with port forwarding to be able to run Jupyter notebooks. You can omit this option if you just want to run the normal Python files.
  • The option “-v /home/USERNAME/LDL-main:/home/LDL” mounts the directory /home/USERNAME/LDL-main in the location /home/LDL inside the Docker container.

The Docker images do not have matplotlib or idx2numpy pre-installed so if you want to run code examples that rely on these modules, then do the following:

pip install matplotlib

pip install idx2numpy

Let’s now try a code example:

cd /home/LDL/tf_framework

python c6e1_boston.py

If you instead want to run it as a Jupyter notebook, then do the following:

cd /home/LDL

jupyter notebook --ip 0.0.0.0 --port 8888 --allow-root

You should see something along the following lines being printed in your shell:

To access the notebook, open this file in a browser:
file:///root/.local/share/jupyter/runtime/nbserver-362-open.html

Or copy and paste this URL:
http://hostname:8888/?token=80d3a14a3660b254c26e3f983c8cf26e3de0714de3b9830e

You can now copy the URL above and paste it into a browser, but replace “hostname” with “localhost”. This should enable you to run the notebooks in your browser.

If you instead want to run PyTorch, you would use the following Docker command line:

nvidia-docker run --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 --rm -it -p 8888:8888 -v /home/USERNAME/LDL-main:/home/LDL nvcr.io/nvidia/pytorch:21.07-py3 /bin/bash

However, some of the code examples use some utilities from TensorFlow so you will also need to install TensorFlow in your container:

pip install tensorflow

I have also run into some problems with Jupyter notebooks if it is not upgraded to a newer version:

pip install --upgrade notebook

And as before, we need matplotlib and idx2numpy:

pip install matplotlib

pip install idx2numpy

Try a code example:

cd /home/LDL/pt_framework

python c6e1_boston.py

Or if you want to run it as a Jupyter notebook, then do the following as described above:

cd /home/LDL

jupyter notebook --ip 0.0.0.0 --port 8888 --allow-root

Extending a Docker Image

(section added November 17, 2022)

In the examples above, we had to install some missing packages to the Docker image. Instead of doing that each time, you can create your own new Docker image based on a base image. This is done by using a Dockerfile. The LDL repository contains two different files:

  • Dockerfile_tf – if you want to run the TensorFlow examples
  • Dockerfile_pt – if you want to run the PyTorch examples

You can create your image using one of the following two commands, depending on if you want to run TensorFlow or PyTorch:

docker build -t ldl_tf:v1 -f Dockerfile_tf .
docker build -t ldl_pt:v1 -f Dockerfile_pt .

This will create a new image named ldl_tf:v1 (TensorFlow) or ldl_pt:v1 (PyTorch). You can now start Docker with one of the following two commands, and you will no longer need to install additional packages before running the programming examples:

nvidia-docker run --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 --rm -it -p 8888:8888 -v /home/USERNAME/LDL-main:/home/LDL ldl_tf:v1 /bin/bash

nvidia-docker run --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 --rm -it -p 8888:8888 -v /home/USERNAME/LDL-main:/home/LDL ldl_pt:v1 /bin/bash

Hopefully this is helpful. If you find anything that looks wrong, then please submit a report on the errata page: https://ldlbook.com/errata/