How to setup a TensorFlow Cuda/GPU-enabled dev environment in a Docker container

Getting TensorFlow to run in a Docker container with GPU support is no easy task. I think I have it figured out. Here are my steps to create a Docker image. I have Ubuntu 14 hosting a Ubuntu 14 Docker container.

Install video card (I have a Nvidia GTX 980) Note that Ubuntu runs an open source driver, we want the Nvidia driver.

Sign up for CUDA Developer (Nvidia takes one day to approve it!)

Install Nvidia video driver and  Cuda Tool Kit

ctrl+alt+f1 (to switch to tty1, in order to stop X server to install Cuda)
sudo service lightdm stop
sudo ./
sudo service lightdm start

Install cuDNN v4

tar -xzvf [cudnn tgz file]
cd [cudnn directory]
sudo cp lib* /usr/local/cuda/lib64/ 
sudo cp cudnn.h /usr/local/cuda/include/

Pull a Docker Ubuntu 14 image

docker pull ubuntu

Create directory for TensorFlow source

I created it in /home/dk1027/git on my host. I also used this to transfer files between my host and the docker container.

Run the docker container

export CUDA_SO=$(\ls /usr/lib/x86_64-linux-gnu/libcuda* | xargs -I{} echo '-v {}:{}')
export DEVICES=$(\ls /dev/nvidia* | xargs -I{} echo '--device {}:{}')
docker run -it --device /dev/mem:/dev/mem -v /lib/modules:/lib/modules --cap-add=ALL --privileged $CUDA_SO $DEVICES -v /home/dk1027/git:/git -v /usr/local/cuda:/usr/local/cuda ubuntu /bin/bash

Some explanations:

$CUDA_SO and $DEVICES exposes the GPU and CUDA .so files to the docker container.

/usr/local/cuda is where I installed cudnn and cuda tool kit.

–device /dev/mem:/dev/mem -v /lib/modules:/lib/modules –cap-add=ALL –privileged made it possible to build the example trainer and to run it later, otherwise will run into this error: modprobe: ERROR: ../libkmod/libkmod.c:556 kmod_search_moddep() could not open moddep file '/lib/modules/3.19.0-25-generic/modules.dep.bin'

What follows happens within the container

Install dependencies for building TensorFlow and the example trainer

apt-get install build-essential
apt-get install python-numpy swig python-dev 
apt-get install zlib1g-dev

Following “Install from source” section:

Setup Bazel

apt-get install software-properties-common 
apt-get update 
apt-get install oracle-java8-installer
apt-get install unzip (because the bazel installer needs it)

Downloaded (because that is the version TensorFlow instructions asked to install)

Download the latest bazel (1.5) … otherwise it doesn’t build

chmod +x
./ --user
export PATH="$PATH:$HOME/bin"

Build TensorFlow and example trainer

cd /git/TensorFlow

Follow instructions on the TensorFlow page for ./configure

Not sure why I had an compiler internal seg fault error. Retried and succeeded.

Build example trainer

bazel-bin/tensorflow/cc/tutorials_example_trainer --use_gpu

It got stuck allocating memory, I ran nvidia-smi and saw that GPU memory was used. However it seg fault after a while.

Tried again, took a couple minute, and then it finishes successfully!

Create the pip package and install

sudo apt-get install python-pip python-dev
bazel build -c opt --config=cuda //tensorflow/tools/pip_package:build_pip_package
bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg
pip install /tmp/tensorflow_pkg/tensorflow-0.7.0-cp27-none-linux_x86_64.whl
pip uninstall protobuf
pip install protobuf==3.0.0b2 (

Train something!

cd /git/tensorflow/tensorflow/models/image/mnist

Go to the host, on another terminal, run this command. Observe that Python is using the GPU now!

watch nvidia-smi
Fri Feb 19 01:00:40 2016       
| NVIDIA-SMI 352.39     Driver Version: 352.39         |                       
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  GeForce GTX 980     Off  | 0000:01:00.0      On |                  N/A |
| 33%   39C    P2   101W / 390W |   3868MiB /  4088MiB |     82%      Default |
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|    0      1265    G   /usr/bin/X                                     109MiB |
|    0      2104    G   compiz                                          12MiB |
|    0     10751    C   python                                        3728MiB |

It maybe a good idea to commit your container at this point.

Update May 2, 2016

To update from source:

git pull --recurse-submodules
git submodule update --recursive
bazel build -c opt --config=cuda //tensorflow/tools/pip_package:build_pip_package
bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg
pip install /tmp/tensorflow_pkg/tensorflow-VERSION-none-linux_x86_64.whl
pip uninstall protobuf
pip install protobuf==3.0.0b2