How to Install NVIDIA Drivers for P100 and CUDA on Barreleye G2 / Power9 (Work in Progress)

I’m trying to get Nvidia Telsa P100 working with CUDA on Barreleye G2 Server.  Barreleye G2 is Power9 / OpenPOWER server that Rackspace and Google are building.

Posting the process I’m using, here, to help others trying to do the same.

Before Installing CUDA, make sure you have the right drivers for Nvidia Devices:

You have 2 Alternatives for Installing Drivers for NVIDIA devices:

1) Install the Power8 Drivers from NVIDIA website for Power9:

wget http://us.download.nvidia.com/tesla/384.59/nvidia-driver-local-repo-ubuntu1604-384.59_1.0-1_ppc64el.deb

dpkg -i nvidia-driver-local-repo-ubuntu1604-384.59_1.0-1_ppc64el.deb

OR

2) Install the APT recommended Drivers:

sudo apt-get install ubuntu-drivers-common
sudo ubuntu-drivers devices

Based  on output recommendation from above command, install recommended driver. For Example:
sudo apt-get install nvidia-384

Installing CUDA on Power9 / Barreleye G2:

Install dependency packages for CUDA

sudo apt-get install build-essential

Install repository packages for CUDA specific to ppc and your build (in my case 16.04). For the other builds, lookup Nvidia website for specific deb package
wget https://developer.nvidia.com/compute/cuda/8.0/prod/local_installers/cuda-repo-ubuntu1604-8-0-local-ga2_8.0.54-1_ppc64el-deb
sudo dpkg -i ./cuda-repo-ubuntu1604-8-0-local-ga2_8.0.54-1_ppc64el-deb

Update the APT Definitions (You need to do this for above repo package to take effect ):

sudo apt-get update

Install CUDA Libraries

sudo apt-get install cuda

Make CUDA Accessible to all users:

echo ‘export PATH=$PATH:/usr/local/cuda-8.0/bin’ | sudo tee
echo /usr/local/cuda-8.0/lib64 | sudo tee /etc/ld.so.conf.d/cuda-8-0.conf
sudo ldconfig

Check your Drivers and Packages are all working together and you see devices, modules and details of your hardware 

sudo dmesg | grep -i nvidia
sudo lsmod | grep nvidia
nvidia-smi –list-gpus

If things went okay, output for above 3 commands, should be meaningful.

Now lets check if CUDA install went okay by building and running CUDA applications:

mkdir ~/samples
cp -r /usr/local/cuda-8.0/samples/ ~/samples/
cd ~/samples/samples/7_CUDALibraries/simpleCUFFT
make
./simpleCUFFT

In my case, CUDA application fails to run with following errors. I’m trying to resolve if this is Power9 / PPC OS related error or Nvidia device error. Will edit this post with details. Process for install should remain same neverthless.

root@ubuntu:~/samples/samples/7_CUDALibraries/simpleCUFFT# ./simpleCUFFT
[simpleCUFFT] is starting…
GPU Device 0: “Tesla P100-PCIE-16GB” with compute capability 6.0

[ 3457.484507] Severe Machine check interrupt [[Not recovered]
[ 3457.485099] Initiator: CPU
[ 3457.485332] Error type: Real address [Load/Store (foreign)]
[ 3457.485762] Effective address: 00003fff9e49208c
Bus error (core dumped)

 

Advertisement

How to Install Ubuntu Xenial 16.04 LTS on Power9 Machines

If you want to install Original LTS Ubuntu 16.04 (with initial LTS Kernel 4.4) on Power9, you were / are out of luck. This is because full kernel support for Power9 got in from 4.10 on wards.

But to some jubilation, 16.04.3 LTS got released today (08/03/17) with support for 17.04 Kernel (4.10). So work around to install Ubuntu LTS on Power9 is to use 16.04.03 LTS HWE kernel / initrd instead of ones I indicated in my previous blog post:

Kernel:

http://ports.ubuntu.com/ubuntu-ports/dists/xenial-updates/main/installer-ppc64el/current/images/hwe-netboot/ubuntu-installer/ppc64el/vmlinux

Initrd:

http://ports.ubuntu.com/ubuntu-ports/dists/xenial-updates/main/installer-ppc64el/current/images/hwe-netboot/ubuntu-installer/ppc64el/initrd.gz

Again, Follow the same procedure in previous blog post  BUT with NEW kernel / initrd links.

Enjoy the first LTS port on Power9