How to Install NVIDIA Drivers for P100 and CUDA on Barreleye G2 / Power9 (Work in Progress)

I’m trying to get Nvidia Telsa P100 working with CUDA on Barreleye G2 Server.  Barreleye G2 is Power9 / OpenPOWER server that Rackspace and Google are building.

Posting the process I’m using, here, to help others trying to do the same.

Before Installing CUDA, make sure you have the right drivers for Nvidia Devices:

You have 2 Alternatives for Installing Drivers for NVIDIA devices:

1) Install the Power8 Drivers from NVIDIA website for Power9:

wget http://us.download.nvidia.com/tesla/384.59/nvidia-driver-local-repo-ubuntu1604-384.59_1.0-1_ppc64el.deb

dpkg -i nvidia-driver-local-repo-ubuntu1604-384.59_1.0-1_ppc64el.deb

OR

2) Install the APT recommended Drivers:

sudo apt-get install ubuntu-drivers-common
sudo ubuntu-drivers devices

Based  on output recommendation from above command, install recommended driver. For Example:
sudo apt-get install nvidia-384

Installing CUDA on Power9 / Barreleye G2:

Install dependency packages for CUDA

sudo apt-get install build-essential

Install repository packages for CUDA specific to ppc and your build (in my case 16.04). For the other builds, lookup Nvidia website for specific deb package
wget https://developer.nvidia.com/compute/cuda/8.0/prod/local_installers/cuda-repo-ubuntu1604-8-0-local-ga2_8.0.54-1_ppc64el-deb
sudo dpkg -i ./cuda-repo-ubuntu1604-8-0-local-ga2_8.0.54-1_ppc64el-deb

Update the APT Definitions (You need to do this for above repo package to take effect ):

sudo apt-get update

Install CUDA Libraries

sudo apt-get install cuda

Make CUDA Accessible to all users:

echo ‘export PATH=$PATH:/usr/local/cuda-8.0/bin’ | sudo tee
echo /usr/local/cuda-8.0/lib64 | sudo tee /etc/ld.so.conf.d/cuda-8-0.conf
sudo ldconfig

Check your Drivers and Packages are all working together and you see devices, modules and details of your hardware 

sudo dmesg | grep -i nvidia
sudo lsmod | grep nvidia
nvidia-smi –list-gpus

If things went okay, output for above 3 commands, should be meaningful.

Now lets check if CUDA install went okay by building and running CUDA applications:

mkdir ~/samples
cp -r /usr/local/cuda-8.0/samples/ ~/samples/
cd ~/samples/samples/7_CUDALibraries/simpleCUFFT
make
./simpleCUFFT

In my case, CUDA application fails to run with following errors. I’m trying to resolve if this is Power9 / PPC OS related error or Nvidia device error. Will edit this post with details. Process for install should remain same neverthless.

root@ubuntu:~/samples/samples/7_CUDALibraries/simpleCUFFT# ./simpleCUFFT
[simpleCUFFT] is starting…
GPU Device 0: “Tesla P100-PCIE-16GB” with compute capability 6.0

[ 3457.484507] Severe Machine check interrupt [[Not recovered]
[ 3457.485099] Initiator: CPU
[ 3457.485332] Error type: Real address [Load/Store (foreign)]
[ 3457.485762] Effective address: 00003fff9e49208c
Bus error (core dumped)

 

Advertisements
How to Install NVIDIA Drivers for P100 and CUDA on Barreleye G2 / Power9 (Work in Progress)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s