I’m trying to get Nvidia Telsa P100 working with CUDA on Barreleye G2 Server. Barreleye G2 is Power9 / OpenPOWER server that Rackspace and Google are building.
Posting the process I’m using, here, to help others trying to do the same.
Before Installing CUDA, make sure you have the right drivers for Nvidia Devices:
You have 2 Alternatives for Installing Drivers for NVIDIA devices:
1) Install the Power8 Drivers from NVIDIA website for Power9:
dpkg -i nvidia-driver-local-repo-ubuntu1604-384.59_1.0-1_ppc64el.deb
OR
2) Install the APT recommended Drivers:
sudo apt-get install ubuntu-drivers-common
sudo ubuntu-drivers devices
Based on output recommendation from above command, install recommended driver. For Example:
sudo apt-get install nvidia-384
Installing CUDA on Power9 / Barreleye G2:
Install dependency packages for CUDA
sudo apt-get install build-essential
Install repository packages for CUDA specific to ppc and your build (in my case 16.04). For the other builds, lookup Nvidia website for specific deb package
wget https://developer.nvidia.com/compute/cuda/8.0/prod/local_installers/cuda-repo-ubuntu1604-8-0-local-ga2_8.0.54-1_ppc64el-deb
sudo dpkg -i ./cuda-repo-ubuntu1604-8-0-local-ga2_8.0.54-1_ppc64el-deb
Update the APT Definitions (You need to do this for above repo package to take effect ):
sudo apt-get update
Install CUDA Libraries
sudo apt-get install cuda
Make CUDA Accessible to all users:
echo ‘export PATH=$PATH:/usr/local/cuda-8.0/bin’ | sudo tee
echo /usr/local/cuda-8.0/lib64 | sudo tee /etc/ld.so.conf.d/cuda-8-0.conf
sudo ldconfig
Check your Drivers and Packages are all working together and you see devices, modules and details of your hardware
sudo dmesg | grep -i nvidia
sudo lsmod | grep nvidia
nvidia-smi –list-gpus
If things went okay, output for above 3 commands, should be meaningful.
Now lets check if CUDA install went okay by building and running CUDA applications:
mkdir ~/samples
cp -r /usr/local/cuda-8.0/samples/ ~/samples/
cd ~/samples/samples/7_CUDALibraries/simpleCUFFT
make
./simpleCUFFT
In my case, CUDA application fails to run with following errors. I’m trying to resolve if this is Power9 / PPC OS related error or Nvidia device error. Will edit this post with details. Process for install should remain same neverthless.
root@ubuntu:~/samples/samples/7_CUDALibraries/simpleCUFFT# ./simpleCUFFT
[simpleCUFFT] is starting…
GPU Device 0: “Tesla P100-PCIE-16GB” with compute capability 6.0
[ 3457.484507] Severe Machine check interrupt [[Not recovered]
[ 3457.485099] Initiator: CPU
[ 3457.485332] Error type: Real address [Load/Store (foreign)]
[ 3457.485762] Effective address: 00003fff9e49208c
Bus error (core dumped)