This GPU performance-predicting tool is more than a benchmark

In this week’s segment on Best Papers, we chose to publish a report on best paper from Valuetools 2015 conference, which took place in Berlin, Germany, 14-16th of December. The winning paper is titled GPU Performance Prediction Through Parallel Discrete Event Simulation and Common Senseauthored by Guillaume Chapuis, Stephan Eidenbenz, and Nandakishore Santhi.

The presented prediction model is a useful tool for software developers. It can reliably compare performances of different architectures, and hence can serve as a determinant on which architecture will given code run the best, without the need of having the physical GPU at hand. There are also suggested improvements on the model, such as adding a cache model in order to increase the accuracy.

Graphics Processing Unit (GPU) is far easier to model than a Central Processing Unit (CPU). There are a few reasons to this, such as absence of operation system running in the background, instruction reordering, etc. The team coming from Los Almos National Laboratory came up with a performance prediction toolkit that runs on an event simulation engine Simian written in Python. The interaction is managed via a task list.

Upon start, the toolkit predicts time for executing the task list. The three Nvidia GPUs which tested this toolkit were Tesla M2090, K40, and Quadro K6000. The models were validated against three benchmark applications, each aimed at monitoring a different aspect of GPU performance. Runtime predictions were within 20% range of error from the measured values. Finally, performance of next gen GPU was predicted.

From the very beginning, GPU architecture was revolutionary, trading the usual measuring parameters (cache size, frequency, latency) for higher number of processing units.

The paper then goes into describing the GPU architecture, chip composition and the path the code follows inside a GPU. This part closely outlines the advantages of a GPU compared to CPU. GPU computing relies on a high number of blocks a grid, which can simultaneously process a code, in case there is no dependency on an instruction which has not been yet solved. The GPU blocks are acting almost like cores in a multi-core CPU. It is noted that in industry merging of the two can be observed, CPUs are adapting a lot of architecture design from GPUs and vice versa. We recommend taking a look at NVIDIA’s short video demonstration above explaining in simple terms how GPU works.

The methodology of the simulation relies on the simplicity of the architecture. Only a portion of the grid is used, and the total is analytically estimated. Before proceeding to the validation result, the paper quickly lists the specifications of the three GPUs used and it outlines the main differences between them.

Results are sorted into three sets of graphs, each showing the performance of a given GPU. Each set is composed of three graphs, one for every benchmark run. Every graph has runtime in seconds on the y axis, and volume size (matrix size for the SGEMM benchmark) on the x axis. The accuracy is quite high; some minor divergences come only at higher volume sizes.

There is also the graph depicting the performance prediction, based on the values recorded from Quadro K6000 test, of the coming Pascal generation of Nvidia GPU. It cannot be yet validated as the card is still in development and the specifications are not yet public.

Get the full paper for free on EUDL.