Authors: Roverso R., Naiem A., El-Beltagy M., El-Ansary S., Haridi, S
The contribution of the paper is threefold: First, we present the process of trial-and-error we went through, in the hope that our experience will be beneficial to adopters of GPU programming for similar problems. Second, we show the modifications needed to parallelize the DGS algorithm. Third, we show the performance gains of our approach compared to both a sequential CPU-based implementation of DGS and a parallel GPU-based implementation of the auctioning algorithm.
The simplest way to obtain an initial solution is by randomly assigning jobs to agents. An alternative is to do a greedy initial assignment where the benefit aij is taken into account. In our experiments with DGS we found that there was not clear advantage to adopt either approach. Since greedy initial assignment takes a bit longer, we opted to use random agent/job assignment for the initial solution.
We show results of the impact of the various steps that we went through to implement it and enhance its performance. The experimental setup for the tests consists of a consumer machine with a 2.4Ghz Core 2 Duo processor equipped with 4GB of DDR3 RAM and NVIDIA GTX 295 graphic card with 1GB of DDR5 on-board memory. The NVIDIA GTX 295 is currently NVIDIA’s top-of-the-line consumer video card and boasts a total number of 30 Scalar Multiprocessors and 240 Processors, 8 for each SM, which run at a clock rate of 1.24 GHz. In the experiments, we use a thread block size 256 when executing kernels which do not make use of Shared Memory, and 16 in the case they do. Concerning the DGS input scenarios, we use dense instances of the GEOM type defined by Bus and Tvrdık, and generated as follows: first we generate n points randomly in a 2D square of dimensions [0, C] × [0, C], then each aij value is set as the Euclidean distance between points i and j from the generated n points. We define the problem size to be equal to the number of agents/jobs. For the sake of simplicity, we make use of problem sizes which are multiples of the thread block size. Note that every experiment is the result of the averaging of a number of runs executed using differently seeded DGS instances.
Difference Evaluation on Global Memory
Global memory is fully addressable by any thread running on the GPU and no special operation is needed to access data on it. Therefore, in the first version of the kernel, we decided to simply upload the full Aij matrix to the GPU memory together with the current agent to job assignments and all the data we needed to run the ADE algorithm on the GPU. Then we let the GPU spawn a thread for each of the agents involved. Consequently, a CUDA thread cti associated with agent i executes the ADE algorithm only for agent i by evaluating all its possible 2-exchanges. The agent-to-thread allocation on the GPU is trivial and is made by assigning the thread identifier cti to agent i.
Difference Evaluation on Shared Memory
Difference Evaluation using Shared Memory assigns one thread to each 2-exchange evaluation for agent i and job j. That implies that the number of created threads equals the number of cells of the Aij matrix. Each thread ctij then proceeds to load in shared memory the data which is needed for the single evaluation between agent i and job j. Once the 2-exchange evaluation is computed, every thread ctij stores the resulting value in a matrix located in global memory in position (i,j). After that, another small kernel is executed which causes a thread for each row i of the resulting matrix to find the best 2-exchange value along that same row for all indexes j. The outcome of this operation represents the best 2-exchange value for agent i. We compare the results obtained by running the two aforementioned Shared Memory GPU kernel implementations and its Global Memory counterpart against the pure C implementation of the Difference Evaluation for different problem sizes. For evaluation purpose, we used a CUDA-enabled version of the DGS where only the Difference Evaluation phase of the algorithm runs on the GPU and can be evaluated separately from the other phase. This implies that we need to upload the input data for the ADE/JDE phase to the GPU at every iteration of the DGS algorithm, as well as we need to download its output in order to provide input for the Switching phase. The aforementioned memory transfers are accounted for in the measurements. As we can observe, there’s a dramatic improvement when passing from the CPU implementation of the difference evaluation to both GPU implementations, even though multiple memory transfers occur. In addition, the Shared Memory version behaves consistently better than the Global Memory one. Furthermore, the trend for increasing problem sizes is linear for both GPU versions of the Difference Evaluation, opposed to the exponential growth of the CPU version curve.
Considering the Switching phase of the DGS algorithm, we found out that in many cases the computational time necessary to apply the best 2-exchanges is fairly high. Our experience is that the switching phase might have a relative impact of between 35% and 60% of the total computation time of the solver. In order to improve the performance of this phase, we modified the Switching algorithm so that a subset of the best 2-exchanges computed in the Difference Evaluation section might be applied concurrently.
In this paper we presented the realization of a GPU-enabled LSAP solver based on the Deep Greedy Switching heuristic algorithm and implemented using the CUDA programming language. We detailed the process of implementation and enhancement of the two main phases of the algorithm: Difference Evaluation and Switching, and we provided results showing the impact of each iteration on the performance. In particular, we showed how parallelizing some parts of the solver with CUDA can lead to substantial speed–ups. We also suggested a modification to the DGS algorithm, in the Switching phase, which enables the solver to run entirely on the GPU. In the last part of the paper, we also show the performance of the final version of the solver compared to a pure C language DGS implementation and to an auction algorithm implementation on GPUs, concluding that the time needed for the DGS solver to reach an outcome is one order of magnitude lower compared to the “C” implementation for big scenarios and three orders of magnitude lower on average compared to the GPU auction solver in almost all problem sizes.