### On the Optimality and Speed of the Deep Greedy Switching Algorithm for Linear Assignment Problems

**Authors: Amgad Naiem, Mohammed El-Beltagy**

**INTRODUCTION**

The linear sum assignment problem (LSAP) is a classical combinatorial optimization problems that is found in many applications. The assignment problem deals with the question of how to assign n persons to n objects in the best possible way. The assignment problem is specified by the cost matrix cij for minimization problems or the benefit matrix aij for maximization problems. Those matrices describe the cost or benefit of assigning object j to person i.

The linear sum assignment problem can be found in many real world problems; from simple personnel to tasks assignment in a factory or an organization to more complex applications such as peer-to-peer (P2P) satellite refueling. It occurs in fields as varied as bioinformatics and computer vision. The LSAP is also used as subproblems in a number of network flow and applied combinatorial optimization problems such as the quadratic assignment problem (QAP), the traveling salesman problem (TSP) and the transportation problem. The assignment problem is also used as a preprocessing step for pivoting methods.

Due to the dynamic nature of many applications, as well as the expansion of problem sizes, where a solution needs to be found under tight time constraints, heuristics that give rise to solutions that are close to optimal solution are sought. Our efforts in that regards lead us to the Deep Greedy Switching (DGS) algorithm.

One of the first algorithms to solve the assignment problem is the Hungarian algorithm that was developed in the 1950s. Many other algorithms have since been developed for the LSAP. Some of these methods are: primal-dual algorithms (e.g. auction algorithm by Bertsekas), interior point algorithms (e.g. approximate dual projective algorithm by Ramakrishnan et al.) and algorithms that use forest construction as dual forest algorithm by Achatz et al.

The auction algorithm is considered one of the fastest algorithm that can find the optimal solution for the assignment problem. Even as the auction is considered one of the fastest algorithms, for large-scale and complex instances of the assignment problem the auction algorithm can take a lot of time to find the optimal solution. Ramakrishnan et al. discussed five types of the assignment problem instances which are Geometric, Fixed-Cost, High-Cost, Low-Cost and Uniformly Random. It was shown that for the first two problem types the auction algorithm performs poorly in terms of the running time in comparison with the other three problem types. For instances of the Geometric and Fixed-Cost types with n = 10, 000, the auction algorithm can take hours to finish. The application that motived DGS was the centralized P2P live streaming system discussed in “On the feasibility of centrally-coordinated

peer-to-peer live streaming”. This application requires solving instances of the assignment problem with n > 10, 000 persons and objects in less than one minute. This has to be done periodically and in a reasonable time as the peers are watching and transmitting a live stream.

The auction algorithm proved to be unsuitable for our application. Even attempts of a parallelized version of the auction algorithm did not meet our requirements. The fact that the auction algorithm only gives a partial assignment if interrupted makes it more unsuitable. We deduced that the auction algorithm would be impractical for our system and other applications where execution time is strictly constrained and a solution needs to be found before a certain deadline. DGS was developed to fulfill these requirement and in our context it sacrificed a mere 0.6% of the optimal solution obtained by the auction algorithm while attaining a substantial speedup in terms of computational time.

A plethora of greedy heuristics have been developed for different optimization problems. Trick applied a greedy heuristic approach for the generalized assignment problem, where it was shown that some randomization to the greedy approach is required to reach better results than a completely greedy approach. Another heuristic approach is the greedy randomized adaptive search procedure (GRASP, which consist of a multi-start random initial solutions and a local search. DGS on the other hand is developed to best fit the assignment problem where the neighborhood construction is specially designed for the assignment problem properties.

DGS starts with a random initial solution and then the algorithm iteratively tries to find a better solution through searching within a well defined neighborhood. Our greedy switching considers a switch valid if and only if it is better than the current assignment solution, in terms of the problem’s

global objective function. DGS is partially greedy as it ranks the potential switches in order of the greatest improvement a given switch would provide over the current assignment. It attempts to apply the switches following the ranked order of improvement without recalculating how previous switches might have affected the validity of any given potential unexecuted switch. Hence, it is deep.

A key property of the DGS approach is that the objective function value of the assignment problem must increase after each iteration. The algorithm terminates eventually as the objective function value reaches a local optima. Unlike Tabu search and other population based heuristics, our approach starts with one solution and never examines more that one candidate solution at every iteration.

We will discuss the DGS approach for the LSAP in detail in section II. In the same section we will explore optimality aspects of our algorithm. We will present the modifications to the basic DGS for assignment problems with partial connectivity in section III and a parallelization approach of the DGS algorithm on GPUs is presented in section IV. Extensive comparisons between DGS algorithm and the auction algorithm are conducted on known problem types in section V. Results shows that DGS solutions come very close to those obtained by the auction algorithm while being substantially faster to calculate.

*DEEP GREEDY SWITCH FOR LSAP*

Given n persons I = {1,…,n} and n objects J = {1,…,n}, in a LSAP the objective is to find the minimum cost of assigning each person to exactly one object and each object to exactly one person. Assigning object j to person

i incurs cost cij . Finding the minimum cost assignment is computationally equivalent to finding the maximum benefit assignment and the cost matrix can be transformed into a benefit matrix by this simple transformation

aij = C − cij , ∀i ∈ I,j ∈ J

where C = maxi,j{cij}. So without loss of generality we will consider working on a benefit maximization assignment problem as this is how the DGS approach was initially developed, where any cost minimization assignment problem can be transformed to the maximization problem version as described without any change in the optimal solution of the problem. The LSAP can be expressed as an integer binary programming problem where decision variable xij is set to 1 if person i is assigned to object j and is set to 0 otherwise. The problem can be expressed as

In DGS, our solutions are best described via two types of assignment mappings. An assignment mapping σ : J → I is expressed as σ(j) = i and means that object j is assigned to person i. Similarly we define an assignment mapping τ : I → J where τ (i) = j means that person i is assigned to object j. Furthermore, we can construct τ from σ by the mapping function τ = M(σ) and the objective function value i∈I j∈J aij xij for any given assignment σ is calculated by the function f(σ). The DGS is quite robust with respect to a random feasible initial solution. The terminal objective function values for the same problem with different initial assignments, have a very small deviation with respect to each other as will be shown in section V.

*LOCAL SEARCH AND NEIGHBORHOOD CONSTRUCTION*

Switching is a major part of the local search and neighborhood construction as the constructed neighborhood is strictly based on switching person and object assignments. Given a person i that is assigned to object j and needs to switch assignment to object j , the corresponding switch is to have a new solution with i assigned to j instead of j and the person assigned to j to be assigned to j as described below.

The nature of the LSAP makes it quite efficient to calculate the change in the objective function values due to a switch being made. It can be simply expressed as

f(SWITCH(i, j, σ)) − (σ) = (aij + aσ(j)τ(i))) − (aiτ(i) + aσ(j)j ), (1)

where the calculation is simply O(1). In contrast an f(σ) calculation is O(n). Given an initial random solution σ, we find for each person i the object ji = τ (i) that yields the largest increase in the objective function value if person i switches the assignment to object ji as follows

ji = arg max j∈J|j=τ(i) f(SWITCH(i, j, σ)) − f(σ)

and we also calculate the value of the change in the objective function if person i switches to object ji as follows

δi = max j∈J|j=τ(i) f(SWITCH(i, j, σ)) − f(σ).

Similarly, for each object j we find the person ij = σ(j) that yields the largest increase in the objective function value if person ij switches to object j as follows

ij = arg max i∈I|i=σ(j) f(SWITCH(i, j, σ)) − f(σ),

and we also calculate the value of the change in the objective function if person ij switches to object j as follows

δj = max i∈I|i=σ(j) f(SWITCH(i, j, σ)) − f(σ).

The person switches ji and object switches ij are considered the neighborhood for the current assignment σ. From this neighborhood, we find the best person switch i ∗ = arg maxi∈I ¯δi and the best object switch j∗ = arg maxj∈J δj as well. Then we apply the switch that has the maximum objective value between the best person switch value ¯δi∗ and the best object switch value δj∗ . If the best switch still yields an improvement in the objective function value, it is applied and this is called greedy switch.

The current solution is replaced by the new solution after every greedy switches. After replacement, the changes in objective function values for the person and object the were switched are calculated using (1). The process is repeated until no improvements are possible. DGS only applies switches that will make a better solution than the current assignment σ and when no further improvement is done, no recalculation of the best person and object switches is done and the algorithm terminates.

The greediness of the DGS comes from prioritizing the switches based on the likely most beneficial impact on the objective function value. As switches are executed, there are likely to be side effects where some of the best person or object switches calculations will be no longer valid. This happens when some of the δj or δi become inaccurate for the increase in the objective value due to a given switch. To appreciate how this may come about, consider a problem involving three persons and three object (n = 3). Initially object x is assigned to person a, object y is assigned to person b, and object z is assigned to person c. Calculating the best object to person switches results in ja = y, jb = z, and jc = y. If i ∗ = a, then persons a and b will switch objects and hence the previously calculated δb and δc values will no longer reflect the increase in the objective function value by assigning object z to b or object y to c. The algorithm is deep in that it continue to do switches without updating the switch values until the best switch is no longer beneficial. Hence it exploits previously calculated best moves even as they become less relevant. This deepness allows for some sort of non-greedy examination of the problem search space. When all the best switches are no longer beneficial, it then recalculates the switch values for all the persons and objects.

The DGS algorithm is designed to be able to find the solution at any point in time during the algorithm execution unlike the auction algorithm and other primal-dual algorithm where only partial solutions are available until the end of the algorithm. Hence, DGS is effective when working under time constraints as it starts from a complete assignment and at every iteration of the algorithm a complete solution is available. If the solution is required immediately, the algorithm can preemptively terminate and return the current solution. Fig. 1 shows the solution improvement with the number of attempted switches (i.e. the number of iterations in the while loop) for a test run of the DGS on an instance of size n = 10, 000 of type GEOM (see section V-A). It is clear the objective function is always monotonically increasing and that half way through the execution the solution is above 95% of the auction algorithm optimal solution. It also interesting to note that only one round of the repeat loop sufficient to realize the greatest gains. We have seen that pattern in many problem instances.

*Finiteness and time complexity of the DGS *

One of the key properties of the DGS approach is that the size of the valid switches or moves in the constructed neighborhood of σstart is always decreasing until the reevaluation of switch values. It can be shown that at least one of the objects and persons involved in the applied switch will have no further valid beneficial switches (i.e. ∃¯δi ≤ 0∨∃δj ≤ 0) and hence any local exploration (i.e. the while loop) starting from σstart will at most involve 2n switches each iteration, and in practice there will be much less valid switches. Such property allows our approach to finish rapidly for big and complex instances of the LSAP.

In addition given that we have started with an initial random solution σstart and assuming that the values of the aij matrix are all integers, each switch must increase the current objective function value f(σ) at least by 1. Hence, assuming that the ai,j values are integers, the maximum number of possible switches will be Δf = f(σoptimal) − f(σstart) iterations, where σoptimal is the optimal assignment.

Calculating the best person and object switches for all persons and objects is an O(n2) operation. This is considered the most computationally expensive part in the DGS, which is why it is only done sparingly although best switches become progressively invalid. Instead, we only calculate the affected 2 persons and 2 objects best switches for any solution obtained by a greedy switch. In other words, if persons {i, i } and objects {j, j } had made the switch in assignments with each other, then DGS will only recalculate the best switch values for them. While this reduction to the number of switches evaluations will have a great impact on the DGS running time, it may invalidate some of the δj or δi values that have been constructed at the initialization. Therefore we check for both the validity and improvement to be gained by any switch before applying it and if an invalid solution is detected, we disregard such a switch and continue applying the next best potential switch.

We can not make more than 2n greedy switches within the while loop. Hence at each repeat iteration the recalculation of the switch values involves O(n2) calculations. At each repeat iteration the objective function value f(σ) will improve by at least 2. Hence, DGS is at most O(n2Δf ). In practice, we have never observed more than 4 repeat iterations for problem sizes up to n = 20, 000.

*Optimality Limitation of the DGS *

According to the definition of the DGS algorithm, it terminates iff there is no possible beneficial switches for all the persons and objects. In this case after the termination of the algorithm the worst possible case for any person i is to be assigned to object j that has the minimum aij value from this person’s own benefit (i.e. j = arg minj∈J aij ) and if the DGS terminate at this state, then this person could not find any other object j for which a switch would result in an improved solution. Since all the values of aij |j = j are greater than aij and person i cannot do anymore beneficial switches, this implies that any other person i will lose from objective value more than what person i will gain if the switch was to happen. This means that all the other persons are not assigned to the objects that have the minimum benefit from their perspective. If we consider person i2 = i, since no switch with person i is possible, it must be assigned to an object j2 such that

aij2 − aij ≤ ai2j2 − ai2j .

Let d1,2 = aij2 − aij , then d1,2 must be a positive value or zero since aij2 ≥ aij as person i is assumed to have the worst possible aij value. Therefore ai2j2 must be greater than or equal to ai2j and the worst possible case for person i2 that is assigned to object j2 is to be assigned to the second worst object in benefit as we already shown that ai2j2 ≥ ai2j . If we consider person i3 = i2 = i, since no switch is possible with persons i and i2, then it must be assigned to a object j3 such that

aij3 − aij ≤ ai3j3 − ai3j ,

ai2j3 − ai2j2 ≤ ai3j3 − ai3j2 .

Let d1,3 = aij3 − aij , then d1,3 must be positive or equal to zero since aij3 ≥ aij . Let d2,3 = ai2j3 − ai2j2 and it also must be positive or equal to zero as ai2j3 ≥ ai2j2 . Therefore ai3j3 ≥ ai3j and ai3j3 ≥ ai3j2 and the worst case for person i3 is that object j3 to be the third worse object in benefit. Similarly in this worst case scenario, person i4 is assigned to object j4 that is the 4th worst object in benefit for this person and so on until person in that will be assigned to the object jn which will be the object with the maximum aij value for this person.

If we assume that d2,1 = ai2j2 − ai2j , d3,1 = ai3j3 − ai3j and d3,2 = ai3j3 − ai3j2 , we showed that d1,2 ≤ d2,1, d1,3 ≤ d3,1, d2,3 ≤ d3,2 and in general di,j ≤ dj,i for any index j>i. Such a stable (and probably poor) DGS solution and can only happen if the initial solution was on the worst case, within one switch of the worst case or within a small basin of attraction of the worst case. This has a very small probability of happening as verified by the empirical results in Fig. 2. The probability of reaching the worst case outcome is even smaller than the probability of getting the optimal solution due to the tendency of the DGS switching to increase the objective function value in a maximal fashion due to its greediness.

In order to find such worst case assignment solution for a given aij matrix, if it exists, we try to find a person that is in the worst aij value and has no possible switches. Moreover, this has to be the case for the person in the second worst aij value. In practice, such solution may not exist. Hence, finding this solution can be achieved by solving the following integer programming model

Solving such integer programming model is very complex as the number of constraints is n2(n − 1)2 + 2n. Even for a reasonably sized problem this imposes massive memory and processing requirements. Hence, we developed a heuristic (worst case finder) that is able to find a solution close to the solution of this integer programming problem. This heuristic is able to find a solution that is close to the worst case in case there is no worst case initial solution. It is beyond the scope of this paper to discuss the details of this heuristic.

Fig. 2 shows the probability of ending up within a certain percentage of the auction algorithm optimal solution using DGS for an instance of GEOM with n = 100 by taking one million random samples of initial assignments. The worst case initial solution to the problem represented in Fig. 2 that was found using the worst case finder heuristic was given to the DGS as the starting solution and the outcome optimality deviation percentage was 99.57%.

**ASSIGNMENT PROBLEMS WITH PARTIAL CONNECTIVITY **

The assignment problem can be represented as a bipartite graph G = (V,W; E), where each of the disjoint vertex sets V and W have n vertices that represent the persons and objects of the assignment problem respectively and E is the set of edges connecting persons i ∈ V to objects j ∈ W. A subset M ⊂ E is called matching edges if every vertex in the graph is incident with at most one edge from M. The cardinality of the matching M is less than or equal n. The maximum matching is the one which contains as many edges as possible. In a given matching M, if every vertex of G appears in exactly one edge from M (i.e. |M| = n) then M is called a perfect matching.

Each edge in E have a weight equal to aij that indicates the weight of the edge between person i and object j. The graph G is called a complete bipartite graph if the set of edges E contains all the possible assignments between any person in V and any object in W without any restrictions. However, in many applications the problem is described as an incomplete bipartite graph where some of the edges E ∈ E are missing from the graph G. This means that some assignments between certain persons and objects are not allowed.

Mills-Tettey et al. suggests that missing edges (i, j) in G shall have a large cost value or in the case of maximization assignment problem weights of the missing edges aij would be −∞. Such practice is not applicable in a heuristic approach as we may end up having a solution that contains invalid assignments. Moreover we require our algorithm to be able to know at any point in time if the current problem is feasible or not.

In order to be able to determine the feasibility of the assignment, the DGS considers that such edges does not exist at all. Hence the challenging part would be finding an initial feasible solution, if it exists. We assume that Ji is the set of objects that person i is allowed to make an assignment with. The initial feasible solution is determined by iterating over all persons sequentially and trying to find an unassigned object for each person i within the set of available objects Ji. If we fail to find an unassigned object, we choose an assigned object j based on the best impact on the objective function value of assigning object j to person i by unassigning it from person i = τ (j). Since person i is now unassigned, it might be counter productive to have is assigned to object j later by unassigning person i as this could likely lead on infinite changing of hands of object j. Hence, we construct a set of banded handovers and record in it that change of hands of object j for person i to i . This set for each person i is denoted by Bi. If |Bi| for any person i becomes equal to |Ji|, we clear out the banned switches set once (i.e. make Bi = ∅) and if it becomes full again, then we would deduce that all the possible switches for this person has been exhausted and that a feasible solution for this given person is not possible. Hence we declare that there is no feasible solution for the given assignment problem. Furthermore, we have a tag called cl(i) that states whether person i’s banned set has been cleared or not so that we don’t clear the banned switches list twice.

After an initial feasible solution is found, the neighborhood construction and the local search of the DGS approach is applied as described earlier with taking into consideration not to make a switch to a non-existing or missing edge. Also the same stopping criteria for the DGS algorithm is applied.

**GPU-BASED PARALLELIZATION OF DGS **

The process of calculating the best switches for each person in the DGS algorithm has high parallelization potential and hence it was modified to be parallelized on Graphical Processing Units (GPUs) using the CUDA programming language in [24]. The resulting parallel version of the DGS algorithm shows a huge performance gain compared to the sequential CPU-based implementation.

The obvious step in parallelizing the DGS algorithm is to run the calculation of the best switches ij and ji in parallel as these calculations are independent from each other. This is shown in the algorithm ParallelDGS as the lines starting with the letter P.

In the parallel version of the DGS on GPUs, the switching phase was also parallelized where the non-conflicting switches of each iteration are applied in parallel besides the parallelization of the best switches calculations in order to achieve more speedup. We define a conflicting switch as one where a single person is common to both switches. The persons involved in conflicting switches are detected and added to the set C in the checkConflicts algorithm, while the persons with non-conflicting switches are added to the set NC. To insure greediness we sort the picked persons in sets A and B with respect to improvements that their switches would bring about to the objective function. This is done by the sortδ ¯i and sortδj functions respectively. Hence, the parallelized switching procedure will proceed with the most promising and nonconflicting switches first.

**COMPUTATIONAL RESULTS **

In this section we present computational results for conducted experiments showing the performance of the DGS against the auction algorithm by Bertsekas, which is a well known fast algorithm for solving the LSAP while guaranteeing near optimal solution. All the presented results comparison between the two algorithms are an average of five randomly generated instances of the assignment problem and both optimality and speed differences are evaluated. All the experiments conducted throughout this paper was made on a MacBook Pro with a 2.66 GHz Intel Core 2 Duo processor and 4G of memory.

In all the experiments we compared the execution time of the DGS with our implementation of the auction algorithm, which has the same behavior as the sequential auction algorithm. The fact that our implemented auction algorithm and these sequential implementations of the auction algorithm have the same behavior for different problem types, shows the validity of the obtained results.

*Fully Connected Graphs*

We consider four variations of the DGS in the tests that we ran. They are as follows:

- NGS: Non-greedy switches, where the switches are not prioritized and are carried out in a random fashion.
- GS: Greedy switches are applied every iteration where a new search for the best switches is done at the end of every switch. This difference between the DGS and GS, emphasizes the importance of having a deep switches.
- DGS-P: Persons switches only, where we only calculate the switches for the persons and neglect the objects switches in order to decrease the required computational time of the algorithm.
- DGS-PO: The standard DGS algorithm with persons and objects switches.

The instances used in our experiments are of three types. The first category composed of the OR-library small classical benchmark instances used by Beasley and the other two types are RAND and GEOM used by Bus et al., which are generated as follows:

- RAND: Completely dense where aij values are random integer value that are uniformly distributed between 1 and C, where C is a constant input parameter.
- GEOM: First 2n points are distributed randomly on a 2D graph square of dimensions [0, C] × [0, C], where n points represent the persons and n points for the objects. Then based on these geometric vertices, each aij value is set as the Euclidean distance between point i from the first n generated points and point j from the second n generated points.

The small classical benchmark problems sizes are from 100 to 800 persons / objects. In these problems, the DGS results of five different random initial solutions shows a deviation from the optimal solution by 0.6% at its worst performance, and an average deviation of 0.5% from the optimal solution. This shows that the quality of the solution is not affected by the used initial solution. These results also shows that there is no significant difference in terms of running time on such small problems between the DGS and the auction algorithm.

The RAND generated instances have C = 1, 000 for low cost instances and 100, 000 for high-cost instances, similar to the experiments in [28] of parallel auction implementation. In our experiments, the number of persons / objects (n) was from 1, 000 to 20, 000. Speedup in the results is calculated as the the auction running time divided by the DGS running time on the same machine.

The results shows that our approach has a deviation no more than 0.18% from the optimal solution reached by the auction algorithm. The maximum standard deviation of optimality percentage among all different runs is 0.015 from the average optimality percentage.

The GEOM instances similar to the RAND instances have C = 1, 000 for low-cost instances and 100, 000 for high-cost instances. The experiments tried was for n from 1, 000 to 5, 000. Table III and Table IV summarizes the results for low cost and high-cost experiments respectively. The deviation in the results between the solutions of the DGS and the auction was 0.018% at the worst case and average of 0.015% amongst five runs. The maximum standard deviation of optimality percentage among all different runs is 0.001 from the average optimality percentage.

The GEOM instances are more complex problems than the RAND instances and hence the auction algorithm running time increases exponentially. The running time of the DGS on the other hand has almost the same speed as the speed of similar problem sizes in the RAND generated instances. The results clearly show that in most cases the simpler DGS-P gives a higher speedup with results that almost always no worse than the other variants.

*Partially Connected Graphs*

The instances used in our experiments of assignment problems with partial connectivity are categorized into four types. The first type is Or-library classical benchmark instances used by Beasley and the second type is gathered from Peerialism’s peer-to-peer (P2P) live streaming application. The other two types are generated according to the assignment problem instances described as follows:

- CDG: Completely dense graphs where aij values are randomly uniform between 0 and 10, 000 and with a degree d that represents the average percentage of dropped edges per node.
- AFX: Assignment problem with fixed cost where aij value are equal to 100 ∗ i ∗ j and have a degree equal to n/16 random edges.

The size of the classical benchmark problems is from 800 to 5, 000 persons / objects, and the results of the DGS approach shows an average deviation of 2.5% from the optimal solution. The results also shows that our approach is more than 4 times faster than the auction algorithm for these instances of the assignment problem.

In order to generate the P2P live streaming application instances, we used MyP2PWorld a reproducible emulation tool for P2P systems by Roverso et al. to produce instances similar to the real P2P application instances used by Peerialism. The properties of the instances gathered from the P2P application is that it has a high connectivity constraints and the aij values are so close to each other, i.e. the difference between most aij values is very small, which makes the problem very hard to solve and this is the reason it takes a very long time for the auction algorithm to find the optimal solution for these problems.

In our P2P experiments, instances of the assignment problem of sizes from n = 12 to 488 were tried. The results of such instances shows that the DGS approach achieves better results than the auction algorithm -optimal solution by 0.3% for an average of 50 different runs where the the aij matrix values were floats. In terms of computational time the DGS approach is faster than the auction algorithm by an order of magnitude.

The experimented problem sizes of the CDG instances of the assignment problem was from n = 1, 000 to 30, 000 with good estimates for the degree d, such that the resulting partially connected graph is feasible and has a high probability of producing the optimal assignment as the fully connected dense graph.

Finally, the assignment problems with fixed cost (AFX) characteristics makes it really complex for the auction algorithm to find an optimal solution for, while the DGS was able to find the −optimal solution very fast. Moreover the solution obtained the DGS is identical to the optimal solution of the auction algorithm.

We tried the DGS-P variant on AFX and CDG problems and found that the solutions obtained were around 1% worse than those obtained by DGS.

*GPU-Based Parallel DGS*

The results of the parallelization of the DGS on GPU shows that the total termination time needed for big problem sizes implemented on GPU is less than the total time needed for executing just the calculation of the best switches in the first iteration of a sequential implementation.

The observed results of the GPU parallelization of the DGS shows that the obtained speedup can be as high as 10 times faster than normal DGS running time. As the sequential DGS algorithm is considerably fast, the GPU parallelization opens up for even more substantial improvement in terms of running time without any additional loss in optimality. In a comparison between the parallel DGS on GPU and the implementation of the auction algorithm on GPU by Vasconcelos et al, the parallel DGS algorithm shows a speedup up to more than 300 times faster than the parallel auction algorithm.

**CONCLUSIONS **

DGS represents a very fast and powerful approach to solving linear sum assignment problems. In this work we have explored optimality aspects of DGS and shown how variants of it are compared with the auction algorithm. The results suggest that doing person switches (DGS-P) leads to a significant speedup with a minor reduction in the solution quality in fully connected assignment problems. For partially connected assignment problems, DGS-P yielded worse solutions with no improvement in speedup. The speed up achieved by using DGS can reduce the time required to find the optimal solution from hours to just a few seconds on some problem instances. Furthermore, we presented a simple parallelization algorithm for DGS the gave rise to dramatic improvements in execution time. There could be many variants of DGS other than those presented here where a finer trade-off between greediness and depth can be explored, and where more sophisticated parallelization approaches are pursued. Future work will explore in deep details neighborhood structure of DGS in order to obtain tighter bounds on the computational complexity.