31 pages

Performance evaluation of load balancing strategies for approximate string matching on a cluster of heterogeneous workstations

Please download to get full document.

View again

of 31
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Performance evaluation of load balancing strategies for approximate string matching on a cluster of heterogeneous workstations
  See discussions, stats, and author profiles for this publication at: Performance evaluation of load balancingstrategies for approximate stringmatching application on an MPI clusterof...  Article   in  Future Generation Computer Systems · October 2003 DOI: 10.1016/S0167-739X(03)00109-2 · Source: DBLP CITATIONS 15 READS 36 2 authors: Panagiotis D MichailidisUniversity of Macedonia 79   PUBLICATIONS   246   CITATIONS   SEE PROFILE Konstantinos G. MargaritisUniversity of Macedonia 411   PUBLICATIONS   1,570   CITATIONS   SEE PROFILE All content following this page was uploaded by Konstantinos G. Margaritis on 11 January 2017. The user has requested enhancement of the downloaded file.  Future Generation Computer Systems 19 (2003) 1075–1104 Performance evaluation of load balancing strategies forapproximate string matching application on an MPIcluster of heterogeneous workstations Panagiotis D. Michailidis ∗ , Konstantinos G. Margaritis Parallel and Distributed Processing Laboratory, Department of Applied Informatics, University of Macedonia,156 Egnatia str., GR-54006 Thessaloniki, Greece Received 4 June 2002; received in revised form 25 February 2003; accepted 11 March 2003 Abstract In this paper, we present three parallel approximate string matching methods on a parallel architecture with heterogeneousworkstations to gain supercomputer power at low cost. The first method is the static master–worker with uniform distri-bution strategy, the second one is the dynamic master–worker with allocation of subtexts and the third one is the dynamicmaster–worker with allocation of text pointers. Further, we propose a hybrid parallel method that combines the advantages of static and dynamic parallel methods in order to reduce the load imbalance and communication overhead. This hybrid methodis based on the following optimal distribution strategy: the text collection is distributed proportional to workstation’s speed.We evaluated and compared the performance of the four methods with clusters one, two, four, six and eight heterogeneousworkstations.Theexperimentalresultsdemonstratethatdynamicallocationoftextpointersandhybridmethodsachievebetterperformance than the two srcinal ones. We also present an analytical performance model for the four methods that confirmsthe actual behaviour of the experimental results.© 2003 Elsevier B.V. All rights reserved. Keywords:  Approximate string matching; Cluster of heterogeneous workstations; Message passing interface; Load balancing strategies;Performance prediction 1. Introduction Approximate string matching is one of the mainproblems in classical string algorithms, with applica-tions to information and multimedia retrieval, com-putational biology, artificial intelligence and patternrecognition, web search engines and text mining. It isdefined as follows: given a large text collection  t   = ∗ Corresponding author. Tel.:  + 30-2310891890;fax:  + 30-2310891842.  E-mail addresses: (P.D. Michailidis), (K.G. Margaritis). t  1 t  2  ··· t  n  of length  n , a short pattern  p  =  p 1 p 2  ··· p m of length  m  and a maximal number of errors allowed k , we want to find all text positions where the patternmatches the text up to  k  errors. Errors can be substi-tuting, deleting, or inserting a character.In the on-line version of the problem, it is possibleto preprocess the pattern but not the text collection.The classical solution involves dynamic program-ming and needs  O( mn )  time [17]. Recently, a numberof sequential algorithms improved the classical timeconsuming one; see, for instance, the surveys [9,12].Some of them are sublinear in the sense that they donot inspect all the characters of the text collection. 0167-739X/$ – see front matter © 2003 Elsevier B.V. All rights reserved.doi:10.1016/S0167-739X(03)00109-2  1076  P.D. Michailidis, K.G. Margaritis/Future Generation Computer Systems 19 (2003) 1075–1104 We are particularly interested in information re-trieval, where current free text collections is normallyso very large that even the fastest on-line sequentialalgorithms are not practical, and therefore the par-allel and distributed processing becomes necessary.There are two basic methods to improve the perfor-mance of approximate string matching on large textcollections: one is based on the fine-grain paralleli-sation of the approximate string matching algorithm[1,6–8,14,16] and the other is based on the distribu-tion of the computation of character comparisons onsupercomputers or network of workstations. As faras the second method, concerned distributed imple-mentations of approximate string matching algorithmare not available in the literature. However, we areaware of few attempts for implementing other similarproblems on a cluster of workstations. In [2] an exact string matching implementation have been proposedand results are reported on a transputer based archi-tecture. In [10,11] an exact string matching algorithm was parallelised and modelled on a homogeneousplatform giving positive experimental results. Further,in [4] the sequence comparison algorithms have been implemented to a variety of parallel computers, i.e.shared and distributed memory architectures. Finally,in [7,21] presented parallelisations of a biological se-quence analysis algorithm on a homogenous cluster of workstations and on an Intel iPSC/860 parallel com-puter, respectively. Some surveys on the fine-grain andcoarse-grain parallelisation of the sequence analysisalgorithms can be found in [15,20].The main contribution of this work is three low-costparallel approximate string matching approaches thatcan search in very large free textbases on inexpensivecluster of heterogeneous PCs or workstations runningLinux operating system. These approaches are basedon master–worker model using static and dynamicallocation of text collection. Further, we propose ahybrid parallel approach that combines the advan-tages of three previous parallel approaches in order toreduce the load imbalance and communication over-head. This hybrid approach is based on the followingoptimal distribution strategy: the text collection is dis-tributed proportional to workstation’s speed. The fourapproaches are implemented using the MPI library[3,13,18,19] over a cluster of heterogeneous worksta-tions. Finally, a framework is presented for the per-formance modelling of the previous four approacheson a cluster of workstations connected by a Fast Eth-ernet network. To the best of our knowledge this isthe first attempt to implement and model the perfor-mance of an approximate string matching applicationusing static and dynamic load balancing strategies ona network of heterogeneous workstations. 2. Heterogeneous computing model A heterogeneous network (HN) can be abstractedas a connected graph HN (M,C) , where •  M   = { M  1 ,M  2 ,...,M  p }  is set of heterogeneousworkstations ( p  is the number of workstations). Thecomputation capacity of each workstation is deter-mined by the power of its CPU, I/O and memoryaccess speed. •  C  is standard interconnection network for worksta-tions, such as Fast Ethernet or an ATM network,where the communication links between any pair of the workstations have the same bandwidth.Based on the above definition, if a network consistsof a set of identical workstations, the system is homo-geneous. Further, a heterogeneous network can be di-vided into two classes: a dedicated system where eachworkstation is dedicated to execute tasks of a parallelcomputation, and non-dedicated system where eachworkstation executes its normal routines (also calledowner workload) and only the idle CPU cycles areused to execute parallel tasks. In this paper, we use adedicated network of heterogeneous workstations. 2.1. Metrics Metrics help to compare and characterise parallelcomputer systems. Metrics cited in this section aredefined and published in previous paper [22]. They can be roughly divided into characterisation metricsand performance metrics. 2.1.1. Characterisation metrics To compute the power weight among workstationsan intuitive metric is defined as follows: W  i (A)  = min pj  = 1 { T(A,M  j  ) } T(A,M  i ),  (1)where  A  is an application and  T(A,M  i )  is the execu-tiontimeforcomputing A onworkstation M  i .Formula  P.D. Michailidis, K.G. Margaritis/Future Generation Computer Systems 19 (2003) 1075–1104  1077 (1) indicates that the power weight of a workstationrefers to its computing speed relative to the fastestworkstation in the network. The value of the powerweight is less than or equal to 1.To calculate the execution time of a computationalsegment, the speed, denoted by  S  f   of the fastest work-station executing basic operations of an application ismeasured by the following equation: S  f   = Θ(c)t  c ,  (2)where  c  is a computational segment,  Θ(c)  is a com-plexity function which gives the number of basicoperations in a computational segment and  t  c  is theexecution time of   c  on the fastest workstation in thenetwork.Using the speed of the fastest workstation,  S  f  , wecan calculate the speeds of the other workstations inthe system, denoted by  S  i  ( i  =  1 ,...,p ), using thecomputing power weight as follows: S  i  =  S  f  W  i , i  =  1 ,...,p  and  i  =  f,  (3)where  W  i  is the computing power weight of   M  i . So,by Eq. (3), the execution time of segment  c  on anydedicated workstation  M  i  (1  ≤  i  ≤  p ) denoted by T  cpu (c,M  i ) , can be represented as T  cpu (c,M  i )  = Θ(c)S  f  W  i = Θ(c)S  i .  (4)Here,  T  cpu  is considered the required CPU time for thesegment.Therefore, the parallel execution time of a segment c  across the heterogeneous network HN, denoted by T  cpu (c, HN ) , can be represented as T  cpu (c, HN )  = Θ(c)  pi = 1  S  i ,  (5)where  pi = 1  S  i  is the computational capacity usedwhich is obtained by summing the individual speedsof the workstations. 2.1.2. Performance metrics Speedup is used to quantify the performancegain from a parallel computation of an application A  over its computation on a single machine on aheterogeneous network system. The speedup of aheterogeneous computation is given bySP (A)  = min pj  = 1 { T(A,M  j  ) } T(A, HN ),  (6)where T(A, HN ) is the total parallel execution time forapplication  A  on HN, and  T(A,M  j  )  is the executiontime for  A  on workstation  M  j  ,  j   =  1 ,...,p .Efficiency or utilisation is a measure of the time per-centage for which a machine is usefully employed inparallel computing. Therefore, the utilisation of paral-lel computing of application A on a dedicated hetero-geneous network is defined as follows: E  = SP (A)  pj  = 1  W  j  .  (7)The previous formula indicates that if the speedup islarger than  pj  = 1  W  j  , the system computing power,the computation presents a superlinear speedup in adedicated heterogeneous network. 3. MPI master–worker implementations of approximate string matching We follow master–worker programming modelto develop our parallel and distributed approximatestring matching implementations under MPI library[3,13,18,19]. This model consists of a master work-station and a collection of worker workstations. Themaster workstation is used to partition a given textcollection into a set of several smaller subtext collec-tions and distribute them to all worker workstationsand collect the local results from the worker worksta-tions. The worker workstations are mainly performeda sequential approximate string matching algorithmon their respective subtext collections [9,12]. Staticand dynamic master–worker strategies are imple-mented and presented in next subsections. 3.1. The MPI static master–worker model3.1.1. Implementation In order to present the static master–worker imple-mentation we make the following assumptions: first,the workstations have an identifier myid and are num-bered from 1 to  p , second the documents of our textcollection are distributed among the various worksta-tions and stored on their local disks, and finally the  1078  P.D. Michailidis, K.G. Margaritis/Future Generation Computer Systems 19 (2003) 1075–1104 pattern and the number of errors  k  are stored in mainmemory to all workstations. The partitioning strategyof this approach is to partition the entire text collec-tion into a number of subtext collections according tothe number of workstations allocated. The size of eachsubtext collection contains   n/p + m − 1 successivecharacters of the complete text collection. There is anoverlap of   m  −  1 pattern characters between succes-sive subtexts, i.e. a reduance of   p(m  −  1 )  characters.Therefore, the static master–worker implementationthat is called P1, is composed of four phases. In firstphase, the master broadcasts the pattern string and thenumber of errors  k  to all the workers. In second phase,each worker reads its subtext collection from the lo-cal disk in main memory. In third phase, each workerperforms character comparisons using a local sequen-tial approximate string matching algorithm to gener-ate the number of occurrences. In fourth phase, themaster collects the number of occurrences from eachworker. This entire implementation is constructed sothat alternative sequential approximate string match-ing algorithms can be substituted quite easily [9,12].In this paper, we use the classical SEL dynamic pro-gramming algorithm [17].The advantage of this simple approach is low com-munication overhead. This advantage was achieved,a priori, by the search computation, assigning eachworker to search its own subtext independently with-out having to communicate with the other workers orthe master. However, the main disadvantage is the pos-sible load imbalance because of the poor partitioningtechnique. In the other words, there is a significant idletime for faster or more lightly loaded workstations ina heterogeneous environment. 3.1.2. Analytical modelling Let  T  a ,  T  b ,  T  c , and  T  d   be the time spent in eachof four phases, respectively. The analytical costs (intime) are as follows.The first phase includes the communication timeto broadcast the pattern string and the number of errors  k  to all workstations involved in processingof string matching. We used the function MPI Bcastto broadcast information that is completed in log 2  p steps. The size of an  m  pattern string is  m  bytesand the number of errors  k  is 1 byte. Therefore,the broadcast transfers  m  +  1 bytes to the other p  −  1 workstations. The time  T  a  for the first phaseis as follows: T  a  =  log 2  p(α  +  (m  +  1 )β).  (8)We know that the total communication time of theparallel algorithm is the summation of two compo-nents: latency time  α  and transmission time  β . Thenthe communication time,  T  comm , to send  P   bytes of data (messages) is defined as T  comm  =  α  +  Pβ.  (9)The second phase is the average I/O time to read therespective subtext collections with size   n/p + m − 1characters from the local disks of the various work-stations. Then, the time  T  b  for the second phase is asfollows: T  b  = p max j  = 1   n/p  +  m  −  1 (S  I / O ) j   ,  (10)where  (S  I / O ) j   is the I/O capacity of the heterogeneousnetwork when  j   workstation is used.The third phase is the average string matching timeacross the heterogeneous network. The string match-ing of an  m  pattern string in a subtext collection withsize   n/p + m − 1 characters requires  (  n/p + m − 1 )m  computation steps for the SEL dynamic program-ming algorithm [17]. Then, the time  T  c  for the stringmatching phase is given by T  c  = p max j  = 1  (  n/p  +  m  −  1 )m(S  search ) j   ,  (11)where  (S  search ) j   is the text searching capacity of theheterogeneous network when  j   workstation is used.Finally, the fourth phase includes the communica-tion time to gather  p  results resulting from the stringmatching carried on the subtext collections by p work-stations concurrently. Each workstation sends back one value (in our case, the number of occurrences).We used the function MPI Reduce to collect the re-sults that is completed in log 2  p  steps. Therefore, time T  d   for this phase is as follows: T  d   =  log 2  p(α  +  β).  (12)The total execution time of our static approximatestring matching implementation,  T  p , using  p  worksta-tions, is the summation of the four terms and is givenby T  p  =  T  a  +  T  b  +  T  c  +  T  d  .  (13)
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks

We need your sign to support Project to invent "SMART AND CONTROLLABLE REFLECTIVE BALLOONS" to cover the Sun and Save Our Earth.

More details...

Sign Now!

We are very appreciated for your Prompt Action!