Data & Analytics

12 pages
3 views

An effective hybrid transactional memory system with strong isolation guarantees

Please download to get full document.

View again

of 12
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Share
Description
An effective hybrid transactional memory system with strong isolation guarantees
Transcript
  An Effective Hybrid Transactional Memory Systemwith Strong Isolation Guarantees Chi Cao Minh, Martin Trautmann, JaeWoong Chung,Austen McDonald, Nathan Bronson, Jared Casper,Christos Kozyrakis, Kunle Olukotun Computer Systems LaboratoryStanford Universityhttp://tcc.stanford.edu {caominh, mat42, jwchung, austenmc, nbronson, jaredc, kozyraki, kunle}@stanford.edu ABSTRACT We propose signature-accelerated transactional memory (SigTM),a hybrid TM system that reduces the overhead of software trans-actions. SigTM uses hardware signatures to track the read-set andwrite-set for pending transactions and perform conflict detectionbetween concurrent threads. All other transactional functionality,including data versioning, is implemented in software. Unlike pre-viously proposed hybrid TM systems, SigTM requires no modifi-cations to the hardware caches, which reduces hardware cost andsimplifies support for nested transactions and multithreaded pro-cessor cores. SigTM is also the first hybrid TM system to providestrong isolation guarantees between transactional blocks and non-transactional accesses without additional read and write barriers innon-transactional code.Usingasetofparallelprogramsthatmakefrequentuseofcoarse-grain transactions, we show that SigTM accelerates software trans-actions by 30% to 280%. For certain workloads, SigTM can matchthe performance of a full-featured hardware TM system, while forworkloads with large read-sets it can be up to two times slower.Overall, we show that SigTM combines the performance character-istics and strong isolation guarantees of hardware TM implementa-tions with the low cost and flexibility of software TM systems. Categories and Subject Descriptors:  C.1.2 [Processor Architec-tures]: Multiple Data Stream Architectures – MIMD processors;D.1.3 [Programming Techniques]: Concurrent Programming – par-allel programming General Terms:  Performance, Design, Languages Keywords:  Transactional Memory, Strong Isolation, Multi-coreArchitectures, Parallel Programming 1. INTRODUCTION Transactional Memory (TM)  [16, 1] is emerging as a promis-ing technology to address the difficulty of parallel programming Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.  ISCA’07,  June 9–13, 2007, San Diego, California, USA.Copyright 2007 ACM 978-1-59593-706-3/07/0006 ...$5.00. for multi-core chips. With TM, programmers simply declare thatcode blocks operating on shared data should execute as  atomic and isolated transactions  with respect to all other code. Concurrencycontrol as multiple transactions execute in parallel is the responsi-bility of the system.Transactional memory can be implemented in hardware (  HTM  )or software ( STM  ). HTM systems use hardware caches to track thedata read or written by each transaction and leverage the cachecoherence protocol to detect conflicts between concurrent transac-tions [13, 21]. The advantage of hardware support is low over-head. Transactional bookkeeping takes place transparently as theprocessor executes load and store instructions. The disadvantageof HTM is complexity and cost, as the caches and the coherenceprotocol must be redesigned. It is also difficult to justify new hard-ware features without significant experience with a large body of transactional software. STM systems implement all bookkeepingin software using instrumentation code (read and write barriers)and software data structures [14, 23, 11, 15]. STM frameworks runon existing hardware and can be flexibly modified to introduce newfeatures, provide language support, or adapt to specific applicationcharacteristics. The disadvantage of the software-only approachis the runtime overhead due to transactional bookkeeping. Eventhough the instrumentation code can be optimized by compilers [2,15], STM can still slow down each thread by 40% or more.Apart from the cost/performancetradeoff, there are important se-mantic differences between HTM and STM. HTM systems support strong isolation , which implies that transactional blocks are iso-lated from non-transactional accesses [18]. There is also a consis-tentorderingbetweencommittedtransactionsandnon-transactionalaccesses. In contrast, high-performance STM systems do not sup-port strong isolation because it requires read and write barriers innon-transactional code and leads to additional runtime overhead.As as result, STM systems may produce incorrect or unpredictableresults even for simple parallel programs that would work correctlywith lock-based synchronization [18, 12, 25].This paper introduces  signature-accelerated TM (SigTM) , a hy-brid transactional memory implementation that reduces the over-head of software transactions and guarantees strong isolation be-tween transactional and non-transactional code. SigTM uses  hard-ware signatures  to conservatively represent the transaction read-setand write-set. Conflict detection and strong isolation are supportedby looking up coherence requests in the hardware signatures [5].All other functionality, including transactional versioning, commit,and rollback, is implemented in software. Unlike previously pro-posed hybrid schemes [17, 10, 24], SigTM requires no modifica-tions to hardware caches, which reduces SigTM’s hardware cost, 69  simplifies crucial features (support for OS events, nested trans-actions, and multithreading), and eliminates negative interferencewith other cache operations (prefetching and speculative accesses).Moreover, SigTM is a stand-alone hybrid TM that does not requiretwo operation modes or two code paths. SigTM is also the first hy-brid TM system to implement strong isolation without additionalbarriers in non-transactional code.The specific contributions of this work are: •  WedescribethehardwareandsoftwarecomponentsofSigTM,a hybrid transactional memory system that uses hardwaresignatures for read-set and write-set tracking. SigTM im-proves the performance of software transactions while pro-viding strong isolation guarantees. •  We compare SigTM to STM and HTM systems using a set of parallel applications that make frequent use of coarse-graintransactions. We show that SigTM outperforms software-only TM by 30% to 280%. While for some workloads itperforms within 10% of HTM systems, for workloads withlarge read-sets, SigTM trails HTM by up to 200%. •  We quantify that relatively small hardware signatures, 128bits for the write-set and 1024 bits for the read-set, are suffi-cient to eliminate false conflicts due to the inexact nature of signature-based conflict detection.The rest of the paper is organized as follows. Sections 2 and 3 re-view STM and HTM systems and their differences in terms of per-formance and isolation guarantees. Section 4 presents the SigTMsystem. Sections 5 and 6 present the experiment methodology andresults respectively. Section 7 discusses related work, and Section8 concludes the paper. 2. SOFTWARE AND HARDWARE TM A TM system provides  version management   for the data writ-ten by transactions and performs  conflict detection  as transactionsexecute concurrently. This section summarizes hardware and soft-ware implementation techniques, focusing on the specific STM andHTM systems we used in this study. There are several alternativeimplementations for both approaches [1, 18]. 2.1 Software Transactional Memory STM systems implement version management and conflict de-tection using software-only techniques. In this work, we use Sun’sTL2 as our base STM [11]. TL2 is a lock-based STM that imple-ments optimistic concurrency control for both read and write ac-cesses and scales well across a range of contention scenarios [12].Algorithm 1 provides a simplified overview of the STM for a C-style programming language. Refer to [11] for a discussion of TL2for object-oriented languages. The STM maintains a global versionclock used to generate timestamps for all data. To implement con-flict detection at word granularity, it also associates a lock to everyword in memory using a hash function. The first bit in the lock word indicates if the corresponding word is currently locked. Theremaining bits are used to store the timestamp generated by the lasttransaction to write the corresponding data.A transaction starts ( STMtxStart ) by taking a checkpoint of the current execution environment using  setjmp  and by readingthe current value of the global clock into variable RV   . A transac-tion updates a word using a write barrier ( STMwriteBarrier ).The barrier first checks for a conflict with other committing or com-mitted transactions using the corresponding lock. A conflict is sig-naled if the word is locked or its timestamp is higher than the value Algorithm 1  Pseudocode for the basic functions in the TL2 STM. procedure  STM TX S TART checkpoint() RV    ← GlobalClock procedure  STM WRITE B ARRIER ( addr ,  data ) bloomFilter .insert( addr ) writeSet .insert( addr ,  data ) procedure  STM READ B ARRIER ( addr ) if   bloomFilter .member( addr ) and  writeSet .member( addr )  thenreturn  writeSet .lookup( addr ) if   locked(addr) or (timeStamp(addr) >  RV    then conflict() readSet .insert( addr ) return  Memory[ addr ] procedure  STM TX C OMMIT for  every  addr  in  writeSet  doif   locked( addr )  then conflict() else lock( addr ) WV    ← Fetch&Increment( GlobalClock ) for  every  addr  in  readSet  doif   locked( addr ) or (timeStamp( addr ) >  RV    )  then conflict() for  every  addr  in  writeSet  do Memory[ addr ] ← writeSetLookup( addr ) for  every  addr  in  writeSet  do timeStamp( addr ) ← WV   unlock( addr ) in RV   . Assuming no conflict, the store address and data are addedto the  write-set  , a hash table that buffers the transaction output un-til it commits. The STM also maintains in software a 32-bit Bloomfilter for the addresses in the write-set. A transaction loads a wordusing a read barrier ( STMreadBarrier ). The barrier first checksif the latest value of the word is available in the write-set, usingthe Bloom filter to reduce the number of hash table lookups. If theaddress is not in the write-set, it checks for a conflict with othercommitting or committed transactions. Assuming no conflict, itinserts the address to the  read-set  , a simple FIFO that tracks readaddresses. Finally, it loads the word from memory and returns itsvalue to the user code.In order to commit its work ( STMtxCommit ), a transaction firstattempts to acquire the lock for all words in its write-set. If it failson any lock then a conflict is signaled. Next, it atomically incre-ments the global version clock using the atomic instructions in theunderlying ISA (e.g.,  cmpxchg  in x86). It also validates all ad-dresses in the read-set by verifying that they are unlocked and thattheir timestamp is not greater than  RV   . At this point, the trans-action is  validated   and guaranteed to complete successfully. Thefinal step is to scan the write-set twice in order to copy the newvalues to memory, update their timestamp to WV   , and release thecorresponding lock.The STM handles conflicts in the following manner. If a trans-action fails to acquire a lock while committing, it first spins for alimited time and then aborts using  longjmp . A transaction abortsin all other conflict cases. To provide liveness, the STM retriesthe transaction after a random backoff delay that is biased by thenumber of aborts thus far. 70  2.2 Hardware Transactional Memory HTM systems implement version management and conflict de-tection by enhancing both caches and the cache coherence protocolin a multi-core system [13, 21]. The HTM used in this study is thehardware equivalent of the TL2 STM. It uses the cache to buffer thewrite-set until the transaction commits. Conflict detection is imple-mentedusingcoherencemessageswhenonetransactionattemptstocommit (optimistic concurrency). In general, our HTM is similar tothe TCC system [13] with two key deviations. First, TCC executesall user code in transactional mode, while our HTM uses transac-tional mechanisms only for user-defined transactions. The rest of the code executes as a conventional multithreaded program withMESI cache coherence and sequential consistency. Second, TCCuses a write-through coherence protocol that provides conflict reso-lution at word granularity. Our HTM uses write-back caches whichrequires conflict detection at the cache line granularity.The HTM extends each cache line with one  read bit (R)  and one write bit (W)  that indicate membership in a transaction’s read-setand write-set respectively. A transaction starts by taking a regis-ter checkpoint using a hardware mechanism. A store writes to thecache and sets the W bit. If there is a cache miss, the cache line isrequested in the shared state. If there is a hit but the line containsmodified data produced prior to the current transaction (modifiedand W bit not set), it first writes back the data to lower levels of thememory hierarchy. A load reads the corresponding word and setsthe R bit if the W bit is not already set. If there is a cache miss forthe load, we retrieve the line in the shared state as well.When a transaction completes, it arbitrates for permission tocommit by acquiring a unique hardware lock. This implies thatonly a single transaction may be committing at any point in time.Parallel commit can be supported using a two-phase protocol [6],but it was not necessary for the system sizes we studied in thispaper. Next, the HTM generates snooping messages to request ex-clusive access for any lines in its write-set (W bit set) that are inshared state. At this point, the transaction is validated. Finally,it commits the write-set atomically by flash resetting all W and Rbits and then releasing the commit lock. All data in the write-setare now modified in the processor’s cache but are non-transactionaland can be read by other processors.An ongoing transaction detects a conflict when it receives a co-herence message with an exclusive request for data in its read-set orwrite-set. Such a message can be generated by a committing trans-action or by a non-transactional store. A software abort handler isinvoked that rolls back the violated transaction by flash invalidatingall lines in the write-set, flash resetting all R and W bits, and restor-ing the register checkpoint [20]. Transaction starvation is avoidedby allowing a transaction that has been retried multiple times to ac-quire the commit lock at its very beginning. Forward progress isguaranteed because a validated transaction cannot abort. To guar-antee this, a validated transaction sends negative acknowledgments(nacks) to all types of coherence requests for data in its write-setand exclusive requests for data in its read-set. Once validated, anHTM transaction must execute just a few instructions to flash resetits W and R bits, hence the window of nacks is extremely short.If a transaction receives a shared request for a line in its read-setor write-set prior to reaching the commit stage, it responds that ithas a shared copy of the line and downgrades its cache line to theshared state if needed. 3. STM - HTM DIFFERENCES This section compares qualitatively the STM and HTM systemsin terms of their performance and strong isolation guarantees. 0.00.51.01.52.02.5 d   e  l   a  u  n  a   y   g  e  n  o  m  e  k  m  e  a  n  s  -  h  i    g  h  k  m  e  a  n  s  -  l   o  w      N  o  r  m  a   l   i  z  e   d   E  x  e  c  u   t   i  o  n   T   i  m  e Busy Commit Write Barrier Read Barrier Other ` 012345678 v  a  c  a  t  i   o  n  -  h  i    g  h  v  a  c  a  t  i   o  n  -  l   o  w   Figure 1.  The STM execution time breakdown for a single proces-sor run. Execution time is normalized to that of the sequential codewithout transaction markers or read/write barriers. 3.1 Performance STM transactions run slower than HTM transactions due to theoverhead of software-based versioning and conflict detection. Eventhoughthetwosystemsmayenablethesamedegreeofconcurrencyand exhibit similar scaling, the latency of individual transactions isa key parameter for the overall performance. Figure 1 shows theexecution time breakdown for STM running on a single processor(no conflicts or aborts). Execution time is normalized to that of thesequential code. STM is slower than sequential code by factors of 1.5 × to 7.2 × . In contrast, the overhead of HTM execution on oneprocessor is less than 10%.The biggest bottleneck for STM is the maintenance and valida-tion of the read-set. For every word read, the STM must execute atleast one read barrier and revalidate its timestamp during the com-mit phase. Even after optimizations, the overhead is significant fortransactions that read non-trivial amounts of data. The next im-portant bottleneck is the three traversals of the write-set needed tovalidate and commit a transaction: first to acquire the locks, thento copy the data values to memory, and finally to release the locks.Lock handling is expensive as it involves atomic instructions andcauses additional cache misses and coherence traffic. The third im-portant factor (not shown in Figure 1) is that an STM transactionmay not detect a conflict until it validates its read-set at the veryend of its execution. This is due to that fact that a read by an ongo-ing transaction is not visible to any other transaction committing anew value for the same word. Invisible reads increase the amountof work wasted on aborted transactions.There are also secondary sources of STM overhead. For exam-ple, loads must first search the write-set for the latest value. Our ex-perienceisthatthesoftwareBloomfiltereliminatesmostredundantsearches. Aborting a transaction on a conflict is also sufficientlyfast for our STM since no data are written to memory before thetransaction is validated. Finally, there can be false conflicts due tothe hash function used to map variable addresses to locks. We usea million locks which makes this case rare.STM overheads can be reduced through manual or compiler- 71  Assume that initially  x==y==0 // Thread 1 atomic {t1=x;...t2=x;} // Thread 2 ......x=100;...... // Thread 1 atomic {t=x;...x=t+1;} // Thread 2 ......x=100;...... // Thread 1 atomic {x+=100;y+=100;...} // Thread 2 ............t=x+y; (a) Non-repeatable Reads (b) Lost Updates (c) Overlapped Writes Figure 2.  Isolation and ordering violation cases for the STM system.basedoptimizations[2, 15]. Theprimarygoalistoeliminateredun-dant or repeated barriers using techniques such as common subex-pression elimination and loop hoisting or by identifying thread-local and immutable data. Nevertheless, even optimized STM coderequires one barrier per unique address as well as the validation andcommit steps at the end of the transaction. A special case is read-only transactions for which there is no need to build or validate theread-set or the write-set [11]. Unfortunately, statically-identified,read-only transactions are not common in most applications.For the HTM, hardware support eliminates most overheads fortransactionalbookkeeping. Noadditionalinstructionsareneededtomaintain the read-set or write-set. Loads check the write-set auto-matically for the latest values. Read-set validation occurs transpar-ently and continuously as coherence requests are processed; hence,conflicts are discovered as soon as a writing transaction commits.The write-set is traversed a single time on commit. On the otherhand, the HTM may experience performance challenges on trans-actions whose read-set and write-set overflow the available cachespace or associativity. There are several proposed techniques thatvirtualize HTM systems using structures in virtual memory [22, 8,7]. In most cases, the HTM performs similarly to an STM systemfor the overflowed transactions due to the overhead of virtualiza-tion. An additional source of overhead for HTM can be false con-flicts due to conflict detection at cache line granularity. The STMuses word granularity conflict detection. If STM-style compiler op-timizations are not used with the HTM system, false conflicts canbe more frequent. For example, if an HTM transaction reads animmutable field in the same cache line with a field written by an-other transaction, there will be a false conflict. Ideally, the compilershould instruct the HTM system to use a non-transactional load forthe immutable field that will not add its address to the read-set [20]. 3.2 Strong Isolation and Ordering The differences between HTM and STM extend beyond per-formance. An important feature for TM systems is  strong iso-lation , which facilitates predictable code behavior [18]. Strongisolation requires that transactional blocks are isolated from non-transactional memory accesses. Moreover, there should be a con-sistent ordering between transactional code and non-transactionalaccesses throughout the system. High-performance STM systemsdo not implement strong isolation because it requires additionalread and write barriers in non-transactional code that exacerbatethe overhead issues. We refer readers to [25] for a thorough discus-sion of isolation and ordering issues in all types of STM systems.Figure 2 presents the three isolation and ordering cases that leadto unpredictable results with our STM. For the case in Figure 2.(a),the non-transactional code in Thread 2 updates  x  while Thread 1runs a transaction. Since Thread 2 does not use a write barrier,Thread 1 has no means of detecting the conflict on  x . The tworeads to  x  from Thread 1  may  return different values (0 and 100respectively). The expected behavior is that either both reads return0 (Thread 1 before Thread 2) or both return 100 (Thread 2 beforeThread1). ForthecaseinFigure2.(b), thelackofconflictdetection // Thread 1 ListNode res;atomic {res = lhead; if  (lhead != null)lhead = lhead.next;}use res multiple times; // Thread 2 atomic {ListNode n = lhead; while  (n != null) {n.val ++;n = n.next;}} Figure 3.  The code for the privatization scenario.between the two threads  may  lead to a lost update. If the write byThread 2 occurs after Thread 1 reads  x  and before it commits itstransaction, the final value of   x  will be 1. The expected behavioris that either  x  has a final value of 100 (Thread 1 before Thread 2)or 101 (Thread 2 before Thread 1). For the case in Figure 2.(c), theproblem arises because Thread 2 does not use a read barrier for  x and  y . Hence, it is possible that, as Thread 1 is in the middle of itstransaction commit, Thread 2 reads the old value of   x  and the newvalue of   y  or vice versa. In other words,  t  may  have value 100,while the expected behavior is that t is either 200 (Thread 1 beforeThread 2) or 0 (Thread 2 before Thread 1).One can dismiss the cases in Figure 2 as examples of data racesthat would lead to unpredictable behavior even with lock-basedsynchronization. Instead of arguing whether or not TM shouldeliminate the data races in lock-based synchronization, we exam-ine the privatization code in Figure 3 [18]. Thread 1 atomicallyremoves an element from a linked-list and uses it multiple times.Thread 2 atomically increments all elements in the list. There aretwo acceptable outcomes for this code: either Thread 1 commits itstransaction first and subsequently uses only the non-incrementedvalue of the first element or Thread 2 commits first and Thread 1subsequently uses only the incremented value of the first element.If we implement the atomic blocks with a single coarse-grain lock,this code will produce one of the expected results as it is race-free.The lack of strong isolation causes this code to behave unpre-dictably with all STM approaches [18]. As the two threads exe-cute concurrently,  lhead  is included in the write-set for Thread 1and the the read-set for Thread 2. Now assume Thread 2 initiatesits transaction commit, where it locks  lhead.val  and it verifiesall variables in its read-set have not changed in memory, including lhead . While Thread 2 is copying its large write-set, Thread 1starts its commit stage. It locks its write-set ( lhead ), validates itsread-set ( lhead  and  lhead.next ), and commits the new valueof   lhead . Subsequently, Thread 1 uses  res  outside of a trans-actional block as it believes that the element has been privatizedby atomically extracting from the list. Depending on the progressrate of Thread 2 with copying its write-set, Thread 1 may get to usethe non-incremented value for res.val a few times before finallyobserving the incremented value committed by Thread 2.The privatization example in Figure 3 is not a unique case of race-free code that performs unpredictably. Another source of cor-rectness issues is that STM systems do not validate their read-set 72  Instruction Description rsSigReset  Reset all bits in read-set or write-set signature wsSigResetrsSigInsert  r1 Insert the address in register r1 in the read-set wsSigInsert  r1 or write-set signature rsSigMember  r1, r2 Set register r2 to 1 if the address in register wsSigMember  r1, r2 r1 hits in the read-set or write-set signature rsSigSave  r1, r2 Save a portion of the read-set or write-set wsSigSave  r1, r2 signature into register r1 rsSigRestore  r1, r2 Restore a portion of the read-set or write-set wsSigRestore  r1, r2 signature from register r1 fetchEx  r1 Prefetch address in register r1 in exclusivestate; if address in cache, upgrade toexclusive state if needed Table 1.  The user-level instructions for management of read-setand write-set signatures in SigTM.until they reach the commit stage. Hence, it is possible to have atransaction use a pointer that has been modified in the meantime byanother transaction, ending up with an infinite loop or a memoryexception [12].Strong isolation can be implemented in an STM using additionalread and write barriers in non-transactional accesses. Shpeisman etal. [25] developed a set of compiler techniques that minimize theirperformance impact by differentiating between private and shareddata, identifying data never accessed within a transaction, and ag-gregating multiple barriers to the same address. For a set of Javaprograms that make infrequent use of transactions, their optimiza-tions reduce the overhead of strong isolation from 180% to 40%.This overhead is significant as it is in addition to the regular over-head of STM instrumentation code. Moreover, the overhead mayactually be higher for applications that make frequent use of trans-actional synchronization or are written in languages like C/C++. If strong isolation forces programmers to pick between performanceand correctness, we have failed to deliver on the basic promise of TM: simple-to-write parallel code that performs well.HTM systems naturally implement strong isolation as all threadinteractions, whether they execute transactions or not, are visiblethrough the coherence protocol and can be properly handled. Forthe cases in Figure 2.(a) and 2.(b), the write to  x  by Thread 2 onour HTM will generate a coherence request for exclusive access. If Thread 1 has already read  x , the coherence request will facilitateconflict detection and will cause Thread 1 to abort and re-executeits transaction. For the case in Figure 2.(c), once the transactionin Thread 1 is validated, it will generate nacks to any incomingcoherence request (shared or exclusive) for an address in its write-set. Since there is one transaction committing at the time, there canbe no deadlock or livelock. Hence, Thread 2 will either read thenew or old values for both  x  and  y . The HTM executes correctlythe privatization code in Figure 3 as Thread 1 cannot read partiallycommitted state from Thread 2. 4. SIGNATURE-ACCELERATED TM This section describes  signature-accelerated transactional mem-ory(SigTM) ,ahybridTMsystemthatreducestheruntimeoverheadof software transactions and transparently provides strong isolationand ordering guarantees. 4.1 Hardware Signatures SigTM enhances the architecture of a TM system with hardwaresignatures for read-set and write-set tracking. Our work was in-spired by the use of signatures for transactional bookkeeping in theBulk HTM [5].A hardware signature is a Bloom filter [3] that conservativelyencodes a set of addresses using a fixed-size register. SigTM usestwo signatures per hardware thread: one for the read-set and theother for the write-set. Table 1 summarizes the instructions usedby software to manage the signatures. Software can reset each sig-nature, insert an address, check if an address hits in the signature,and save/restore its content. To insert address  A  in a signature,hardware first truncates the cache line offset field from the address.Next, it applies one or more deterministic hash functions on the re-maining bits. Each function identifies one bit in the signature to beset to one. To check address A for a hit, the hardware truncates theaddress and applies the same set of hash functions. If all identifiedbits in the signature register are set, A is considered a signature hit.Given the fixed-size of the signature register and the nature of hashfunctions, multiple addresses will map to the same signature bits.Hence, hits are conservative: they will correctly identify addressesthat were previously added to the set but may also identify someaddresses the were never inserted in the signature. We discuss thesignature length and the choice of hash functions in Section 6. Thelast instruction in Table 1 allows software to prefetch or upgradea cache line to one of the exclusive states of the coherence proto-col (E for MESI). If the address is already in the M or E state, theinstruction has no effect.The hardware can also lookup the signatures for hits using ad-dresses from incoming coherence requests. A user-level configu-ration register per hardware thread is used to select if addressesfrom shared and/or exclusive requests should be looked up in theread-set and/or write-set signatures. If an enabled lookup producesa signature hit, a user-level exception is signaled that jumps to apre-registered software handler [20]. Coherence lookups are tem-porarily disabled when the handler is invoked to avoid repeated in-vocations of the handler that prohibit forward progress. The con-figuration register also indicates if coherence requests that hit in thewrite-set signature should be properly acknowledged by the localcache or should receive a nack reply.Apart from hardware signatures and the nack mechanism in thecoherence protocol, SigTM requires no further hardware support.Caches are not modified in any way. 4.2 SigTM Operation Algorithm 2 summarizes the operation of software transactionsunder SigTM. Even though we present a hybrid scheme based onthe TL2 STM, the approach is generally applicable to other STMs(see Section 4.5). SigTM is a standalone implementation. Unlikeother hybrid proposals [17, 10, 24], there is no backup STM, noswitch between HTM and STM modes, and no fast vs. slow codepaths.SigTM eliminates the global version clock and the locks in thebaseSTM.Whileitalsoeliminatesthesoftwareread-set, asoftwarewrite-set is still necessary in order to buffer transactional updatesuntil the transaction commits. The transaction initialization code( SigTMtxStart ) takes a checkpoint and enables read-set signa-ture lookups for exclusive coherence requests. The write barriercode ( SigTMwriteBarrier ) inserts the address in the write-set signature and updates the software write-set. The read barrier( SigTMreadBarrier ) first checks if the address is in the soft-ware write-set, using the hardware write-set signature as a filteringmechanism to avoid most unnecessary write-set lookups. If not, itsimply adds it to the read-set signature and reads the word frommemory. During the transaction execution, if the address from anexclusive coherence request produces a hit in the read-set signa-ture, a conflict is signaled and the transaction aborts. The abortprocess includes resetting both signatures and discarding the soft- 73
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks
SAVE OUR EARTH

We need your sign to support Project to invent "SMART AND CONTROLLABLE REFLECTIVE BALLOONS" to cover the Sun and Save Our Earth.

More details...

Sign Now!

We are very appreciated for your Prompt Action!

x