Government & Politics

5 pages

Vendor Agnostic, High Performance, Double Precision Floating Point Division for FPGAs

of 5
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
—Double precision Floating Point (FP) arithmetic operations are widely used in many applications such as image and signal processing and scientific computing. Field Programmable Gate Arrays (FPGAs) are a popular platform for accelerating such
  Vendor Agnostic, High Performance, DoublePrecision Floating Point Division for FPGAs Xin Fang and Miriam Leeser Dept of Electrical and Computer EngNortheastern UniversityBoston, Massachusetts 02115Email:,  Abstract —Double precision Floating Point (FP) arithmetic op-erations are widely used in many applications such as image andsignal processing and scientific computing. Field ProgrammableGate Arrays (FPGAs) are a popular platform for acceleratingsuch applications due to their relative high performance, flexi-bility and low power consumption compared to general purposeprocessors and GPUs. Increasingly scientists are interested indouble precision FP operations implemented on FPGAs. FPdivision and square root are much more difficult to implementthan addition and multiplication. In this paper we focus on afast divider design for double precision floating point that makesefficient use of FPGA resources including embedded multipliers.The design is table based; we compare it to iterative and digitrecurrence implementations. Our division implementation targetsperformance with balanced latency and high clock frequency. Ourdesign has been implemented on both Xilinx and Altera FPGAs.The table based double precision floating point divider provides agood tradeoff between area and performance and produces goodresults when targeting both Xilinx and Altera FPGAs. I. I NTRODUCTION The IEEE 754 standard [7] specifies the common formatsfor representing floating point numbers, including single anddouble precision. Double precision is widely used in scientificapplications since it can represent a wider range of valueswith greater precision than single precision. There are manycomputations where a list of input values are summed andthen normalized. Even if the input values are single precision,the accumulation and division are done in double precisionto retain accuracy. In hardware, double precision divisionoperates more slowly and requires more hardware resourcesthan single precision. Speed is measured in number of clock cycles to produce a result (latency), clock cycle time, and totaltime for a single operation. Many of these resource issuescompete with one another. For example, a subtractive divider,similar to the long division taught in grade school, uses fewresources and can be implemented with a small clock cycle,but has a long latency. In contrast, a multiplicative divider usesmore resources, a longer clock cycle, and has a shorter latency.The ideal divider optimizes all three of these constraints.In this paper we describe a double precision, IEEE com-pliant floating point divider, adopted from [3], which usesa combination of Look Up Tables (LUTs) and Taylor seriesexpansion, and thus tries to balance the competing design goalsof fast operation with a small amount of hardware. Due toits use of LUTs and multipliers, this design is particularlywell suited for FPGA implementation with their embeddedmultipliers and RAM. The divider that we implement consistsof a reciprocal unit followed by a multiplication unit, so the re-ciprocal unit can be separated into an independent componentin the library.Note that our implementation is designed to handle anywidth of exponent and mantissa, and not just those formatsspecified by the IEEE 754 floating point standard. To supportthis, we make use of parameterized modules, including andgates, or gates, adders and subtractors. In this paper we havefocused on the results for a double precision divider; henceother formats are not covered here.The rest of this paper is organized as follows. Background,including the IEEE double precision format representation aswell as three famous methods for computing floating pointdivision are presented in Section II. Section III describesthe algorithm in detail and Section IV introduces how toimplement the divider. Section V presents the results of ourproject and a comparison with other division implementationson both Altera and Xilinx hardware. The paper ends withconclusions and future work.II. B ACKGROUND  A. Double Precision Format Representation The IEEE 754 double precision standard format is repre-sented using one sign bit, 11 exponent bits, and 52 mantissabits, for a total of 64 bits. The exponent bias is 1023, and fornormalized numbers, the leading 1 is not explicitly stored. Thevalue represented is therefore: ( − 1) sign ∗ 2 exponent − 1023 ∗ 1 .mantissa To implement the division of two floating point numbers, Z   =  X/Y  ; the sign bit, exponent and mantissa of   Z   mustbe calculated and the result normalized. The sign is the XORof the sign bits of   X   and  Y  , while the resulting exponent(before normalization) is the difference between the two inputexponents with the bias suitably handled. The bulk of thecomputation is the division operation, which is the same asinteger division. We will focus on this part for the remainderof the paper.The IEEE standard also specifies other values, includingsubnormal numbers, NaN, and ±∞ . We do not consider thesevalues further. 978-1-4799-1365-7/13/$31.00 ©2013 IEEE   B. Methods For Computing Floating Point Division There are three main methods used to implement floatingpoint division. They are digit recurrence or subtractive,iterative or multiplicative and table-based. Each method isdiscussed in more detail below. 1) Digit Recurrence:  Digit recurrence algorithms computeone digit of the result per clock cycle and are based on thelong division commonly taught in grade school for performingdivision by hand. The most widely used digit recurrencemethod is known as SRT after the three people, Sweeney,Robertson, and Tocher who invented it [11]. The textbook Digital Arithmetic [2] gives more details. Freiman [4] providesa good example of the SRT method based on non-restoringdivision; the radix of the example is 2. For binary arithmetic,the approach used to obtain more than one bit per cycleemploys a radix higher than 2. 2) Iterative Method:  Iterative methods are based on multi-plications that gives you intermediate results that converge tothe highest number of bits of precision required. Two widelyused iterative methods for floating point division are NewtonRaphson and Goldschmidt [13], [14], [12], [5].With Newton-Raphson, the inverse of the divisor is com-puted and then multiplied with the dividend using a doubleprecision multiplier. In order to find  1 /Y  , the iteration isseeded with  S  0 , an estimate to  1 /Y  ; the iteration is:  S  i +1  = S  i ∗ (2 − Y   ∗ S  i ) . The algorithm iterates until S approximatesthe required precision.Goldschmidt’s algorithm also uses an iterative loop tocalculate the quotient. The difference between this andNewton-Raphson is that Goldschmidt’s algorithm multipliesboth the dividend and the divisor. The iteration completeswhen the divisor is approximately equal to one and therequired precision is reached. The value by which thedividend and divisor are multiplied in each iteration is 2 -current value of the divisor. 3) Table based Method:  Two table based methods are dis-cussed here. The first is currently used to implement variableprecision dividers in the VFLOAT library [9], [16], [15]. Thesecond is an improved algorithm whose table size grows moreslowly as the bit width of the inputs grows. These methodsuse a stored value for the initial guess to a result. They useTaylor series expansion to obtain the required number of bitsof precision.The first approach, from [6], is the divider currently usedin VFLOAT. It makes use of two multipliers and one look uptable to implement division. In the single precision floatingpoint divider implementation, the size of the first multiplieris  24 × 24  bits of input with 48 bits of output 48; the secondmultiplier has input  28 × 28  bits with an output of 56 bits. Thelook up table has 12 bits for input, 28 for output, for a totalsize of approximately 16K bytes. However, the disadvantageof this algorithm is that the size of the look up table increasesexponentially with the bit width of the input. It is thereforeimpractical to implement a double precision divider with thisapproach since the LUT would be prohibitively large. For thisreason we have changed to a different divider algorithm thatscales better as the size of the division operation grows [3]. This algorithm requires a smaller look up table and severalsmall multipliers. The remainder of this paper covers detailsof this algorithm and its implementation. The next sectionexplains the algorithm that we use. Section IV covers theimplementation of the divider using VHDL. Experimentalresults obtained from this implementation and a comparisonto previous methods are presented in Section V. Finally weconclude and present future work.III. A LGORITHM  D ESCRIPTION The algorithm we use is from [3]. The discussion below istaken from that paper and more details can be found there. Thisapproach finds  X/Y   by first finding the reciprocal of   Y   andthen multiplying by  X  . The approach is based on a Taylorseries expansion and makes use of LUTs and multipliers,resources readily available in FPGAs. To ensure growth of the look up table is slow, a reduction step is used.  A. Reciprocal Operation To find the reciprocal  1 /Y   the algorithm uses three steps:reduction, evaluation, and post-processing. The top level isshown in Figure 1. a) The reduction step:  From the IEEE standard, weknow that for  Y   normalized,  1 ≤ Y <  2 . Assume  Y   has an mbit significand and  k  is   ( m + 2) / 4  + 1 ;  Y  ( k ) represents thetruncation of Y to  k  bits. In the reduction step, define  ˆ R  asthe reciprocal of   Y  ( k ) .  ˆ R  can be determined using a look uptable with a 14 bit address for double precision. This is dueto the fact that the mantissa is 53 bits with the leading onerepresented, so  k  = 14 . The number of bits used for  ˆ R  is 16.B is defined as the Taylor series expansion of  f  ( A ) = 1 / (1 + A ) where A is defined as  ( Y  × ˆ R ) − 1 . Note that − 2 − k < A <  2 k .For  Z   = 2 − k ,  A  can be represented as: A  =  A 2 z 2 + A 3 z 3 + A 4 z 4 + ... where  | A i |≤ 2 k − 1 . We ignore the smaller terms that do notcontribute to the double precision result. b) In the evaluation step:  using the Taylor series ex-pansion, f  ( A ) =  C  0  + C  1 A + C  2 A 2 + C  3 A 3 + C  4 A 4 + ···≈  C  0  + C  1 ( A 2 z 2 + A 3 z 3 + A 4 z 4 )+ C  2 ( A 2 z 2 + A 3 z 3 + A 4 z 4 )+ C  3 ( A 2 z 2 + A 3 z 3 + A 4 z 4 ) 3 + C  4 ( A 2 z 2 + A 3 z 3 + A 4 z 4 ) 4 ≈  C  0  + C  1 A + C  2 A 22 z 4 + 2 C  2 A 2 A 3 z 5 + C  3 A 32 z 6 Here,  C  i  = 1  when i is even,  C  i  =  − 1  when i is odd.Simplifying we get: B  ≈  1 − A 2 z 2 − A 3 z 3 + ( − A 4  + A 22 ) z 4 +2 A 2 S  3 z 5 − A 32 z 6 ≈  (1 − A ) + A 22 z 4 + 2 A 2 A 3 z 5 − A 32 z 6 The above equation is the one used in the implementation.  Fig. 1. Reciprocal block diagramFig. 2. Black box of design c) In the post-processing step:  , the result of the recip-rocal of Y is given by the product of   ˆ R  and B:  1 /Y   = ˆ RB .  B. Multiplying Reciprocal by Dividend  After obtaining the reciprocal of the divisor, the final step isto multiply the reciprocal with  X   which is the dividend. In ourimplementation, this is done before the number is normalized.Another multiplier is generated for this operation. The finalresult of division is obtained by combining the result of   X/Y  with the sign and exponent; after  X/Y   is normalized androunded the resulting exponent is appropriately adjusted.IV. D IVIDER  I MPLEMENTATION The double precision divider is designed as part of theVFLOAT library available for download from NortheasternUniversity [9], [16]. Components in the VFLOAT library aredesigned to be joined together as a pipeline. Thus they haveinputs READY, STALL, ROUND and EXCEPTION IN andoutputs DONE and EXCEPTION OUT specifically to handlepipelining. The READY and DONE signals are used forknowing when the inputs are ready and when the results areavailable for use. STALL allows a bubble to be inserted intothe pipeline if needed. Round has two modes: round to zero ortruncate, and round to nearest. The exception signals propagatean exception flag with the value that may be incorrect throughthe pipeline. For division, an exception is identified if thedivisor is zero. Otherwise, the exception input is propagatedto the exception output. The complete list of input ports is: Fig. 3. Top level components of divider CLK, RESET, STALL, OP1, OP2, READY, ROUND andEXCEPTION IN. The output ports are DONE, RESULT, andEXCEPTION OUT, as shown in Figure 2.Our implementation of the division algorithm for doubleprecision floating point is written in VHDL and targets thetwo most popular families of commercial FPGAs: Altera andXilinx. For tools we use Altera Quartus II 13.0 and the XilinxISE Design Suite 13.4. The designs make use of embeddedmultipliers and RAMs, which require using the intellectualproperty components provided with each set of tools. ForAltera these are called Megacores; for Xilinx they are calledIP Cores. The two manufacturers also use different simulators;Altera ships a version of Modelsim called ModelSim-Altera10.1d while Xilinx has its own simulator, ISim, that is inte-grated with its tool suite. Both simulators were used to validatethese designs.Figure 3 shows the top level components of the divider.The top level divider consists of computation to denormalizethe floating point input (explicitly represent the hidden one),divider computation which includes a reciprocal and a multipli-cation of the dividend with the reciprocal, and round to normalfor adjusting the result to standard floating point format.  A. Denormalizer  A normalized floating point number in IEEE 754 standardformat has a sign bit, exponent bits and mantissa bits. Asdescribed in Section II, the stored part of the mantissa doesnot include the leading ‘1’; however, this bit is necessaryfor computation. All normalized floating point inputs to acomponent are sent through a denormalizer which adds this‘1’ to the head of the mantissa. For the divider, two such stepsare needed, one for the dividend and one for the divisor.  B. Divider  After obtaining the denormalized number, we need tocalculate the sign bit, exponent and quotient of the result. The  Fig. 4. Reciprocal components sign bit is obtained by XORing the sign bits of the inputs. Theexponent of the result is the exponent of the dividend minusthe exponent of the divisor with the bias suitably handled. Notethat the output of this subtraction might not be the exponentof the final result, because we still need to normalize the resultafter the division operation. The following subsections describethe computation of the reciprocal and the multiplication withthe dividend to obtain the final mantissa. 1) Reciprocal Operation:  Figure 4 shows the hierarchyof structural components used to implement the reciprocalalgorithm described in Section III-A. The design uses thereciprocal table to calculate  ˆ R  and 4 multipliers. The lookuptable has 14 address bits and 16 output bits for a total of 32Kbytes. The four multipliers are the following sizes. MultiplierYR has 17 bit and 58 bit inputs and a 71 bit output; its purposeis to multiply Y and  ˆ R  in order to get A. Multiplier S has two14 bit inputs and a 28 bit output; its purpose is to compute A 2 ∗ A 2  and  A 2 ∗ A 3 . Multiplier M has 28 bit and 14 bit inputsand a 42 bit output; it computes the cube of   A 2 . Multiplier Lhas one 16 bit and one 58 bit input and a 74 bit output; itcomputes  ˆ R ∗ B . In addition, parameterized adders and logicgates from the VFLOAT library are used. 2) Multiplier:  After determining the reciprocal of the di-visor, the next step is to multiply this reciprocal with themantissa of the dividend. This multiplier is called the mantissamultiplier. The inputs are 53 bits and 57 bits, and the outputis 110 bits. The number of pipeline stages for this and theother multipliers in the implementation can be adjusted usingthe parameters available in Xilinx Core Generator or AlteraMegacores. After this multiplication, the last component willround an renormalize the output. C. Round to Normal The IEEE standard specifies four rounding modes: Roundto zero, round to nearest, round to negative infinity and roundto positive infinity. In the VFLOAT library, two roundingoptions are supported: round to zero or truncation and round tonearest. Each component has an input signal, round. If Roundis 0, truncation is implemented, else round to nearest. Thisoperation may change the exponent as well as the mantissa of the result. The round component also removes the leading ‘1’from the mantissa of the result for storage in IEEE floatingpoint format. After this step, the normalized, double precisionfloating point result represented in IEEE standard notation willbe on the output port.V. R ESULTS AND  C OMPARISON We have implemented our double precision floating pointdivider and compared it to other, popular designs. Results areshown in Table I for designs mapped to Xilinx and Table IIfor designs mapped to Altera. As mentioned above, we usedAltera Quartus II 13.0 and the Xilinx ISE Design Suite 13.4to implement our design. CC is the number of clock cycleslatency. All designs can be pipelined with a throughput of one clock cycle. Our design has the advantage of working onboth Altera and Xilinx FPGAs. In contrast, the IP cores inXilinx and Megacores from Altera are vendor specific. Ourdesigns allow the user to adjust the number of clock cycles of latency by adjusting the level of pipelining for each multiplier.Xilinx allows the designer to choose the latency of the divideras a parameter in Core Generator. We chose to compare tolatencies of 8, 14 and 20 clock cycles. For Altera Megacores,the designer is offered a handful of latencies from which tochoose.The latency of our first implementation is 14 clock cy-cles on both vendor’s hardware. The maximum frequencyon a Stratix V device(5SGXB6) from Altera is 121MHz.Implemented on a Xilinx Virtex 6 device (XC6VLX75T), themaximum frequency is 148MHz. Note that a faster clock frequency does not necessarily make a better design. Sinceour goal is to fit the divider into larger pipelines, the goal is tobalance the clock speeds across all components. In addition,there is often a tradeoff between high clock frequency andnumber of clock cycles latency, as can be seen in these results.Computing the result in the same total time with a lower clock frequency dissipates less energy, another design goal. However,too slow of a latency can slow down the entire pipeline.We compared our design to several popular methods forimplementing division, including the floating point cores pro-vided from each manufacturer. Table I shows the synthesisresults of several methods using the Xilinx IDE. Note thatgiven the same latency, our design provides a better maximumfrequency. This faster frequency comes at the cost of morehardware resources being used. The last design is reportedfrom a paper that presents a multiplicative divider [8]. Thisdesign can only support double precision with an error of 2 units in the last place (ulps) while our error is one ulp.Also, this design has long latency although a fast maximumfrequency. These numbers are reported by the authors; we didnot implement their design.Table II shows the synthesis results of several methodsusing the Altera IDE. The Megacore from Altera has onlythree fixed latencies that you can choose from: 10, 24 and61 clock cycles. In this case, the Altera Megacores providethe best overall latency. Note that these designs use moreDSP blocks than our design. Also note that we can changethe latency, as our second implementation shows, by adjustingthe pipeline parameter of the MegaCore multipliers. Thus ourdivider design is more flexible compared to the one availablefrom Altera. The last two designs repeat results previouslypublished in [10]. The Radix 4 digit recurrence implementedby FloPoCo has larger maximum frequency but also longer  Method Device Latency Max Frequency Total Latency ResourceOur Implementation Virtex 6 14 Clk Cycles 148 MHz 95 ns 934 Registers, 6957 Slice LUTsIP Core from Xilinx Virtex 6 8 Clk Cycles 74 MHz 108 ns 722 Registers, 3210 Slice LUTsIP Core from Xilinx Virtex 6 14 Clk Cycles 117 MHz 120 ns 1188 Registers, 3234 Slice LUTsIP Core from Xilinx Virtex 6 20 Clk Cycles 192 MHz 104 ns 2035 Registers, 3216 Slice LUTsMultiplicative Virtex 6 36 Clk Cycles 275 MHz 131 ns 2097 Slices, 118K BRAM, 28 18*18 TABLE I. R ESULTS AND COMPARISON WITH  X ILINX Method Device Latency Max Frequency Total Latency ResourceOur Implementation 1 Stratix V 14 Clk Cycles 121 MHz 116 ns 818 ALMs, 931 Logic Register, 11 DSP block Our Implementation 2 Stratix V 16 Clk Cycles 145 MHz 110 ns 1004 ALMs, 1105 Logic Register, 13 DSP block MegaCore from Altera Stratix V 10 Clk Cycles 176 MHz 57 ns 525 ALMs, 1247 Logic Register, 14 DSP block MegaCore from Altera Stratix V 24 Clk Cycles 237 MHz 101 ns 849 ALMs, 1809 Logic Register, 14 DSP block MegaCore from Altera Stratix V 61 Clk Cycles 332 MHz 184 ns 9379 ALMs, 13493 Logic RegisterRadix 4 Digit Recurrence Stratix V 36 Clk Cycles 219 MHz 164 ns 2605 ALMs, 5473 Logic RegisterPolynomial Approx d =2 Stratix V 18 Clk Cycles 268 MHz 67 ns 444 ALMs, 823 Logic Register, 2 M20K,+ Newton Raphson 9 DSP block  TABLE II. R ESULTS AND COMPARISON WITH  A LTERA latency [1]. The method with Polynomial Approximation (d=2)plus Newton Raphson algorithm [10] has the fastest overalllatency. It also uses significant memory resources comparedto the other designs.VI. C ONCLUSIONS AND  F UTURE  W ORK We have presented a double precision floating point dividerimplementation that produces hardware with a good tradeoff of frequency, latency and resource usage and that can beimplemented on both Altera and Xilinx FPGAs. Our designmakes use of the embedded multipliers and embedded RAMscommonly found in modern FPGA fabric. It is flexible, withthe designer able to adjust the maximum number of clock cycles.In the future, we plan to investigate our divider further. Inparticular, we will focus on improving the frequency by focus-ing on optimizing the critical path and trying different levels of pipelining. In addition we plan to adapt our implementation tosupport variable precision division on both Altera and XilinxFPGAs, and to make it available as part of the VFLOAT library[9], [16].R EFERENCES[1] J. Detrey and F. de Dinechin. A tool for unbiased comparisonbetween logarithmic and floating-point arithmetic.  The Journal of VLSI Signal Processing Systems for Signal, Image, and Video Technology ,49(1):161–175, 2007.[2] M. D. Ercegovac and T. Lˇang.  Digital arithmetic . Morgan Kaufmann,2003.[3] M. D. Ercegovac, T. Lang, J.-M. Muller, and A. Tisserand. Reciproca-tion, square root, inverse square root, and some elementary functionsusing small multipliers.  Computers, IEEE Transactions on , 49(7):628–637, 2000.[4] C. Freiman. Statistical analysis of certain binary division algorithms. Proceedings of the IRE  , 49(1):91–103, 1961.[5] R. Goldberg, G. Even, and P.-M. Seidel. An fpga implementationof pipelined multiplicative division with ieee rounding. In  Field-Programmable Custom Computing Machines, 2007. FCCM 2007. 15th Annual IEEE Symposium on , pages 185–196. IEEE, 2007.[6] P. Hung, H. Fahmy, O. Mencer, and M. J. Flynn. Fast division algorithmwith a small lookup table. In  Signals, Systems, and Computers,1999. Conference Record of the Thirty-Third Asilomar Conference on ,volume 2, pages 1465–1468. IEEE, 1999.[7] Institute of Electrical and Electronics Engineers (IEEE). 754-2008 —IEEE standard for floating-point arithmetic.  IEEE  , pages 1–58, 2008.[8] M. K. Jaiswal and R. C. Cheung. High performance reconfigurablearchitecture for double precision floating point division. In  Reconfig-urable Computing: Architectures, Tools and Applications , pages 302–313. Springer, 2012.[9] Northeastern University Reconfigurable Computing Laboratory. Vfloat:The northeastern variable precision floating point library.[10] B. Pasca. Correctly rounded floating-point division for dsp-enabledfpgas. In  Field Programmable Logic and Applications (FPL), 201222nd International Conference on , pages 249–254. IEEE, 2012.[11] J. Robertson. A new class of digital division methods.  ElectronicComputers, IRE Transactions on , EC-7(3):218–222, 1958.[12] E. Roesler and B. Nelson. Novel optimizations for hardware floating-point units in a modern fpga architecture. In  Field-Programmable Logicand Applications: Reconfigurable Computing Is Going Mainstream ,pages 637–646. Springer, 2002.[13] P. Soderquist and M. Leeser. Area and performance tradeoffs in floating-point divide and square-root implementations.  ACM Computing Surveys(CSUR) , 28(3):518–564, 1996.[14] P. Soderquist and M. Leeser. Division and square root: choosing theright implementation.  Micro, IEEE  , 17(4):56–66, 1997.[15] X. Wang, S. Braganza, and M. Leeser. Advanced components in thevariable precision floating-point library. In  Field-Programmable CustomComputing Machines, 2006. FCCM’06. 14th Annual IEEE Symposiumon , pages 249–258. IEEE, 2006.[16] X. Wang and M. Leeser. Vfloat: A variable precision fixed-andfloating-point library for reconfigurable hardware.  ACM Transactionson Reconfigurable Technology and Systems (TRETS) , 3(3):16, 2010.
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks