Vendor Agnostic, High Performance, DoublePrecision Floating Point Division for FPGAs
Xin Fang and Miriam Leeser
Dept of Electrical and Computer EngNortheastern UniversityBoston, Massachusetts 02115Email: fang.xi@husky.neu.edu, mel@coe.neu.edu
Abstract
—Double precision Floating Point (FP) arithmetic operations are widely used in many applications such as image andsignal processing and scientiﬁc computing. Field ProgrammableGate Arrays (FPGAs) are a popular platform for acceleratingsuch applications due to their relative high performance, ﬂexibility and low power consumption compared to general purposeprocessors and GPUs. Increasingly scientists are interested indouble precision FP operations implemented on FPGAs. FPdivision and square root are much more difﬁcult to implementthan addition and multiplication. In this paper we focus on afast divider design for double precision ﬂoating point that makesefﬁcient use of FPGA resources including embedded multipliers.The design is table based; we compare it to iterative and digitrecurrence implementations. Our division implementation targetsperformance with balanced latency and high clock frequency. Ourdesign has been implemented on both Xilinx and Altera FPGAs.The table based double precision ﬂoating point divider provides agood tradeoff between area and performance and produces goodresults when targeting both Xilinx and Altera FPGAs.
I. I
NTRODUCTION
The IEEE 754 standard [7] speciﬁes the common formatsfor representing ﬂoating point numbers, including single anddouble precision. Double precision is widely used in scientiﬁcapplications since it can represent a wider range of valueswith greater precision than single precision. There are manycomputations where a list of input values are summed andthen normalized. Even if the input values are single precision,the accumulation and division are done in double precisionto retain accuracy. In hardware, double precision divisionoperates more slowly and requires more hardware resourcesthan single precision. Speed is measured in number of clock cycles to produce a result (latency), clock cycle time, and totaltime for a single operation. Many of these resource issuescompete with one another. For example, a subtractive divider,similar to the long division taught in grade school, uses fewresources and can be implemented with a small clock cycle,but has a long latency. In contrast, a multiplicative divider usesmore resources, a longer clock cycle, and has a shorter latency.The ideal divider optimizes all three of these constraints.In this paper we describe a double precision, IEEE compliant ﬂoating point divider, adopted from [3], which usesa combination of Look Up Tables (LUTs) and Taylor seriesexpansion, and thus tries to balance the competing design goalsof fast operation with a small amount of hardware. Due toits use of LUTs and multipliers, this design is particularlywell suited for FPGA implementation with their embeddedmultipliers and RAM. The divider that we implement consistsof a reciprocal unit followed by a multiplication unit, so the reciprocal unit can be separated into an independent componentin the library.Note that our implementation is designed to handle anywidth of exponent and mantissa, and not just those formatsspeciﬁed by the IEEE 754 ﬂoating point standard. To supportthis, we make use of parameterized modules, including andgates, or gates, adders and subtractors. In this paper we havefocused on the results for a double precision divider; henceother formats are not covered here.The rest of this paper is organized as follows. Background,including the IEEE double precision format representation aswell as three famous methods for computing ﬂoating pointdivision are presented in Section II. Section III describesthe algorithm in detail and Section IV introduces how toimplement the divider. Section V presents the results of ourproject and a comparison with other division implementationson both Altera and Xilinx hardware. The paper ends withconclusions and future work.II. B
ACKGROUND
A. Double Precision Format Representation
The IEEE 754 double precision standard format is represented using one sign bit, 11 exponent bits, and 52 mantissabits, for a total of 64 bits. The exponent bias is 1023, and fornormalized numbers, the leading 1 is not explicitly stored. Thevalue represented is therefore:
(
−
1)
sign
∗
2
exponent
−
1023
∗
1
.mantissa
To implement the division of two ﬂoating point numbers,
Z
=
X/Y
; the sign bit, exponent and mantissa of
Z
mustbe calculated and the result normalized. The sign is the XORof the sign bits of
X
and
Y
, while the resulting exponent(before normalization) is the difference between the two inputexponents with the bias suitably handled. The bulk of thecomputation is the division operation, which is the same asinteger division. We will focus on this part for the remainderof the paper.The IEEE standard also speciﬁes other values, includingsubnormal numbers, NaN, and
±∞
. We do not consider thesevalues further.
9781479913657/13/$31.00 ©2013 IEEE
B. Methods For Computing Floating Point Division
There are three main methods used to implement ﬂoatingpoint division. They are digit recurrence or subtractive,iterative or multiplicative and tablebased. Each method isdiscussed in more detail below.
1) Digit Recurrence:
Digit recurrence algorithms computeone digit of the result per clock cycle and are based on thelong division commonly taught in grade school for performingdivision by hand. The most widely used digit recurrencemethod is known as SRT after the three people, Sweeney,Robertson, and Tocher who invented it [11]. The textbook Digital Arithmetic [2] gives more details. Freiman [4] providesa good example of the SRT method based on nonrestoringdivision; the radix of the example is 2. For binary arithmetic,the approach used to obtain more than one bit per cycleemploys a radix higher than 2.
2) Iterative Method:
Iterative methods are based on multiplications that gives you intermediate results that converge tothe highest number of bits of precision required. Two widelyused iterative methods for ﬂoating point division are NewtonRaphson and Goldschmidt [13], [14], [12], [5].With NewtonRaphson, the inverse of the divisor is computed and then multiplied with the dividend using a doubleprecision multiplier. In order to ﬁnd
1
/Y
, the iteration isseeded with
S
0
, an estimate to
1
/Y
; the iteration is:
S
i
+1
=
S
i
∗
(2
−
Y
∗
S
i
)
. The algorithm iterates until S approximatesthe required precision.Goldschmidt’s algorithm also uses an iterative loop tocalculate the quotient. The difference between this andNewtonRaphson is that Goldschmidt’s algorithm multipliesboth the dividend and the divisor. The iteration completeswhen the divisor is approximately equal to one and therequired precision is reached. The value by which thedividend and divisor are multiplied in each iteration is 2 current value of the divisor.
3) Table based Method:
Two table based methods are discussed here. The ﬁrst is currently used to implement variableprecision dividers in the VFLOAT library [9], [16], [15]. Thesecond is an improved algorithm whose table size grows moreslowly as the bit width of the inputs grows. These methodsuse a stored value for the initial guess to a result. They useTaylor series expansion to obtain the required number of bitsof precision.The ﬁrst approach, from [6], is the divider currently usedin VFLOAT. It makes use of two multipliers and one look uptable to implement division. In the single precision ﬂoatingpoint divider implementation, the size of the ﬁrst multiplieris
24
×
24
bits of input with 48 bits of output 48; the secondmultiplier has input
28
×
28
bits with an output of 56 bits. Thelook up table has 12 bits for input, 28 for output, for a totalsize of approximately 16K bytes. However, the disadvantageof this algorithm is that the size of the look up table increasesexponentially with the bit width of the input. It is thereforeimpractical to implement a double precision divider with thisapproach since the LUT would be prohibitively large. For thisreason we have changed to a different divider algorithm thatscales better as the size of the division operation grows [3]. This algorithm requires a smaller look up table and severalsmall multipliers. The remainder of this paper covers detailsof this algorithm and its implementation. The next sectionexplains the algorithm that we use. Section IV covers theimplementation of the divider using VHDL. Experimentalresults obtained from this implementation and a comparisonto previous methods are presented in Section V. Finally weconclude and present future work.III. A
LGORITHM
D
ESCRIPTION
The algorithm we use is from [3]. The discussion below istaken from that paper and more details can be found there. Thisapproach ﬁnds
X/Y
by ﬁrst ﬁnding the reciprocal of
Y
andthen multiplying by
X
. The approach is based on a Taylorseries expansion and makes use of LUTs and multipliers,resources readily available in FPGAs. To ensure growth of the look up table is slow, a reduction step is used.
A. Reciprocal Operation
To ﬁnd the reciprocal
1
/Y
the algorithm uses three steps:reduction, evaluation, and postprocessing. The top level isshown in Figure 1.
a) The reduction step:
From the IEEE standard, weknow that for
Y
normalized,
1
≤
Y <
2
. Assume
Y
has an mbit signiﬁcand and
k
is
(
m
+ 2)
/
4
+ 1
;
Y
(
k
)
represents thetruncation of Y to
k
bits. In the reduction step, deﬁne
ˆ
R
asthe reciprocal of
Y
(
k
)
.
ˆ
R
can be determined using a look uptable with a 14 bit address for double precision. This is dueto the fact that the mantissa is 53 bits with the leading onerepresented, so
k
= 14
. The number of bits used for
ˆ
R
is 16.B is deﬁned as the Taylor series expansion of
f
(
A
) = 1
/
(1 +
A
)
where A is deﬁned as
(
Y
×
ˆ
R
)
−
1
. Note that
−
2
−
k
< A <
2
k
.For
Z
= 2
−
k
,
A
can be represented as:
A
=
A
2
z
2
+
A
3
z
3
+
A
4
z
4
+
...
where

A
i
≤
2
k
−
1
. We ignore the smaller terms that do notcontribute to the double precision result.
b) In the evaluation step:
using the Taylor series expansion,
f
(
A
) =
C
0
+
C
1
A
+
C
2
A
2
+
C
3
A
3
+
C
4
A
4
+
···≈
C
0
+
C
1
(
A
2
z
2
+
A
3
z
3
+
A
4
z
4
)+
C
2
(
A
2
z
2
+
A
3
z
3
+
A
4
z
4
)+
C
3
(
A
2
z
2
+
A
3
z
3
+
A
4
z
4
)
3
+
C
4
(
A
2
z
2
+
A
3
z
3
+
A
4
z
4
)
4
≈
C
0
+
C
1
A
+
C
2
A
22
z
4
+ 2
C
2
A
2
A
3
z
5
+
C
3
A
32
z
6
Here,
C
i
= 1
when i is even,
C
i
=
−
1
when i is odd.Simplifying we get:
B
≈
1
−
A
2
z
2
−
A
3
z
3
+ (
−
A
4
+
A
22
)
z
4
+2
A
2
S
3
z
5
−
A
32
z
6
≈
(1
−
A
) +
A
22
z
4
+ 2
A
2
A
3
z
5
−
A
32
z
6
The above equation is the one used in the implementation.
Fig. 1. Reciprocal block diagramFig. 2. Black box of design
c) In the postprocessing step:
, the result of the reciprocal of Y is given by the product of
ˆ
R
and B:
1
/Y
= ˆ
RB
.
B. Multiplying Reciprocal by Dividend
After obtaining the reciprocal of the divisor, the ﬁnal step isto multiply the reciprocal with
X
which is the dividend. In ourimplementation, this is done before the number is normalized.Another multiplier is generated for this operation. The ﬁnalresult of division is obtained by combining the result of
X/Y
with the sign and exponent; after
X/Y
is normalized androunded the resulting exponent is appropriately adjusted.IV. D
IVIDER
I
MPLEMENTATION
The double precision divider is designed as part of theVFLOAT library available for download from NortheasternUniversity [9], [16]. Components in the VFLOAT library aredesigned to be joined together as a pipeline. Thus they haveinputs READY, STALL, ROUND and EXCEPTION IN andoutputs DONE and EXCEPTION OUT speciﬁcally to handlepipelining. The READY and DONE signals are used forknowing when the inputs are ready and when the results areavailable for use. STALL allows a bubble to be inserted intothe pipeline if needed. Round has two modes: round to zero ortruncate, and round to nearest. The exception signals propagatean exception ﬂag with the value that may be incorrect throughthe pipeline. For division, an exception is identiﬁed if thedivisor is zero. Otherwise, the exception input is propagatedto the exception output. The complete list of input ports is:
Fig. 3. Top level components of divider
CLK, RESET, STALL, OP1, OP2, READY, ROUND andEXCEPTION IN. The output ports are DONE, RESULT, andEXCEPTION OUT, as shown in Figure 2.Our implementation of the division algorithm for doubleprecision ﬂoating point is written in VHDL and targets thetwo most popular families of commercial FPGAs: Altera andXilinx. For tools we use Altera Quartus II 13.0 and the XilinxISE Design Suite 13.4. The designs make use of embeddedmultipliers and RAMs, which require using the intellectualproperty components provided with each set of tools. ForAltera these are called Megacores; for Xilinx they are calledIP Cores. The two manufacturers also use different simulators;Altera ships a version of Modelsim called ModelSimAltera10.1d while Xilinx has its own simulator, ISim, that is integrated with its tool suite. Both simulators were used to validatethese designs.Figure 3 shows the top level components of the divider.The top level divider consists of computation to denormalizethe ﬂoating point input (explicitly represent the hidden one),divider computation which includes a reciprocal and a multiplication of the dividend with the reciprocal, and round to normalfor adjusting the result to standard ﬂoating point format.
A. Denormalizer
A normalized ﬂoating point number in IEEE 754 standardformat has a sign bit, exponent bits and mantissa bits. Asdescribed in Section II, the stored part of the mantissa doesnot include the leading ‘1’; however, this bit is necessaryfor computation. All normalized ﬂoating point inputs to acomponent are sent through a denormalizer which adds this‘1’ to the head of the mantissa. For the divider, two such stepsare needed, one for the dividend and one for the divisor.
B. Divider
After obtaining the denormalized number, we need tocalculate the sign bit, exponent and quotient of the result. The
Fig. 4. Reciprocal components
sign bit is obtained by XORing the sign bits of the inputs. Theexponent of the result is the exponent of the dividend minusthe exponent of the divisor with the bias suitably handled. Notethat the output of this subtraction might not be the exponentof the ﬁnal result, because we still need to normalize the resultafter the division operation. The following subsections describethe computation of the reciprocal and the multiplication withthe dividend to obtain the ﬁnal mantissa.
1) Reciprocal Operation:
Figure 4 shows the hierarchyof structural components used to implement the reciprocalalgorithm described in Section IIIA. The design uses thereciprocal table to calculate
ˆ
R
and 4 multipliers. The lookuptable has 14 address bits and 16 output bits for a total of 32Kbytes. The four multipliers are the following sizes. MultiplierYR has 17 bit and 58 bit inputs and a 71 bit output; its purposeis to multiply Y and
ˆ
R
in order to get A. Multiplier S has two14 bit inputs and a 28 bit output; its purpose is to compute
A
2
∗
A
2
and
A
2
∗
A
3
. Multiplier M has 28 bit and 14 bit inputsand a 42 bit output; it computes the cube of
A
2
. Multiplier Lhas one 16 bit and one 58 bit input and a 74 bit output; itcomputes
ˆ
R
∗
B
. In addition, parameterized adders and logicgates from the VFLOAT library are used.
2) Multiplier:
After determining the reciprocal of the divisor, the next step is to multiply this reciprocal with themantissa of the dividend. This multiplier is called the mantissamultiplier. The inputs are 53 bits and 57 bits, and the outputis 110 bits. The number of pipeline stages for this and theother multipliers in the implementation can be adjusted usingthe parameters available in Xilinx Core Generator or AlteraMegacores. After this multiplication, the last component willround an renormalize the output.
C. Round to Normal
The IEEE standard speciﬁes four rounding modes: Roundto zero, round to nearest, round to negative inﬁnity and roundto positive inﬁnity. In the VFLOAT library, two roundingoptions are supported: round to zero or truncation and round tonearest. Each component has an input signal, round. If Roundis 0, truncation is implemented, else round to nearest. Thisoperation may change the exponent as well as the mantissa of the result. The round component also removes the leading ‘1’from the mantissa of the result for storage in IEEE ﬂoatingpoint format. After this step, the normalized, double precisionﬂoating point result represented in IEEE standard notation willbe on the output port.V. R
ESULTS AND
C
OMPARISON
We have implemented our double precision ﬂoating pointdivider and compared it to other, popular designs. Results areshown in Table I for designs mapped to Xilinx and Table IIfor designs mapped to Altera. As mentioned above, we usedAltera Quartus II 13.0 and the Xilinx ISE Design Suite 13.4to implement our design. CC is the number of clock cycleslatency. All designs can be pipelined with a throughput of one clock cycle. Our design has the advantage of working onboth Altera and Xilinx FPGAs. In contrast, the IP cores inXilinx and Megacores from Altera are vendor speciﬁc. Ourdesigns allow the user to adjust the number of clock cycles of latency by adjusting the level of pipelining for each multiplier.Xilinx allows the designer to choose the latency of the divideras a parameter in Core Generator. We chose to compare tolatencies of 8, 14 and 20 clock cycles. For Altera Megacores,the designer is offered a handful of latencies from which tochoose.The latency of our ﬁrst implementation is 14 clock cycles on both vendor’s hardware. The maximum frequencyon a Stratix V device(5SGXB6) from Altera is 121MHz.Implemented on a Xilinx Virtex 6 device (XC6VLX75T), themaximum frequency is 148MHz. Note that a faster clock frequency does not necessarily make a better design. Sinceour goal is to ﬁt the divider into larger pipelines, the goal is tobalance the clock speeds across all components. In addition,there is often a tradeoff between high clock frequency andnumber of clock cycles latency, as can be seen in these results.Computing the result in the same total time with a lower clock frequency dissipates less energy, another design goal. However,too slow of a latency can slow down the entire pipeline.We compared our design to several popular methods forimplementing division, including the ﬂoating point cores provided from each manufacturer. Table I shows the synthesisresults of several methods using the Xilinx IDE. Note thatgiven the same latency, our design provides a better maximumfrequency. This faster frequency comes at the cost of morehardware resources being used. The last design is reportedfrom a paper that presents a multiplicative divider [8]. Thisdesign can only support double precision with an error of 2 units in the last place (ulps) while our error is one ulp.Also, this design has long latency although a fast maximumfrequency. These numbers are reported by the authors; we didnot implement their design.Table II shows the synthesis results of several methodsusing the Altera IDE. The Megacore from Altera has onlythree ﬁxed latencies that you can choose from: 10, 24 and61 clock cycles. In this case, the Altera Megacores providethe best overall latency. Note that these designs use moreDSP blocks than our design. Also note that we can changethe latency, as our second implementation shows, by adjustingthe pipeline parameter of the MegaCore multipliers. Thus ourdivider design is more ﬂexible compared to the one availablefrom Altera. The last two designs repeat results previouslypublished in [10]. The Radix 4 digit recurrence implementedby FloPoCo has larger maximum frequency but also longer
Method Device Latency Max Frequency Total Latency ResourceOur Implementation Virtex 6 14 Clk Cycles 148 MHz 95 ns 934 Registers, 6957 Slice LUTsIP Core from Xilinx Virtex 6 8 Clk Cycles 74 MHz 108 ns 722 Registers, 3210 Slice LUTsIP Core from Xilinx Virtex 6 14 Clk Cycles 117 MHz 120 ns 1188 Registers, 3234 Slice LUTsIP Core from Xilinx Virtex 6 20 Clk Cycles 192 MHz 104 ns 2035 Registers, 3216 Slice LUTsMultiplicative Virtex 6 36 Clk Cycles 275 MHz 131 ns 2097 Slices, 118K BRAM, 28 18*18
TABLE I. R
ESULTS AND COMPARISON WITH
X
ILINX
Method Device Latency Max Frequency Total Latency ResourceOur Implementation 1 Stratix V 14 Clk Cycles 121 MHz 116 ns 818 ALMs, 931 Logic Register, 11 DSP block Our Implementation 2 Stratix V 16 Clk Cycles 145 MHz 110 ns 1004 ALMs, 1105 Logic Register, 13 DSP block MegaCore from Altera Stratix V 10 Clk Cycles 176 MHz 57 ns 525 ALMs, 1247 Logic Register, 14 DSP block MegaCore from Altera Stratix V 24 Clk Cycles 237 MHz 101 ns 849 ALMs, 1809 Logic Register, 14 DSP block MegaCore from Altera Stratix V 61 Clk Cycles 332 MHz 184 ns 9379 ALMs, 13493 Logic RegisterRadix 4 Digit Recurrence Stratix V 36 Clk Cycles 219 MHz 164 ns 2605 ALMs, 5473 Logic RegisterPolynomial Approx d =2 Stratix V 18 Clk Cycles 268 MHz 67 ns 444 ALMs, 823 Logic Register, 2 M20K,+ Newton Raphson 9 DSP block
TABLE II. R
ESULTS AND COMPARISON WITH
A
LTERA
latency [1]. The method with Polynomial Approximation (d=2)plus Newton Raphson algorithm [10] has the fastest overalllatency. It also uses signiﬁcant memory resources comparedto the other designs.VI. C
ONCLUSIONS AND
F
UTURE
W
ORK
We have presented a double precision ﬂoating point dividerimplementation that produces hardware with a good tradeoff of frequency, latency and resource usage and that can beimplemented on both Altera and Xilinx FPGAs. Our designmakes use of the embedded multipliers and embedded RAMscommonly found in modern FPGA fabric. It is ﬂexible, withthe designer able to adjust the maximum number of clock cycles.In the future, we plan to investigate our divider further. Inparticular, we will focus on improving the frequency by focusing on optimizing the critical path and trying different levels of pipelining. In addition we plan to adapt our implementation tosupport variable precision division on both Altera and XilinxFPGAs, and to make it available as part of the VFLOAT library[9], [16].R
EFERENCES[1] J. Detrey and F. de Dinechin. A tool for unbiased comparisonbetween logarithmic and ﬂoatingpoint arithmetic.
The Journal of VLSI Signal Processing Systems for Signal, Image, and Video Technology
,49(1):161–175, 2007.[2] M. D. Ercegovac and T. Lˇang.
Digital arithmetic
. Morgan Kaufmann,2003.[3] M. D. Ercegovac, T. Lang, J.M. Muller, and A. Tisserand. Reciprocation, square root, inverse square root, and some elementary functionsusing small multipliers.
Computers, IEEE Transactions on
, 49(7):628–637, 2000.[4] C. Freiman. Statistical analysis of certain binary division algorithms.
Proceedings of the IRE
, 49(1):91–103, 1961.[5] R. Goldberg, G. Even, and P.M. Seidel. An fpga implementationof pipelined multiplicative division with ieee rounding. In
FieldProgrammable Custom Computing Machines, 2007. FCCM 2007. 15th Annual IEEE Symposium on
, pages 185–196. IEEE, 2007.[6] P. Hung, H. Fahmy, O. Mencer, and M. J. Flynn. Fast division algorithmwith a small lookup table. In
Signals, Systems, and Computers,1999. Conference Record of the ThirtyThird Asilomar Conference on
,volume 2, pages 1465–1468. IEEE, 1999.[7] Institute of Electrical and Electronics Engineers (IEEE). 7542008 —IEEE standard for ﬂoatingpoint arithmetic.
IEEE
, pages 1–58, 2008.[8] M. K. Jaiswal and R. C. Cheung. High performance reconﬁgurablearchitecture for double precision ﬂoating point division. In
Reconﬁgurable Computing: Architectures, Tools and Applications
, pages 302–313. Springer, 2012.[9] Northeastern University Reconﬁgurable Computing Laboratory. Vﬂoat:The northeastern variable precision ﬂoating point library. http://www.coe.neu.edu/Research/rcl/projects/ﬂoatingpoint/index.html.[10] B. Pasca. Correctly rounded ﬂoatingpoint division for dspenabledfpgas. In
Field Programmable Logic and Applications (FPL), 201222nd International Conference on
, pages 249–254. IEEE, 2012.[11] J. Robertson. A new class of digital division methods.
ElectronicComputers, IRE Transactions on
, EC7(3):218–222, 1958.[12] E. Roesler and B. Nelson. Novel optimizations for hardware ﬂoatingpoint units in a modern fpga architecture. In
FieldProgrammable Logicand Applications: Reconﬁgurable Computing Is Going Mainstream
,pages 637–646. Springer, 2002.[13] P. Soderquist and M. Leeser. Area and performance tradeoffs in ﬂoatingpoint divide and squareroot implementations.
ACM Computing Surveys(CSUR)
, 28(3):518–564, 1996.[14] P. Soderquist and M. Leeser. Division and square root: choosing theright implementation.
Micro, IEEE
, 17(4):56–66, 1997.[15] X. Wang, S. Braganza, and M. Leeser. Advanced components in thevariable precision ﬂoatingpoint library. In
FieldProgrammable CustomComputing Machines, 2006. FCCM’06. 14th Annual IEEE Symposiumon
, pages 249–258. IEEE, 2006.[16] X. Wang and M. Leeser. Vﬂoat: A variable precision ﬁxedandﬂoatingpoint library for reconﬁgurable hardware.
ACM Transactionson Reconﬁgurable Technology and Systems (TRETS)
, 3(3):16, 2010.