Energy-Efficient VLSI Realization of Binary64 Division with Redundant Number Systems
Abstract:
VLSI realizations of digit-recurrence binary division usually use redundant representation of partial remainders and quotient digits. The former allows for fast carry-free computation of the next partial remainder, and the latter leads to less number of the required divisor multiples. In studying the previous relevant works, we have noted that the binary carry save (CS) number system is prevalent in the representation of partial remainders, and redundant high radix representation of quotient digits is popular in order to reduce the cycle count. In this paper, we explore a design space containing four division architectures. These are based on binary CS or radix-16 signed digit (SD) representations of partial remainders. On the other hand, they use full or partial pre computation of divisor multiples. The latter uses smaller multiplexer at the cost two extra adders, where one of the operands is constant within all cycles. The quotient digits are represented by radix-16 [−9,9]SDs. Our synthesis-based evaluation of VLSI realizations of the best previous relevant work and the four proposed designs show reduced power and energy figures in the proposed designs at the cost of more silicon area and delay measures. However, our energy-delay product is 26%–35% less than that of the reference work. The proposed architecture of this paper analysis the logic size, area and power consumption using Xilinx 14.2.
Existing System:
We briefly describe some representative binary division hardware designs, but mainly focus on those that internally use radix-16, for better comparison with our radix-16 designs. Taylor offered the aforementioned idea of decomposing the radix-16 quotient digit to two minimally redundant radix-4 digits that are obtained in a time-overlapping manner. However, the gained speed is at the cost of more silicon area and more power dissipation. This overlapping technique has been later used in a number of publications, where the main difference is in partial remainder representation (e.g., CS or BSD). This is summarized in Table I, where other design particularities, such as the employed quotient digit set and the use of overlapping, are also indicated. Nevertheless, each design may have its particular additional contribution.
Use of Signed Digit Number Systems:
Utilization of a high radix SD number system (for partial remainder representation) entails particular carry-free addition, based on (1). This regards a previous work and two of the proposed designs in this paper, which use radix-16 SD representation of partial remainders. The digit set of the former is not indicated, while those of the proposed designs are MRSD. Therefore, an addition of a radix-16 MRSD number (i.e., the shifted partial remainder) with a binary number (i.e., the divisor multiple) is in order. In the previous works that use radix-16 MRSD number system, each radix-16 MRSD digit is represented by a 5-bit two’s complement encoding. The most significant bit is a negabit with arithmetic value−1(0)corresponding to logical status 0(1)[14]. Fig. 1 shows two digit slices of the required carry-free addition, for the latter so-called negabit two’s complement representation, where white (black) dots represent negabits (posibits). One 4-bit adder per radix-16 position, whose input bits are distinguished within dashed boxes, is used to produce the sum digits (surrounded by dashed curves). The 4-bit adders produce the negabit and three most significant posibits of the sum digit, in the same radix-16 position, and the carry-out rests at the least significant position of the next sum digit.
Table 1: taxonomy of binary digit recurrence dividers
Figure 1: Two digit slices of CFA required for PRC
The use of the aforementioned adder is expected to lead to less power dissipation with respect to designs that use CS adder, which is mainly due to the use of 5 bits per digit in the former compared with 8 bits in the latter.
Disadvantages:
- Area coverage is high
Proposed System:
The general digit recurrence division architecture is shown in Fig. 2, where the pertinent discussion regarding the proposed designs will be provided as appropriate. Double precision operands (i.e., binary64 floating point) is assumed for all the proposed architectures as is the case for the main reference work.
Within the framework of Fig. 2, static and semidynamic DMGs and two different representations for partial remainders provide us with a design space based on the following options.
1) Radix-16 Quotient Digit Set:This choice, as in the previous relevant works, leads to the reduced number of cycles versus the direct generation of quotient bits.
2) SD Representation of Quotient Digits:We use[−9,9] radix-16 SD set for the intermediate representation of quotient digits. The 5-bit representation of such digits is the same as the MRSD of Fig. 1. The primary choice would be the minimally redundant[−8,8] digit set that requires the minimum number of divisor multiples, while the other extreme choice would be the maximally redundant[−15,15]digit set. However, another influential measure is the number of fractional digits that are sufficient for truncated comparison of partial remainders with divisor multiples. This is later shown to be 2 in the case of digit sets[−α, α](α∈[9,15]),and3forα=8.
3) Semidynamic DMG: The [−9,9] multiples of divisor that are needed in the PRC are normally obtained within the initialization cycle, where ten-way multiplexer is required for selectingqj+1D. However, besides implementing the latter conventional method, we propose the following method that uses a four-way multiplexer. In the initialization cycle, we generate only {2,3,6}D, and dynamically obtain ±{4,5,7,8,9}D, as±{6D−2D,6D∓D,6D+2D,6D+3D}, respectively, within each recurrence.
4) Use of Redundant Number Systems for PRC:The previous relevant works have opted for CS representation of partial remainders. To be able to independently show the advantage of aforementioned semidynamic DMG, we also use CS as one option, which due to doubling the representation storage does not seem to be a proper choice when lower power dissipation is desirable. Therefore, our other choice is to use higher radix redundant number systems for partial remainder representation. For example, radix-16 maximally redundant number system (MRSD) [5] is a viable choice, with only 25% extra representation storage.
Based on the above discussion, our design space includes four architectures that are designated as CS-10, CS-4, MRSD-10, and MRSD-4, where CS and MRSD regard the partial remainder representation, and 4 and 10 refer to the sizes of utilized multiplexers, corresponding to[0,3]×D or [0,9]×D multiples, respectively.
Quotient Digit Selection:
As was argued above, the quotient digit set of our choice is [−9,9]. In general, the next quotient digit qj+1 should be selected from [−α, α], such that the convergence condition hat is described by (3) (above) holds [1], where in this case (i.e., α=9),ρ=(a/(16−1))=(9/15)=0.6. The convergence condition (−0.6D≤W[j +1]≤0.6D) is partially unfolded as in Fig. 3, which suggests the comparison of the partial remainder with a set of divisor multiples (e.g., 1.6Dand−0.4D,forqj+1=−1)in order to decide the value of the next quotient digit. However, for some W[j] values (e.g., 16W[j] =−.5D), there may be more than one validqj+1. For example, see the overlapped zone between the dashed lines forqj+1=−1andqj+1=0.
To ease the QDS process and reduce the number of comparisons, it is common to precompute a set of comparison constants, as fixed multiples of divisor [as in (2)], to be used for exact QDS. As such, (4) provides the proper range of Mks, for k ∈[−9,9]. Therefore, an ease to compute choice is Mk =(k+0.5)D, which leads to the case that the exact interval of 16W[j] (for a particular value of qj+1 =−1) falls between M−2 =−1.5Dand M−1 =−.5D(see the corresponding bold arrows in Fig. 3).
The overall QDS architecture is shown in Fig. 4, where Wk[j +1]=(16W[j])t −(Mk)t, and the index t denotes tfractional digits.
Partial Remainder Computation:
The aforementioned four proposed architectures are mainly described within the PRC discussion as follows.:
1) CS-10 and MRSD-10 Designs: The straightforward PRC would use a unified carry-free adder/subtractor (CFA/S) and a 10:1 multiplexer, for the quotient digit set [−9,9]. The multiplexer selects one precomputed divisor multiple from [0,9]×D, which are obtained in the initialization phase. Fig. 5 shows the required architecture, for designs CS-10 and MRSD-10, where a digit is meant to denote either an MRSD or four CS digits, and QDS box represents Fig. 4, whose output control signals are shown as “MUX Selector” and “ADD/SUB.”
2) CS-4 and MRSD-4 Designs: In order to utilize a smaller selector of divisor multiples, we propose the architecture of Fig. 6, where again QDS box represents Fig. 4, whose output control signals are shown as “MUX Selector” and “ADD/SUB.” Let qj+1 =6α+β,where α∈[−1,1] and β ∈[0,3]. The overlapped PRC computes W[j +1]’ = 16W[j]+6αD via a CFA and a CFS, since W[j]is represented in binary CS or radix-16 MRSD formats. The operation of this part overlaps with the QDS. The rest of PRC is carried out after QDS, whereβDis selected via the four-input multiplexer, followed by computing W[j +1]=W[j +1]’ +βD, via another CFA.
Advantages:
- Area coverage is low
Software implementation:
- Modelsim
- Xilinx ISE