Hybrid Hardware/Software Floating-PointImplementations for Optimized Areaand Throughput Tradeoffs

**Abstract:**

Hybrid floating-point (FP) implementations improve software FP performance without incurring the area overhead of full hardware FP units. The proposed implementations are synthesized in 65-nm CMOS and integrated into small fixed-point processors with a RISC-like architecture. Unsigned, shift carry, and leading zero detection (USL) support is added to a processor to augment an existing instruction set architecture and increase FP throughput with little area overhead. The hybrid implementations with USL support increase software FP throughput per core by 2.18×for addition/subtraction, 1.29×for multiplication, 3.07–4.05×for division, and 3.11–3.81×for square root, and use 90.7–94.6% less area than dedicated fused multiply– add (FMA) hardware. Hybrid implementations with custom FP-specific hardware increase throughput per core over a fixed point software kernel by 3.69–7.28×for addition/subtraction, 1.22–2.03×for multiplication, 14.4×for division, and 31.9× for square root, and use 77.3–97.0% less area than dedicated FMA hardware. The circuit area and throughput are found for 38 multiply–add, 8 addition/subtraction, 6 multiplication, 45 division, and 45 square root designs. Thirty-three multiply– add implementations are presented, which improve throughput per core versus a fixed-point software implementation by 1.11–15.9× and use 38.2–95.3% less area than dedicated FMA hardware. The proposed architecture of this paper analysis the logic size, area and power consumption using Xilinx 1.2.

**Existing System:**

Several approaches have been explored for increasing FPthroughput and maintaining low area overhead. Fused andcascade multiply–add FPUs improve accuracy and providecomputational speedup; however, they introduce large area and power overhead, which are undesirable forsimple fixed-point processors. If blocks of data have similarmagnitudes, block FP (BFP) can be useful for increasing SNR and dynamic range. Microoperations have beenused to create a virtual FPU, which reuse existing fixed-pointhardware to emulate an FP datapath for a very long instructionword processor. Hardware prescaling and postscalinghas also been used to reduce the required hardware for FP division and square root. The hardware overhead can bereduced by shortening the exponent and mantissa widths forvideo coding, audio applications, and radar imageformation. Some speech recognition and image processingapplications have been shown to not require the full mantissawidth. Custom FP instructions have also been exploredfor an FPGA to increase FP throughput with lower areaoverhead than a full hardware FPU. However, Hockertand Compton did not consider modular FPUs built fromstandalone addition/subtraction and multiplication designs orthe throughput when performing the multiply–add operation,nor did they explore the area and throughput tradeoffs ofvarious division and square root algorithms.

**Disadvantages**:

- Area coverage is high

**Proposed System:**

The main contributions of this paper are as follows.

- Eight hybrid implementations with CFP hardware and six with USL support.
- Design and implementation of 38 multiply–add, 8 addition/subtraction, 6 multiplication, 45 division, and 45 square root designs. These designs include full software kernels, full hardware modules, hybrid implementations with USL support, and hybrid implementations with CFP hardware. Three different algorithms for division and three for square root are utilized.
- Evaluation of the proposed software kernels, hardware modules, and hybrid implementations, and FPUs (i.e., the combination of two or more FP software kernels, hardware modules, or hybrid implementations) in terms of area, throughput, and instruction count when performing FP multiply–add, addition/subtraction, multiplication, division, and square root.

Full hardware modules offer the highest throughput, butrequire the most area of the designs implemented. These modules are referred to as “full hardware” because all arithmetic isperformed on dedicated FP hardware. Since the target platformhas a 16-bit datapath, the FP values are first loaded into FPregisters. Each value is stored as two 16-bit words.

For comparison purposes, a separate version of each moduleis created, with a 32-bit word size and datapath. The fullhardware modules are discussed as follows.

- Fused Multiply–Add Module (Full HW FMA)

The full hardware FMA module uses theFMAinstruction,with a two-cycle execution latency. The design of the modulematches that of a traditional single-path FMA architecture,similar to the FMA in the IBM RS/6000. The addendis complemented if effective subtraction is performed and rightshifted by the exponent difference. The multiplier uses radix-4Booth encoding with reduced sign extension, limiting thewidths of the partial products to 28 and 29 bits. The partialproducts are then compressed using a Wallace tree into carrysave format. A 3:2 carry-save adder then adds these values andthe lower 48 bits of the shifted addend. An end-around carryadder with a carry lookahead adder computes the sum. In parallel, a leading zero anticipator (LZA) determines the numberof leading zeros for the result, to within 1 place.The result is complemented if the addend is larger thanthe product. The result is normalized using the LZA count,followed by a possible 1-bit correction and rounding.Full HW FMA (32-bit I/O)is created for a 32-bit datapathand word size and usesFMA32with a two-cycle executionlatency. This instruction uses three source operands.

- Addition/Subtraction Module (Full HW Add/Sub)

This module uses theFPAddandFPSubinstructions witha two-cycle execution latency each.Full HW Add/Sub (32-bit I/O) is created for a 32-bitdatapath and word size and usesFPAdd32andFPSub32, eachof which has a single-cycle execution latency. If operands areread from a processor’s local memory, then a single instructioncan perform addition/subtraction.

- Multiplication Module (Full HW Mult)

This module uses theFPMultinstruction with a single-cycleexecution latency to perform multiplication.Full HW Mult (32-bit I/O)is created for a 32-bit datapathand word size and uses theFPMult32instruction to performmultiplication with a single-cycle execution latency. Assumingoperands are read from a processor’s local memory, a singleinstruction can perform multiplication.

- Division Module (Full HW Div)

This module performs the restoring division algorithmusingFPDiv. This instruction has a 30-cycle execution latency.Full HW Div (32-bit I/O)is created for a 32-bit datapath andword size and uses theFPDiv32instruction with a 30-cycleexecution latency. A single instruction can perform division ifoperands are read from a processor’s local memory.

- Square Root Module (Full HW Sqrt)

This module usesFPSqrtwith a 26-cycle execution latencyto perform the nonrestoring square root algorithm.

Full HW Sqrt (32-bit I/O)is created for a 32-bit datapath andword size and uses theFPSqrt32instruction to perform squareroot operations with a 26-cycle execution latency. A singleinstruction can perform square root operations.

Figure 1: (a) Hardware to implement theFPMult_NormRndCarryinstruction for the Hybrid Mult w/ CFP Ver. 1implementation. FP Reg 1 is loaded with the product of the mantissa multiplication. The rounded result and carry bit are produced. If the carryflag is set, the exponent is incremented in software. (b) Hardware to implement the FPMult_NormRndinstruction for the Hybrid Mult w/ CFP Ver. 2implementation. FP Reg 1 is loaded with the product and FP Reg 2 is loaded with the sign bits and exponents of both operands. The sign, exponent, rounded result, and zero flag are then produced.

**Advantages:**

- Reduce the area coverage

**Software implementation:**

- Modelsim
- Xilinx ISE