High-Speed and Low-Latency ECC ProcessorImplementation Over GF(2^{m})on FPGA

**Abstract:**

In this paper, a novel high-speed elliptic curve cryptography (ECC) processor implementation for point multiplication (PM) on field-programmable gate array (FPGA) is proposed. A new segmented pipelined full-precision multiplier is used to reduce the latency, and the Lopez-Dahab Montgomery PM algorithm is modified for careful scheduling to avoid data dependency resulting in a drastic reduction in the number of clock cycles (CCs) required. The proposed ECC architecture has been implemented on Xilinx FPGAs’ Virtex4, Virtex5, and Virtex7 families. To the best of our knowledge, our single- and three-multiplier-based designs show the fastest performance to date when compared with reported works individually. Our one-multiplier-based ECC processor also achieves the highest reported speed together with the best reported area-time performance on Virtex4 (5.32 µs at 210 MHz), on Virtex5 (4.91µs at 228 MHz), and on the more advanced Virtex7 (3.18 µsat 352 MHz). Finally, the proposed three-multiplier-based ECC implementation is the first work reporting the lowest number of CCs and the fastest ECC processor design on FPGA (450 CCs to get 2.83 µs on Virtex7). The proposed architecture of this paper analysis the logic size, area and power consumption using Xilinx 14.2.

**Existing System:**

Elliptic curve cryptography (ECC) was proposed byKoblitz and Miller in 1985 individually. Publickey cryptography based on ECC provides higher security perbit than its Rivest, Shamir, Adleman counterpart. ECC hassome additional advantages such as a more compact structure,a lower bandwidth, and faster computation that all make ECCusable in both high-speed and low-resource applications. TheNational Institute of Standards and Technology (NIST) hasproposed a number of standard elliptic curves over binaryGalois fields GF(2m). Binary field curves are suitableor hardware implementation as field arithmetic operationsare carry free. Field-programmable gate array (FPGA)-basedECC hardware design is increasingly popular because of its flexibility, shorter development time scales, easy debugging,and continual improvement of the technology (lower powerand higher performance FPGAs)

Many high-performance ECC (HPECC) processor implementations on FPGA have been reported in the literature;the most relevant are presented.The common optimizing technique of high-speed designs isthe reduction of latency [number of clock cycles (CCs)]of a point multiplication (PM). To achieve low latency fora PM, these works adopted either parallel multipliers orlarge-size multipliers at the expense of additional area complexity; pipelining stages are also often used to increaseclock frequency at the expense of few extra CCs and areaoverheads. In addition, the pipelining stages inthe multipliers create idle cycles at the PM level if thereis data dependency in the instructions. As a result, carefulscheduling is required to take full advantage of pipelining. Indeed, recently, Khan and Benaissa havereported the highest throughput and highest speed ECCdesigns on FPGA using novel digit-serial and bit-parallelmultipliers together with efficient scheduling and pipeliningtechniques.

**Disadvantages**:

- Area coverage is high
- Speed is low

**Proposed System:**

ProposedGF2^{m}multiplierwithsegmentedpipelining:

The proposed full-precision GF(2m) field multiplier(including reduction) with segmented pipelining is shownin Fig. 1 and consists of two pipelining stages to improveclock frequency. The first stage pipelining is the proposed segmented pipelining to break the critical pathdelay of the GF2MUL part.In the segmented pipelining, we divide the mbit multiplieroperand intonnumber ofwbit long-digit multiplier operands.Then, we multiply thembit multiplicand by each of thewbit multipliers. The results of the wdigit size multiplier arem+w−1 bit long. We save each of the resultsin the m+w−1 size pipelining register. Here, we save multiplication’s results in thennumber ofm+w−1sizeregisters. The outputs of them+w−1 size registers are alignedby shifting (logically)wbits from each other followed byXORoperations (addition). The result of the addition that is 2m−1bit long is then reduced tombit in the reduction unit. In thereduction unit, we reduce the 2m−1 bits tombit multiplieroutput using a fast irreducible reduction polynomial.The output of the reduction unit is applied to the second stagepipelining register. Thus, there are two pipelining stages, andhence, the proposed multiplier consumes only two CCs as aninitial delay to perform multiplication. The pipelining of themultiplier divides the total critical path delay into two parts:the critical path delay of GF2MUL, TA+(log2(m/n))TX,andthe critical path delay of the reduction part using the fast NISTreduction polynomial (r-nomial),(log2((n+2r))TX,asshown in Table I. Both critical path delays depend on the sizeof the segment,w. Thus, any one of the two critical paths canbe the critical path of the multiplier. The optimum criticalpath can be defined by the optimum size ofwthat can bedetermined by a trial-and-error method.

Figure 1: Proposed segmented full-precision multiplier over GF(2m).

Proposedhigh-performanceeccforpointmultiplication:

We present careful scheduling in the pointaddition and point doubling operations, a novel pipelined fullprecision multiplier, and othersupporting units to achieve highspeed and low latency while optimizing area complexity.

Point Multiplication without Pipelining Delay:

In general, the Montgomery point addition and point doubling in the projective coordinates requires a total of six fieldmultiplication, five field squaring, and four field addition operations’ equivalent latency if implemented serially according toAlgorithm 1. If the field squaring and field addition operations can be concurrently operated with multiplication, thenthe point operations’ latency will be equivalent to the latencyof the six field multiplications. The six multiplications can,for example, be computed in two steps using three multipliersor in three steps using two multipliers or in six steps by serialmultiplications using one multiplier. Again, thedigit size can affect the performance of ECC; for example,a bit-serial implementation takes mcycles, a digit (wbits)serial one takes (m/w)cycles, and a bit-parallel implementation takes a single CC. In the case of high-speeddesign, digit-serial multipliers are considered to reduce latency.The disadvantage of large digit-serial multipliers is lower clockfrequency. Thus, pipelining stages are applied to improve clockfrequency. The clock frequency can be improved with theincrease in the number of pipelining stages in breaking thecritical path delay. The main disadvantages of increasing thenumber of pipelining stages in the high-speed end of the designspace are the increase in the number of CCs per multiplicationand overcoming data dependency. To avoid pipeliningdelay, optimal scheduling of the field operations of the PM isnecessary.

Figure 2: Proposed high-performance ECC processor architecture

Our first proposed ECC processor architecture is shownin Fig. 2. It comprises a full-precision mbit multiplier withtwo pipelining stages, one squaring circuit, one quad-squaringcircuit, and two addition circuits in order to accomplish pointoperations (point addition andpoint doubling) within six CCs.To achieve six CCs-based point operations, we includesome strategies in the point operations of the MontgomeryPM algorithm as shown in Algorithm 2 [24]. In the proposedalgorithm, we combine point addition and point doublingto avoid data dependency. In the PM, a particular loop isoverlapped with its next loop by two CCs due to two-stagepipelining. Thus, state1 (st1) and state2 (st2) depend on theprevious key bit,ki+1.

PROPOSEDLOW-LATENCYECC PROCESSORFORPOINTMULTIPLICATION:

The speed of ECC can be improved for high-speed applications by reducing the latency of the PM. Parallel full-precisionmultipliers can reduce latency to speed up the point operations.We proposed a high-speed ECC processor for PM utilizingthree full-precision multipliers to achieve the lowest latencyhigh-speed ECC as shown in Fig. 3.

Figure 3: ProposedLLECC processor architecture

**Advantages:**

- Area coverage is low
- Speed is high

**Software implementation:**

- Modelsim
- Xilinx ISE