FPGA Realization of Low Register SystolicAll-One-Polynomial Multipliers overGF(2^{m})and Their Applications inTrinomial Multipliers

**Abstract:**

Systolic all-one-polynomial (AOP) multipliers usually suffer from the problem of high register complexity, especially in field-programmable gate array (FPGA) platforms where the register resources are not that abundant. In this paper, we have shown that the AOP-based systolic multipliers can easily achieve low register-complexity implementations and the proposed architectures can be employed as computation cores to derive efficient implementations of systolic Montgomery multipliers based on trinomials. First, we propose a novel data broadcasting scheme in which the register complexity involved within existing AOP-based systolic multipliers is significantly reduced. We have found out that the modified AOP-based structure can be packed as a standard computation core. Next, we propose a novel Montgomery multiplication algorithm that can fully employ the proposed AOP-based computation core. The proposed Montgomery algorithm employs a novel precomputedmodular operation, and the systolic structures based on this algorithm fully inherit the advantages brought from the AOP-based core (low register complexity, low critical-path delay, and low latency) except some marginal hardware overhead brought by a precomputation unit. The proposed architectures are then implemented by Xilinx ISE 14.1 and it is shown that compared with the existing designs, the proposed designs achieve at least 61.8% and 47.6% less area-delay product and powerdelay product than the best of competing designs, respectively. The proposed architecture of this paper analysis the logic size, area and power consumption using Xilinx 14.2.

**Existing System:**

All-one-polynomials (AOPs) and trinomials aretwo of the important irreducible polynomials beingused. The AOP-based multipliers can beused for the nearly AOP, which could be used for efficientrealization of cryptosystems. The AOP-based structurescan be used as a kernel circuit for field exponentiation,inversion, and division architectures, while trinomialbased multipliers are more popular than AOP-based ones,as two trinomials have been recommended by the NationalInstitute of Standards and Technology (NIST) for ECCimplementation. However, because of the complexitydifferences, AOPs and trinomials are not usually consideredtogether in practical field multiplication implementations.There are basically two kinds of structures for multipliersoverGF(2m): systolic esign and nonsystolic design. Systolicmultipliers overGF(2m) based on irreducible polynomials are preferred in high-performance applications due totheir features, such as modularity and regularity.Systolic structures also have high register complexity, sinceall processing elements (PEs) in the systolic array need to useregisters for pipelining, while nonsystolic designs usuallyhave lower complexity with larger critical-path delay.For practical applications, especially in field-programmablegate array (FPGA) platforms, where the register resources arenot that abundant, low register-complexity systolic structuresare required. Many efforts have been reported to reducethe register complexity in systolic multipliers based on irreducible AOPs and trinomials. A bit-parallelAOP-based systolic multiplier has been introduced.Furthermore, another efficient AOP-based design is presented. Moreover, one low-complexity systolic MontgomeryAOP-based multiplier has been proposed. Leeet al.presented a bit-parallel systolic trinomial multiplier. Meher proposed efficient bit-parallel systolic and supersystolicdesigns. Xieet al.introduced a low register-complexitysystolic structure. Very recently, Montgomery systolic multipliers were presented where the register count was efficientlyreduced. Several other works were reported for efficientrealization of finite field Montgomery multiplication overGF(2m).

**Disadvantages**:

- High latency
- High complexity

**Proposed System:**

Lowregister-complexityAOP-basedsystolicmultipliers(AOP-basedcomputationcore:

We briefly review the AOP-basedmultiplication algorithm first, and then present our proposedarchitectures based on the existing structures.

Figure 1: Conventional systolic structure of AOP-based multiplication (structure-I: S-I), where BSC denotes the bit-shifting cell and the black box denotes the registers. (a) Structure. (b) Internal structure of PE-1. (c) Internal structure of regular PE. (d) Internal structure of PE-2.

For the structures of Figs. 1 and 2, we find thatk^{2}registersin the PEs pipeline identical data (in shifted order) to theneighboring PEs. These registers can be removed if we changethe broadcasting strategy. As shown in Figs. 3 and 4, i.e., MS-Iand MS-II, a shifted connection strategy is used in which theinput Ais directly fed to each PE and thus reduces the registersrequired. Moreover, the details of shifted connection are alsoshown in Figs. 3 and 4. To reduce the complexity further, wehave usedNANDandXNORgates to replace the originalANDandXORgates, as depicted in [7] and [11] (the critical-pathdelay is then shortened toTNA+TXN,whereTNAandTXNrepresent the delays of NANDandXNORgates, respectively,as evidenced by the normalized area and delay comparisonof various logic gates shown in Table I). It is noted that forAOP-based multiplication, the last PE (inside the dotted box)can be removed asbk =0. The modified structures involvenearly the same time complexity as the previous ones, but theregister complexity is significantly reduced.

Figure 2: Existing low critical-path structure of [24] for AOP-based multiplication (structure-II: S-II), where the black box denotes the registers. (a) Structure. (b) Internal structure of PE-1. (c) Internal structure of PE-2. (d) Internal structure of regular PE. (e) Internal structure of PE-3.

Low-Latency Implementations:

For practical applications, we can further reduce the latencies of structures shown in Figs. 3 and 4, fork+1=pq+f,where 0≤f ≤q. Without loss of generality, we assumef =0, and then, we can decompose the original systolicarray ofk+1 PEs intopparallel arrays to achieve low-latencyimplementations, as shown in Fig. 5. An extra pipelined addertree consisting of XNORgates and registers is needed to addthe results from parrays together to yield the final resultC.

Figure 3: Modified structure-I (MS-I), where the black box denotes the registers. For AOP implementation, we can remove the PE inside the dotted area since bk =0, but for the formation of standard computation core, this PE will be preserved. (a) MS-I. (b) Internal structure of PE-1. (c) Internal structure of regular PE.

Figure 4: Modified structure-II (MS-II), where the black box denotes the registers. For AOP implementation, we can remove the PE inside the dotted area sincebk =0, but for the formation of standard computation core, this PE will be preserved. (a) MS-II. (b) Internal structure of PE-1. (c) Internal structure of PE-2. (d) Internal structure of regular PE. (e) Internal structure of PE-3.

Figure 5: Low-latency implementation of systolic structure, where the internalPEs can be those of MS-I or MS-II.

Digit-Parallel Structures:

We can combine neighboring PEs in a systolic array intoone PE to reduce the register usage further. Fig. 6 showsan example of combining two neighboring PEs into one PE(based on the PEs from MS-I). The critical-path delay of thenew PE thus turns into(TNA+2TXN). For simplicity, we definethe structure based on new PE in Fig. 6(b) as a digit-levelparallel structure with digit sized=2. If we choose the valueofdappropriately, the proposed architecture can achieve theoptimal area-time complexity for specific applications.

**Advantages:**

- Low latency
- Low complexity

**Software implementation:**

- Modelsim
- Xilinx ISE