Efficient Designs of Multiported Memory on FPGA
Abstract:
The utilization of block RAMs (BRAMs) is a critical performance factor for multiported memory designs on field programmable gate arrays (FPGAs). Not only does the excessive demand on BRAMs block the usage of BRAMs from other parts of a design, but the complex routing between BRAMs and logic also limits the operating frequency. This paper first introduces a brand new perspective and a more efficient way of using a conventional two reads one write (2R1W) memory as a 2R1W/4R memory. By exploiting the 2R1W/4R as the building block, this paper introduces a hierarchical design of 4R1W memory that requires 25% fewer BRAMs than the previous approach of duplicating the 2R1W module. Memories with more read/write ports can be extended from the proposed 2R1W/4R memory and the hierarchical 4R1W memory. Compared with previous xor-based and live value table-based approaches, the proposed designs can, respectively, reduce up to 53% and 69% of BRAM usage for 4R2W memory designs with 8K-depth. For complex multiported designs, the proposed BRAM-efficient approaches can achieve higher clock frequencies by alleviating the complex routing in an FPGA. For 4R3W memory with 8K-depth, the proposed design can save 53% of BRAMs and enhance the operating frequency by 20%. The proposed architecture of this paper analysis the logic size, area and power consumption using Xilinx 14.2.
Existing System:
To implement a multiported memory on an FPGA, two typesof design techniques are required, namely increasing readports and increasing write ports. Table I lists the techniquesproposed by previous works for multiported memories onFPGAs. The approach ofreplicationenables multipleread ports by replicating the data on multiple BRAMs. Thistechnique uses low complexity of control logic, but requiresexcessive usage of BRAMs. LVT, which is implemented bysynthesizing slices on FPGA, enables multiple write ports byduplicating BRAMs and tracking which BRAM stores thelatest value of an address. The other approach to increase writeports is referred to asXOR-based. Different from LVT,which uses a table to track the location of the latest value, theXOR-based design duplicates BRAMs and encodes the storeddata withXORoperations. The target data can be retrieved byapplying theXORagain. In general, theXOR-based approachcan achieve a higher operating frequency, but requires moreBRAMs than the LVT approach.
Note that thispaper focuses on architectural solutions toachieve multiple accesses for a general memory that takesrequests at the current cycle and returns results in the nextcycle. Users of the multiported memory can be completelyignorant of the details of memory designs. There are otherworks focusing on enabling multiple accesses for specifictypes of storage elements, such as register files. Theyenable concurrent reads with an approach similar to replication, but avoid write conflicts by renaming the registerswith software approaches, such as compiler or assembler.These approaches, which tacklespecific storage functions andinvolve effort of users, are not in the scope of this paper.
The following sections will provide more in-depth discussions about implementations and design concerns of thesetechniques. To facilitate a more general discussion, the following paragraphs use a memorybankto refers to a standalonememory module used as a building block to implement a memory system. A memory system usually consists of multiplebanks. The memory space, also referred to asmemory depth,is distributed across the banks. When designing a memorysystem on FPGAs, a BRAM can be used to support the complete memory space. BRAMs can also be deployed as banksto enable larger memory space or higher access bandwidth.
Disadvantages:
- Memory is high
- High slice utilization
Proposed System:
This section proposes efficient solutions to implement multiported memories on FPGAs. Unlike the replication method the approach proposed inthis paper supportsmultiple reads withXORoperations, while multiple writescan be enabled using additional BRAMs. A remap table isadded to track the location of the correct data. The mainmemory architecture is similar to that of our previous workintroduced in [9]. On top of the main architecture, this paperintroduces a brand new perspective of using a 2R1W moduleas either a 2R1W or a 4R module, denoted as a 2R1W/4Rmemory. By applying the 2R1W/4R, this paper exploits theversatile usage mode and proposes a hierarchicalXOR-baseddesign of 4R1W memory that requires fewer BRAMs thanprevious designs. Memories with more read/write ports canbe supported by extending the proposed 2R1W/4R memoryand the hierarchical 4R1W memory.
Techniques to Increase Read Ports:
1) Bank Division With XOR Design Scheme: BankDivision WithXOR(BDX) is an approach to increase read ports proposed. Unlike the method used,BDX avoids replicating the storage elements of the wholememory space. With BDX, multiple reads can be supported byusing theXORoperations. Note that BDX is different from theXOR-based design. The XOR-based approach usesXORoperations to increase write ports by storing the encodeddata to maintain the data coherence between memory modules.BDX usesXORoperations to increase read ports by retrievingthe target data from the encoded value.
Figure 1: Example of a 2R1W memoryimplemented by BDX technique. (a) Supporting multiple reads with XORoperations. (b) Supporting a write request in the two-cycle pipeline architecture.
Fig. 1 illustrates an example of a 2R1W memory implemented with the BDX scheme. As shown in Fig. 1(a), thememory space is distributed to four data banks (banks 0–3).OneXOR-bank is added to keep the XORvalues of the databanks.
2) 2R1W/4R (An Efficient Two-Mode Memory): To implement HBDX in an efficient way, this paper introduces a brandnew perspective of using a 2R1W module as either a 2R1Wor a 4R module. This new way of using the 2R1W module isdenoted as 2R1W/4R. Thishybridmodule can support either2R and 1W or 4R. Note that the 2R1W/4R module uses exactlythe same design as the 2R1W module introduced in Fig. 1.
Fig. 2 illustrates how the two modes work. Fig. 2(a) showsthe 2R1W mode. When there is a write requestW0, thisdesigncan support up to two conflicting reads. The write requestW0stores the data directly to the target data bank, and reads allthe data at the same offset from the other data banks (Rupdate) to update the XOR-bank. Fig. 2(b) shows the 4R mode.
Figure 2: Two modes of 2R1W/4R module. The module is implemented with four data banks and one XOR-bank. (a) 2R1W mode. (b) 4R mode
3) HBDX Designs With 2R1W/4R Module: Fig. 3 illustrates a design scheme that can support more read ports byreplicating the 2R1W module. However, this design schemecould significantly increase the usage of the limited BRAMson an FPGA. To achieve a more BRAM-efficient design, thispaper proposes HBDX, which adopts a hierarchical structurethat organizes the 2R1W to achieve 4R1W without replicatingthe 2R1W module. To further enhance the design, HBDX inthis section leverages the 2R1W/4R scheme introduced in theprevious section as the basic building module to implementa 4R1W module.
Figure 3: Example of mR1W memory implemented with multiple 2R1W modules.
Fig. 4 illustrates a 4R1W memory design by using theHBDX scheme. In this 4R1W design, each basic buildingblock is a 2R1W module of the BDX scheme introduced inFig. 1.
Figure 4: HBDX 4R1W implemented with 2R1W/4R modules.
Techniques to Increase Write Ports:
Bank division with remap table (BDRT) is an approach toincrease write ports proposed. Unlike the LVT designused, BDRT avoids replicating the wholememory space and supports multiple writes using additionalBRAMs and a remap table to track the location of thelatest data. Fig. 5 shows an example of the design fora 1R2W memory. This example consists of two data banks(banks 0 and 1), one bank buffer, and a remap table. The Nullentries in a memory bank are the entries that do not store anyvalid data. When receivingW0 and W1, these requests willfirst look up the remap table to identify the correct BRAM thatstores the latest data. According to the remap table,W0andW1are, respectively, going to address 0 and address 1 in bank 0.
Figure 5: Example of a 1R2W memory implemented with BDRT technique (a) According to the remap table, both W0 and W1 are going to bank 0. The null entries in BRAMs are the entries that do not store any valid data. (b) Final state of the multiported memory after completing the two writes W0andW1.
Advantages:
- Memory reduction
- Low slice utilization
Software implementation:
- Modelsim
- Xilinx ISE