ENFIRE: A Spatio-Temporal Fine-GrainedReconfigurable Hardware

Abstract:

Field programmable gate arrays (FPGAs) are well-established as fine-grained reconfigurable computing platforms. However, FPGAs demonstrate poor scalability in advanced technology nodes due to the large negative impact of the elaborate programmable interconnects (PIs). The need for such vast PIs arises from two key factors: 1) fine-grained bit-level data manipulation in the configurable logic blocks and 2) the purely spatial computing model followed in the FPGAs. In this paper, we propose ENFIRE, a novel memory-based spatio-temporal framework designed to provide the flexibility of reconfigurable bit-level information processing while improving scalability and energy efficiency. Dense 2-D memory arrays serve as the main computing elements storing not only the data to be processed but also the functional behavior of the application mapped into lookup tables. Computing elements are spatially distributed, communicating as needed over a hierarchical bus interconnect, while the functions are evaluated temporally inside each computing element. A custom software framework facilitates application mapping to the framework. By leveraging both spatial and temporal computing, ENFIRE significantly reduces the interconnect overhead when compared with FPGA. Simulation results show an improvement of 7.6×in energy, 1.6×in energy efficiency, 1.1×in leakage, and 5.3×in unified energy efficiency, a metric that considers energy and area together, compared with comparable FPGA implementations. The proposed architecture of this paper analysis the logic size, area and power consumption using Xilinx 14.2.

Existing System:

Traditionally, FPGA architectures employ a purely spatialcomputing model. Applications are mapped into a set ofmultiple-input, single-output lookup tables(LUTs) connectedby the programmable interconnects (PIs). However, thisrequires a rather elaborate PI network, which becomes a majorperformance bottleneck and leads to poor power, performance,and scalability across technology nodes, where the PI networkalone has been shown to account for an 80% of powerconsumption and a 40%–80% of critical path delay. Furthermore, because of the dominance of the PI network, FPGAplatforms experience poor performance scalability across thetechnology nodes. As a result, there is a growing need foralternative reconfigurable computing frameworks, which canreduce the PI requirement, improving energy efficiency andtechnology scalability over the conventional architectures.

In order to improve area and performance, researchershave investigated using an FPGA’s embedded memory arraysfor computation, when they are not configured as on-chip memory. Time-multiplexed hardware reconfigurableschemes have also been investigated to increase the hardwareutilization, and therefore save area and performance.However, when executing a specific application, they arestill considered as a fully spatial computing model, which issimilar to the traditional FPGAs, and hence incur a large PIoverhead. Spatio-temporal reconfigurable hardware schemeshave also been proposed to improve the utilization of both PIand computing elements. Instead of using an arithmeticlogic unit for multicycle execution as done by PipeRenchMAHA uses a memory-based computing (MBC) schemeto improve energy efficiency. However, these frameworksoperate with a granularity of at least 8 bits, and thereforesuffer from poor resource efficiency and low energy efficiencywhen executing bit-level applications. In addition, the Garp architecture was proposed to embed a reconfigurableFPGA-like architecture within a CPU; however, this is essentially a temporal computing element (CPU) using a spatialfabric (FPGA) to accelerate some computations, so functionevaluation is not strictly spatio-temporal. Also related is theTabula architecture, which still performs spatial dataprocessing, though it time-multiplexes the FPGA resourcesto increase the apparent number of LUTs from the user’sperspective.

Disadvantages:

  • Energy efficiency is low

Proposed System:

In particular, the paper makes the following novelcontributions.

  • It proposes ENFIRE, a novel MBC framework, which uses a 2-D memory array hybridized with CMOS controlling logic. Unlike previous work in MBC, ENFIRE enables energy-efficient mapping of FPGA-like fine-grained logic that can be used stand-alone as an FPGA replacement, or embedded within a GPP for energy-efficient acceleration of fine-grained workloads. Many applications, including analytics, stream processing, and information encoding/decoding can largely benefit from efficient fine-grained data processing, which is not available in the conventional coarse-grain reconfigurable architectures.
  • It provides the design details of the proposed architecture including the µ-architecture of the MLBs. It also describes in detail the hierarchal interconnect model between the MLBs.
  • It presents a custom-designed complete software application mapping flow, describing the major steps of the software framework, and also presents several examples of the application mapping.
  • It describes the modeling process of the hardware components, shows the simulation setup, and presents the simulation results. It then demonstrates that the ENFIRE framework can achieve considerable improvement in energy efficiency compared with an FPGA implementation at the cost of increased latency.

The block diagram of a single ENFIRE PE is shownin Fig. 1. Each PE is referred to as an MLB and operatesindependently. Each contains its own schedule (instruction)table and data memory array, which holds LUT responses inaddition to the data for processing. The MLBs are connectedwith a two-level hierarchical bus interconnect. The first level,referred to as a cluster, contains four MLBs, while the second,called a tile, contains four clusters. In the remainder ofthis section, we describe in detail the µ-architecture of asingle MLB along with the interconnect structure and thecommunication patterns between the MLBs.

Figure 1: Block diagram of an MLB

MLB Structure:

Fig. 1 shows an MLB block diagram. Each MLB consistsof the following components.

  • Program Countertracks the current instruction being executed.
  • Schedule Tablememory array, which holds the instructions for the given application.
  • Decoder responsible for decoding the instructions fetched from the schedule table.
  • Register Fileholds the intermediate results from the application during execution.
  • Address Generation and Memory Controller generates the memory access request and corresponding memory address for LUT operations.
  • Data Memorylarge 2-D memory array, which holds the LUTs, as well as the data being processed.
  • DatapathLogiccontrolled by the decoded instruction and determines the output destination.

The MLB is more analogous to the complex logicblock (CLB) structure in an FPGA than to a standardprocessor. The primary difference between an MLB and aCLB is that the MLB stores multiple LUTs, which can bedynamically selected at runtime, as opposed to a single, fixedconfiguration. This allows for a spatio-temporal computingmodel that enables more efficient resource reuse than a CLB.

Figure 2: Wordline segmentation of a data row in the memory bank

Our design uses the 2-kB memory banks (256 rows×64 bits/row). Rather than read an entire row to perform asmall memory access, wordline segmentation is used, whichensures that only the minimum number of required cells areenergized during the read.ANDgates are inserted in the pathof the wordline to ensure that theSELECTsignal is receivedonly by the desired cells during a read operation. This scheme,shown in Fig. 2, allows for four of each sized LUT per memorybank. Because of a limitation in instruction encoding, there is afifth 4-bit wide segment that cannot be used as an LUT, butcan be used for general data storage if needed.

Interconnect Structure:

The statically scheduled random logic functions targeted byENFIRE eliminate the need for a packet-switchedinterconnectscheme and the associated power and area overhead forrouters. Routing information can be efficiently encoded as apart of the instructions and can be decoded at runtime to enablethe inter-MLB communication as needed.

Figure 3:Hierarchical bus-based interconnect structure of (a) single cluster of four MLBs and (b) single tile of four clusters.

Most random logic functions exhibit strong spatial localityof data. To exploit this, ENFIRE utilizes a sparse hierarchicalbus interconnect to enable low-overhead communication forseparate MLBs. Within each cluster, an 8-bit bus fully connects all the MLBs [Fig. 3(a)]. This yields an available bandwidthof 10.26 GB/s at the maximum operation frequency of 1.3 GHz(Section V-A). Between the clusters, there is a similar16-bit bus, 4-bits of which are reserved for each MLB in thesource cluster [Fig. 3(b)]. When an MLB places data on itsintercluster bus, the data are broadcast to all other clusters inthe tile.

Advantages:

  • Energy efficiency is high

Software implementation:

  • Modelsim
  • Xilinx ISE