Temporarily Fine-Grained Sleep Technique forNear- and Subthreshold Parallel Architectures

Abstract:

This paper presents a design approach forimproving energy-efficiency and throughput of parallel architectures in near- and subthreshold voltage circuits. The focusis to suppress leakage energy dissipation of the idle portionsof circuits during active modes, which can allow us to whollytransform the throughput improvement from parallel architectures into energy savings via deep voltage scaling. We begin byinvestigating the efficacy of parallel and pipeline architectures inthe near- and subthreshold circuits. The investigation reveals thatactive energy dissipation largely undermines the ability of deepvoltage scaling to transform excessive throughput into energysavings. Techniques, such as power-gating switches (PGSs), canmitigate active-leakage power dissipation; however, the overheadfor entering and exiting sleep modes can offset the energysavings provided by sleep mode, particularly if sleep time is finegrained for suppressing active leakage. Therefore, in this paper,we propose a PGS design technique, inspired by the so-calledzigzag supercutoff CMOS, in order to optimize the overheads ofmode transitions of PGS in near- and subthreshold circuits. Theproposed technique enables to have circuits in sleep mode for asshort as a single clock cycle with a negligible amount of energyand delay overheads. We apply our proposed design to parallelmultiplier-based test circuits operating at near- and subthresholdvoltages. Simulations show a significant improvement in energyefficiency over baselines at the same throughput.The proposed architecture of this paper is analysis the logic size, area and power consumption using tanner tool.

Existing System:

One of the most effective approaches to reduce powerdissipation is to design digital circuits for the operation atsupply voltages (VDD) scaled from nominal to near or belowthe level of transistor threshold voltage (Vth). Thisapproach is often referred to as near- and subthreshold voltagecircuits, and it can provide approximately one or two ordersof magnitude savings in energy dissipation.Furthermore, we can employ parallel and pipeline architectures, and by scaling VDD, it is possible to trade offthroughput improvements from those architectural techniquesfor higher energy-efficiency. Several classic studies show thatsuch combinations can improve throughput, energy-efficiency,or both.The existing works on parallel and pipelined architectures,however, have emphasis on nominalVDDdesigns, having littleor no attention on a crucial issue that has greater significancein near- and subthreshold circuits: active-leakage dissipation.AsVDD is scaled from nominal to near- and subthreshold levels, increasingly slowed-down circuits accumulatemore leakage power per clockcycle. Eventually, leakageenergy dissipation starts to offset the quadratic savings ofdynamic energy dissipation. Active-leakage energy dissipation,consequently, becomes critical to runtime computing energyefficiency. The VDD level at which the total energy consumption starts to increase is defined as energy-optimalvoltage (VOPT).

In order to improve computing energy-efficiency beyondthe conventional limit, i.e., VOPT, it is of great importance toreduce leakage energy dissipation during active modes. One ofthe potential solutions for this is to place idle parts of circuitsinto a low-leakage (sleep) mode. For example, although nottargeting ultralow-voltage (ULV) circuits, Hu et al. andTschanzet al.have proposed power-gating switch (PGS)for each function block of an execution stage of a pipelinedmicroprocessor. By opportunistically having the blocks thatperform no useful work in a sleep mode, we can reduce activeleakage waste.While low-leakage sleep mode is a valid approach, it cancause non-negligible energy and delay overheads to frequentlyenter and exit modes (i.e., mode transitions) for suppressingactive-leakage dissipation. The use of PGS, as an example, canconsume a significant amount of dynamic energy to chargeand discharge parasitic capacitances during mode transitions.Furthermore, the delay required for transitioning from a sleepto an active mode can degrade throughput and complicatesleep control.

Disadvantages:

  • High delay
  • High power consumption

Proposed System:

We apply the proposed PGS strategy onto parallel multipliertest circuits in near- and subthreshold circuits. The techniqueallows having idle portions of the test circuits in the sleepmode for as short as a single clock cycle, with minimal delayand energy overheads. The simulation results show that theparallel architecture employing the proposed temporarily fine-grained PGS design technique can improve energy-efficiencyby up to 1.9×to 2.6×, over the non-parallelized/pipelined, simply pipelined, and simply parallelized baselines, at the samethroughput. Our contribution of this paper can be summarizedas follows.

1) We analyze the challenge of active leakage in parallelarchitectures in near- and subthreshold voltage circuits.

2) We motivate the active leakage can be significantlymitigated by employing temporarily fine-grained PGS.Via experiment, we find ZSCCMOS a good fit.

3) We experimentally confirm that the combination of parallel architectures and ZSCCMOS can successfully tradeoff architectural throughput improvements for energysavings in near- and subthreshold digital circuits.

Increasing hardware-level parallelism and pipelining arenotable architectural strategies for enhancing computingthroughput. These approaches can also provide significantenergy-efficiency gains, since the improved throughput can betraded off for energy savings via voltage scaling.

Figure 1: Three test architectures based on a 16-bit multiplier. (a) Baseline, (b) two-stage pipelined, and (c) two-way parallel designs. Dashed lines: boundaries of the equivalent sequencing stage across three designs.

Efficacy of Two-Way Parallel Architecture:

In order to investigate parallelism and pipelining in nearand subthreshold circuits, we use three test architectures basedon 16-bit array multipliers in a 65-nm general-purpose CMOS.The baseline version, shown in Fig. 1(a), consists of 32 input

D flip-flops and an array multiplier, which operates at themaximum clock frequency (FCLK,BASE)at eachVDD.Fig.1(b)shows the two-stage pipeline architecture, which consists of32 input and pipeline flip-flops. The two-stage pipeline canhalve the critical path delay, which allows us to use the sameclock frequency (FCLK,PIPE =FCLK,BASE) atVDD.ThislowVDDcan improve energy-efficiency. Finally, Fig. 1(c) showsthe two-way parallel architecture. This design includes a 32-bit2-to-1 multiplexer to recombine the outputs of the two multipliers (Multipliers 1 and 2). In the parallel architecture, while anew input comes atFCLK,BASE, computation is interleaved byclocking the input flop-flops atFCLK,PARA, which is the half ofFCLK,BASE. Although clock frequency is reduced, throughputis still maintained. This slack, provided by parallelism, enablesus to reduceVDDto increase power and energy savings. Theenergy dissipation of output flip-flops is not included.

Mode Transition Overhead of PGS:

Fig. 2(a) shows the conventional nMOS-based PGS design,and Fig. 2(b) shows the transient behaviors of the virtualground potential (VVG), and power dissipation of the maincircuits when entering, exercising, and exiting sleep modes.Here, when the SLEEP BAR (SLPB) signal transitions tologic level LOW (i.e., entering sleep mode), the potentialof VG starts to rise to a level close to VDD. The elapsedtime associated with this transition is defined as time to sleep(T2SLP). After that, the circuits reach deep sleep, when theyconsume small leakage power, referred to asPSLEEP.Inorderto exit a sleep mode, SLPB signal is set HIGH, and eachnode of the main circuits, including VG, returns to its stablestate. The transition time from sleep to active mode is definedas wake-up time (T2WKU). The total sleep time, TSLP,isdefined as the sum ofT2SLP, TSLEEP,andT2WKU. The energydissipated during a mode transition is defined asETRAN.

Figure 2: (a) Main circuits (two inverters) with an nMOS PGS, showing the critical discharging path during a wake-up process. (b) Timing and energy overheads during a mode transition.

We compare four PGS design techniques: a footer PGS, afooter PGS with gate overdrive voltage, a ZSCCMOS, anda ZSCCMOS with gate overdrive voltage (Fig. 3). In thetwo footer-only PGS designs, the PGS is sized to ∼10% ofthe total nFET width of the main circuits (a 16-bit arraymultiplier). In the two ZSCCMOS designs, the footer is sizedto∼10% of the total nFET width, and the header PGS is sizedto ∼10% of the total pFET width. We include the overhead oflevel converters in our simulations for the designs usingoverdrive voltage [i.e., Fig. 3(b) and (d)].

Figure 3: Four PGS designs. (a) Footer PGS. (b) Footer PGS with gateoverdrive voltage. (c) ZSCCMOS. (d) ZSCCMOS with gate overdrive voltage

Parallelarchitecturewith thetemporarilyfine-grainedsleeptechnique:

Fig. 4(a) shows the test circuits that we use in the experiment. It consists of two 16-bit array multipliers (M1 and M2),a 32-bit 2-to-1 multiplexer, input flip-flops (IFF1 and IFF2),and sleep flip-flops (SFF1 and SFF2). M1 and M2 employthe ZSCCMOS technique with no gate overdrive voltage. Themultiplexer employs no PGS. CLK is a synchronous clocksignal for nonparallel portions of the circuit, while CLK1 is a2×slower clock. CLK2 is the inverted version of CLK1. Thedata input (INPUT) to the test circuits is accompanied with aVALID signal that indicates whether INPUT is valid or not.Both INPUT and VALID signals are synchronized with CLK.In addition, by AND-ing VALID and CLK1 (or CLK2), wecan generate LCLK1 (or LCLK2), which is used in the inputflip-flops of M1 (or M2) for clock gating.

We present the detailed operating waveforms of the test circuits in Fig. 4(b). In the waveforms, at the clock cycle 1,the first input is not valid (VALID transitioning to LOW beforethe rising clock edge). SFF1, clocked by CLK1, capturesVALID and generates SLPB1=LOW and SLP1=HIGHsignals. This puts M1 into sleep mode and minimizes itsactive-leakage dissipation. The IFF1 of M1 is reset, forcingthe inputs of M1 to LOW, so that every internal node in M1settles to a predefined state. The IFF1 is also clock-gated, suchthat their dynamic energy consumption is reduced. Similarly,just before the next rising edge of CLK, another invalid inputcomes in (VALID is still LOW at the rising clock edge).

VALID signal, this time, is captured by SFF2 at the rising edgeof CLK2. This makes SLPB2=LOW and SLP2=HIGHto turn OFFM2 during that cycle. The IFF of M2 is reset,forcing internal nodes in M2 to predefined states. In addition,IFF2 is clock-gated by LCLK2 to reduce the dynamic energyconsumption.At Cycle 3, VALID becomes HIGH. The SFF1 captures theVALID signal at the rising edge of CLK1, which modulatesSLPB1 and SLB1 to wake up M1. The wake-up process isfinished before the IFFs of M1 sample INPUT at the risingedge of LCLK because of a very short T2WKU of<1FO4delays, and the time difference between the rising edges ofCLK1 and LCLK1.

Figure 4: (a) Proposed design. A 16-bit two-way parallel multiplier with a temporarily fine-grained PGS technique. (b) Functional waveforms

Advantages:

  • Low power consumption
  • Low delay

Software implementation:

  • Tanner EDA