Hardware-Oriented Compression of Long Short-Term Memory for Efficient Inference

Hardware-Oriented Compression of Long Short-Term Memory for Efficient Inference

ABSTRACT

Long short-term memory (LSTM) and its variants have been widely adopted in processing sequential data. However, the intrinsic large memory requirement and high computational complexity make it hard to be employed in embedded systems. This incurs the need of model compression and dedicated hardware accelerator for LSTM. In this letter, efficient clipped gating and top-k pruning schemes are introduced to convert the dense matrix computations in LSTM into structured sparse-matrixsparse- vector multiplications. Then, mixed quantization schemes are developed to eliminate most of the multiplications in LSTM. The proposed compression scheme is well suited for efficient hardware implementations. Experimental results show that the model size and the number of matrix operations can be reduced by 32× and 18.5×, respectively, at a cost of less than 1% accuracy loss on a word-level language modeling task.

EXISTING SYSTEM :

Recurrent neural networks (RNNs) have exhibited powerful capability in processing sequential data . Long-range dependencies a sequences are captured through RNN’s recurrent structure. By introducing gating mechanism to control the information flow, some RNN variants, such as long shortterm memory (LSTM) , and gated recurrent unit (GRU) , are able to achieve higher accuracy. Thus, those models are more widely used nowadays. However, those kinds of gated RNNs usually have high computational complexity and occupy large memory space to store network weights, making it hard to be employed in resourcelimited applications. Various efforts have been made to alleviate this problem, including quantization of weights and activations, network sparsity , conditional computing, tensor decomposition , etc. For network sparsity, reduction in model size is usually achieved by first pruning the redundant connections in the network and, then, storing the resulting sparse weight matrices with special compression formats. The computational complexity can also be reduced by skipping operations involving zeros. When considering the design of a hardware accelerator for a pruned RNN model, the unstructured sparsity pattern of network weights will cause unbalanced workload during parallel processing, which is known as the load-imbalance problem . Han et al. introduced a load-balance pruning method to tackle this problem, but no details were given in their paper . Another approach was proposed byWen et al. where group lasso regularization was introduced to force LSTM to learn structured sparse weights.

PROPOSED SYSTEM :

we propose an efficient approach to tackle the load-imbalance problem and aggressively reduce the complexity of LSTM. First, a clipped gating method is presented to enable sparse activations. The sparsity can be easily controlled with an introduced regularizer. The computations related to zero activations can be skipped. Though activation sparsity has been widely explored in accelerating the computation of convolutional neural networks, it has not been explored in LSTM before since the original form of LSTM cannot produce zero activations. Then, a top-k pruning scheme is introduced, which leads to structured sparse weights and is able to avoid the load imbalance problem. A judicious grouping method is employed in top-k pruning to maintain network accuracy. Last, mixed quantization schemes are adopted, where most of the multiplications can be replaced with shift operations. Experimental results show that the reduction in model size and the number of matrix operations are able to achieve 32×and 18.5×, respectively, at a cost of less than 1% accuracy loss on given task. The final model is highly compact with very small multiplication operations. This letter combines clipped gating, top-k pruning, and mixed quantization schemes together to reduce the model size and the computational complexity of LSTM.

CONCLUSION :

This letter combines clipped gating, top-k pruning, and mixed quantization schemes together to reduce the model size and the computational complexity of LSTM. It is shown that 32× smaller model size and 18.5× reduction in the number of matrix operations can be achieved. The final model occupies only about 800-kB memory space and needs very small multiplication operations, resulting in a highly compact LSTM model. The proposed compression scheme is very suitable for dedicated hardware implementations of LSTM.

Hardware-Oriented Compression of Long Short-Term Memory for Efficient Inference

Recent Post

Project Categories