FlinkCL: An OpenCL-based In-Memory Computing Architecture on Heterogeneous CPU-GPU Clusters for Big Data
Abstract:
Research on in-memory big data management and processing has been prompted by the increase in main memory capacity and the explosion in big data. By offering an efficient in-memory distributed execution model, existing in-memory cluster computing platforms such as Flink and Spark have been proven to be outstanding for processing big data. This paper proposes FlinkCL, an inmemory computing architecture on heterogeneous CPU-GPU clusters based on OpenCL that enables Flink to utilize GPU’s massive parallel processing ability. Our proposed architecture utilizes four techniques: a heterogeneous distributed abstract model (HDST), a Just-In-Time (JIT) compiling schema, a hierarchical partial reduction (HPR) and a heterogeneous task management strategy. Using FlinkCL, programmers only need to write Java code with simple interfaces. The Java code can be compiled to OpenCL kernels and executed on CPUs and GPUs automatically. In the HDST, a novel memory mapping scheme is proposed to avoid serialization or deserialization between Java Virtual Machine (JVM) objects and OpenCL structs. We have comprehensively evaluated FlinkCL with a set of representative workloads to show its effectiveness. Our results show that FlinkCL improve the performance by up to 11× for some computationally heavy algorithms and maintains minor performance improvements for a I/O bound algorithm.
Existing System:
By offering an efficient in-memory distributed execution model, existing in-memory cluster computing platforms such as Flink and Spark have been proven to be outstanding for processing big data. This paper proposes FlinkCL, an inmemory computing architecture on heterogeneous CPU-GPU clusters based on OpenCL that enables Flink to utilize GPU’s massive parallel processing ability.
Proposed System:
Our proposed architecture utilizes four techniques: a heterogeneous distributed abstract model (HDST), a Just-In-Time (JIT) compiling schema, a hierarchical partial reduction (HPR) and a heterogeneous task management strategy. Using FlinkCL, programmers only need to write Java code with simple interfaces. The Java code can be compiled to OpenCL kernels and executed on CPUs and GPUs automatically. In the HDST, a novel memory mapping scheme is proposed to avoid serialization or deserialization between Java Virtual Machine (JVM) objects and OpenCL structs. We have comprehensively evaluated FlinkCL with a set of representative workloads to show its effectiveness. Our results show that FlinkCL improve the performance by up to 11× for some computationally heavy algorithms and maintains minor performance improvements for a I/O bound algorithm.
Conclusion:
GPUs have become efficient accelerators for HPC. This paper has proposed FlinkCL, which harnesses the high computational power of GPUs to accelerate the in-memory cluster computing with an easy programming model. FlinkCL is based on four proposed core techniques: an HDST, a JIT compiling scheme, an HPR scheme and a heterogeneous task management strategy. By using these techniques, FlinkCL remains compatible with both the compile-time and the runtime of the original Flink. To further improve the scalability of FlinkCL, a pipeline scheme similar to that introduced in could be considered. We can utilize this pipeline to overlap the communication between cluster nodes and the computation in a node. In addition, by using an asynchronous execution model, transfer on PCIe and executions on GPUs can also be overlapped. In the current implementation, data in the GPU memory must be moved into the host memory before it can be sent over the network. A future research direction could involve enabling GPU-to-GPU communication using GPU Direct RDMA to further improve the performance. Another optimization measure could be a software cache scheme that can cache intermediate data in GPUs to avoid unnecessary data transfers on PCIe. In current design, hMap and hReduce function are compiled to kernels separately. Actually, we can fuse these kernels together if possible in our JIT compiler. By adopting this scheme, the time for kernel invoking can be decreased and some data transfers on PCIe can be avoided.
REFERENCES:
[1] “Flink programming guide,” http://flink.apache.org/, 2016, online; accessed 1-November-2016.
[2] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica, “Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing,” in Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation. USENIX Association, 2012, pp. 2–2.
[3] W. Yang, K. Li, Z. Mo, and K. Li, “Performance optimization using partitioned SpMV on GPUs and multicore CPUs,” IEEE Transactions on Computers, vol. 64, no. 9, pp. 2623–2636, 2015.
[4] Z. Zhong, V. Rychkov, and A. Lastovetsky, “Data partitioning on multicore and multi-GPU platforms using functional performance models,” IEEE Transactions on Computers, vol. 64, no. 9, pp. 2506– 2518, 2015.
[5] K. Li, W. Yang, and K. Li, “Performance analysis and optimization for spmv on gpu using probabilistic modeling,” Parallel and Distributed Systems, IEEE Transactions on, vol. 26, no. 1, pp. 196–205, 2015.
[6] C. Chen, K. Li, A. Ouyang, Z. Tang, and K. Li, “Gpu-accelerated parallel hierarchical extreme learning machine on Flink for big data,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 47, no. 10, pp. 2740–2753, 2017.
[7] W. Gropp, E. Lusk, N. Doss, and A. Skjellum, “A highperformance, portable implementation of the MPI message passing interface standard,” Parallel computing, vol. 22, no. 6, pp. 789– 828, 1996.
[8] L. Dagum and R. Enon, “OpenMP: An industry-standard API for shared-memory programming,” IEEE Computational Science and Engineering, vol. 5, no. 1, pp. 46–55, 1998.
[9] P. Carbone, G. Fra, S. Ewen, S. Haridi, and K. Tzoumas, “Lightweight asynchronous snapshots for distributed dataflows,” Computer Science, 2015.
[10] C. Chen, K. Li, A. Ouyang, Z. Tang, and K. Li, “GFlink: An in-memory computing architecture on heterogeneous CPU-GPU clusters for big data,” in International Conference on Parallel Processing, 2016, pp. 542–551.
[11] C. Li, Y. Yang, Z. Lin, and H. Zhou, “Automatic data placement into GPU on-chip memory resources,” in Ieee/acm International Symposium on Code Generation and Optimization, 2015, pp. 23–33.
[12] N. Fauzia and P. Sadayappan, “Characterizing and enhancing global memory data coalescing on GPUs,” in Ieee/acm International Symposium on Code Generation and Optimization, 2015, pp. 12–22.
[13] T. Ben-Nun, E. Levy, A. Barak, and E. Rubin, “Memory access patterns: the missing piece of the multi-GPU puzzle,” in High Performance Computing, Networking, Storage and Analysis, 2015 SC – International Conference for, 2017, p. 19.
[14] I. J. Sung, G. D. Liu, and W. M. W. Hwu, “DL: A data layout transformation system for heterogeneous computing,” in Innovative Parallel Computing, 2012, pp. 1–11.
[15] “Aparapi in amd developer website,” http://developer.amd. com/tools-and-sdks/opencl-zone/aparapi/, 2016, online; accessed 1-April-2016.