FAULT TOLERANT STENCIL COMPUTATION ON CLOUD-BASED GPU SPOT INSTANCES
ABSTRACT
This paper describes a fault tolerant framework fordistributed stencil computation on cloud-based GPU clusters. Ituses pipelining to overlap the data movement with computationin the halo region as well as parallelises data movement within theGPUs. Instead of running stencil codes on traditional clusters andsupercomputers, the computation is performed on the AmazonWeb Service GPU cloud, and utilizes its spot instances to improvecost-efficiency. The implementation is based on a low-cost faulttolerantmechanism to handle the possible termination of thespot instances. Coupled with a price bidding module, our stencilframework not only optimizes for performance but also for cost.Experimental results show that our framework outperforms thestate-of-the-art solutions achieving a peak of 25 TFLOPS for 2-D decomposition running on 512 nodes. We also show that theuse of spot instances yields good cost-efficiency, increasing theaverage TFLOPS/USD from 132 to 360.
EXISTING SYSTEM:
In the area of distributed stencil computation, work hasfocused on hiding the performance gap between network communication and computation, which is even more significantin cloud computing than conventional clusters where highperformance network like Infiniband is available. Physis, adistributed stencil code generator is developed by Maruyama etal. scales up to 256 GPUs on the TSUBAME2.0 supercomputer. Panda is another code generation frameworkfor MPI+CUDA+CPU clusters, however, it does nothave any communication latency hiding strategies. Jin et al.extends a temporal blocking stencil kernel into distributedGPUs. Shimokawabe et al. proposed a framework forexecuting stencil code on distributed GPU clusters for meshapplications . However, distribute theworkload into clusters only on 2 dimensions, which limitsthe scalability. Zhang et al. proposed an auto-generation andauto-tuning framework for GPU clusters up to 32 nodes .It uses a similar approach with Physis to hide thecommunication delay. Nukada et al. proposed several methodsof reducing communication delays for distributed stencilcode , however, it depends highly on Infiniband, instead ofgeneric network. Danalis et al. proposed the idea of structuringcomputation and communication into tiles and executing themin a pipelined fashion. This is a more fine-grained strategythan the state-of-the-art solutions for stencil, which hidescommunication delay simply with the concurrent computationof the interior region.
PROPOSED SYSTEM:
To summarize, the contributions of this paper include:_ A pipelined implementation of stencil codes that optimallydecouples the dependencies between communicationand computation;_ A complete code for large scale distributed stencil computationon cloud that achieves good performance;_ A low cost checkpointing mechanism that achieves faulttolerance on cloud spot instances;_ An optimized bidding strategy for spot instance thatlowers the cost and improves the stability.
CONCLUSION
In this paper, we proposed and implemented an fault tolerantstencil framework for cloud-based GPU clusters using spotinstances that takes advantage of pipelining to hide communicationoverhead. With pipelining, the dependency betweenstencil computation and data movement are decoupled, andhence execution proceeds in parallel regardless of dependencies.The only issue affecting performance is the balancebetween stages of the pipeline – a pipeline is only as fastas its slowest component. We investigated the relative aswell as scaling relationships between pipeline stages, thenmodelled and tuned the pipeline to achieve best possiblebalance. Our experiments on up to 512 GPUs in the cloudyielded 1:75_ speed up over Physis, a state-of-the-art solution.We also showed that our strategy when applied to real-worldapplication stencils was up to 1:51_ faster than Physis’.With a checkpointing fault-tolerance mechanism, our stencilcode is able to take advantage of AWS spot instances.These are much cheaper than traditional reserved instances,but comes with the risk of premature terminations. Ourcheckpointing mechanism is able to recover the computationprogress with a different set of resources should spot instancesbe terminated. Moreover, it will not slow down the regularcomputation when termination does not happen.We also investigated the bidding strategies of spot instance,and combined real time price changes and historical price datainto a statistical strategy that ensures cost-efficiency as wellas stability in most of the situations.
REFERENCES
[1] S. Jeschke, D. Cline, and P. Wonka, “A GPU Laplacian solver fordiffusion curves and Poisson image editing,” in ACM Transactions onGraphics (TOG), vol. 28, no. 5. ACM, 2009, p. 116.
[2] G. Barozzi, C. Bussi, and M. Corticelli, “A fast cartesian scheme forunsteady heat diffusion on irregular domains,” Numerical Heat Transfer,Part B: Fundamentals, vol. 46, no. 1, pp. 59–77, 2004.
[3] P. Bailey, J. Myre, S. Walsh, D. Lilja, and M. Saar, “Accelerating latticeBoltzmann fluid flow simulations using graphics processors,” in ParallelProcessing, 2009. ICPP ’09. International Conference on, Sept 2009,pp. 550–557.
[4] R. Abdelkhalek, H. Calandra, O. Coulaud, J. Roman, and G. Latu, “Fastseismic modeling and reverse time migration on a GPU cluster,” in HighPerformance Computing Simulation, 2009. HPCS ’09. InternationalConference on, June 2009, pp. 36–43.
[5] E. Elsen, P. LeGresley, and E. Darve, “Large calculation of the flow overa hypersonic vehicle using a GPU,” Journal of Computational Physics,vol. 227, no. 24, pp. 10 148 – 10 161, 2008.
[6] S. Venkatasubramanian and R. W. Vuduc, “Tuned and wildly asynchronousstencil kernels for hybrid CPU/GPU systems,” in Proceedingsof the 23rd International Conference on Supercomputing, ser. ICS ’09.New York, NY, USA: ACM, 2009, pp. 244–255.
[7] M. Griebel and P. Zaspel, “A multi-GPU accelerated solver for thethree-dimensional two-phase incompressible navier-stokes equations,”Computer Science – Research and Development, vol. 25, no. 1-2, pp.65–73, 2010.
[8] S. Donath, C. Feichtinger, T. Pohl, J. G¨otz, and U. R¨ude, “A parallelfree surface lattice Boltzmann method for large-scale applications,”Parallel Computational Fluid Dynamics: Recent Advances and FutureDirections, p. 318, 2010.
[9] A. Danalis, K.-Y. Kim, L. Pollock, and M. Swany, “Transformations toparallel codes for communication-computation overlap,” in Proceedingsof the 2005 ACM/IEEE conference on Supercomputing. IEEE ComputerSociety, 2005, p. 58.
[10] N. Maruyama, T. Nomura, K. Sato, and S. Matsuoka, “Physis: an implicitlyparallel programming model for stencil computations on large-scaleGPU-accelerated supercomputers,” in High Performance Computing,Networking, Storage and Analysis (SC), 2011 International Conferencefor. IEEE, 2011, pp. 1–12.