Opendata, web and dolomites

Report

Teaser, summary, work performed and final results

Periodic Reporting for period 2 - DAPP (Data-centric Parallel Programming)

Teaser

Modern computing hardware designs are facing a fundamental crisis: Moore’s Law and Dennard’s scaling, two fundamental trends that have enabled continuous improvement in hardware performance and energy have come to an end. Computing architectures have to be specialized to...

Summary

Modern computing hardware designs are facing a fundamental crisis: Moore’s Law and Dennard’s scaling, two fundamental trends that have enabled continuous improvement in hardware performance and energy have come to an end. Computing architectures have to be specialized to the workload in order to exploit the transistors and computational power efficiently. Thus, accelerators, optimized for different workloads (e.g., throughput vs. latency-heavy tasks) are introduced in the form of GPUs, FPGAs, and more specialized units such as neural network accelerators (TPUs or tensor cores). The traditional CPUs keep focusing on latency-critical workloads with massive out-of-order processing capabilities.

Designing this hardware is challenging but doable as industry has demonstrated. What remains a large open problem is how to program this hardware. The DAPP project squarely addresses that challenge by developing a data-centric programming languages and intermediate representation. DAPP also innovates on the side of splitting the roles of domain programmer (or scientist) and performance engineer by defining a clear abstraction between the two roles.

The project has already demonstrated its applicability to compile near-optimal programs to GPU, CPU, and FPGA from the same data-centric code-base. We will continue to refine the use-facing language as well as the optimization language for performance engineers. This will also touch the internal graph representation. The project aims to deliver a workable implementation that is released to the community for further research and development.

Work performed

The effort in the DAPP project is focused on data-centric programming models, both general-purpose and for specific algorithms. Our main contribution is an Intermediate Representation (IR) that can be generated from different high-level languages and frameworks (Python, MATLAB, TensorFlow). The IR is graph-based and is organized around data movement, specifying dataflow and control-flow explicitly as part of the program. The Stateful DataFlow multiGraph (SDFG) combines fine-grained data dependencies with high-level control flow, and is both expressive and amenable to high-level and reusable program transformations, such as tiling, vectorization, and double buffering. These transformations are then applied to the SDFG in an interactive process (guided by a performance engineer), using extensible pattern matching and graph rewriting. SDFGs successfully map to CPUs, GPUs, and FPGAs, on a wide variety of application motifs: from fundamental computational kernels, through polyhedral applications, to graph analytics. The representation is both expressive and performant, allowing domain scientists to develop applications that can be tuned to approach peak hardware performance without modifying the original scientific code.

To provide a complete picture of the importance of data locality and data-centric programming, we summarized, together with other leading research groups, the collective knowledge on the topic. The review paper, titled “Trends in Data Locality Abstractions for HPC Systems”, was published at IEEE TPDS. For high-performance programming of FPGAs, we wrote a review and in-depth analysis of the techniques that are used to optimize HPC applications for data movement on dataflow architectures, consolidating the insights and state-of-the-art in High-Level Synthesis (HLS) for FPGAs.

While general-purpose programming interfaces and IRs can accelerate a wide variety of regular applications, irregular applications, such as graph algorithms, pose additional challenges that we investigated. In particular, we constructed graph representations that incur reduced data movement, such as the Log(Graph) succinct representation; representations that are amenable to efficient processing (e.g., using vectorization) for CPUs and GPUs with SlimSell; and efficient representations for streaming graph algorithms on FPGAs, namely a substream-centric representation for the Maximum Weighted Matching algorithm.

Optimizing communication performance is imperative for large-scale computing, as overheads limit the strong scalability of parallel applications. Today’s network cards also contain rather powerful processors optimized for data movement. As part of DAPP, we developed sPIN, a portable programming model to offload simple packet processing functions to the network card. The model provides both the simplicity of accelerator languages such as CUDA, and the flexibility of directly controlling the network card to optimize collective communication operations and system services by bypassing the CPU.

Lastly, we leveraged Machine Learning techniques and Deep Learning to statically analyze and comprehend code semantics. In particular, we propose Neural Code Comprehension, a novel processing pipeline to learn code semantics robustly, based on a novel embedding space that we call inst2vec. The pipeline analyzes the dataflow of an application (using an IR), and apply it to a variety of program analysis tasks, including algorithm classification, hardware mapping (i.e., whether a program will run faster on a CPU or a GPU), and thread workload coarsening factors, where we set a new state-of-the-art in accuracy for two out of the three tasks.

Final results

The novel data-centric intermediate representation we design as part of DAPP inherently maps to different processing architectures, including CPUs, GPUs, and FPGAs. Results are competitive with expert-tuned libraries from Intel and NVIDIA, approaching peak hardware performance, and up to 5 orders of magnitude faster than naive FPGA code written with High-Level Synthesis, all from the same representation. For the first time, we were able to compile an entire HPC benchmark (Polybench), consisting of 30 applications, to an FPGA and produce correct results. Using data-centric transformations, we were able to automatically map programs to the GPU and FPGA as accelerators with ease, where in the former case we outperform a project that was designed specifically for the purpose of mapping CPU applications to GPUs. We believe that the SDFG data-centric IR is the enabler of these results, and we intend to further develop the representation for mapping to distributed systems. We also intend to enhance the interactive part of SDFG manipulation by the performance engineer, both by providing a clear user interface and by guiding the optimization process using Machine Learning and Neural Code Comprehension.

Website & more info

More info: https://spcl.inf.ethz.ch/DAPP/.