ExaTENSOR: Numerical Tensor Algebra Virtual Processor for Large-Scale Heterogeneous HPC systems
Dmitry I. Lyakh
ExaTENSOR is designed as an advanced software library for large-scale numerical tensor algebra workloads on large-scale heterogeneous HPC platforms, including HPC clusters and leadership HPC systems, with applications in electronic structure simulations, quantum circuit simulations, and generic data analytics. ExaTENSOR provides a set of user-level API functions as well as an internal programming language (TAProL) which can be used for performing basic tensor algebra operations, e.g., tensor contractions, tensor additions, etc., on distributed HPC architectures equipped with accelerators. Although the immediate focus was specifically on the NVIDIA GPU accelerators, the ExaTENSOR design is based on hardware virtualization and separation of the algorithm expression from the hardware and system specificity that was inspired by some prior works (CLUSTER, ACES III/IV). Essentially, the ExaTENSOR parallel runtime is a domain-specific virtual machine (virtual processor) capable of directly interpreting and executing basic tensor algebra operations in a platform independent way. The ExaTENSOR hardware virtualization mechanism encapsulates the complexity of the node architecture and the system scale, thus, in principle, making possible to run the same numerical tensor algebra workload efficiently on many different HPC platforms. Internally, the ExaTENSOR parallel runtime implements the hierarchical task parallelism, thus properly adjusting the task granularity for each computing unit. ExaTENSOR supports accelerators in a plug-and-play way: A new hardware accelerator will only require a single-node library that implements the required tensor algebra primitives. This driver library will then be integrated under the hardware agnostic interface called TAL-SH (shared-memory tensor algebra layer).
Algorithm for Long Range Calculations in Classical Molecular Dynamics
In classical molecular dynamics simulations, the electrostatic (Coulomb) potential induces a global interaction between atoms. When calculated directly, this requires a computational cost of O(N^2) for N atoms. A common fast algorithm for calculating electrostatic forces is the particle-mesh Ewald (PME) method, which derives its speed from the efficiency of FFT for problems with high uniformity. The recent trend in hardware architectures with increasing parallelism poses a challenge for these FFT-based algorithms. Therefore, alternative algorithms such as the fast multipole method (FMM) and multilevel summation (MSM) are being considered. If we are to transition to such alternatives, a common interface between these alternatives must be developed. The developers of NAMD and GROMACS are interested in this approach.
Current Multilevel Summation Level Solver Activities
We are developing a long-range electrostatic solver that is performance portable and targets HPC centers with hybrid CPU-GPU architectures. The solver uses the Multilevel Summation Method (MSM) which is a local (nearest-neighbor communication) hierarchal grid based algorithm.
The current MSM developmental activities can be grouped in two broad categories. The first category is disentangling the MSM algorithm from the underlying HPC architecture hardware. This is primarily accomplished by software abstraction layers between the MSM algorithm and the CPU and GPU compute devices. This design feature helps performance portability by minimizing the amount of code modifications needed for various HPC CPU/GPU architectures and the ongoing improvements in the GPU hardware and CUDA API.
The second activity is the implementation of a CUDA direct kernel for the direct stencil calculation on the grid hierarchies. This direct kernel uses large stencils with minimum dimensions of 13x13x13 and will explore the use of unified memory. Importantly, the direct kernel is not a simple function but a C++ class that is designed by composition and derivation from an abstract base class. This design structure helps attain performance portability and permits rapid implementation of other types of large stencil calculations.
Stencil Computation for Weather Prediction