ExaTENSOR: Numerical Tensor Algebra Virtual Processor for Large-Scale Heterogeneous HPC systems
Dmitry I. Lyakh
ExaTENSOR is designed as an advanced software library for large-scale numerical tensor algebra workloads on large-scale heterogeneous HPC platforms, including HPC clusters and leadership HPC systems, with applications in electronic structure simulations, quantum circuit simulations, and generic data analytics. ExaTENSOR provides a set of user-level API functions as well as an internal programming language (TAProL) which can be used for performing basic tensor algebra operations, e.g., tensor contractions, tensor additions, etc., on distributed HPC architectures equipped with accelerators. Although the immediate focus was specifically on the NVIDIA GPU accelerators, the ExaTENSOR design is based on hardware virtualization and separation of the algorithm expression from the hardware and system specificity that was inspired by some prior works (CLUSTER, ACES III/IV). Essentially, the ExaTENSOR parallel runtime is a domain-specific virtual machine (virtual processor) capable of directly interpreting and executing basic tensor algebra operations in a platform independent way. The ExaTENSOR hardware virtualization mechanism encapsulates the complexity of the node architecture and the system scale, thus, in principle, making possible to run the same numerical tensor algebra workload efficiently on many different HPC platforms. Internally, the ExaTENSOR parallel runtime implements the hierarchical task parallelism, thus properly adjusting the task granularity for each computing unit. ExaTENSOR supports accelerators in a plug-and-play way: A new hardware accelerator will only require a single-node library that implements the required tensor algebra primitives. This driver library will then be integrated under the hardware agnostic interface called TAL-SH (shared-memory tensor algebra layer).
Algorithm for Long Range Calculations in Classical Molecular Dynamics
In classical molecular dynamics simulations, the electrostatic (Coulomb) potential induces a global interaction between atoms. When calculated directly, this requires a computational cost of O(N^2) for N atoms. A common fast algorithm for calculating electrostatic forces is the particle-mesh Ewald (PME) method, which derives its speed from the efficiency of FFT for problems with high uniformity. The recent trend in hardware architectures with increasing parallelism poses a challenge for these FFT-based algorithms. Therefore, alternative algorithms such as the fast multipole method (FMM) and multilevel summation (MSM) are being considered. If we are to transition to such alternatives, a common interface between these alternatives must be developed. The developers of NAMD and GROMACS are interested in this approach.
Current Multilevel Summation Level Solver Activities
We are developing a long-range electrostatic solver that is performance portable and targets HPC centers with hybrid CPU-GPU architectures. The solver uses the Multilevel Summation Method (MSM) which is a local (nearest-neighbor communication) hierarchal grid based algorithm.
The current MSM developmental activities can be grouped in two broad categories. The first category is disentangling the MSM algorithm from the underlying HPC architecture hardware. This is primarily accomplished by software abstraction layers between the MSM algorithm and the CPU and GPU compute devices. This design feature helps performance portability by minimizing the amount of code modifications needed for various HPC CPU/GPU architectures and the ongoing improvements in the GPU hardware and CUDA API.
The second activity is the implementation of a CUDA direct kernel for the direct stencil calculation on the grid hierarchies. This direct kernel uses large stencils with minimum dimensions of 13x13x13 and will explore the use of unified memory. Importantly, the direct kernel is not a simple function but a C++ class that is designed by composition and derivation from an abstract base class. This design structure helps attain performance portability and permits rapid implementation of other types of large stencil calculations.
Ying Wai Li
OWL is a scientific software for performing large-scale Monte Carlo simulations for the study of finite-temperature properties of materials. Originally developed to implement a special Monte Carlo method called Wang-Landau sampling (hence its name OWL: Oak-Ridge Wang-Landau), OWL now provides a collection of commonly used parallel, classical Monte Carlo algorithms suitable for running on high performance computers (HPC).
OWL is written in C++ with an object-oriented, modular software architecture that disentangles the implementation for the physical systems from the algorithms. This design not only allows for the extension to various modern and parallel Monte Carlo algorithms; more importantly, it provides two modes for the calculation of physical observables of the systems in question - OWL can be run in the stand-alone mode for user-implemented model Hamiltonians, it can also be run in the "driver" mode that drives an external package as a library for energy calculations. This encourages reuse of community codes, and is particularly useful when the energies are calculated by first-principles methods such as density functional theory. OWL adopts the heterogeneous "MPI+X" programming model. It has an MPI task manager to arrange computer resources for different tasks as well as for the external library. While Monte Carlo algorithms reside on the MPI level and scalability is achieved by employing multiple walkers, energy calculations are parallelized using both the MPI and the "X" (X = OpenMP, CUDA, etc.) levels.
As of today, OWL provides interfaces to Quantum Espresso and an ORNL-developed density functional theory code, Locally Self-Consistent Multiple Scattering (LSMS), to perform first-principles based statistical mechanics simulations. OWL is under active development; supports and interfaces to other software packages are on the way. We intend to make OWL available to the community on Github, with a website that provides detailed building instructions and documentations.
LSMS is a first principles, Density Functional theory based, electronic structure code targeted mainly at materials applications. LSMS calculates the local spin density approximation to the diagonal part of the electron Green’s function. The electron/spin density and energy are easily determined once the Green’s function is known. Linear scaling with system size is achieved in the LSMS by using several unique properties of the real space multiple scattering approach to the Green’s function 1) the Green’s function is “nearsighted”, therefore, each domain, i.e. atom, requires only information from nearby atoms in order to calculate the local value of the Green’s function. 2) the Green’s function is analytic, therefore, the required integral over electron energy levels can be analytically continued onto a contour in the complex plane where the imaginary part of the energy further restricts its range; and 3) to generate the local electron/spin density an atom needs only a small about of information, phase shifts, from those atoms within the range of the Green’s function. The very compact nature of the information that needs to be passed between processors and the high efficiency of the dense linear algebra algorithms employed to calculate the Green’s function are responsible for the superior performance of the LSMS code.
In addition of non relativistic and scalar relativistic calculations, LSMS allows the solution of the fully relativistic Dirac equation for electron scattering. Thus, all relativistic effects including spin-orbit interactions are accounted for, which allows the calculation of magnetocrystaline anisotropy energies and Dzyaloshinskii-Moriya antisymmetric exchange interactions. The energies for arbitrary non-collinear magnetic spin configurations can be calculated using self- consistently determined Lagrange multipliers that constrain the local magnetic order.
LSMS utilizes multiple levels of parallelism: 1) distributed memory parallelism via MPI to parallelize over the atoms in the system, 2) On node, shared memory, parallelism is achieved for both parallelization over atoms as well as over energy points on the integration contour, 3) the calculation of the multiple scattering matrix uses GPU acceleration when available.
An additional level of parallelism is provided by the capability to perform Wang-Landau Monte-Carlo sampling of magnetic and chemical order. This allows the first principles statistical physics calculation of magnetic and ordering phase transitions. By utilizing multiple Monte-Carlo walkers, the LSMS scalability is extended by multiple orders of magnitude.
To provide better scalability of the recently developed full-potential version of LSMS, new approaches to solving the Poisson equation are explored to obtain electrostatic potentials from space filling charge densities.
Current efforts are underway to release LSMS under an Open Source license and to make it available to the wider scientific community.