Parallel algorithm is based on a hybrid MPI+OpenMP+OpenCL parallelization for modern hybrid supercomputer architectures.
Computing domain is decomposed between cluster nodes,
then between MPI processes inside nodes,
then among OpenMP threads of MPI processes.
Parallel performance in real applications with EBR5 scheme, implicit time integration:
HPC4 of KIAE, flow around a rotor blade, IDDES, 22M nodes (left);
OpenMP performance on a 24-core CPU (Intel Xeon 8160), a round jet, IDDES, 1.6M nodes (center);
Lomonosov, a 3D cavity, DES, 160M nodes (right).

Parallel performance on hybrid systems in real applications with EBR5 scheme, implicit time integration, IDDES turbulence modeling approach:
K60-GPU, nodes with 2 16-core CPU Intel Xeon Gold 6142 and 4 GPU NVIDIA V100, mesh 80M nodes, flow around a turbine blade (left);
Lomonosov 2, nodes with 1 14-core CPU Intel Xeon E5-2697v3 and 1 GPU NVIDIA K40, mesh 12.5M nodes, flow around a cylinder (right).

The code is highly portable and works fine on multicore CPUs, including Intel, AMD, IBM, ARM, Elbrus architectures;
manycore accelerators, such as Intel Xeon Phi; GPUs from various verndors, including NVIDIA, AMD, Intel; indegrated CPU+GPU devices.
This heterogeoenous MPI+OpenMP+OpenCL parallel implementation was funded by the Russian Science Foundation, project 19-11-00299.
|