Tsubouchi, Miyuki
Normalized to: Tsubouchi, M.
2 article(s) in total. 12 co-authors, from 1 to 2 common article(s). Median position in authors list is 7,0.
[1]
oai:arXiv.org:1907.02289 [pdf] - 1910783
Implementation and Performance of Barnes-Hut N-body algorithm on
Extreme-scale Heterogeneous Many-core Architectures
Iwasawa, Masaki;
Namekata, Daisuke;
Sakamoto, Ryo;
Nakamura, Takashi;
Kimura, Yasuyuki;
Nitadori, Keigo;
Wang, Long;
Tsubouchi, Miyuki;
Makino, Jun;
Liu, Zhao;
Fu, Haohuan;
Yang, Guangwen
Submitted: 2019-07-04
In this paper, we report the implementation and measured performance of our
extreme-scale global simulation code on Sunway TaihuLight and two PEZY-SC2
systems: Shoubu System B and Gyoukou. The numerical algorithm is the parallel
Barnes-Hut tree algorithm, which has been used in many large-scale
astrophysical particle-based simulations. Our implementation is based on our
FDPS framework. However, the extremely large numbers of cores of the systems
used (10M on TaihuLight and 16M on Gyoukou) and their relatively poor memory
and network bandwidth pose new challenges. We describe the new algorithms
introduced to achieve high efficiency on machines with low memory bandwidth.
The measured performance is 47.9, 10.6 PF, and 1.01PF on TaihuLight, Gyoukou
and Shoubu System B (efficiency 40\%, 23.5\% and 35.5\%). The current code is
developed for the simulation of planetary rings, but most of the new algorithms
are useful for other simulations, and are now available in the FDPS framework.
[2]
oai:arXiv.org:1907.02290 [pdf] - 2046222
Accelerated FDPS --- Algorithms to Use Accelerators with FDPS
Submitted: 2019-07-04
In this paper, we describe the algorithms we implemented in FDPS to make
efficient use of accelerator hardware such as GPGPUs. We have developed FDPS to
make it possible for many researchers to develop their own high-performance
parallel particle-based simulation programs without spending large amount of
time for parallelization and performance tuning. The basic idea of FDPS is to
provide a high-performance implementation of parallel algorithms for
particle-based simulations in a "generic" form, so that researchers can define
their own particle data structure and interparticle interaction functions and
supply them to FDPS. FDPS compiled with user-supplied data type and interaction
function provides all necessary functions for parallelization, and using those
functions researchers can write their programs as though they are writing
simple non-parallel program. It has been possible to use accelerators with
FDPS, by writing the interaction function that uses the accelerator. However,
the efficiency was limited by the latency and bandwidth of communication
between the CPU and the accelerator and also by the mismatch between the
available degree of parallelism of the interaction function and that of the
hardware parallelism. We have modified the interface of user-provided
interaction function so that accelerators are more efficiently used. We also
implemented new techniques which reduce the amount of work on the side of CPU
and amount of communication between CPU and accelerators. We have measured the
performance of N-body simulations on a systems with NVIDIA Volta GPGPU using
FDPS and the achieved performance is around 27 \% of the theoretical peak
limit. We have constructed a detailed performance model, and found that the
current implementation can achieve good performance on systems with much
smaller memory and communication bandwidth.