Normalized to: Nitadori, K.
[1]
oai:arXiv.org:2006.16560 [pdf] - 2125009
PeTar: a high-performance N-body code for modeling massive collisional
stellar systems
Submitted: 2020-06-30
The numerical simulations of massive collisional stellar systems, such as
globular clusters (GCs), are very time-consuming. Until now, only a few
realistic million-body simulations of GCs with a small fraction of binaries
(5%) have been performed by using the NBODY6++GPU code. Such models took half a
year computational time on a GPU based super-computer. In this work, we develop
a new N-body code, PeTar, by combining the methods of Barnes-Hut tree, Hermite
integrator and slow-down algorithmic regularization (SDAR). The code can
accurately handle an arbitrary fraction of multiple systems (e.g. binaries,
triples) while keeping a high performance by using the hybrid parallelization
methods with MPI, OpenMP, SIMD instructions and GPU. A few benchmarks indicate
that PeTar and NBODY6++GPU have a very good agreement on the long-term
evolution of the global structure, binary orbits and escapers. On a highly
configured GPU desktop computer, the performance of a million-body simulation
with all stars in binaries by using PeTar is 11 times faster than that of
NBODY6++GPU. Moreover, on the Cray XC50 supercomputer, PeTar well scales when
number of cores increase. The ten million-body problem, which covers the region
of ultra compact dwarfs and nuclearstar clusters, becomes possible to be
solved.
[2]
oai:arXiv.org:2002.07938 [pdf] - 2052130
A slow-down time-transformed symplectic integrator for solving the
few-body problem
Submitted: 2020-02-18
An accurate and efficient method dealing with the few-body dynamics is
important for simulating collisional N-body systems like star clusters and to
follow the formation and evolution of compact binaries. We describe such a
method which combines the time-transformed explicit symplectic integrator
(Preto & Tremaine 1999; Mikkola & Tanikawa 1999) and the slow-down method
(Mikkola & Aarseth 1996). The former conserves the Hamiltonian and the angular
momentum for a long-term evolution, while the latter significantly reduces the
computational cost for a weakly perturbed binary. In this work, the Hamilton
equations of this algorithm are analyzed in detail. We mathematically and
numerically show that it can correctly reproduce the secular evolution like the
orbit averaged method and also well conserve the angular momentum. For a weakly
perturbed binary, the method is possible to provide a few order of magnitude
faster performance than the classical algorithm. A publicly available code
written in the c++ language, SDAR, is available on GitHub
(https://github.com/lwang-astro/SDAR). It can be used either as a stand alone
tool or a library to be plugged in other $N$-body codes. The high precision of
the floating point to 62 digits is also supported.
[3]
oai:arXiv.org:1907.02290 [pdf] - 2046222
Accelerated FDPS --- Algorithms to Use Accelerators with FDPS
Submitted: 2019-07-04
In this paper, we describe the algorithms we implemented in FDPS to make
efficient use of accelerator hardware such as GPGPUs. We have developed FDPS to
make it possible for many researchers to develop their own high-performance
parallel particle-based simulation programs without spending large amount of
time for parallelization and performance tuning. The basic idea of FDPS is to
provide a high-performance implementation of parallel algorithms for
particle-based simulations in a "generic" form, so that researchers can define
their own particle data structure and interparticle interaction functions and
supply them to FDPS. FDPS compiled with user-supplied data type and interaction
function provides all necessary functions for parallelization, and using those
functions researchers can write their programs as though they are writing
simple non-parallel program. It has been possible to use accelerators with
FDPS, by writing the interaction function that uses the accelerator. However,
the efficiency was limited by the latency and bandwidth of communication
between the CPU and the accelerator and also by the mismatch between the
available degree of parallelism of the interaction function and that of the
hardware parallelism. We have modified the interface of user-provided
interaction function so that accelerators are more efficiently used. We also
implemented new techniques which reduce the amount of work on the side of CPU
and amount of communication between CPU and accelerators. We have measured the
performance of N-body simulations on a systems with NVIDIA Volta GPGPU using
FDPS and the achieved performance is around 27 \% of the theoretical peak
limit. We have constructed a detailed performance model, and found that the
current implementation can achieve good performance on systems with much
smaller memory and communication bandwidth.
[4]
oai:arXiv.org:1907.02289 [pdf] - 1910783
Implementation and Performance of Barnes-Hut N-body algorithm on
Extreme-scale Heterogeneous Many-core Architectures
Iwasawa, Masaki;
Namekata, Daisuke;
Sakamoto, Ryo;
Nakamura, Takashi;
Kimura, Yasuyuki;
Nitadori, Keigo;
Wang, Long;
Tsubouchi, Miyuki;
Makino, Jun;
Liu, Zhao;
Fu, Haohuan;
Yang, Guangwen
Submitted: 2019-07-04
In this paper, we report the implementation and measured performance of our
extreme-scale global simulation code on Sunway TaihuLight and two PEZY-SC2
systems: Shoubu System B and Gyoukou. The numerical algorithm is the parallel
Barnes-Hut tree algorithm, which has been used in many large-scale
astrophysical particle-based simulations. Our implementation is based on our
FDPS framework. However, the extremely large numbers of cores of the systems
used (10M on TaihuLight and 16M on Gyoukou) and their relatively poor memory
and network bandwidth pose new challenges. We describe the new algorithms
introduced to achieve high efficiency on machines with low memory bandwidth.
The measured performance is 47.9, 10.6 PF, and 1.01PF on TaihuLight, Gyoukou
and Shoubu System B (efficiency 40\%, 23.5\% and 35.5\%). The current code is
developed for the simulation of planetary rings, but most of the new algorithms
are useful for other simulations, and are now available in the FDPS framework.
[5]
oai:arXiv.org:1903.03138 [pdf] - 1868141
A Mean-Field Approach to Simulating the Merging of Collisionless Stellar
Systems Using a Particle-Based Method
Submitted: 2019-03-07
We present a mean-field approach to simulating merging processes of two
spherical collisionless stellar systems. This approach is realized with a
self-consistent field (SCF) method in which the full spatial dependence of the
density and potential of a system is expanded in a set of basis functions for
solving Poisson's equation. In order to apply this SCF method to a merging
situation where two systems are moving in space, we assign the expansion center
to the center of mass of each system, the position of which is followed by a
mass-less particle placed at that position initially. Merging simulations over
a wide range of impact parameters are performed using both an SCF code
developed here and a tree code. The results of each simulation produced by the
two codes show excellent agreement in the evolving morphology of the merging
systems and in the density and velocity dispersion profiles of the merged
systems. However, comparing the results generated by the tree code to those
obtained with the softening-free SCF code, we have found that in large impact
parameter cases, a softening length of the Plummer type introduced in the tree
code has an effect of advancing the orbital phase of the two systems in the
merging process at late times. We demonstrate that the faster orbital phase
originates from the larger convergence length to the pure Newtonian force.
Other application problems suitable to the current SCF code are also discussed.
[6]
oai:arXiv.org:1804.08935 [pdf] - 1705276
Fortran interface layer of the framework for developing particle
simulator FDPS
Submitted: 2018-04-24, last modified: 2018-04-25
Numerical simulations based on particle methods have been widely used in
various fields including astrophysics. To date, simulation softwares have been
developed by individual researchers or research groups in each field, with a
huge amount of time and effort, even though numerical algorithms used are very
similar. To improve the situation, we have developed a framework, called FDPS,
which enables researchers to easily develop massively parallel particle
simulation codes for arbitrary particle methods. Until version 3.0, FDPS have
provided API only for C++ programing language. This limitation comes from the
fact that FDPS is developed using the template feature in C++, which is
essential to support arbitrary data types of particle. However, there are many
researchers who use Fortran to develop their codes. Thus, the previous versions
of FDPS require such people to invest much time to learn C++. This is
inefficient. To cope with this problem, we newly developed a Fortran interface
layer in FDPS, which provides API for Fortran. In order to support arbitrary
data types of particle in Fortran, we design the Fortran interface layer as
follows. Based on a given derived data type in Fortran representing particle, a
Python script provided by us automatically generates a library that manipulates
the C++ core part of FDPS. This library is seen as a Fortran module providing
API of FDPS from the Fortran side and uses C programs internally to
interoperate Fortran with C++. In this way, we have overcome several technical
issues when emulating `template' in Fortran. By using the Fortran interface,
users can develop all parts of their codes in Fortran. We show that the
overhead of the Fortran interface part is sufficiently small and a code written
in Fortran shows a performance practically identical to the one written in C++.
[7]
oai:arXiv.org:1612.06984 [pdf] - 1580972
Unconvergence of Very Large Scale GI Simulations
Submitted: 2016-12-21
The giant impact (GI) is one of the most important hypotheses both in
planetary science and geoscience, since it is related to the origin of the Moon
and also the initial condition of the Earth. A number of numerical simulations
have been done using the smoothed particle hydrodynamics (SPH) method. However,
GI hypothesis is currently in a crisis. The "canonical" GI scenario failed to
explain the identical isotope ratio between the Earth and the Moon. On the
other hand, little has been known about the reliability of the result of GI
simulations. In this paper, we discuss the effect of the resolution on the
results of the GI simulations by varying the number of particles from $3
\times10^3$ to $10^8$. We found that the results does not converge, but shows
oscillatory behaviour. We discuss the origin of this oscillatory behaviour.
[8]
oai:arXiv.org:1601.03138 [pdf] - 1422207
Implementation and performance of FDPS: A Framework Developing Parallel
Particle Simulation Codes
Submitted: 2016-01-13, last modified: 2016-04-24
We present the basic idea, implementation, measured performance and
performance model of FDPS (Framework for developing particle simulators). FDPS
is an application-development framework which helps the researchers to develop
particle-based simulation programs for large-scale distributed-memory parallel
supercomputers. A particle-based simulation program for distributed-memory
parallel computers needs to perform domain decomposition, redistribution of
particles, and gathering of particle information for interaction calculation.
Also, even if distributed-memory parallel computers are not used, in order to
reduce the amount of computation, algorithms such as Barnes-Hut tree method
should be used for long-range interactions. For short-range interactions, some
methods to limit the calculation to neighbor particles are necessary. FDPS
provides all of these necessary functions for efficient parallel execution of
particle-based simulations as "templates", which are independent of the actual
data structure of particles and the functional form of the interaction. By
using FDPS, researchers can write their programs with the amount of work
necessary to write a simple, sequential and unoptimized program of O(N^2)
calculation cost, and yet the program, once compiled with FDPS, will run
efficiently on large-scale parallel supercomputers. A simple gravitational
N-body program can be written in around 120 lines. We report the actual
performance of these programs and the performance model. The weak scaling
performance is very good, and almost linear speedup was obtained for up to the
full system of K computer. The minimum calculation time per timestep is in the
range of 30 ms (N=10^7) to 300 ms (N=10^9). These are currently limited by the
time for the calculation of the domain decomposition and communication
necessary for the interaction calculation. We discuss how we can overcome these
bottlenecks.
[9]
oai:arXiv.org:1504.03687 [pdf] - 1280830
NBODY6++GPU: Ready for the gravitational million-body problem
Submitted: 2015-04-14, last modified: 2015-05-21
Accurate direct $N$-body simulations help to obtain detailed information
about the dynamical evolution of star clusters. They also enable comparisons
with analytical models and Fokker-Planck or Monte-Carlo methods. NBODY6 is a
well-known direct $N$-body code for star clusters, and NBODY6++ is the extended
version designed for large particle number simulations by supercomputers. We
present NBODY6++GPU, an optimized version of NBODY6++ with hybrid
parallelization methods (MPI, GPU, OpenMP, and AVX/SSE) to accelerate large
direct $N$-body simulations, and in particular to solve the million-body
problem. We discuss the new features of the NBODY6++GPU code, benchmarks, as
well as the first results from a simulation of a realistic globular cluster
initially containing a million particles. For million-body simulations,
NBODY6++GPU is $400-2000$ times faster than NBODY6 with 320 CPU cores and 32
NVIDIA K20X GPUs. With this computing cluster specification, the simulations of
million-body globular clusters including $5\%$ primordial binaries require
about an hour per half-mass crossing time.
[10]
oai:arXiv.org:1211.4406 [pdf] - 978407
4.45 Pflops Astrophysical N-Body Simulation on K computer -- The
Gravitational Trillion-Body Problem
Submitted: 2012-11-19, last modified: 2015-04-13
As an entry for the 2012 Gordon-Bell performance prize, we report performance
results of astrophysical N-body simulations of one trillion particles performed
on the full system of K computer. This is the first gravitational trillion-body
simulation in the world. We describe the scientific motivation, the numerical
algorithm, the parallelization strategy, and the performance analysis. Unlike
many previous Gordon-Bell prize winners that used the tree algorithm for
astrophysical N-body simulations, we used the hybrid TreePM method, for similar
level of accuracy in which the short-range force is calculated by the tree
algorithm, and the long-range force is solved by the particle-mesh algorithm.
We developed a highly-tuned gravity kernel for short-range forces, and a novel
communication algorithm for long-range forces. The average performance on 24576
and 82944 nodes of K computer are 1.53 and 4.45 Pflops, which correspond to 49%
and 42% of the peak speed.
[11]
oai:arXiv.org:1412.0659 [pdf] - 904525
24.77 Pflops on a Gravitational Tree-Code to Simulate the Milky Way
Galaxy with 18600 GPUs
Submitted: 2014-12-01
We have simulated, for the first time, the long term evolution of the Milky
Way Galaxy using 51 billion particles on the Swiss Piz Daint supercomputer with
our $N$-body gravitational tree-code Bonsai. Herein, we describe the scientific
motivation and numerical algorithms. The Milky Way model was simulated for 6
billion years, during which the bar structure and spiral arms were fully
formed. This improves upon previous simulations by using 1000 times more
particles, and provides a wealth of new data that can be directly compared with
observations. We also report the scalability on both the Swiss Piz Daint and
the US ORNL Titan. On Piz Daint the parallel efficiency of Bonsai was above
95%. The highest performance was achieved with a 242 billion particle Milky Way
model using 18600 GPUs on Titan, thereby reaching a sustained GPU and
application performance of 33.49 Pflops and 24.77 Pflops respectively.
[12]
oai:arXiv.org:1409.5981 [pdf] - 884029
Particle mesh multipole method: An efficient solver for
gravitational/electrostatic forces based on multipole method and fast
convolution over a uniform mesh
Submitted: 2014-09-21, last modified: 2014-10-17
We propose an efficient algorithm for the evaluation of the potential and its
gradient of gravitational/electrostatic $N$-body systems, which we call
particle mesh multipole method (PMMM or PM$^3$). PMMM can be understood both as
an extension of the particle mesh (PM) method and as an optimization of the
fast multipole method (FMM).In the former viewpoint, the scalar density and
potential held by a grid point are extended to multipole moments and local
expansions in $(p+1)^2$ real numbers, where $p$ is the order of expansion. In
the latter viewpoint, a hierarchical octree structure which brings its
$\mathcal O(N)$ nature, is replaced with a uniform mesh structure, and we
exploit the convolution theorem with fast Fourier transform (FFT) to speed up
the calculations. Hence, independent $(p+1)^2$ FFTs with the size equal to the
number of grid points are performed.
The fundamental idea is common to PPPM/MPE by Shimada et al. (1993) and FFTM
by Ong et al. (2003). PMMM differs from them in supporting both the open and
periodic boundary conditions, and employing an irreducible form where both the
multipole moments and local expansions are expressed in $(p+1)^2$ real numbers
and the transformation matrices in $(2p+1)^2$ real numbers.
The computational complexity is the larger of $\mathcal O(p^2 N)$ and
$\mathcal O(N \log (N/p^2))$, and the memory demand is $\mathcal O(N)$ when the
number of grid points is $\propto N/p^2$.
[13]
oai:arXiv.org:1101.2020 [pdf] - 648806
The Cosmogrid Simulation: Statistical Properties of Small Dark Matter
Halos
Ishiyama, Tomoaki;
Rieder, Steven;
Makino, Junichiro;
Zwart, Simon Portegies;
Groen, Derek;
Nitadori, Keigo;
de Laat, Cees;
McMillan, Stephen;
Hiraki, Kei;
Harfst, Stefan
Submitted: 2011-01-10, last modified: 2013-04-08
We present the results of the "Cosmogrid" cosmological N-body simulation
suites based on the concordance LCDM model. The Cosmogrid simulation was
performed in a 30Mpc box with 2048^3 particles. The mass of each particle is
1.28x10^5 Msun, which is sufficient to resolve ultra-faint dwarfs. We found
that the halo mass function shows good agreement with the Sheth & Tormen
fitting function down to ~10^7 Msun. We have analyzed the spherically averaged
density profiles of the three most massive halos which are of galaxy group size
and contain at least 170 million particles. The slopes of these density
profiles become shallower than -1 at the inner most radius. We also find a
clear correlation of halo concentration with mass. The mass dependence of the
concentration parameter cannot be expressed by a single power law, however a
simple model based on the Press-Schechter theory proposed by Navarro et al.
gives reasonable agreement with this dependence. The spin parameter does not
show a correlation with the halo mass. The probability distribution functions
for both concentration and spin are well fitted by the log-normal distribution
for halos with the masses larger than ~10^8 Msun. The subhalo abundance depends
on the halo mass. Galaxy-sized halos have 50% more subhalos than ~10^{11} Msun
halos have.
[14]
oai:arXiv.org:1203.4037 [pdf] - 1117378
Phantom-GRAPE: numerical software library to accelerate collisionless
$N$-body simulation with SIMD instruction set on x86 architecture
Submitted: 2012-03-19, last modified: 2012-10-09
(Abridged) We have developed a numerical software library for collisionless
N-body simulations named "Phantom-GRAPE" which highly accelerates force
calculations among particles by use of a new SIMD instruction set extension to
the x86 architecture, AVX, an enhanced version of SSE. In our library, not only
the Newton's forces, but also central forces with an arbitrary shape f(r),
which has a finite cutoff radius r_cut (i.e. f(r)=0 at r>r_cut), can be quickly
computed. Using an Intel Core i7--2600 processor, we measure the performance of
our library for both the forces. In the case of Newton's forces, we achieve 2 x
10^9 interactions per second with 1 processor core, which is 20 times higher
than the performance of an implementation without any explicit use of SIMD
instructions, and 2 times than that with the SSE instructions. With 4 processor
cores, we obtain the performance of 8 x 10^9 interactions per second. In the
case of the arbitrarily shaped forces, we can calculate 1 x 10^9 and 4 x 10^9
interactions per second with 1 and 4 processor cores, respectively. The
performance with 1 processor core is 6 times and 2 times higher than those of
the implementations without any use of SIMD instructions and with the SSE
instructions. These performances depend weakly on the number of particles. It
is good contrast with the fact that the performance of force calculations
accelerated by GPUs depends strongly on the number of particles. Substantially
weak dependence of the performance on the number of particles is suitable to
collisionless N-body simulations, since these simulations are usually performed
with sophisticated N-body solvers such as Tree- and TreePM-methods combined
with an individual timestep scheme. Collisionless N-body simulations
accelerated with our library have significant advantage over those accelerated
by GPUs, especially on massively parallel environments.
[15]
oai:arXiv.org:1205.1222 [pdf] - 1123183
Accelerating NBODY6 with Graphics Processing Units
Submitted: 2012-05-06
We describe the use of Graphics Processing Units (GPUs) for speeding up the
code NBODY6 which is widely used for direct $N$-body simulations. Over the
years, the $N^2$ nature of the direct force calculation has proved a barrier
for extending the particle number. Following an early introduction of force
polynomials and individual time-steps, the calculation cost was first reduced
by the introduction of a neighbour scheme. After a decade of GRAPE computers
which speeded up the force calculation further, we are now in the era of GPUs
where relatively small hardware systems are highly cost-effective. A
significant gain in efficiency is achieved by employing the GPU to obtain the
so-called regular force which typically involves some 99 percent of the
particles, while the remaining local forces are evaluated on the host. However,
the latter operation is performed up to 20 times more frequently and may still
account for a significant cost. This effort is reduced by parallel SSE/AVX
procedures where each interaction term is calculated using mainly single
precision. We also discuss further strategies connected with coordinate and
velocity prediction required by the integration scheme. This leaves hard
binaries and multiple close encounters which are treated by several
regularization methods. The present nbody6-GPU code is well balanced for
simulations in the particle range $10^4-2 \times 10^5$ for a dual GPU system
attached to a standard PC.
[16]
oai:arXiv.org:1203.1623 [pdf] - 1117138
Formation and Hardening of Supermassive Black Hole Binaries in Minor
Mergers of Disk Galaxies
Submitted: 2012-03-07
We model for the first time the complete orbital evolution of a pair of
Supermassive Black Holes (SMBHs) in a 1:10 galaxy merger of two disk dominated
gas-rich galaxies, from the stage prior to the formation of the binary up to
the onset of gravitational wave emission when the binary separation has shrunk
to 1 milli parsec. The high-resolution smoothed particle hydrodynamics (SPH)
simulations used for the first phase of the evolution include star formation,
accretion onto the SMBHs as well as feedback from supernovae explosions and
radiative heating from the SMBHs themselves. Using the direct N-body code
\phi-GPU we evolve the system further without including the effect of gas,
which has been mostly consumed by star formation in the meantime. We start at
the time when the separation between two SMBHs is ~ 700 pc and the two black
holes are still embedded in their galaxy cusps. We use 3 million particles to
study the formation and evolution of the SMBH binary till it becomes hard.
After a hard binary is formed, we reduce (reselect) the particles to 1.15
million and follow the subsequent shrinking of the SMBH binary due to 3-body
encounters with the stars. We find approximately constant hardening rates and
that the SMBH binary rapidly develops a high eccentricity. Similar hardening
rates and eccentricity values are reported in earlier studies of SMBH binary
evolution in the merging of dissipation-less spherical galaxy models. The
estimated coalescence time is ~ 2.9 Gyr, significantly smaller than a Hubble
time. We discuss why this timescale should be regarded as an upper limit. Since
1:10 mergers are among the most common interaction events for galaxies at all
cosmic epochs, we argue that several SMBH binaries should be detected with
currently planned space-borne gravitational wave interferometers, whose
sensitivity will be especially high for SMBHs in the mass range considered
here.
[17]
oai:arXiv.org:1201.1694 [pdf] - 1092800
PSDF: Particle Stream Data Format for N-Body Simulations
Submitted: 2012-01-09
We present a data format for the output of general N-body simulations,
allowing the presence of individual time steps. By specifying a standard,
different N-body integrators and different visualization and analysis programs
can all share the simulation data, independent of the type of programs used to
produce the data. Our Particle Stream Data Format, PSDF, is specified in YAML,
based on the same approach as XML but with a simpler syntax. Together with a
specification of PSDF, we provide background and motivation, as well as
specific examples in a variety of computer languages. We also offer a web site
from which these examples can be retrieved, in order to make it easy to augment
existing codes in order to give them the option to produce PSDF output.
[18]
oai:arXiv.org:1104.2700 [pdf] - 1053360
N-body simulation for self-gravitating collisional systems with a new
SIMD instruction set extension to the x86 architecture, Advanced Vector
eXtensions
Submitted: 2011-04-14, last modified: 2011-09-05
We present a high-performance N-body code for self-gravitating collisional
systems accelerated with the aid of a new SIMD instruction set extension of the
x86 architecture: Advanced Vector eXtensions (AVX), an enhanced version of the
Streaming SIMD Extensions (SSE). With one processor core of Intel Core i7-2600
processor (8 MB cache and 3.40 GHz) based on Sandy Bridge micro-architecture,
we implemented a fourth-order Hermite scheme with individual timestep scheme
(Makino and Aarseth, 1992), and achieved the performance of 20 giga floating
point number operations per second (GFLOPS) for double-precision accuracy,
which is two times and five times higher than that of the previously developed
code implemented with the SSE instructions (Nitadori et al., 2006b), and that
of a code implemented without any explicit use of SIMD instructions with the
same processor core, respectively. We have parallelized the code by using
so-called NINJA scheme (Nitadori et al., 2006a), and achieved 90 GFLOPS for a
system containing more than N = 8192 particles with 8 MPI processes on four
cores. We expect to achieve about 10 tera FLOPS (TFLOPS) for a self-gravitating
collisional system with N 105 on massively parallel systems with at most 800
cores with Sandy Bridge micro-architecture. This performance will be comparable
to that of Graphic Processing Unit (GPU) cluster systems, such as the one with
about 200 Tesla C1070 GPUs (Spurzem et al., 2010). This paper offers an
alternative to collisional N-body simulations with GRAPEs and GPUs.
[19]
oai:arXiv.org:1006.4159 [pdf] - 1033223
Astrophysical Weighted Particle Magnetohydrodynamics
Submitted: 2010-06-21
This paper presents applications of weighted meshless scheme for conservation
laws to the Euler equations and the equations of ideal magnetohydrodynamics.
The divergence constraint of the latter is maintained to the truncation error
by a new meshless divergence cleaning procedure. The physics of the interaction
between the particles is described by an one-dimensional Riemann problem in a
moving frame. As a result, necessary diffusion which is required to treat
dissipative processes is added automatically. As a result, our scheme has no
free parameters that controls the physics of inter-particle interaction, with
the exception of the number of the interacting neighbours which control the
resolution and accuracy. The resulting equations have the form similar to SPH
equations, and therefore existing SPH codes can be used to implement the
weighed particle scheme. The scheme is validated in several hydrodynamic and
MHD test cases. In particular, we demonstrate for the first time the ability of
a meshless MHD scheme to model magneto-rotational instability in accretion
disks.
[20]
oai:arXiv.org:1001.0773 [pdf] - 1019013
Simulating the universe on an intercontinental grid of supercomputers
Zwart, Simon Portegies;
Ishiyama, Tomoaki;
Groen, Derek;
Nitadori, Keigo;
Makino, Junichiro;
de Laat, Cees;
McMillan, Stephen;
Hiraki, Kei;
Harfst, Stefan;
Grosso, Paola
Submitted: 2010-01-05
Understanding the universe is hampered by the elusiveness of its most common
constituent, cold dark matter. Almost impossible to observe, dark matter can be
studied effectively by means of simulation and there is probably no other
research field where simulation has led to so much progress in the last decade.
Cosmological N-body simulations are an essential tool for evolving density
perturbations in the nonlinear regime. Simulating the formation of large-scale
structures in the universe, however, is still a challenge due to the enormous
dynamic range in spatial and temporal coordinates, and due to the enormous
computer resources required. The dynamic range is generally dealt with by the
hybridization of numerical techniques. We deal with the computational
requirements by connecting two supercomputers via an optical network and make
them operate as a single machine. This is challenging, if only for the fact
that the supercomputers of our choice are separated by half the planet, as one
is located in Amsterdam and the other is in Tokyo. The co-scheduling of the two
computers and the 'gridification' of the code enables us to achieve a 90%
efficiency for this distributed intercontinental supercomputer.
[21]
oai:arXiv.org:0708.0738 [pdf] - 3727
6th and 8th Order Hermite Integrator for N-body Simulations
Submitted: 2007-08-06, last modified: 2008-02-04
We present sixth- and eighth-order Hermite integrators for astrophysical
$N$-body simulations, which use the derivatives of accelerations up to second
order ({\it snap}) and third order ({\it crackle}). These schemes do not
require previous values for the corrector, and require only one previous value
to construct the predictor. Thus, they are fairly easy to implemente. The
additional cost of the calculation of the higher order derivatives is not very
high. Even for the eighth-order scheme, the number of floating-point operations
for force calculation is only about two times larger than that for traditional
fourth-order Hermite scheme. The sixth order scheme is better than the
traditional fourth order scheme for most cases. When the required accuracy is
very high, the eighth-order one is the best. These high-order schemes have
several practical advantages. For example, they allow a larger number of
particles to be integrated in parallel than the fourth-order scheme does,
resulting in higher execution efficiency in both general-purpose parallel
computers and GRAPE systems.
[22]
oai:arXiv.org:astro-ph/0606105 [pdf] - 82545
High-Performance Small-Scale Simulation of Star Clusters Evolution on
Cray XD1
Submitted: 2006-06-06, last modified: 2006-06-07
In this paper, we describe the performance of an $N$-body simulation of star
cluster with 64k stars on a Cray XD1 system with 400 dual-core Opteron
processors. A number of astrophysical $N$-body simulations were reported in
SCxy conferences. All previous entries for Gordon-Bell prizes used at least
700k particles. The reason for this preference of large numbers of particles is
the parallel efficiency. It is very difficult to achieve high performance on
large parallel machines, if the number of particles is small. However, for many
scientifically important problems the calculation cost scales as $O(N^{3.3})$,
and it is very important to use large machines for relatively small number of
particles. We achieved 2.03 Tflops, or 57.7% of the theoretical peak
performance, using a direct $O(N^2)$ calculation with the individual timestep
algorithm, on 64k particles. The best efficiency previously reported on similar
calculation with 64K or smaller number of particles is 12% (9 Gflops) on Cray
T3E-600 with 128 processors. Our implementation is based on highly scalable
two-dimensional parallelization scheme, and low-latency communication network
of Cray XD1 turned out to be essential to achieve this level of performance.
[23]
oai:arXiv.org:astro-ph/0511062 [pdf] - 77420
Performance Tuning of N-Body Codes on Modern Microprocessors: I. Direct
Integration with a Hermite Scheme on x86_64 Architecture
Submitted: 2005-11-02
The main performance bottleneck of gravitational N-body codes is the force
calculation between two particles. We have succeeded in speeding up this
pair-wise force calculation by factors between two and ten, depending on the
code and the processor on which the code is run. These speedups were obtained
by writing highly fine-tuned code for x86_64 microprocessors. Any existing
N-body code, running on these chips, can easily incorporate our assembly code
programs.
In the current paper, we present an outline of our overall approach, which we
illustrate with one specific example: the use of a Hermite scheme for a direct
N^2 type integration on a single 2.0 GHz Athlon 64 processor, for which we
obtain an effective performance of 4.05 Gflops, for double precision accuracy.
In subsequent papers, we will discuss other variations, including the
combinations of N log N codes, single precision implementations, and
performance on other microprocessors.