Normalized to: Falcou, J.
[1]
oai:arXiv.org:1106.0159 [pdf] - 646350
Parallel Spherical Harmonic Transforms on heterogeneous architectures
(GPUs/multi-core CPUs)
Submitted: 2011-06-01, last modified: 2013-04-01
Spherical Harmonic Transforms (SHT) are at the heart of many scientific and
practical applications ranging from climate modelling to cosmological
observations. In many of these areas new, cutting-edge science goals have been
recently proposed requiring simulations and analyses of experimental or
observational data at very high resolutions and of unprecedented volumes. Both
these aspects pose formidable challenge for the currently existing
implementations of the transforms.
This paper describes parallel algorithms for computing SHT with two variants
of intra-node parallelism appropriate for novel supercomputer architectures,
multi-core processors and Graphic Processing Units (GPU). It also discusses
their performance, alone and embedded within a top-level, MPI-based
parallelisation layer ported from the S2HAT library, in terms of their
accuracy, overall efficiency and scalability. We show that our inverse SHT run
on GeForce 400 Series GPUs equipped with latest CUDA architecture ("Fermi")
outperforms the state of the art implementation for a multi-core processor
executed on a current Intel Core i7-2600K. Furthermore, we show that an
MPI/CUDA version of the inverse transform run on a cluster of 128 Nvidia Tesla
S1070 is as much as 3 times faster than the hybrid MPI/OpenMP version executed
on the same number of quad-core processors Intel Nahalem for problem sizes
motivated by our target applications. Performance of the direct transforms is
however found to be at the best comparable in these cases. We discuss in detail
the algorithmic solutions devised for major steps involved in the transforms
calculation, emphasising those with a major impact on their overall
performance, and elucidates the sources of the dichotomy between the direct and
the inverse operations.
[2]
oai:arXiv.org:1010.1260 [pdf] - 514578
Spherical harmonic transform with GPUs
Submitted: 2010-10-06
We describe an algorithm for computing an inverse spherical harmonic
transform suitable for graphic processing units (GPU). We use CUDA and base our
implementation on a Fortran90 routine included in a publicly available parallel
package, S2HAT. We focus our attention on the two major sequential steps
involved in the transforms computation, retaining the efficient parallel
framework of the original code. We detail optimization techniques used to
enhance the performance of the CUDA-based code and contrast them with those
implemented in the Fortran90 version. We also present performance comparisons
of a single CPU plus GPU unit with the S2HAT code running on either a single or
4 processors. In particular we find that use of the latest generation of GPUs,
such as NVIDIA GF100 (Fermi), can accelerate the spherical harmonic transforms
by as much as 18 times with respect to S2HAT executed on one core, and by as
much as 5.5 with respect to S2HAT on 4 cores, with the overall performance
being limited by the Fast Fourier transforms. The work presented here has been
performed in the context of the Cosmic Microwave Background simulations and
analysis. However, we expect that the developed software will be of more
general interest and applicability.