Normalized to: Arnault, C.
[1]
oai:arXiv.org:1807.03078 [pdf] - 1916738
Analyzing billion-objects catalog interactively: Apache Spark for
physicists
Submitted: 2018-07-09, last modified: 2019-07-16
Apache Spark is a Big Data framework for working on large distributed
datasets. Although widely used in the industry, it remains rather limited in
the academic community or often restricted to software engineers. The goal of
this paper is to show with practical uses-cases that the technology is mature
enough to be used without excessive programming skills by astronomers or
cosmologists in order to perform standard analyses over large datasets, as
those originating from future galaxy surveys. To demonstrate it, we start from
a realistic simulation corresponding to 10 years of LSST data taking (6
billions of galaxies). Then, we design, optimize and benchmark a set of Spark
python algorithms in order to perform standard operations as adding photometric
redshift errors, measuring the selection function or computing power spectra
over tomographic bins. Most of the commands execute on the full 110 GB dataset
within tens of seconds and can therefore be performed interactively in order to
design full-scale cosmological analyses. A jupyter notebook summarizing the
analysis is available at https://github.com/astrolabsoftware/1807.03078.
[2]
oai:arXiv.org:1804.07501 [pdf] - 1767405
FITS Data Source for Apache Spark
Submitted: 2018-04-20, last modified: 2018-10-15
We investigate the performance of Apache Spark, a cluster computing
framework, for analyzing data from future LSST-like galaxy surveys. Apache
Spark attempts to address big data problems have hitherto proved successful in
the industry, but its use in the astronomical community still remains limited.
We show how to manage complex binary data structures handled in astrophysics
experiments such as binary tables stored in FITS files, within a distributed
environment. To this purpose, we first designed and implemented a Spark
connector to handle sets of arbitrarily large FITS files, called spark-fits.
The user interface is such that a simple file "drag-and-drop" to a cluster
gives full advantage of the framework. We demonstrate the very high scalability
of spark-fits using the LSST fast simulation tool, CoLoRe, and present the
methodologies for measuring and tuning the performance bottlenecks for the
workloads, scaling up to terabytes of FITS data on the Cloud@VirtualData,
located at Universit\'e Paris Sud. We also evaluate its performance on Cori, a
High-Performance Computing system located at NERSC, and widely used in the
scientific community.