Full-text search for arXiv

2 article(s) in total. 3 co-authors, from 1 to 2 common article(s). Median position in authors list is 2,5.

[1] oai:arXiv.org:1807.03078 [pdf] - 1916738

Analyzing billion-objects catalog interactively: Apache Spark for physicists

Plaszczynski, S.; Peloton, J.; Arnault, C.; Campagne, J. E.

Comments:

Submitted: 2018-07-09, last modified: 2019-07-16

Apache Spark is a Big Data framework for working on large distributed datasets. Although widely used in the industry, it remains rather limited in the academic community or often restricted to software engineers. The goal of this paper is to show with practical uses-cases that the technology is mature enough to be used without excessive programming skills by astronomers or cosmologists in order to perform standard analyses over large datasets, as those originating from future galaxy surveys. To demonstrate it, we start from a realistic simulation corresponding to 10 years of LSST data taking (6 billions of galaxies). Then, we design, optimize and benchmark a set of Spark python algorithms in order to perform standard operations as adding photometric redshift errors, measuring the selection function or computing power spectra over tomographic bins. Most of the commands execute on the full 110 GB dataset within tens of seconds and can therefore be performed interactively in order to design full-scale cosmological analyses. A jupyter notebook summarizing the analysis is available at https://github.com/astrolabsoftware/1807.03078.

[2] oai:arXiv.org:1804.07501 [pdf] - 1767405

FITS Data Source for Apache Spark

Peloton, Julien; Arnault, Christian; Plaszczynski, Stéphane

Comments: 9 pages, 6 figures. Package available at https://github.com/astrolabsoftware/spark-fits Accepted in Computing and Software for Big Science

Submitted: 2018-04-20, last modified: 2018-10-15

We investigate the performance of Apache Spark, a cluster computing framework, for analyzing data from future LSST-like galaxy surveys. Apache Spark attempts to address big data problems have hitherto proved successful in the industry, but its use in the astronomical community still remains limited. We show how to manage complex binary data structures handled in astrophysics experiments such as binary tables stored in FITS files, within a distributed environment. To this purpose, we first designed and implemented a Spark connector to handle sets of arbitrarily large FITS files, called spark-fits. The user interface is such that a simple file "drag-and-drop" to a cluster gives full advantage of the framework. We demonstrate the very high scalability of spark-fits using the LSST fast simulation tool, CoLoRe, and present the methodologies for measuring and tuning the performance bottlenecks for the workloads, scaling up to terabytes of FITS data on the Cloud@VirtualData, located at Universit\'e Paris Sud. We also evaluate its performance on Cori, a High-Performance Computing system located at NERSC, and widely used in the scientific community.