Normalized to: Caubet, M.
[1]
oai:arXiv.org:2003.03217 [pdf] - 2061804
CosmoHub: Interactive exploration and distribution of astronomical data
on Hadoop
Tallada, Pau;
Carretero, Jorge;
Casals, Jordi;
Acosta-Silva, Carles;
Serrano, Santiago;
Caubet, Marc;
Castander, Francisco J.;
César, Eduardo;
Crocce, Martín;
Delfino, Manuel;
Eriksen, Martin;
Fosalba, Pablo;
Gaztañaga, Enrique;
Merino, Gonzalo;
Neissner, Christian;
Tonello, Nadia
Submitted: 2020-03-04, last modified: 2020-03-10
We present CosmoHub (https://cosmohub.pic.es), a web application based on
Hadoop to perform interactive exploration and distribution of massive
cosmological datasets. Recent Cosmology seeks to unveil the nature of both dark
matter and dark energy mapping the large-scale structure of the Universe,
through the analysis of massive amounts of astronomical data, progressively
increasing during the last (and future) decades with the digitization and
automation of the experimental techniques.
CosmoHub, hosted and developed at the Port d'Informaci\'o Cient\'ifica (PIC),
provides support to a worldwide community of scientists, without requiring the
end user to know any Structured Query Language (SQL). It is serving data of
several large international collaborations such as the Euclid space mission,
the Dark Energy Survey (DES), the Physics of the Accelerating Universe Survey
(PAUS) and the Marenostrum Institut de Ci\`encies de l'Espai (MICE) numerical
simulations. While originally developed as a PostgreSQL relational database web
frontend, this work describes the current version of CosmoHub, built on top of
Apache Hive, which facilitates scalable reading, writing and managing huge
datasets. As CosmoHub's datasets are seldomly modified, Hive it is a better
fit.
Over 60 TiB of catalogued information and $50 \times 10^9$ astronomical
objects can be interactively explored using an integrated visualization tool
which includes 1D histogram and 2D heatmap plots. In our current
implementation, online exploration of datasets of $10^9$ objects can be done in
a timescale of tens of seconds. Users can also download customized subsets of
data in standard formats generated in few minutes.