Normalized to: Aydt, R.
[1]
oai:arXiv.org:astro-ph/0508145 [pdf] - 75034
Optimized Data Loading for a Multi-Terabyte Sky Survey Repository
Submitted: 2005-08-04
Advanced instruments in a variety of scientific domains are collecting
massive amounts of data that must be post-processed and organized to support
scientific research activities. Astronomers have been pioneers in the use of
databases to host highly structured repositories of sky survey data. As more
powerful telescopes come online, the increased volume and complexity of the
data collected poses enormous challenges to state-of-the-art database systems
and data-loading techniques. When the data source is an instrument taking
ongoing samples, the database loading must, at a minimum, keep up with the
data-acquisition rate. These challenges are being faced not only by the
astronomy community, but also by other scientific disciplines interested in
building scalable databases to house multi-terabyte archives of complex
structured data. In this paper we present SkyLoader, our novel framework for
fast and scalable data loading that is being used to populate a multi-table,
multi-terabyte database repository for the Palomar-Quest sky survey. Our
framework consists of an efficient algorithm for bulk loading, an effective
data structure to support data integrity and proper error handling during the
loading process, support for optimized parallelism that matches the number of
concurrent loaders with the database host capabilities, and guidelines for
database and system tuning. Performance studies showing the positive effects of
the adopted strategies are also presented. Our parallel bulk loading with array
buffering technique has made fast population of a multi-terabyte repository a
reality, reducing the loading time for a 40-gigabyte data set from more than 20
hours to less than 3 hours. We believe our framework offers a promising
approach for loading other large and complex scientific databases.