Normalized to: Ball, N.
[1]
oai:arXiv.org:1410.2276 [pdf] - 1222480
The Next Generation Virgo Cluster Survey. XV. The photometric redshift
estimation for background sources
Raichoor, A.;
Mei, S.;
Erben, T.;
Hildebrandt, H.;
Huertas-Company, M.;
Ilbert, O.;
Licitra, R.;
Ball, N. M.;
Boissier, S.;
Boselli, A.;
Chen, Y. -T.;
Côté, P.;
Cuillandre, J. -C.;
Duc, P. A.;
Durrell, P. R.;
Ferrarese, L.;
Guhathakurta, P.;
Gwyn, S. D. J.;
Kavelaars, J. J.;
Lançon, A.;
Liu, C.;
MacArthur, L. A.;
Muller, M.;
Muñoz, R. P.;
Peng, E. W.;
Puzia, T. H.;
Sawicki, M.;
Toloba, E.;
Van Waerbeke, L.;
Woods, D.;
Zhang, H.
Submitted: 2014-10-08
The Next Generation Virgo Cluster Survey is an optical imaging survey
covering 104 deg^2 centered on the Virgo cluster. Currently, the complete
survey area has been observed in the u*giz-bands and one third in the r-band.
We present the photometric redshift estimation for the NGVS background sources.
After a dedicated data reduction, we perform accurate photometry, with special
attention to precise color measurements through point spread
function-homogenization. We then estimate the photometric redshifts with the Le
Phare and BPZ codes. We add a new prior which extends to iAB = 12.5 mag. When
using the u*griz-bands, our photometric redshifts for 15.5 \le i \lesssim 23
mag or zphot \lesssim 1 galaxies have a bias |\Delta z| < 0.02, less than 5%
outliers, and a scatter \sigma_{outl.rej.} and an individual error on zphot
that increase with magnitude (from 0.02 to 0.05 and from 0.03 to 0.10,
respectively). When using the u*giz-bands over the same magnitude and redshift
range, the lack of the r-band increases the uncertainties in the 0.3 \lesssim
zphot \lesssim 0.8 range (-0.05 < \Delta z < -0.02, \sigma_{outl.rej} ~ 0.06,
10-15% outliers, and zphot.err. ~ 0.15). We also present a joint analysis of
the photometric redshift accuracy as a function of redshift and magnitude. We
assess the quality of our photometric redshifts by comparison to spectroscopic
samples and by verifying that the angular auto- and cross-correlation function
w(\theta) of the entire NGVS photometric redshift sample across redshift bins
is in agreement with the expectations.
[2]
oai:arXiv.org:1312.3997 [pdf] - 759635
Focus Demo: CANFAR+Skytree: A Cloud Computing and Data Mining System for
Astronomy
Submitted: 2013-12-13
This is a companion Focus Demonstration article to the CANFAR+Skytree poster
(Ball 2012), demonstrating the usage of the Skytree machine learning software
on the Canadian Advanced Network for Astronomical Research (CANFAR) cloud
computing system. CANFAR+Skytree is the world's first cloud computing system
for data mining in astronomy.
[3]
oai:arXiv.org:1312.3996 [pdf] - 759634
CANFAR+Skytree: A Cloud Computing and Data Mining System for Astronomy
Submitted: 2013-12-13
At the Canadian Astronomy Data Centre, we have combined our cloud computing
system, CANFAR, with the world's most advanced machine learning software,
Skytree, to create the world's first cloud computing system for data mining in
astronomy. CANFAR provides a generic environment for the storage and processing
of large datasets, removing the requirement to set up and maintain a computing
system when implementing an extensive undertaking such as a survey pipeline.
500 processor cores and several hundred terabytes of persistent storage are
currently available to users. The storage is implemented via the International
Virtual Observatory Alliance's VOSpace protocol, and is accessible both
interactively, and to all processing jobs. The user interacts with CANFAR by
utilizing virtual machines, which appear to them as equivalent to a desktop.
Each machine is replicated as desired to perform large-scale parallel
processing. Such an arrangement enables the user to immediately install and run
the same astronomy code that they already utilize, in the same way as on a
desktop. In addition, unlike many cloud systems, batch job scheduling is
handled for the user on multiple virtual machines by the Condor job queueing
system. Skytree is installed and run just as any other software on the system,
and thus acts as a library of command line data mining functions that can be
integrated into one's wider analysis. Thus we have created a generic
environment for large-scale analysis by data mining, in the same way that
CANFAR itself has done for storage and processing. Because Skytree scales to
large data in linear runtime, this allows the full sophistication of the huge
fields of data mining and machine learning to be applied to the hundreds of
millions of objects that make up current large datasets. We demonstrate the
utility of the CANFAR+Skytree system by showing science results obtained.
[Abridged]
[4]
oai:arXiv.org:1110.5688 [pdf] - 431048
Discussion on "Techniques for Massive-Data Machine Learning in
Astronomy" by A. Gray
Submitted: 2011-10-25
Astronomy is increasingly encountering two fundamental truths: (1) The field
is faced with the task of extracting useful information from extremely large,
complex, and high dimensional datasets; (2) The techniques of astroinformatics
and astrostatistics are the only way to make this tractable, and bring the
required level of sophistication to the analysis. Thus, an approach which
provides these tools in a way that scales to these datasets is not just
desirable, it is vital. The expertise required spans not just astronomy, but
also computer science, statistics, and informatics. As a computer scientist and
expert in machine learning, Alex's contribution of expertise and a large number
of fast algorithms designed to scale to large datasets, is extremely welcome.
We focus in this discussion on the questions raised by the practical
application of these algorithms to real astronomical datasets. That is, what is
needed to maximally leverage their potential to improve the science return?
This is not a trivial task. While computing and statistical expertise are
required, so is astronomical expertise. Precedent has shown that, to-date, the
collaborations most productive in producing astronomical science results (e.g,
the Sloan Digital Sky Survey), have either involved astronomers expert in
computer science and/or statistics, or astronomers involved in close, long-term
collaborations with experts in those fields. This does not mean that the
astronomers are giving the most important input, but simply that their input is
crucial in guiding the effort in the most fruitful directions, and coping with
the issues raised by real data. Thus, the tools must be useable and
understandable by those whose primary expertise is not computing or statistics,
even though they may have quite extensive knowledge of those fields.
[5]
oai:arXiv.org:1110.5685 [pdf] - 1085143
Utilizing Astroinformatics to Maximize the Science Return of the Next
Generation Virgo Cluster Survey
Submitted: 2011-10-25
The Next Generation Virgo Cluster Survey is a 104 square degree survey of the
Virgo Cluster, carried out using the MegaPrime camera of the
Canada-France-Hawaii telescope, from semesters 2009A-2012A. The survey will
provide coverage of this nearby dense environment in the universe to
unprecedented depth, providing profound insights into galaxy formation and
evolution, including definitive measurements of the properties of galaxies in a
dense environment in the local universe, such as the luminosity function. The
limiting magnitude of the survey is g_AB = 25.7 (10 sigma point source), and
the 2 sigma surface brightness limit is g_AB ~ 29 mag arcsec^-2. The data
volume of the survey (approximately 50 terabytes of images), while large by
contemporary astronomical standards, is not intractable. This renders the
survey amenable to the methods of astroinformatics. The enormous dynamic range
of objects, from the giant elliptical galaxy M87 at M(B) = -21.6, to the
faintest dwarf ellipticals at M(B) ~ -6, combined with photometry in 5 broad
bands (u* g' r' i' z'), and unprecedented depth revealing many previously
unseen structures, creates new challenges in object detection and
classification. We present results from ongoing work on the survey, including
photometric redshifts, Virgo cluster membership, and the implementation of fast
data mining algorithms on the infrastructure of the Canadian Astronomy Data
Centre, as part of the Canadian Advanced Network for Astronomical Research
(CANFAR).
[6]
oai:arXiv.org:0906.2173 [pdf] - 212295
Data Mining and Machine Learning in Astronomy
Submitted: 2009-06-11, last modified: 2010-08-10
We review the current state of data mining and machine learning in astronomy.
'Data Mining' can have a somewhat mixed connotation from the point of view of a
researcher in this field. If used correctly, it can be a powerful approach,
holding the potential to fully exploit the exponentially increasing amount of
available data, promising great scientific advance. However, if misused, it can
be little more than the black-box application of complex computing algorithms
that may give little physical insight, and provide questionable results. Here,
we give an overview of the entire data mining process, from data collection
through to the interpretation of results. We cover common machine learning
algorithms, such as artificial neural networks and support vector machines,
applications from a broad range of astronomy, emphasizing those where data
mining techniques directly resulted in improved science, and important current
and future directions, including probability density functions, parallel
algorithms, petascale computing, and the time domain. We conclude that, so long
as one carefully selects an appropriate algorithm, and is guided by the
astronomical problem at hand, data mining can be very much the powerful tool,
and not the questionable black box.
[7]
oai:arXiv.org:0903.3121 [pdf] - 1001675
Incorporating Photometric Redshift Probability Density Information into
Real-Space Clustering Measurements
Submitted: 2009-03-18, last modified: 2009-09-14
The use of photometric redshifts in cosmology is increasing. Often, however
these photo-zs are treated like spectroscopic observations, in that the peak of
the photometric redshift, rather than the full probability density function
(PDF), is used. This overlooks useful information inherent in the full PDF. We
introduce a new real-space estimator for one of the most used cosmological
statistics, the 2-point correlation function, that weights by the PDF of
individual photometric objects in a manner that is optimal when Poisson
statistics dominate. As our estimator does not bin based on the PDF peak it
substantially enhances the clustering signal by usefully incorporating
information from all photometric objects that overlap the redshift bin of
interest. As a real-world application, we measure QSO clustering in the Sloan
Digital Sky Survey (SDSS). We find that our simplest binned estimator improves
the clustering signal by a factor equivalent to increasing the survey size by a
factor of 2-3. We also introduce a new implementation that fully weights
between pairs of objects in constructing the cross-correlation and find that
this pair-weighted estimator improves clustering signal in a manner equivalent
to increasing the survey size by a factor of 4-5. Our technique uses
spectroscopic data to anchor the distance scale and it will be particularly
useful where spectroscopic data (e.g, from BOSS) overlaps deeper photometry
(e.g.,from Pan-STARRS, DES or the LSST). We additionally provide simple,
informative expressions to determine when our estimator will be competitive
with the autocorrelation of spectroscopic objects. Although we use QSOs as an
example population, our estimator can and should be applied to any clustering
estimate that uses photometric objects.
[8]
oai:arXiv.org:0804.3417 [pdf] - 11961
Robust Machine Learning Applied to Terascale Astronomical Datasets
Submitted: 2008-04-21
We present recent results from the LCDM (Laboratory for Cosmological Data
Mining; http://lcdm.astro.uiuc.edu) collaboration between UIUC Astronomy and
NCSA to deploy supercomputing cluster resources and machine learning algorithms
for the mining of terascale astronomical datasets. This is a novel application
in the field of astronomy, because we are using such resources for data mining,
and not just performing simulations. Via a modified implementation of the NCSA
cyberenvironment Data-to-Knowledge, we are able to provide improved
classifications for over 100 million stars and galaxies in the Sloan Digital
Sky Survey, improved distance measures, and a full exploitation of the simple
but powerful k-nearest neighbor algorithm. A driving principle of this work is
that our methods should be extensible from current terascale datasets to
upcoming petascale datasets and beyond. We discuss issues encountered to-date,
and further issues for the transition to petascale. In particular, disk I/O
will become a major limiting factor unless the necessary infrastructure is
implemented.
[9]
oai:arXiv.org:0804.3413 [pdf] - 11960
Robust Machine Learning Applied to Astronomical Datasets III:
Probabilistic Photometric Redshifts for Galaxies and Quasars in the SDSS and
GALEX
Submitted: 2008-04-21
We apply machine learning in the form of a nearest neighbor instance-based
algorithm (NN) to generate full photometric redshift probability density
functions (PDFs) for objects in the Fifth Data Release of the Sloan Digital Sky
Survey (SDSS DR5). We use a conceptually simple but novel application of NN to
generate the PDFs - perturbing the object colors by their measurement error -
and using the resulting instances of nearest neighbor distributions to generate
numerous individual redshifts. When the redshifts are compared to existing SDSS
spectroscopic data, we find that the mean value of each PDF has a dispersion
between the photometric and spectroscopic redshift consistent with other
machine learning techniques, being sigma = 0.0207 +/- 0.0001 for main sample
galaxies to r < 17.77 mag, sigma = 0.0243 +/- 0.0002 for luminous red galaxies
to r < ~19.2 mag, and sigma = 0.343 +/- 0.005 for quasars to i < 20.3 mag. The
PDFs allow the selection of subsets with improved statistics. For quasars, the
improvement is dramatic: for those with a single peak in their probability
distribution, the dispersion is reduced from 0.343 to sigma = 0.117 +/- 0.010,
and the photometric redshift is within 0.3 of the spectroscopic redshift for
99.3 +/- 0.1% of the objects. Thus, for this optical quasar sample, we can
virtually eliminate 'catastrophic' photometric redshift estimates. In addition
to the SDSS sample, we incorporate ultraviolet photometry from the Third Data
Release of the Galaxy Evolution Explorer All-Sky Imaging Survey (GALEX AIS GR3)
to create PDFs for objects seen in both surveys. For quasars, the increased
coverage of the observed frame UV of the SED results in significant improvement
over the full SDSS sample, with sigma = 0.234 +/- 0.010. We demonstrate that
this improvement is genuine. [Abridged]
[10]
oai:arXiv.org:astro-ph/0610171 [pdf] - 85595
Galaxy Colour, Morphology, and Environment in the Sloan Digital Sky
Survey
Submitted: 2006-10-05, last modified: 2007-10-24
We use the Fourth Data Release of the Sloan Digital Sky Survey to investigate
the relation between galaxy rest frame u-r colour, morphology, as described by
the concentration and Sersic indices, and environmental density, for a sample
of 79,553 galaxies at z < ~0.1. We split the samples according to density and
luminosity and recover the expected bimodal distribution in the
colour-morphology plane, shown especially clearly by this subsampling. We
quantify the bimodality by a sum of two Gaussians on the colour and morphology
axes and show that, for the red/early-type population both colour and
morphology do not change significantly as a function of density. For the
blue/late-type population, with increasing density the colour becomes redder
but the morphology again does not change significantly. Both populations become
monotonically redder and of earlier type with increasing luminosity. There is
no significant qualitative difference between the behaviour of the two
morphological measures. We supplement the morphological sample with 13,655
galaxies assigned Hubble types by an artificial neural network. We find,
however, that the resulting distribution is less well described by two
Gaussians. Therefore, there are either more than two significant morphological
populations, physical processes not seen in colour space, or the Hubble type,
particularly the different subtypes of spirals Sa-Sd, has an irreducible
fuzziness when related to environmental density. For each of the three measures
of morphology, on removing the density relation due to it, we recover a strong
residual relation in colour. However, on similarly removing the colour-density
relation there is no evidence for a residual relation due to morphology.
[Abridged]
[11]
oai:arXiv.org:0710.4482 [pdf] - 6338
Robust Machine Learning Applied to Terascale Astronomical Datasets
Submitted: 2007-10-24
We present recent results from the Laboratory for Cosmological Data Mining
(http://lcdm.astro.uiuc.edu) at the National Center for Supercomputing
Applications (NCSA) to provide robust classifications and photometric redshifts
for objects in the terascale-class Sloan Digital Sky Survey (SDSS). Through a
combination of machine learning in the form of decision trees, k-nearest
neighbor, and genetic algorithms, the use of supercomputing resources at NCSA,
and the cyberenvironment Data-to-Knowledge, we are able to provide improved
classifications for over 100 million objects in the SDSS, improved photometric
redshifts, and a full exploitation of the powerful k-nearest neighbor
algorithm. This work is the first to apply the full power of these algorithms
to contemporary terascale astronomical datasets, and the improvement over
existing results is demonstrable. We discuss issues that we have encountered in
dealing with data on the terascale, and possible solutions that can be
implemented to deal with upcoming petascale datasets.
[12]
oai:arXiv.org:astro-ph/0612471 [pdf] - 316659
Robust Machine Learning Applied to Astronomical Datasets II: Quantifying
Photometric Redshifts for Quasars Using Instance-Based Learning
Submitted: 2006-12-17, last modified: 2007-03-22
We apply instance-based machine learning in the form of a k-nearest neighbor
algorithm to the task of estimating photometric redshifts for 55,746 objects
spectroscopically classified as quasars in the Fifth Data Release of the Sloan
Digital Sky Survey. We compare the results obtained to those from an empirical
color-redshift relation (CZR). In contrast to previously published results
using CZRs, we find that the instance-based photometric redshifts are assigned
with no regions of catastrophic failure. Remaining outliers are simply
scattered about the ideal relation, in a similar manner to the pattern seen in
the optical for normal galaxies at redshifts z < ~1. The instance-based
algorithm is trained on a representative sample of the data and
pseudo-blind-tested on the remaining unseen data. The variance between the
photometric and spectroscopic redshifts is sigma^2 = 0.123 +/- 0.002 (compared
to sigma^2 = 0.265 +/- 0.006 for the CZR), and 54.9 +/- 0.7%, 73.3 +/- 0.6%,
and 80.7 +/- 0.3% of the objects are within delta z < 0.1, 0.2, and 0.3
respectively. We also match our sample to the Second Data Release of the Galaxy
Evolution Explorer legacy data and the resulting 7,642 objects show a further
improvement, giving a variance of sigma^2 = 0.054 +/- 0.005, and 70.8 +/- 1.2%,
85.8 +/- 1.0%, and 90.8 +/- 0.7% of objects within delta z < 0.1, 0.2, and 0.3.
We show that the improvement is indeed due to the extra information provided by
GALEX, by training on the same dataset using purely SDSS photometry, which has
a variance of sigma^2 = 0.090 +/- 0.007. Each set of results represents a
realistic standard for application to further datasets for which the spectra
are representative.
[13]
oai:arXiv.org:astro-ph/0507547 [pdf] - 74716
Bivariate Galaxy Luminosity Functions in the Sloan Digital Sky Survey
Submitted: 2005-07-22, last modified: 2006-09-18
Bivariate luminosity functions (LFs) are computed for galaxies in the New
York Value-Added Galaxy Catalogue, based on the Sloan Digital Sky Survey Data
Release 4. The galaxy properties investigated are the morphological type,
inverse concentration index, Sersic index, absolute effective surface
brightness, reference frame colours, absolute radius, eClass spectral type,
stellar mass and galaxy environment. The morphological sample is flux-limited
to galaxies with r < 15.9 and consists of 37,047 classifications to an RMS
accuracy of +/- half a class in the sequence E, S0, Sa, Sb, Sc, Sd, Im. These
were assigned by an artificial neural network, based on a training set of 645
eyeball classifications. The other samples use r < 17.77 with a median redshift
of z ~ 0.08, and a limiting redshift of z < 0.15 to minimize the effects of
evolution. Other cuts, for example in axis ratio, are made to minimize biases.
A wealth of detail is seen, with clear variations between the LFs according to
absolute magnitude and the second parameter. They are consistent with an early
type, bright, concentrated, red population and a late type, faint, less
concentrated, blue, star forming population. This bimodality suggests two major
underlying physical processes, which in agreement with previous authors we
hypothesize to be merger and accretion, associated with the properties of
bulges and discs respectively. The bivariate luminosity-surface brightness
distribution is fit with the Choloniewski function (a Schechter function in
absolute magnitude and Gaussian in surface brightness). The fit is found to be
poor, as might be expected if there are two underlying processes.
[14]
oai:arXiv.org:astro-ph/0606541 [pdf] - 82981
Robust Machine Learning Applied to Astronomical Datasets I: Star-Galaxy
Classification of the SDSS DR3 Using Decision Trees
Submitted: 2006-06-21
We provide classifications for all 143 million non-repeat photometric objects
in the Third Data Release of the Sloan Digital Sky Survey (SDSS) using decision
trees trained on 477,068 objects with SDSS spectroscopic data. We demonstrate
that these star/galaxy classifications are expected to be reliable for
approximately 22 million objects with r < ~20. The general machine learning
environment Data-to-Knowledge and supercomputing resources enabled extensive
investigation of the decision tree parameter space. This work presents the
first public release of objects classified in this way for an entire SDSS data
release. The objects are classified as either galaxy, star or nsng (neither
star nor galaxy), with an associated probability for each class. To demonstrate
how to effectively make use of these classifications, we perform several
important tests. First, we detail selection criteria within the probability
space defined by the three classes to extract samples of stars and galaxies to
a given completeness and efficiency. Second, we investigate the efficacy of the
classifications and the effect of extrapolating from the spectroscopic regime
by performing blind tests on objects in the SDSS, 2dF Galaxy Redshift and 2dF
QSO Redshift (2QZ) surveys. Given the photometric limits of our spectroscopic
training data, we effectively begin to extrapolate past our star-galaxy
training set at r ~ 18. By comparing the number counts of our training sample
with the classified sources, however, we find that our efficiencies appear to
remain robust to r ~ 20. As a result, we expect our classifications to be
accurate for 900,000 galaxies and 6.7 million stars, and remain robust via
extrapolation for a total of 8.0 million galaxies and 13.9 million stars.
[Abridged]
[15]
oai:arXiv.org:astro-ph/0306390 [pdf] - 57471
Galaxy Types in the Sloan Digital Sky Survey Using Supervised Artificial
Neural Networks
Submitted: 2003-06-19
Supervised artificial neural networks are used to predict useful properties
of galaxies in the Sloan Digital Sky Survey, in this instance morphological
classifications, spectral types and redshifts. By giving the trained networks
unseen data, it is found that correlations between predicted and actual
properties are around 0.9 with rms errors of order ten per cent. Thus, given a
representative training set, these properties may be reliably estimated for
galaxies in the survey for which there are no spectra and without human
intervention.
[16]
oai:arXiv.org:astro-ph/0110492 [pdf] - 45549
Morphological Classification of Galaxies Using Artificial Neural
Networks
Submitted: 2001-10-22
The results of morphological galaxy classifications performed by humans and
by automated methods are compared. In particular, a comparison is made between
the eyeball classifications of 454 galaxies in the Sloan Digital Sky Survey
(SDSS) commissioning data (Shimasaku et al. 2001) with those of supervised
artificial neural network programs constructed using the MATLAB Neural Network
Toolbox package. Networks in this package have not previously been used for
galaxy classification. It is found that simple neural networks are able to
improve on the results of linear classifiers, giving correlation coefficients
of the order of 0.8 +/- 0.1, compared with those of around 0.7 +/- 0.1 for
linear classifiers. The networks are trained using the resilient
backpropagation algorithm, which, to the author's knowledge, has not been
specifically used in the galaxy classification literature. The galaxy
parameters used and the network architecture are both important, and in
particular the galaxy concentration index, a measure of the concentration of
light towards the centre of the galaxy, is the most significant parameter.
Simple networks are briefly applied to 29,429 galaxies with redshifts from the
SDSS Early Data Release. They give an approximate ratio of types E/S0:Sp:Irr of
14 +/- 5 : 86 +/- 12 : 0 +/- 0.1, which broadly agrees with the well known
approximate ratios of 20:80:1 observed at low redshift.