Normalized to: Genovese, C.
[1]
oai:arXiv.org:1509.06376 [pdf] - 1530313
Detecting Effects of Filaments on Galaxy Properties in the Sloan Digital
Sky Survey III
Submitted: 2015-09-21, last modified: 2017-01-12
We study the effects of filaments on galaxy properties in the Sloan Digital
Sky Survey (SDSS) Data Release 12 using filaments from the `Cosmic Web
Reconstruction' catalogue (Chen et al. 2016), a publicly available filament
catalogue for SDSS. Since filaments are tracers of medium-to-high density
regions, we expect that galaxy properties associated with the environment are
dependent on the distance to the nearest filament. Our analysis demonstrates
that a red galaxy or a high-mass galaxy tend to reside closer to filaments than
a blue or low-mass galaxy. After adjusting the effect from stellar mass, on
average, early-forming galaxies or large galaxies have a shorter distance to
filaments than late-forming galaxies or small galaxies. For the Main galaxy
sample (MGS), all signals are very significant ($>6\sigma$). For the LOWZ and
CMASS sample, the stellar mass and size are significant ($>2 \sigma$). The
filament effects we observe persist until $z = 0.7$ (the edge of the CMASS
sample). Comparing our results to those using the galaxy distances from
redMaPPer galaxy clusters as a reference, we find a similar result between
filaments and clusters. Moreover, we find that the effect of clusters on the
stellar mass of nearby galaxies depends on the galaxy's filamentary
environment. Our findings illustrate the strong correlation of galaxy
properties with proximity to density ridges, strongly supporting the claim that
density ridges are good tracers of filaments.
[2]
oai:arXiv.org:1509.06443 [pdf] - 1447640
Cosmic Web Reconstruction through Density Ridges: Catalogue
Submitted: 2015-09-21
We construct a catalogue for filaments using a novel approach called SCMS
(subspace constrained mean shift; Ozertem & Erdogmus 2011; Chen et al. 2015).
SCMS is a gradient-based method that detects filaments through density ridges
(smooth curves tracing high-density regions). A great advantage of SCMS is its
uncertainty measure, which allows an evaluation of the errors for the detected
filaments. To detect filaments, we use data from the Sloan Digital Sky Survey,
which consist of three galaxy samples: the NYU main galaxy sample (MGS), the
LOWZ sample and the CMASS sample. Each of the three dataset covers different
redshift regions so that the combined sample allows detection of filaments up
to z = 0.7. Our filament catalogue consists of a sequence of two-dimensional
filament maps at different redshifts that provide several useful statistics on
the evolution cosmic web. To construct the maps, we select spectroscopically
confirmed galaxies within 0.050 < z < 0.700 and partition them into 130 bins.
For each bin, we ignore the redshift, treating the galaxy observations as a 2-D
data and detect filaments using SCMS. The filament catalogue consists of 130
individual 2-D filament maps, and each map comprises points on the detected
filaments that describe the filamentary structures at a particular redshift. We
also apply our filament catalogue to investigate galaxy luminosity and its
relation with distance to filament. Using a volume-limited sample, we find
strong evidence (6.1$\sigma$ - 12.3$\sigma$) that galaxies close to filaments
are generally brighter than those at significant distance from filaments.
[3]
oai:arXiv.org:1501.05303 [pdf] - 1288321
Cosmic Web Reconstruction through Density Ridges: Method and Algorithm
Submitted: 2015-01-21, last modified: 2015-08-27
The detection and characterization of filamentary structures in the cosmic
web allows cosmologists to constrain parameters that dictates the evolution of
the Universe. While many filament estimators have been proposed, they generally
lack estimates of uncertainty, reducing their inferential power. In this paper,
we demonstrate how one may apply the Subspace Constrained Mean Shift (SCMS)
algorithm (Ozertem and Erdogmus (2011); Genovese et al. (2012)) to uncover
filamentary structure in galaxy data. The SCMS algorithm is a gradient ascent
method that models filaments as density ridges, one-dimensional smooth curves
that trace high-density regions within the point cloud. We also demonstrate how
augmenting the SCMS algorithm with bootstrap-based methods of uncertainty
estimation allows one to place uncertainty bands around putative filaments. We
apply the SCMS method to datasets sampled from the P3M N-body simulation, with
galaxy number densities consistent with SDSS and WFIRST-AFTA and to LOWZ and
CMASS data from the Baryon Oscillation Spectroscopic Survey (BOSS). To further
assess the efficacy of SCMS, we compare the relative locations of BOSS
filaments with galaxy clusters in the redMaPPer catalog, and find that
redMaPPer clusters are significantly closer (with p-values $< 10^{-9}$) to
SCMS-detected filaments than to randomly selected galaxies.
[4]
oai:arXiv.org:1508.04149 [pdf] - 1300265
Investigating Galaxy-Filament Alignments in Hydrodynamic Simulations
using Density Ridges
Submitted: 2015-08-17
In this paper, we study the filamentary structures and the galaxy alignment
along filaments at redshift $z=0.06$ in the MassiveBlack-II simulation, a
state-of-the-art, high-resolution hydrodynamical cosmological simulation which
includes stellar and AGN feedback in a volume of (100 Mpc$/h$)$^3$. The
filaments are constructed using the subspace constrained mean shift (SCMS;
Ozertem & Erdogmus (2011) and Chen et al. (2015a)). First, we show that
reconstructed filaments using galaxies and reconstructed filaments using dark
matter particles are similar to each other; over $50\%$ of the points on the
galaxy filaments have a corresponding point on the dark matter filaments within
distance $0.13$ Mpc$/h$ (and vice versa) and this distance is even smaller at
high-density regions. Second, we observe the alignment of the major principal
axis of a galaxy with respect to the orientation of its nearest filament and
detect a $2.5$ Mpc$/h$ critical radius for filament's influence on the
alignment when the subhalo mass of this galaxy is between $10^9M_\odot/h$ and
$10^{12}M_\odot/h$. Moreover, we find the alignment signal to increase
significantly with the subhalo mass. Third, when a galaxy is close to filaments
(less than $0.25$ Mpc$/h$), the galaxy alignment toward the nearest galaxy
group depends on the galaxy subhalo mass. Finally, we find that galaxies close
to filaments or groups tend to be rounder than those away from filaments or
groups.
[5]
oai:arXiv.org:1406.7536 [pdf] - 844312
Estimating the distribution of Galaxy Morphologies on a continuous space
Submitted: 2014-06-29
The incredible variety of galaxy shapes cannot be summarized by human defined
discrete classes of shapes without causing a possibly large loss of
information. Dictionary learning and sparse coding allow us to reduce the high
dimensional space of shapes into a manageable low dimensional continuous vector
space. Statistical inference can be done in the reduced space via probability
distribution estimation and manifold estimation.
[6]
oai:arXiv.org:1404.3168 [pdf] - 809422
Functional Regression for Quasar Spectra
Submitted: 2014-04-11
The Lyman-alpha forest is a portion of the observed light spectrum of distant
galactic nuclei which allows us to probe remote regions of the Universe that
are otherwise inaccessible. The observed Lyman-alpha forest of a quasar light
spectrum can be modeled as a noisy realization of a smooth curve that is
affected by a `damping effect' which occurs whenever the light emitted by the
quasar travels through regions of the Universe with higher matter
concentration. To decode the information conveyed by the Lyman-alpha forest
about the matter distribution, we must be able to separate the smooth
`continuum' from the noise and the contribution of the damping effect in the
quasar light spectra. To predict the continuum in the Lyman-alpha forest, we
use a nonparametric functional regression model in which both the response and
the predictor variable (the smooth part of the damping-free portion of the
spectrum) are function-valued random variables. We demonstrate that the
proposed method accurately predicts the unobservable continuum in the
Lyman-alpha forest both on simulated spectra and real spectra. Also, we
introduce distribution-free prediction bands for the nonparametric functional
regression model that have finite sample guarantees. These prediction bands,
together with bootstrap-based confidence bands for the projection of the mean
continuum on a fixed number of principal components, allow us to assess the
degree of uncertainty in the model predictions.
[7]
oai:arXiv.org:1401.1867 [pdf] - 1202636
Nonparametric 3D map of the IGM using the Lyman-alpha forest
Submitted: 2014-01-08
Visualizing the high-redshift Universe is difficult due to the dearth of
available data; however, the Lyman-alpha forest provides a means to map the
intergalactic medium at redshifts not accessible to large galaxy surveys.
Large-scale structure surveys, such as the Baryon Oscillation Spectroscopic
Survey (BOSS), have collected quasar (QSO) spectra that enable the
reconstruction of HI density fluctuations. The data fall on a collection of
lines defined by the lines-of-sight (LOS) of the QSO, and a major issue with
producing a 3D reconstruction is determining how to model the regions between
the LOS. We present a method that produces a 3D map of this relatively
uncharted portion of the Universe by employing local polynomial smoothing, a
nonparametric methodology. The performance of the method is analyzed on
simulated data that mimics the varying number of LOS expected in real data, and
then is applied to a sample region selected from BOSS. Evaluation of the
reconstruction is assessed by considering various features of the predicted 3D
maps including visual comparison of slices, PDFs, counts of local minima and
maxima, and standardized correlation functions. This 3D reconstruction allows
for an initial investigation of the topology of this portion of the Universe
using persistent homology.
[8]
oai:arXiv.org:1202.2902 [pdf] - 1116614
Regularization Techniques for PSF-Matching Kernels. I. Choice of Kernel
Basis
Submitted: 2012-02-13
We review current methods for building PSF-matching kernels for the purposes
of image subtraction or coaddition. Such methods use a linear decomposition of
the kernel on a series of basis functions. The correct choice of these basis
functions is fundamental to the efficiency and effectiveness of the matching -
the chosen bases should represent the underlying signal using a reasonably
small number of shapes, and/or have a minimum number of user-adjustable tuning
parameters. We examine methods whose bases comprise multiple Gauss-Hermite
polynomials, as well as a form free basis composed of delta-functions. Kernels
derived from delta-functions are unsurprisingly shown to be more expressive;
they are able to take more general shapes and perform better in situations
where sum-of-Gaussian methods are known to fail. However, due to its many
degrees of freedom (the maximum number allowed by the kernel size) this basis
tends to overfit the problem, and yields noisy kernels having large variance.
We introduce a new technique to regularize these delta-function kernel
solutions, which bridges the gap between the generality of delta-function
kernels, and the compactness of sum-of-Gaussian kernels. Through this
regularization we are able to create general kernel solutions that represent
the intrinsic shape of the PSF-matching kernel with only one degree of freedom,
the strength of the regularization lambda. The role of lambda is effectively to
exchange variance in the resulting difference image with variance in the kernel
itself. We examine considerations in choosing the value of lambda, including
statistical risk estimators and the ability of the solution to predict
solutions for adjacent areas. Both of these suggest moderate strengths of
lambda between 0.1 and 1.0, although this optimization is likely dataset
dependent.
[9]
oai:arXiv.org:1003.5536 [pdf] - 951278
The Geometry of Nonparametric Filament Estimation
Submitted: 2010-03-25, last modified: 2010-12-12
We consider the problem of estimating filamentary structure from planar point
process data. We make some connections with computational geometry and we
develop nonparametric methods for estimating the filaments. We show that, under
weak conditions, the filaments have a simple geometric representation as the
medial axis of the data distribution's support. Our methods convert an
estimator of the support's boundary into an estimator of the filaments. We also
find the rates of convergence of our estimators.
[10]
oai:arXiv.org:1011.4059 [pdf] - 1042031
Image Coaddition with Temporally Varying Kernels
Submitted: 2010-11-17
Large, multi-frequency imaging surveys, such as the Large Synaptic Survey
Telescope (LSST), need to do near-real time analysis of very large datasets.
This raises a host of statistical and computational problems where standard
methods do not work. In this paper, we study a proposed method for combining
stacks of images into a single summary image, sometimes referred to as a
template. This task is commonly referred to as image coaddition. In part, we
focus on a method proposed in previous work, which outlines a procedure for
combining stacks of images in an online fashion in the Fourier domain. We
evaluate this method by comparing it to two straightforward methods through the
use of various criteria and simulations. Note that the goal is not to propose
these comparison methods for use in their own right, but to ensure that
additional complexity also provides substantially improved performance.
[11]
oai:arXiv.org:0910.5449 [pdf] - 30040
Straight to the Source: Detecting Aggregate Objects in Astronomical
Images with Proper Error Control
Submitted: 2009-10-28
The next generation of telescopes will acquire terabytes of image data on a
nightly basis. Collectively, these large images will contain billions of
interesting objects, which astronomers call sources. The astronomers' task is
to construct a catalog detailing the coordinates and other properties of the
sources. The source catalog is the primary data product for most telescopes and
is an important input for testing new astrophysical theories, but to construct
the catalog one must first detect the sources. Existing algorithms for catalog
creation are effective at detecting sources, but do not have rigorous
statistical error control. At the same time, there are several multiple testing
procedures that provide rigorous error control, but they are not designed to
detect sources that are aggregated over several pixels. In this paper, we
propose a technique that does both, by providing rigorous statistical error
control on the aggregate objects themselves rather than the pixels. We
demonstrate the effectiveness of this approach on data from the Chandra X-ray
Observatory Satellite. Our technique effectively controls the rate of false
sources, yet still detects almost all of the sources detected by procedures
that do not have such rigorous error control and have the advantage of
additional data in the form of follow up observations, which will not be
available for upcoming large telescopes. In fact, we even detect a new source
that was missed by previous studies. The statistical methods developed in this
paper can be extended to problems beyond Astronomy, as we will illustrate with
an example from Neuroimaging.
[12]
oai:arXiv.org:0805.4136 [pdf] - 12977
Inference for the dark energy equation of state using Type IA supernova
data
Submitted: 2008-05-27, last modified: 2009-05-18
The surprising discovery of an accelerating universe led cosmologists to
posit the existence of "dark energy"--a mysterious energy field that permeates
the universe. Understanding dark energy has become the central problem of
modern cosmology. After describing the scientific background in depth, we
formulate the task as a nonlinear inverse problem that expresses the comoving
distance function in terms of the dark-energy equation of state. We present two
classes of methods for making sharp statistical inferences about the equation
of state from observations of Type Ia Supernovae (SNe). First, we derive a
technique for testing hypotheses about the equation of state that requires no
assumptions about its form and can distinguish among competing theories.
Second, we present a framework for computing parametric and nonparametric
estimators of the equation of state, with an associated assessment of
uncertainty. Using our approach, we evaluate the strength of statistical
evidence for various competing models of dark energy. Consistent with current
studies, we find that with the available Type Ia SNe data, it is not possible
to distinguish statistically among popular dark-energy models, and that, in
particular, there is no support in the data for rejecting a cosmological
constant. With much more supernova data likely to be available in coming years
(e.g., from the DOE/NASA Joint Dark Energy Mission), we address the more
interesting question of whether future data sets will have sufficient
resolution to distinguish among competing theories.
[13]
oai:arXiv.org:0809.2800 [pdf] - 16406
Revealing components of the galaxy population through nonparametric
techniques
Submitted: 2008-09-16
The distributions of galaxy properties vary with environment, and are often
multimodal, suggesting that the galaxy population may be a combination of
multiple components. The behaviour of these components versus environment holds
details about the processes of galaxy development. To release this information
we apply a novel, nonparametric statistical technique, identifying four
components present in the distribution of galaxy H$\alpha$ emission-line
equivalent-widths. We interpret these components as passive, star-forming, and
two varieties of active galactic nuclei. Independent of this interpretation,
the properties of each component are remarkably constant as a function of
environment. Only their relative proportions display substantial variation. The
galaxy population thus appears to comprise distinct components which are
individually independent of environment, with galaxies rapidly transitioning
between components as they move into denser environments.
[14]
oai:arXiv.org:0704.2605 [pdf] - 582
Mapping the Cosmological Confidence Ball Surface
Submitted: 2007-04-19
We present a new technique to compute simultaneously valid confidence
intervals for a set of model parameters. We apply our method to the Wilkinson
Microwave Anisotropy Probe's (WMAP) Cosmic Microwave Background (CMB) data,
exploring a seven dimensional space (tau, Omega_DE, Omega_M, omega_DM, omega_B,
f_nu, n_s). We find two distinct regions-of-interest: the standard Concordance
Model, and a region with large values of omega_DM, omega_B and H_0. This second
peak in parameter space can be rejected by applying a constraint (or a prior)
on the allowable values of the Hubble constant. Our new technique uses a
non-parametric fit to the data, along with a frequentist approach and a smart
search algorithm to map out a statistical confidence surface. The result is a
confidence ``ball'': a set of parameter values that contains the true value
with probability at least 1-alpha. Our algorithm performs a role similar to the
often used Markov Chain Monte Carlo (MCMC), which samples from the posterior
probability function in order to provide Bayesian credible intervals on the
parameters. While the MCMC approach samples densely around a peak in the
posterior, our new technique allows cosmologists to perform efficient analyses
around any regions of interest: e.g., the peak itself, or, possibly more
importantly, the 1-alpha confidence surface.
[15]
oai:arXiv.org:astro-ph/0511437 [pdf] - 1592066
Statistical Computations with AstroGrid and the Grid
Submitted: 2005-11-15
We outline our first steps towards marrying two new and emerging
technologies; the Virtual Observatory (e.g, AstroGrid) and the computational
grid. We discuss the construction of VOTechBroker, which is a modular software
tool designed to abstract the tasks of submission and management of a large
number of computational jobs to a distributed computer system. The broker will
also interact with the AstroGrid workflow and MySpace environments. We present
our planned usage of the VOTechBroker in computing a huge number of n-point
correlation functions from the SDSS, as well as fitting over a million CMBfast
models to the WMAP data.
[16]
oai:arXiv.org:astro-ph/0510844 [pdf] - 77340
Massive Science with VO and Grids
Nichol, Robert;
Smith, Garry;
Miller, Christopher;
Freeman, Peter;
Genovese, Chris;
Wasserman, Larry;
Bryan, Brent;
Gray, Alexander;
Schneider, Jeff;
Moore, Andrew
Submitted: 2005-10-31
There is a growing need for massive computational resources for the analysis
of new astronomical datasets. To tackle this problem, we present here our first
steps towards marrying two new and emerging technologies; the Virtual
Observatory (e.g, AstroGrid) and the computational grid (e.g. TeraGrid, COSMOS
etc.). We discuss the construction of VOTechBroker, which is a modular software
tool designed to abstract the tasks of submission and management of a large
number of computational jobs to a distributed computer system. The broker will
also interact with the AstroGrid workflow and MySpace environments. We discuss
our planned usages of the VOTechBroker in computing a huge number of n-point
correlation functions from the SDSS data and massive model-fitting of millions
of CMBfast models to WMAP data. We also discuss other applications including
the determination of the XMM Cluster Survey selection function and the
construction of new WMAP maps.
[17]
oai:arXiv.org:astro-ph/0510406 [pdf] - 260658
Examining the Effect of the Map-Making Algorithm on Observed Power
Asymmetry in WMAP Data
Submitted: 2005-10-13
We analyze first-year data of WMAP to determine the significance of asymmetry
in summed power between arbitrarily defined opposite hemispheres, using maps
that we create ourselves with software developed independently of the WMAP
team. We find that over the multipole range l=[2,64], the significance of
asymmetry is ~ 10^-4, a value insensitive to both frequency and power spectrum.
We determine the smallest multipole ranges exhibiting significant asymmetry,
and find twelve, including l=[2,3] and [6,7], for which the significance -> 0.
In these ranges there is an improbable association between the direction of
maximum significance and the ecliptic plane (p ~ 0.01). Also, contours of least
significance follow great circles inclined relative to the ecliptic at the
largest scales. The great circle for l=[2,3] passes over previously reported
preferred axes and is insensitive to frequency, while the great circle for
l=[6,7] is aligned with the ecliptic poles. We examine how changing map-making
parameters affects asymmetry, and find that at large scales, it is rendered
insignificant if the magnitude of the WMAP dipole vector is increased by
approximately 1-3 sigma (or 2-6 km/s). While confirmation of this result would
require data recalibration, such a systematic change would be consistent with
observations of frequency-independent asymmetry. We conclude that the use of an
incorrect dipole vector, in combination with a systematic or foreground process
associated with the ecliptic, may help to explain the observed asymmetry.
[18]
oai:arXiv.org:astro-ph/0410140 [pdf] - 1233420
Nonparametric Inference for the Cosmic Microwave Background
Submitted: 2004-10-06
The Cosmic Microwave Background (CMB), which permeates the entire Universe,
is the radiation left over from just 380,000 years after the Big Bang. On very
large scales, the CMB radiation field is smooth and isotropic, but the
existence of structure in the Universe - stars, galaxies, clusters of galaxies
- suggests that the field should fluctuate on smaller scales. Recent
observations, from the Cosmic Microwave Background Explorer to the Wilkinson
Microwave Anisotropy Project, have strikingly confirmed this prediction. CMB
fluctuations provide clues to the Universe's structure and composition shortly
after the Big Bang that are critical for testing cosmological models. For
example, CMB data can be used to determine what portion of the Universe is
composed of ordinary matter versus the mysterious dark matter and dark energy.
To this end, cosmologists usually summarize the fluctuations by the power
spectrum, which gives the variance as a function of angular frequency. The
spectrum's shape, and in particular the location and height of its peaks,
relates directly to the parameters in the cosmological models. Thus, a critical
statistical question is how accurately can these peaks be estimated. We use
recently developed techniques to construct a nonparametric confidence set for
the unknown CMB spectrum. Our estimated spectrum, based on minimal assumptions,
closely matches the model-based estimates used by cosmologists, but we can make
a wide range of additional inferences. We apply these techniques to test
various models and to extract confidence intervals on cosmological parameters
of interest. Our analysis shows that, even without parametric assumptions, the
first peak is resolved accurately with current data but that the second and
third peaks are not.
[19]
oai:arXiv.org:astro-ph/0401121 [pdf] - 61991
Multi-Tree Methods for Statistics on Very Large Datasets in Astronomy
Submitted: 2004-01-08
Many fundamental statistical methods have become critical tools for
scientific data analysis yet do not scale tractably to modern large datasets.
This paper will describe very recent algorithms based on computational geometry
which have dramatically reduced the computational complexity of 1) kernel
density estimation (which also extends to nonparametric regression,
classification, and clustering), and 2) the n-point correlation function for
arbitrary n. These new multi-tree methods typically yield orders of magnitude
in speedup over the previous state of the art for similar accuracy, making
millions of data points tractable on desktop workstations for the first time.
[20]
oai:arXiv.org:astro-ph/0112050 [pdf] - 46423
Non-Parametric Inference in Astrophysics
Submitted: 2001-12-03
We discuss non-parametric density estimation and regression for astrophysics
problems. In particular, we show how to compute non-parametric confidence
intervals for the location and size of peaks of a function. We illustrate these
ideas with recent data on the Cosmic Microwave Background. We also briefly
discuss non-parametric Bayesian inference.
[21]
oai:arXiv.org:astro-ph/0112049 [pdf] - 46422
A Non-parametric Analysis of the CMB Power Spectrum
Submitted: 2001-12-03
We examine Cosmic Microwave Background (CMB) temperature power spectra from
the BOOMERANG, MAXIMA, and DASI experiments. We non-parametrically estimate the
true power spectrum with no model assumptions. This is a significant departure
from previous research which used either cosmological models or some other
parameterized form (e.g. parabolic fits). Our non-parametric estimate is
practically indistinguishable from the best fit cosmological model, thus
lending independent support to the underlying physics that governs these
models. We also generate a confidence set for the non-parametric fit and
extract confidence intervals for the numbers, locations, and heights of peaks
and the successive peak-to-peak height ratios. At the 95%, 68%, and 40%
confidence levels, we find functions that fit the data with one, two, and three
peaks respectively (0 <= l <= 1100). Therefore, the current data prefer two
peaks at the 1 sigma level. However, we also rule out a constant temperature
function at the >8 sigma level. If we assume that there are three peaks in the
data, we find their locations to be within l_1 = (118,300), l_2 = (377,650),
and l_3 = (597,900). We find the ratio of the first peak-height to the second
(Delta T_1)/(Delta T_2)^2= (1.06, 4.27) and the second to the third (Delta
T_2)/(Delta T_3)^2= (0.41, 2.5). All measurements are for 95% confidence. If
the standard errors on the temperature measurements were reduced to a third of
what they are currently, as we expect to be achieved by the MAP and Planck CMB
experiments, we could eliminate two-peak models at the 95% confidence limit.
The non-parametric methodology discussed in this paper has many astrophysical
applications.
[22]
oai:arXiv.org:astro-ph/0110570 [pdf] - 45627
A new source detection algorithm using FDR
Submitted: 2001-10-26
The False Discovery Rate (FDR) method has recently been described by Miller
et al (2001), along with several examples of astrophysical applications. FDR is
a new statistical procedure due to Benjamini and Hochberg (1995) for
controlling the fraction of false positives when performing multiple hypothesis
testing. The importance of this method to source detection algorithms is
immediately clear. To explore the possibilities offered we have developed a new
task for performing source detection in radio-telescope images, Sfind 2.0,
which implements FDR. We compare Sfind 2.0 with two other source detection and
measurement tasks, Imsad and SExtractor, and comment on several issues arising
from the nature of the correlation between nearby pixels and the necessary
assumption of the null hypothesis. The strong suggestion is made that
implementing FDR as a threshold defining method in other existing
source-detection tasks is easy and worthwhile. We show that the constraint on
the fraction of false detections as specified by FDR holds true even for highly
correlated and realistic images. For the detection of true sources, which are
complex combinations of source-pixels, this constraint appears to be somewhat
less strict. It is still reliable enough, however, for a priori estimates of
the fraction of false source detections to be robust and realistic.
[23]
oai:arXiv.org:astro-ph/0110230 [pdf] - 45288
Computational AstroStatistics: Fast and Efficient Tools for Analysing
Huge Astronomical Data Sources
Nichol, R. C.;
Chong, S.;
Connolly, A. J.;
Davies, S.;
Genovese, C.;
Hopkins, A. M.;
Miller, C. J.;
Moore, A. W.;
Pelleg, D.;
Richards, G. T.;
Schneider, J.;
Szapudi, I.;
Wasserman, L.
Submitted: 2001-10-09
I present here a review of past and present multi-disciplinary research of
the Pittsburgh Computational AstroStatistics (PiCA) group. This group is
dedicated to developing fast and efficient statistical algorithms for analysing
huge astronomical data sources. I begin with a short review of
multi-resolutional kd-trees which are the building blocks for many of our
algorithms. For example, quick range queries and fast n-point correlation
functions. I will present new results from the use of Mixture Models (Connolly
et al. 2000) in density estimation of multi-color data from the Sloan Digital
Sky Survey (SDSS). Specifically, the selection of quasars and the automated
identification of X-ray sources. I will also present a brief overview of the
False Discovery Rate (FDR) procedure (Miller et al. 2001a) and show how it has
been used in the detection of ``Baryon Wiggles'' in the local galaxy power
spectrum and source identification in radio data. Finally, I will look forward
to new research on an automated Bayes Network anomaly detector and the possible
use of the Locally Linear Embedding algorithm (LLE; Roweis & Saul 2000) for
spectral classification of SDSS spectra.
[24]
oai:arXiv.org:astro-ph/0107034 [pdf] - 43405
Controlling the False Discovery Rate in Astrophysical Data Analysis
Submitted: 2001-07-02
The False Discovery Rate (FDR) is a new statistical procedure to control the
number of mistakes made when performing multiple hypothesis tests, i.e. when
comparing many data against a given model hypothesis. The key advantage of FDR
is that it allows one to a priori control the average fraction of false
rejections made (when comparing to the null hypothesis) over the total number
of rejections performed. We compare FDR to the standard procedure of rejecting
all tests that do not match the null hypothesis above some arbitrarily chosen
confidence limit, e.g. 2 sigma, or at the 95% confidence level. When using FDR,
we find a similar rate of correct detections, but with significantly fewer
false detections. Moreover, the FDR procedure is quick and easy to compute and
can be trivially adapted to work with correlated data. The purpose of this
paper is to introduce the FDR procedure to the astrophysics community. We
illustrate the power of FDR through several astronomical examples, including
the detection of features against a smooth one-dimensional function, e.g.
seeing the ``baryon wiggles'' in a power spectrum of matter fluctuations, and
source pixel detection in imaging data. In this era of large datasets and high
precision measurements, FDR provides the means to adaptively control a
scientifically meaningful quantity -- the number of false discoveries made when
conducting multiple hypothesis tests.
[25]
oai:arXiv.org:astro-ph/0012333 [pdf] - 39922
Fast Algorithms and Efficient Statistics: N-point Correlation Functions
Moore, Andrew;
Connolly, Andy;
Genovese, Chris;
Gray, Alex;
Grone, Larry;
Kanidoris, Nick;
Nichol, Robert;
Schneider, Jeff;
Szalay, Alex;
Szapudi, Istvan;
Wasserman, Larry
Submitted: 2000-12-14
We present here a new algorithm for the fast computation of N-point
correlation functions in large astronomical data sets. The algorithm is based
on kdtrees which are decorated with cached sufficient statistics thus allowing
for orders of magnitude speed-ups over the naive non-tree-based implementation
of correlation functions. We further discuss the use of controlled
approximations within the computation which allows for further acceleration. In
summary, our algorithm now makes it possible to compute exact, all-pairs,
measurements of the 2, 3 and 4-point correlation functions for cosmological
data sets like the Sloan Digital Sky Survey (SDSS; York et al. 2000) and the
next generation of Cosmic Microwave Background experiments (see Szapudi et al.
2000).
[26]
oai:arXiv.org:astro-ph/0011557 [pdf] - 39563
SDSS-RASS: Next Generation of Cluster-Finding Algorithms
Nichol, R.;
Miller, C.;
Connolly, A.;
Chong, S.;
Genovese, C.;
Moore, A.;
Reichart, D.;
Schneider, J.;
Wasserman, L.;
Annis, J.;
Brinkman, J.;
Bohringer, H.;
Castander, F.;
Kim, R.;
McKay, T.;
Postman, M.;
Sheldon, E.;
Szapudi, I.;
Romer, K.;
Voges, W.
Submitted: 2000-11-29
We outline here the next generation of cluster-finding algorithms. We show
how advances in Computer Science and Statistics have helped develop robust,
fast algorithms for finding clusters of galaxies in large multi-dimensional
astronomical databases like the Sloan Digital Sky Survey (SDSS). Specifically,
this paper presents four new advances: (1) A new semi-parametric algorithm -
nicknamed ``C4'' - for jointly finding clusters of galaxies in the SDSS and
ROSAT All-Sky Survey databases; (2) The introduction of the False Discovery
Rate into Astronomy; (3) The role of kernel shape in optimizing cluster
detection; (4) A new determination of the X-ray Cluster Luminosity Function
which has bearing on the existence of a ``deficit'' of high redshift, high
luminosity clusters. This research is part of our ``Computational
AstroStatistics'' collaboration (see Nichol et al. 2000) and the algorithms and
techniques discussed herein will form part of the ``Virtual Observatory''
analysis toolkit.
[27]
oai:arXiv.org:astro-ph/0008187 [pdf] - 37503
Fast Algorithms and Efficient Statistics: Density Estimation in Large
Astronomical Datasets
Submitted: 2000-08-11
In this paper, we outline the use of Mixture Models in density estimation of
large astronomical databases. This method of density estimation has been known
in Statistics for some time but has not been implemented because of the large
computational cost. Herein, we detail an implementation of the Mixture Model
density estimation based on multi-resolutional KD-trees which makes this
statistical technique into a computationally tractable problem. We provide the
theoretical and experimental background for using a mixture model of Gaussians
based on the Expectation Maximization (EM) Algorithm. Applying these analyses
to simulated data sets we show that the EM algorithm - using the AIC penalized
likelihood to score the fit - out-performs the best kernel density estimate of
the distribution while requiring no ``fine--tuning'' of the input algorithm
parameters. We find that EM can accurately recover the underlying density
distribution from point processes thus providing an efficient adaptive
smoothing method for astronomical source catalogs. To demonstrate the general
application of this statistic to astrophysical problems we consider two cases
of density estimation: the clustering of galaxies in redshift space and the
clustering of stars in color space. From these data we show that EM provides an
adaptive smoothing of the distribution of galaxies in redshift space
(describing accurately both the small and large-scale features within the data)
and a means of identifying outliers in multi-dimensional color-color space
(e.g. for the identification of high redshift QSOs). Automated tools such as
those based on the EM algorithm will be needed in the analysis of the next
generation of astronomical catalogs (2MASS, FIRST, PLANCK, SDSS) and ultimately
in in the development of the National Virtual Observatory.
[28]
oai:arXiv.org:astro-ph/0007404 [pdf] - 37242
Computational AstroStatistics: Fast Algorithms and Efficient Statistics
for Density Estimation in Large Astronomical Datasets
Submitted: 2000-07-26
We present initial results on the use of Mixture Models for density
estimation in large astronomical databases. We provide herein both the
theoretical and experimental background for using a mixture model of Gaussians
based on the Expectation Maximization (EM) Algorithm. Applying these analyses
to simulated data sets we show that the EM algorithm - using the both the AIC &
BIC penalized likelihood to score the fit - can out-perform the best kernel
density estimate of the distribution while requiring no ``fine-tuning'' of the
input algorithm parameters. We find that EM can accurately recover the
underlying density distribution from point processes thus providing an
efficient adaptive smoothing method for astronomical source catalogs. To
demonstrate the general application of this statistic to astrophysical problems
we consider two cases of density estimation; the clustering of galaxies in
redshift space and the clustering of stars in color space. From these data we
show that EM provides an adaptive smoothing of the distribution of galaxies in
redshift space (describing accurately both the small and large-scale features
within the data) and a means of identifying outliers in multi-dimensional
color-color space (e.g. for the identification of high redshift QSOs).
Automated tools such as those based on the EM algorithm will be needed in the
analysis of the next generation of astronomical catalogs (2MASS, FIRST, PLANCK,
SDSS) and ultimately in the development of the National Virtual Observatory.