Full-text search for arXiv

28 article(s) in total. 61 co-authors, from 1 to 25 common article(s). Median position in authors list is 4,0.

[1] oai:arXiv.org:1509.06376 [pdf] - 1530313

Detecting Effects of Filaments on Galaxy Properties in the Sloan Digital Sky Survey III

Chen, Yen-Chi; Ho, Shirley; Mandelbaum, Rachel; Bahcall, Neta A.; Brownstein, Joel R.; Freeman, Peter E.; Genovese, Christopher R.; Schneider, Donald P.; Wasserman, Larry

Comments: To appear in MNRAS

Submitted: 2015-09-21, last modified: 2017-01-12

We study the effects of filaments on galaxy properties in the Sloan Digital Sky Survey (SDSS) Data Release 12 using filaments from the `Cosmic Web Reconstruction' catalogue (Chen et al. 2016), a publicly available filament catalogue for SDSS. Since filaments are tracers of medium-to-high density regions, we expect that galaxy properties associated with the environment are dependent on the distance to the nearest filament. Our analysis demonstrates that a red galaxy or a high-mass galaxy tend to reside closer to filaments than a blue or low-mass galaxy. After adjusting the effect from stellar mass, on average, early-forming galaxies or large galaxies have a shorter distance to filaments than late-forming galaxies or small galaxies. For the Main galaxy sample (MGS), all signals are very significant ($>6\sigma$). For the LOWZ and CMASS sample, the stellar mass and size are significant ($>2 \sigma$). The filament effects we observe persist until $z = 0.7$ (the edge of the CMASS sample). Comparing our results to those using the galaxy distances from redMaPPer galaxy clusters as a reference, we find a similar result between filaments and clusters. Moreover, we find that the effect of clusters on the stellar mass of nearby galaxies depends on the galaxy's filamentary environment. Our findings illustrate the strong correlation of galaxy properties with proximity to density ridges, strongly supporting the claim that density ridges are good tracers of filaments.

[2] oai:arXiv.org:1509.06443 [pdf] - 1447640

Cosmic Web Reconstruction through Density Ridges: Catalogue

Chen, Yen-Chi; Ho, Shirley; Brinkmann, Jon; Freeman, Peter E.; Genovese, Christopher R.; Schneider, Donald P.; Wasserman, Larry

Comments: 14 pages, 12 figures, 4 tables

Submitted: 2015-09-21

We construct a catalogue for filaments using a novel approach called SCMS (subspace constrained mean shift; Ozertem & Erdogmus 2011; Chen et al. 2015). SCMS is a gradient-based method that detects filaments through density ridges (smooth curves tracing high-density regions). A great advantage of SCMS is its uncertainty measure, which allows an evaluation of the errors for the detected filaments. To detect filaments, we use data from the Sloan Digital Sky Survey, which consist of three galaxy samples: the NYU main galaxy sample (MGS), the LOWZ sample and the CMASS sample. Each of the three dataset covers different redshift regions so that the combined sample allows detection of filaments up to z = 0.7. Our filament catalogue consists of a sequence of two-dimensional filament maps at different redshifts that provide several useful statistics on the evolution cosmic web. To construct the maps, we select spectroscopically confirmed galaxies within 0.050 < z < 0.700 and partition them into 130 bins. For each bin, we ignore the redshift, treating the galaxy observations as a 2-D data and detect filaments using SCMS. The filament catalogue consists of 130 individual 2-D filament maps, and each map comprises points on the detected filaments that describe the filamentary structures at a particular redshift. We also apply our filament catalogue to investigate galaxy luminosity and its relation with distance to filament. Using a volume-limited sample, we find strong evidence (6.1$\sigma$ - 12.3$\sigma$) that galaxies close to filaments are generally brighter than those at significant distance from filaments.

[3] oai:arXiv.org:1501.05303 [pdf] - 1288321

Cosmic Web Reconstruction through Density Ridges: Method and Algorithm

Chen, Yen-Chi; Ho, Shirley; Freeman, Peter E.; Genovese, Christopher R.; Wasserman, Larry

Comments: To appear in MNRAS. 18 pages, 19 figures, 1 table

Submitted: 2015-01-21, last modified: 2015-08-27

The detection and characterization of filamentary structures in the cosmic web allows cosmologists to constrain parameters that dictates the evolution of the Universe. While many filament estimators have been proposed, they generally lack estimates of uncertainty, reducing their inferential power. In this paper, we demonstrate how one may apply the Subspace Constrained Mean Shift (SCMS) algorithm (Ozertem and Erdogmus (2011); Genovese et al. (2012)) to uncover filamentary structure in galaxy data. The SCMS algorithm is a gradient ascent method that models filaments as density ridges, one-dimensional smooth curves that trace high-density regions within the point cloud. We also demonstrate how augmenting the SCMS algorithm with bootstrap-based methods of uncertainty estimation allows one to place uncertainty bands around putative filaments. We apply the SCMS method to datasets sampled from the P3M N-body simulation, with galaxy number densities consistent with SDSS and WFIRST-AFTA and to LOWZ and CMASS data from the Baryon Oscillation Spectroscopic Survey (BOSS). To further assess the efficacy of SCMS, we compare the relative locations of BOSS filaments with galaxy clusters in the redMaPPer catalog, and find that redMaPPer clusters are significantly closer (with p-values $< 10^{-9}$) to SCMS-detected filaments than to randomly selected galaxies.

[4] oai:arXiv.org:1508.04149 [pdf] - 1300265

Investigating Galaxy-Filament Alignments in Hydrodynamic Simulations using Density Ridges

Chen, Yen-Chi; Ho, Shirley; Tenneti, Ananth; Mandelbaum, Rachel; Croft, Rupert; DiMatteo, Tiziana; Freeman, Peter E.; Genovese, Christopher R.; Wasserman, Larry

Comments: 11 pages, 10 figures

Submitted: 2015-08-17

In this paper, we study the filamentary structures and the galaxy alignment along filaments at redshift $z=0.06$ in the MassiveBlack-II simulation, a state-of-the-art, high-resolution hydrodynamical cosmological simulation which includes stellar and AGN feedback in a volume of (100 Mpc$/h$)$^3$. The filaments are constructed using the subspace constrained mean shift (SCMS; Ozertem & Erdogmus (2011) and Chen et al. (2015a)). First, we show that reconstructed filaments using galaxies and reconstructed filaments using dark matter particles are similar to each other; over $50\%$ of the points on the galaxy filaments have a corresponding point on the dark matter filaments within distance $0.13$ Mpc$/h$ (and vice versa) and this distance is even smaller at high-density regions. Second, we observe the alignment of the major principal axis of a galaxy with respect to the orientation of its nearest filament and detect a $2.5$ Mpc$/h$ critical radius for filament's influence on the alignment when the subhalo mass of this galaxy is between $10^9M_\odot/h$ and $10^{12}M_\odot/h$. Moreover, we find the alignment signal to increase significantly with the subhalo mass. Third, when a galaxy is close to filaments (less than $0.25$ Mpc$/h$), the galaxy alignment toward the nearest galaxy group depends on the galaxy subhalo mass. Finally, we find that galaxies close to filaments or groups tend to be rounder than those away from filaments or groups.

[5] oai:arXiv.org:1406.7536 [pdf] - 844312

Estimating the distribution of Galaxy Morphologies on a continuous space

Vinci, Giuseppe; Freeman, Peter; Newman, Jeffrey; Wasserman, Larry; Genovese, Christopher

Comments: 4 pages, 3 figures, Statistical Challenges in 21st Century Cosmology, Proceedings IAU Symposium No. 306, 2014

Submitted: 2014-06-29

The incredible variety of galaxy shapes cannot be summarized by human defined discrete classes of shapes without causing a possibly large loss of information. Dictionary learning and sparse coding allow us to reduce the high dimensional space of shapes into a manageable low dimensional continuous vector space. Statistical inference can be done in the reduced space via probability distribution estimation and manifold estimation.

[6] oai:arXiv.org:1404.3168 [pdf] - 809422

Functional Regression for Quasar Spectra

Ciollaro, Mattia; Cisewski, Jessi; Freeman, Peter; Genovese, Christopher; Lei, Jing; O'Connell, Ross; Wasserman, Larry

Comments:

Submitted: 2014-04-11

The Lyman-alpha forest is a portion of the observed light spectrum of distant galactic nuclei which allows us to probe remote regions of the Universe that are otherwise inaccessible. The observed Lyman-alpha forest of a quasar light spectrum can be modeled as a noisy realization of a smooth curve that is affected by a `damping effect' which occurs whenever the light emitted by the quasar travels through regions of the Universe with higher matter concentration. To decode the information conveyed by the Lyman-alpha forest about the matter distribution, we must be able to separate the smooth `continuum' from the noise and the contribution of the damping effect in the quasar light spectra. To predict the continuum in the Lyman-alpha forest, we use a nonparametric functional regression model in which both the response and the predictor variable (the smooth part of the damping-free portion of the spectrum) are function-valued random variables. We demonstrate that the proposed method accurately predicts the unobservable continuum in the Lyman-alpha forest both on simulated spectra and real spectra. Also, we introduce distribution-free prediction bands for the nonparametric functional regression model that have finite sample guarantees. These prediction bands, together with bootstrap-based confidence bands for the projection of the mean continuum on a fixed number of principal components, allow us to assess the degree of uncertainty in the model predictions.

[7] oai:arXiv.org:1401.1867 [pdf] - 1202636

Nonparametric 3D map of the IGM using the Lyman-alpha forest

Cisewski, Jessi; Croft, Rupert A. C.; Freeman, Peter E.; Genovese, Christopher R.; Khandai, Nishikanta; Ozbek, Melih; Wasserman, Larry

Comments:

Submitted: 2014-01-08

Visualizing the high-redshift Universe is difficult due to the dearth of available data; however, the Lyman-alpha forest provides a means to map the intergalactic medium at redshifts not accessible to large galaxy surveys. Large-scale structure surveys, such as the Baryon Oscillation Spectroscopic Survey (BOSS), have collected quasar (QSO) spectra that enable the reconstruction of HI density fluctuations. The data fall on a collection of lines defined by the lines-of-sight (LOS) of the QSO, and a major issue with producing a 3D reconstruction is determining how to model the regions between the LOS. We present a method that produces a 3D map of this relatively uncharted portion of the Universe by employing local polynomial smoothing, a nonparametric methodology. The performance of the method is analyzed on simulated data that mimics the varying number of LOS expected in real data, and then is applied to a sample region selected from BOSS. Evaluation of the reconstruction is assessed by considering various features of the predicted 3D maps including visual comparison of slices, PDFs, counts of local minima and maxima, and standardized correlation functions. This 3D reconstruction allows for an initial investigation of the topology of this portion of the Universe using persistent homology.

[8] oai:arXiv.org:1202.2902 [pdf] - 1116614

Regularization Techniques for PSF-Matching Kernels. I. Choice of Kernel Basis

Becker, A. C.; Homrighausen, D.; Connolly, A. J.; Genovese, C. R.; Owen, R.; Bickerton, S. J.; Lupton, R. H.

Comments: Submitted to MNRAS; 5 figures

Submitted: 2012-02-13

We review current methods for building PSF-matching kernels for the purposes of image subtraction or coaddition. Such methods use a linear decomposition of the kernel on a series of basis functions. The correct choice of these basis functions is fundamental to the efficiency and effectiveness of the matching - the chosen bases should represent the underlying signal using a reasonably small number of shapes, and/or have a minimum number of user-adjustable tuning parameters. We examine methods whose bases comprise multiple Gauss-Hermite polynomials, as well as a form free basis composed of delta-functions. Kernels derived from delta-functions are unsurprisingly shown to be more expressive; they are able to take more general shapes and perform better in situations where sum-of-Gaussian methods are known to fail. However, due to its many degrees of freedom (the maximum number allowed by the kernel size) this basis tends to overfit the problem, and yields noisy kernels having large variance. We introduce a new technique to regularize these delta-function kernel solutions, which bridges the gap between the generality of delta-function kernels, and the compactness of sum-of-Gaussian kernels. Through this regularization we are able to create general kernel solutions that represent the intrinsic shape of the PSF-matching kernel with only one degree of freedom, the strength of the regularization lambda. The role of lambda is effectively to exchange variance in the resulting difference image with variance in the kernel itself. We examine considerations in choosing the value of lambda, including statistical risk estimators and the ability of the solution to predict solutions for adjacent areas. Both of these suggest moderate strengths of lambda between 0.1 and 1.0, although this optimization is likely dataset dependent.

[9] oai:arXiv.org:1003.5536 [pdf] - 951278

The Geometry of Nonparametric Filament Estimation

Genovese, Christopher R.; Perone-Pacifico, Marco; Verdinelli, Isabella; Wasserman, Larry

Comments: substantial revision

Submitted: 2010-03-25, last modified: 2010-12-12

We consider the problem of estimating filamentary structure from planar point process data. We make some connections with computational geometry and we develop nonparametric methods for estimating the filaments. We show that, under weak conditions, the filaments have a simple geometric representation as the medial axis of the data distribution's support. Our methods convert an estimator of the support's boundary into an estimator of the filaments. We also find the rates of convergence of our estimators.

[10] oai:arXiv.org:1011.4059 [pdf] - 1042031

Image Coaddition with Temporally Varying Kernels

Homrighausen, Darren; Genovese, Christopher; Connolly, Andy; Becker, Andy; Owen, Russell

Comments:

Submitted: 2010-11-17

Large, multi-frequency imaging surveys, such as the Large Synaptic Survey Telescope (LSST), need to do near-real time analysis of very large datasets. This raises a host of statistical and computational problems where standard methods do not work. In this paper, we study a proposed method for combining stacks of images into a single summary image, sometimes referred to as a template. This task is commonly referred to as image coaddition. In part, we focus on a method proposed in previous work, which outlines a procedure for combining stacks of images in an online fashion in the Fourier domain. We evaluate this method by comparing it to two straightforward methods through the use of various criteria and simulations. Note that the goal is not to propose these comparison methods for use in their own right, but to ensure that additional complexity also provides substantially improved performance.

[11] oai:arXiv.org:0910.5449 [pdf] - 30040

Straight to the Source: Detecting Aggregate Objects in Astronomical Images with Proper Error Control

Friedenberg, David A.; Genovese, Christopher R.

Comments:

Submitted: 2009-10-28

The next generation of telescopes will acquire terabytes of image data on a nightly basis. Collectively, these large images will contain billions of interesting objects, which astronomers call sources. The astronomers' task is to construct a catalog detailing the coordinates and other properties of the sources. The source catalog is the primary data product for most telescopes and is an important input for testing new astrophysical theories, but to construct the catalog one must first detect the sources. Existing algorithms for catalog creation are effective at detecting sources, but do not have rigorous statistical error control. At the same time, there are several multiple testing procedures that provide rigorous error control, but they are not designed to detect sources that are aggregated over several pixels. In this paper, we propose a technique that does both, by providing rigorous statistical error control on the aggregate objects themselves rather than the pixels. We demonstrate the effectiveness of this approach on data from the Chandra X-ray Observatory Satellite. Our technique effectively controls the rate of false sources, yet still detects almost all of the sources detected by procedures that do not have such rigorous error control and have the advantage of additional data in the form of follow up observations, which will not be available for upcoming large telescopes. In fact, we even detect a new source that was missed by previous studies. The statistical methods developed in this paper can be extended to problems beyond Astronomy, as we will illustrate with an example from Neuroimaging.

[12] oai:arXiv.org:0805.4136 [pdf] - 12977

Inference for the dark energy equation of state using Type IA supernova data

Genovese, Christopher; Freeman, Peter; Wasserman, Larry; Nichol, Robert; Miller, Christopher

Comments: Published in at http://dx.doi.org/10.1214/08-AOAS229 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Submitted: 2008-05-27, last modified: 2009-05-18

The surprising discovery of an accelerating universe led cosmologists to posit the existence of "dark energy"--a mysterious energy field that permeates the universe. Understanding dark energy has become the central problem of modern cosmology. After describing the scientific background in depth, we formulate the task as a nonlinear inverse problem that expresses the comoving distance function in terms of the dark-energy equation of state. We present two classes of methods for making sharp statistical inferences about the equation of state from observations of Type Ia Supernovae (SNe). First, we derive a technique for testing hypotheses about the equation of state that requires no assumptions about its form and can distinguish among competing theories. Second, we present a framework for computing parametric and nonparametric estimators of the equation of state, with an associated assessment of uncertainty. Using our approach, we evaluate the strength of statistical evidence for various competing models of dark energy. Consistent with current studies, we find that with the available Type Ia SNe data, it is not possible to distinguish statistically among popular dark-energy models, and that, in particular, there is no support in the data for rejecting a cosmological constant. With much more supernova data likely to be available in coming years (e.g., from the DOE/NASA Joint Dark Energy Mission), we address the more interesting question of whether future data sets will have sufficient resolution to distinguish among competing theories.

[13] oai:arXiv.org:0809.2800 [pdf] - 16406

Revealing components of the galaxy population through nonparametric techniques

Bamford, Steven P.; Rojas, Alex L.; Nichol, Robert C.; Miller, Christopher J.; Wasserman, Larry; Genovese, Christopher R.; Freeman, Peter E.

Comments: 12 pages, 10 figures, accepted for publication in MNRAS

Submitted: 2008-09-16

The distributions of galaxy properties vary with environment, and are often multimodal, suggesting that the galaxy population may be a combination of multiple components. The behaviour of these components versus environment holds details about the processes of galaxy development. To release this information we apply a novel, nonparametric statistical technique, identifying four components present in the distribution of galaxy H$\alpha$ emission-line equivalent-widths. We interpret these components as passive, star-forming, and two varieties of active galactic nuclei. Independent of this interpretation, the properties of each component are remarkably constant as a function of environment. Only their relative proportions display substantial variation. The galaxy population thus appears to comprise distinct components which are individually independent of environment, with galaxies rapidly transitioning between components as they move into denser environments.

[14] oai:arXiv.org:0704.2605 [pdf] - 582

Mapping the Cosmological Confidence Ball Surface

Bryan, Brent; Schneider, Jeff; Miller, Christopher J.; Nichol, Robert C.; Genovese, Christopher; Wasserman, Larry

Comments: 41 pages, 12 figures. To appear in ApJ

Submitted: 2007-04-19

We present a new technique to compute simultaneously valid confidence intervals for a set of model parameters. We apply our method to the Wilkinson Microwave Anisotropy Probe's (WMAP) Cosmic Microwave Background (CMB) data, exploring a seven dimensional space (tau, Omega_DE, Omega_M, omega_DM, omega_B, f_nu, n_s). We find two distinct regions-of-interest: the standard Concordance Model, and a region with large values of omega_DM, omega_B and H_0. This second peak in parameter space can be rejected by applying a constraint (or a prior) on the allowable values of the Hubble constant. Our new technique uses a non-parametric fit to the data, along with a frequentist approach and a smart search algorithm to map out a statistical confidence surface. The result is a confidence ``ball'': a set of parameter values that contains the true value with probability at least 1-alpha. Our algorithm performs a role similar to the often used Markov Chain Monte Carlo (MCMC), which samples from the posterior probability function in order to provide Bayesian credible intervals on the parameters. While the MCMC approach samples densely around a peak in the posterior, our new technique allows cosmologists to perform efficient analyses around any regions of interest: e.g., the peak itself, or, possibly more importantly, the 1-alpha confidence surface.

[15] oai:arXiv.org:astro-ph/0511437 [pdf] - 1592066

Statistical Computations with AstroGrid and the Grid

Nichol, Robert C; Smith, Garry; Miller, Christopher J; Genovese, Chris; Wasserman, Larry; Bryan, Brent; Gray, Alexander; Schneider, Jeff; Moore, Andrew W

Comments: Invited talk to appear in "Proceedings of PHYSTAT05: Statistical Problems in Particle Physics, Astrophysics and Cosmology"

Submitted: 2005-11-15

We outline our first steps towards marrying two new and emerging technologies; the Virtual Observatory (e.g, AstroGrid) and the computational grid. We discuss the construction of VOTechBroker, which is a modular software tool designed to abstract the tasks of submission and management of a large number of computational jobs to a distributed computer system. The broker will also interact with the AstroGrid workflow and MySpace environments. We present our planned usage of the VOTechBroker in computing a huge number of n-point correlation functions from the SDSS, as well as fitting over a million CMBfast models to the WMAP data.

[16] oai:arXiv.org:astro-ph/0510844 [pdf] - 77340

Massive Science with VO and Grids

Nichol, Robert; Smith, Garry; Miller, Christopher; Freeman, Peter; Genovese, Chris; Wasserman, Larry; Bryan, Brent; Gray, Alexander; Schneider, Jeff; Moore, Andrew

Comments: Invited talk at ADASSXV conference published as ASP Conference Series, Vol. XXX, 2005 C. Gabriel, C. Arviset, D. Ponz and E. Solano, eds. 9 pages

Submitted: 2005-10-31

There is a growing need for massive computational resources for the analysis of new astronomical datasets. To tackle this problem, we present here our first steps towards marrying two new and emerging technologies; the Virtual Observatory (e.g, AstroGrid) and the computational grid (e.g. TeraGrid, COSMOS etc.). We discuss the construction of VOTechBroker, which is a modular software tool designed to abstract the tasks of submission and management of a large number of computational jobs to a distributed computer system. The broker will also interact with the AstroGrid workflow and MySpace environments. We discuss our planned usages of the VOTechBroker in computing a huge number of n-point correlation functions from the SDSS data and massive model-fitting of millions of CMBfast models to WMAP data. We also discuss other applications including the determination of the XMM Cluster Survey selection function and the construction of new WMAP maps.

[17] oai:arXiv.org:astro-ph/0510406 [pdf] - 260658

Examining the Effect of the Map-Making Algorithm on Observed Power Asymmetry in WMAP Data

Freeman, P. E.; Genovese, C. R.; Miller, C. J.; Nichol, R. C.; Wasserman, L.

Comments: 45 pages, 16 figures (21 figure files), high-resolution versions of Figures 1-3 at http://www.stat.cmu.edu/~pfreeman, accepted for publication in ApJ

Submitted: 2005-10-13

We analyze first-year data of WMAP to determine the significance of asymmetry in summed power between arbitrarily defined opposite hemispheres, using maps that we create ourselves with software developed independently of the WMAP team. We find that over the multipole range l=[2,64], the significance of asymmetry is ~ 10^-4, a value insensitive to both frequency and power spectrum. We determine the smallest multipole ranges exhibiting significant asymmetry, and find twelve, including l=[2,3] and [6,7], for which the significance -> 0. In these ranges there is an improbable association between the direction of maximum significance and the ecliptic plane (p ~ 0.01). Also, contours of least significance follow great circles inclined relative to the ecliptic at the largest scales. The great circle for l=[2,3] passes over previously reported preferred axes and is insensitive to frequency, while the great circle for l=[6,7] is aligned with the ecliptic poles. We examine how changing map-making parameters affects asymmetry, and find that at large scales, it is rendered insignificant if the magnitude of the WMAP dipole vector is increased by approximately 1-3 sigma (or 2-6 km/s). While confirmation of this result would require data recalibration, such a systematic change would be consistent with observations of frequency-independent asymmetry. We conclude that the use of an incorrect dipole vector, in combination with a systematic or foreground process associated with the ecliptic, may help to explain the observed asymmetry.

[18] oai:arXiv.org:astro-ph/0410140 [pdf] - 1233420

Nonparametric Inference for the Cosmic Microwave Background

Genovese, Christopher R.; Miller, Christopher J.; Nichol, Robert C.; Arjunwadkar, Mihir; Wasserman, Larry

Comments: Invited review for "Statistical Science". Accepted for publication in Feburary 2004 journal

Submitted: 2004-10-06

The Cosmic Microwave Background (CMB), which permeates the entire Universe, is the radiation left over from just 380,000 years after the Big Bang. On very large scales, the CMB radiation field is smooth and isotropic, but the existence of structure in the Universe - stars, galaxies, clusters of galaxies - suggests that the field should fluctuate on smaller scales. Recent observations, from the Cosmic Microwave Background Explorer to the Wilkinson Microwave Anisotropy Project, have strikingly confirmed this prediction. CMB fluctuations provide clues to the Universe's structure and composition shortly after the Big Bang that are critical for testing cosmological models. For example, CMB data can be used to determine what portion of the Universe is composed of ordinary matter versus the mysterious dark matter and dark energy. To this end, cosmologists usually summarize the fluctuations by the power spectrum, which gives the variance as a function of angular frequency. The spectrum's shape, and in particular the location and height of its peaks, relates directly to the parameters in the cosmological models. Thus, a critical statistical question is how accurately can these peaks be estimated. We use recently developed techniques to construct a nonparametric confidence set for the unknown CMB spectrum. Our estimated spectrum, based on minimal assumptions, closely matches the model-based estimates used by cosmologists, but we can make a wide range of additional inferences. We apply these techniques to test various models and to extract confidence intervals on cosmological parameters of interest. Our analysis shows that, even without parametric assumptions, the first peak is resolved accurately with current data but that the second and third peaks are not.

[19] oai:arXiv.org:astro-ph/0401121 [pdf] - 61991

Multi-Tree Methods for Statistics on Very Large Datasets in Astronomy

Gray, Alexander G.; Moore, Andrew W.; Nichol, Robert C.; Connolly, Andrew J.; Genovese, Christopher; Wasserman, Larry

Comments: 4-page conference proceeding based on talk given at ADASS XIII, 13-15 October, 2003, Strasbourg

Submitted: 2004-01-08

Many fundamental statistical methods have become critical tools for scientific data analysis yet do not scale tractably to modern large datasets. This paper will describe very recent algorithms based on computational geometry which have dramatically reduced the computational complexity of 1) kernel density estimation (which also extends to nonparametric regression, classification, and clustering), and 2) the n-point correlation function for arbitrary n. These new multi-tree methods typically yield orders of magnitude in speedup over the previous state of the art for similar accuracy, making millions of data points tractable on desktop workstations for the first time.

[20] oai:arXiv.org:astro-ph/0112050 [pdf] - 46423

Non-Parametric Inference in Astrophysics

Wasserman, Larry; Miller, Christopher J.; Nichol, Robert C.; Genovese, Chris; Jang, Woncheol; Connolly, Andrew J.; Moore, Andrew W.; Schneider, Jeff; group, the PICA

Comments: Invited presentation at "Statistical Challenges in Modern Astronomy III" July 18-21 2001 Penn St. University. See http://www.picagroup.org for more information on the PICA group membership, software and recent papers

Submitted: 2001-12-03

We discuss non-parametric density estimation and regression for astrophysics problems. In particular, we show how to compute non-parametric confidence intervals for the location and size of peaks of a function. We illustrate these ideas with recent data on the Cosmic Microwave Background. We also briefly discuss non-parametric Bayesian inference.

[21] oai:arXiv.org:astro-ph/0112049 [pdf] - 46422

A Non-parametric Analysis of the CMB Power Spectrum

Miller, Christopher J.; Nichol, Robert C.; Genovese, Christopher; Wasserman, Larry

Comments: Uses emulateapj.sty. 4 pages, 1 table, 2 figures. Submitted to ApJ Letters. Our code and best non-parametric fit are available at http://www.picagroup.org

Submitted: 2001-12-03

We examine Cosmic Microwave Background (CMB) temperature power spectra from the BOOMERANG, MAXIMA, and DASI experiments. We non-parametrically estimate the true power spectrum with no model assumptions. This is a significant departure from previous research which used either cosmological models or some other parameterized form (e.g. parabolic fits). Our non-parametric estimate is practically indistinguishable from the best fit cosmological model, thus lending independent support to the underlying physics that governs these models. We also generate a confidence set for the non-parametric fit and extract confidence intervals for the numbers, locations, and heights of peaks and the successive peak-to-peak height ratios. At the 95%, 68%, and 40% confidence levels, we find functions that fit the data with one, two, and three peaks respectively (0 <= l <= 1100). Therefore, the current data prefer two peaks at the 1 sigma level. However, we also rule out a constant temperature function at the >8 sigma level. If we assume that there are three peaks in the data, we find their locations to be within l_1 = (118,300), l_2 = (377,650), and l_3 = (597,900). We find the ratio of the first peak-height to the second (Delta T_1)/(Delta T_2)^2= (1.06, 4.27) and the second to the third (Delta T_2)/(Delta T_3)^2= (0.41, 2.5). All measurements are for 95% confidence. If the standard errors on the temperature measurements were reduced to a third of what they are currently, as we expect to be achieved by the MAP and Planck CMB experiments, we could eliminate two-peak models at the 95% confidence limit. The non-parametric methodology discussed in this paper has many astrophysical applications.

[22] oai:arXiv.org:astro-ph/0110570 [pdf] - 45627

A new source detection algorithm using FDR

Hopkins, A. M.; Miller, C. J.; Connolly, A. J.; Genovese, C.; Nichol, R. C.; Wasserman, L.

Comments: 17 pages, 7 figures, accepted for publication by AJ

Submitted: 2001-10-26

The False Discovery Rate (FDR) method has recently been described by Miller et al (2001), along with several examples of astrophysical applications. FDR is a new statistical procedure due to Benjamini and Hochberg (1995) for controlling the fraction of false positives when performing multiple hypothesis testing. The importance of this method to source detection algorithms is immediately clear. To explore the possibilities offered we have developed a new task for performing source detection in radio-telescope images, Sfind 2.0, which implements FDR. We compare Sfind 2.0 with two other source detection and measurement tasks, Imsad and SExtractor, and comment on several issues arising from the nature of the correlation between nearby pixels and the necessary assumption of the null hypothesis. The strong suggestion is made that implementing FDR as a threshold defining method in other existing source-detection tasks is easy and worthwhile. We show that the constraint on the fraction of false detections as specified by FDR holds true even for highly correlated and realistic images. For the detection of true sources, which are complex combinations of source-pixels, this constraint appears to be somewhat less strict. It is still reliable enough, however, for a priori estimates of the fraction of false source detections to be robust and realistic.

[23] oai:arXiv.org:astro-ph/0110230 [pdf] - 45288

Computational AstroStatistics: Fast and Efficient Tools for Analysing Huge Astronomical Data Sources

Nichol, R. C.; Chong, S.; Connolly, A. J.; Davies, S.; Genovese, C.; Hopkins, A. M.; Miller, C. J.; Moore, A. W.; Pelleg, D.; Richards, G. T.; Schneider, J.; Szapudi, I.; Wasserman, L.

Comments: Invited talk at "Statistical Challenges in Modern Astronomy III" July 18-21 2001. 9 pages

Submitted: 2001-10-09

I present here a review of past and present multi-disciplinary research of the Pittsburgh Computational AstroStatistics (PiCA) group. This group is dedicated to developing fast and efficient statistical algorithms for analysing huge astronomical data sources. I begin with a short review of multi-resolutional kd-trees which are the building blocks for many of our algorithms. For example, quick range queries and fast n-point correlation functions. I will present new results from the use of Mixture Models (Connolly et al. 2000) in density estimation of multi-color data from the Sloan Digital Sky Survey (SDSS). Specifically, the selection of quasars and the automated identification of X-ray sources. I will also present a brief overview of the False Discovery Rate (FDR) procedure (Miller et al. 2001a) and show how it has been used in the detection of ``Baryon Wiggles'' in the local galaxy power spectrum and source identification in radio data. Finally, I will look forward to new research on an automated Bayes Network anomaly detector and the possible use of the Locally Linear Embedding algorithm (LLE; Roweis & Saul 2000) for spectral classification of SDSS spectra.

[24] oai:arXiv.org:astro-ph/0107034 [pdf] - 43405

Controlling the False Discovery Rate in Astrophysical Data Analysis

Miller, Christopher J.; Genovese, Christopher; Nichol, Robert C.; Wasserman, Larry; Connolly, Andrew; Reichart, Daniel; Hopkins, Andrew; Schneider, Jeff; Moore, Andrew

Comments: 15 pages, 9 figures. Submitted to AJ

Submitted: 2001-07-02

The False Discovery Rate (FDR) is a new statistical procedure to control the number of mistakes made when performing multiple hypothesis tests, i.e. when comparing many data against a given model hypothesis. The key advantage of FDR is that it allows one to a priori control the average fraction of false rejections made (when comparing to the null hypothesis) over the total number of rejections performed. We compare FDR to the standard procedure of rejecting all tests that do not match the null hypothesis above some arbitrarily chosen confidence limit, e.g. 2 sigma, or at the 95% confidence level. When using FDR, we find a similar rate of correct detections, but with significantly fewer false detections. Moreover, the FDR procedure is quick and easy to compute and can be trivially adapted to work with correlated data. The purpose of this paper is to introduce the FDR procedure to the astrophysics community. We illustrate the power of FDR through several astronomical examples, including the detection of features against a smooth one-dimensional function, e.g. seeing the ``baryon wiggles'' in a power spectrum of matter fluctuations, and source pixel detection in imaging data. In this era of large datasets and high precision measurements, FDR provides the means to adaptively control a scientifically meaningful quantity -- the number of false discoveries made when conducting multiple hypothesis tests.

[25] oai:arXiv.org:astro-ph/0012333 [pdf] - 39922

Fast Algorithms and Efficient Statistics: N-point Correlation Functions

Moore, Andrew; Connolly, Andy; Genovese, Chris; Gray, Alex; Grone, Larry; Kanidoris, Nick; Nichol, Robert; Schneider, Jeff; Szalay, Alex; Szapudi, Istvan; Wasserman, Larry

Comments: To appear in Proceedings of MPA/MPE/ESO Conference "Mining the Sky", July 31 - August 4, 2000, Garching, Germany

Submitted: 2000-12-14

We present here a new algorithm for the fast computation of N-point correlation functions in large astronomical data sets. The algorithm is based on kdtrees which are decorated with cached sufficient statistics thus allowing for orders of magnitude speed-ups over the naive non-tree-based implementation of correlation functions. We further discuss the use of controlled approximations within the computation which allows for further acceleration. In summary, our algorithm now makes it possible to compute exact, all-pairs, measurements of the 2, 3 and 4-point correlation functions for cosmological data sets like the Sloan Digital Sky Survey (SDSS; York et al. 2000) and the next generation of Cosmic Microwave Background experiments (see Szapudi et al. 2000).

[26] oai:arXiv.org:astro-ph/0011557 [pdf] - 39563

SDSS-RASS: Next Generation of Cluster-Finding Algorithms

Comments: To appear in Proceedings of MPA/MPE/ESO Conference "Mining the Sky", July 31 - August 4, 2000, Garching, Germany

Submitted: 2000-11-29

We outline here the next generation of cluster-finding algorithms. We show how advances in Computer Science and Statistics have helped develop robust, fast algorithms for finding clusters of galaxies in large multi-dimensional astronomical databases like the Sloan Digital Sky Survey (SDSS). Specifically, this paper presents four new advances: (1) A new semi-parametric algorithm - nicknamed ``C4'' - for jointly finding clusters of galaxies in the SDSS and ROSAT All-Sky Survey databases; (2) The introduction of the False Discovery Rate into Astronomy; (3) The role of kernel shape in optimizing cluster detection; (4) A new determination of the X-ray Cluster Luminosity Function which has bearing on the existence of a ``deficit'' of high redshift, high luminosity clusters. This research is part of our ``Computational AstroStatistics'' collaboration (see Nichol et al. 2000) and the algorithms and techniques discussed herein will form part of the ``Virtual Observatory'' analysis toolkit.

[27] oai:arXiv.org:astro-ph/0008187 [pdf] - 37503

Fast Algorithms and Efficient Statistics: Density Estimation in Large Astronomical Datasets

Connolly, A. J.; Genovese, C.; Moore, A. W.; Nichol, R. C.; Schneider, J.; Wasserman, L.

Comments: This paper is only published here on astro-ph. The paper is still valid. Please contact the authors with any questions and requests for the software

Submitted: 2000-08-11

In this paper, we outline the use of Mixture Models in density estimation of large astronomical databases. This method of density estimation has been known in Statistics for some time but has not been implemented because of the large computational cost. Herein, we detail an implementation of the Mixture Model density estimation based on multi-resolutional KD-trees which makes this statistical technique into a computationally tractable problem. We provide the theoretical and experimental background for using a mixture model of Gaussians based on the Expectation Maximization (EM) Algorithm. Applying these analyses to simulated data sets we show that the EM algorithm - using the AIC penalized likelihood to score the fit - out-performs the best kernel density estimate of the distribution while requiring no ``fine--tuning'' of the input algorithm parameters. We find that EM can accurately recover the underlying density distribution from point processes thus providing an efficient adaptive smoothing method for astronomical source catalogs. To demonstrate the general application of this statistic to astrophysical problems we consider two cases of density estimation: the clustering of galaxies in redshift space and the clustering of stars in color space. From these data we show that EM provides an adaptive smoothing of the distribution of galaxies in redshift space (describing accurately both the small and large-scale features within the data) and a means of identifying outliers in multi-dimensional color-color space (e.g. for the identification of high redshift QSOs). Automated tools such as those based on the EM algorithm will be needed in the analysis of the next generation of astronomical catalogs (2MASS, FIRST, PLANCK, SDSS) and ultimately in in the development of the National Virtual Observatory.

[28] oai:arXiv.org:astro-ph/0007404 [pdf] - 37242

Computational AstroStatistics: Fast Algorithms and Efficient Statistics for Density Estimation in Large Astronomical Datasets

Nichol, R. C.; Connolly, A. J.; Moore, A. W.; Schneider, J.; Genovese, C.; Wasserman, L.

Comments: Proceedings from ``Virtual Observatories of the Future'' edited by R. J. Brunner, S. G. Djorgovski, A. Szalay

Submitted: 2000-07-26

We present initial results on the use of Mixture Models for density estimation in large astronomical databases. We provide herein both the theoretical and experimental background for using a mixture model of Gaussians based on the Expectation Maximization (EM) Algorithm. Applying these analyses to simulated data sets we show that the EM algorithm - using the both the AIC & BIC penalized likelihood to score the fit - can out-perform the best kernel density estimate of the distribution while requiring no ``fine-tuning'' of the input algorithm parameters. We find that EM can accurately recover the underlying density distribution from point processes thus providing an efficient adaptive smoothing method for astronomical source catalogs. To demonstrate the general application of this statistic to astrophysical problems we consider two cases of density estimation; the clustering of galaxies in redshift space and the clustering of stars in color space. From these data we show that EM provides an adaptive smoothing of the distribution of galaxies in redshift space (describing accurately both the small and large-scale features within the data) and a means of identifying outliers in multi-dimensional color-color space (e.g. for the identification of high redshift QSOs). Automated tools such as those based on the EM algorithm will be needed in the analysis of the next generation of astronomical catalogs (2MASS, FIRST, PLANCK, SDSS) and ultimately in the development of the National Virtual Observatory.