Normalized to: Reis, I.
[1]
oai:arXiv.org:1911.06823 [pdf] - 1999710
Effectively using unsupervised machine learning in next generation
astronomical surveys
Submitted: 2019-11-15
In recent years many works have shown that unsupervised Machine Learning (ML)
can help detect unusual objects and uncover trends in large astronomical
datasets, but a few challenges remain. We show here, for example, that
different methods, or even small variations of the same method, can produce
significantly different outcomes. While intuitively somewhat surprising, this
can naturally occur when applying unsupervised ML to highly dimensional data,
where there can be many reasonable yet different answers to the same question.
In such a case the outcome of any single unsupervised ML method should be
considered a sample from a conceivably wide range of possibilities. We
therefore suggest an approach that eschews finding an optimal outcome, instead
facilitating the production and examination of many valid ones. This can be
achieved by incorporating unsupervised ML into data visualisation portals. We
present here such a portal that we are developing, applied to the sample of
SDSS spectra of galaxies. The main feature of the portal is interactive 2D maps
of the data. Different maps are constructed by applying dimensionality
reduction to different subspaces of the data, so that each map contains
different information that in turn gives a different perspective on the data.
The interactive maps are intuitive to use, and we demonstrate how peculiar
objects and trends can be detected by means of a few button clicks. We believe
that including tools in this spirit in next generation astronomical surveys
will be important for making unexpected discoveries, either by professional
astronomers or by citizen scientists, and will generally enable the benefits of
visual inspection even when dealing with very complex and extensive datasets.
Our portal is available online at galaxyportal.space.
[2]
oai:arXiv.org:1811.05994 [pdf] - 1806220
Probabilistic Random Forest: A machine learning algorithm for noisy
datasets
Submitted: 2018-11-14
Machine learning (ML) algorithms become increasingly important in the
analysis of astronomical data. However, since most ML algorithms are not
designed to take data uncertainties into account, ML based studies are mostly
restricted to data with high signal-to-noise ratio. Astronomical datasets of
such high-quality are uncommon. In this work we modify the long-established
Random Forest (RF) algorithm to take into account uncertainties in the
measurements (i.e., features) as well as in the assigned classes (i.e.,
labels). To do so, the Probabilistic Random Forest (PRF) algorithm treats the
features and labels as probability distribution functions, rather than
deterministic quantities. We perform a variety of experiments where we inject
different types of noise to a dataset, and compare the accuracy of the PRF to
that of RF. The PRF outperforms RF in all cases, with a moderate increase in
running time. We find an improvement in classification accuracy of up to 10% in
the case of noisy features, and up to 30% in the case of noisy labels. The PRF
accuracy decreased by less then 5% for a dataset with as many as 45%
misclassified objects, compared to a clean dataset. Apart from improving the
prediction accuracy in noisy datasets, the PRF naturally copes with missing
values in the data, and outperforms RF when applied to a dataset with different
noise characteristics in the training and test sets, suggesting that it can be
used for Transfer Learning.
[3]
oai:arXiv.org:1711.00022 [pdf] - 1689724
Detecting outliers and learning complex structures with large
spectroscopic surveys - a case study with APOGEE stars
Submitted: 2017-10-31, last modified: 2018-05-28
In this work we apply and expand on a recently introduced outlier detection
algorithm that is based on an unsupervised random forest. We use the algorithm
to calculate a similarity measure for stellar spectra from the Apache Point
Observatory Galactic Evolution Experiment (APOGEE). We show that the similarity
measure traces non-trivial physical properties and contains information about
complex structures in the data. We use it for visualization and clustering of
the dataset, and discuss its ability to find groups of highly similar objects,
including spectroscopic twins. Using the similarity matrix to search the
dataset for objects allows us to find objects that are impossible to find using
their best fitting model parameters. This includes extreme objects for which
the models fail, and rare objects that are outside the scope of the model. We
use the similarity measure to detect outliers in the dataset, and find a number
of previously unknown Be-type stars, spectroscopic binaries, carbon rich stars,
young stars, and a few that we cannot interpret. Our work further demonstrates
the potential for scientific discovery when combining machine learning methods
with modern survey data.
[4]
oai:arXiv.org:1805.09829 [pdf] - 1732695
Redshifted broad absorption line quasars found via machine-learned
spectral similarity
Submitted: 2018-05-24
We report the discovery of 31 new redshifted broad absorption line quasars
(RSBALs) from the Sloan Digital Sky Survey (SDSS). The number of previously
known such objects is 19. The identification of the new objects was enabled by
calculating similarities between quasar spectra in the SDSS. Using these
similarities we look for the objects that are similar to the ones in the
original sample, visually inspecting only hundreds, out of over 160,000 spectra
considered. We compare the performance of several similarity measures, as well
as different methods of employing them, in finding the RSBALs. We find that
decision tree based similarities recover the most objects, and that an ensemble
of methods performs better than any single one. As the similarities are not
tailored for the specific problem of finding RSBALs, they could be used for
searching for other types of quasars. The similarities and the code for their
calculation are available online.