Normalized to: Knowles, J.
[1]
oai:arXiv.org:1603.05166 [pdf] - 1396952
Fifty Years of Pulsar Candidate Selection: From simple filters to a new
principled real-time classification approach
Submitted: 2016-03-16
Improving survey specifications are causing an exponential rise in pulsar
candidate numbers and data volumes. We study the candidate filters used to
mitigate these problems during the past fifty years. We find that some existing
methods such as applying constraints on the total number of candidates
collected per observation, may have detrimental effects on the success of
pulsar searches. Those methods immune to such effects are found to be
ill-equipped to deal with the problems associated with increasing data volumes
and candidate numbers, motivating the development of new approaches. We
therefore present a new method designed for on-line operation. It selects
promising candidates using a purpose-built tree-based machine learning
classifier, the Gaussian Hellinger Very Fast Decision Tree (GH-VFDT), and a new
set of features for describing candidates. The features have been chosen so as
to i) maximise the separation between candidates arising from noise and those
of probable astrophysical origin, and ii) be as survey-independent as possible.
Using these features our new approach can process millions of candidates in
seconds (~1 million every 15 seconds), with high levels of pulsar recall
(90%+). This technique is therefore applicable to the large volumes of data
expected to be produced by the Square Kilometre Array (SKA). Use of this
approach has assisted in the discovery of 20 new pulsars in data obtained
during the LOFAR Tied-Array All-Sky Survey (LOTAAS).
[2]
oai:arXiv.org:1405.2278 [pdf] - 821796
Hellinger Distance Trees for Imbalanced Streams
Submitted: 2014-05-09
Classifiers trained on data sets possessing an imbalanced class distribution
are known to exhibit poor generalisation performance. This is known as the
imbalanced learning problem. The problem becomes particularly acute when we
consider incremental classifiers operating on imbalanced data streams,
especially when the learning objective is rare class identification. As
accuracy may provide a misleading impression of performance on imbalanced data,
existing stream classifiers based on accuracy can suffer poor minority class
performance on imbalanced streams, with the result being low minority class
recall rates. In this paper we address this deficiency by proposing the use of
the Hellinger distance measure, as a very fast decision tree split criterion.
We demonstrate that by using Hellinger a statistically significant improvement
in recall rates on imbalanced data streams can be achieved, with an acceptable
increase in the false positive rate.
[3]
oai:arXiv.org:1307.8012 [pdf] - 1515668
A Study on Classification in Imbalanced and Partially-Labelled Data
Streams
Submitted: 2013-07-30
The domain of radio astronomy is currently facing significant computational
challenges, foremost amongst which are those posed by the development of the
world's largest radio telescope, the Square Kilometre Array (SKA). Preliminary
specifications for this instrument suggest that the final design will
incorporate between 2000 and 3000 individual 15 metre receiving dishes, which
together can be expected to produce a data rate of many TB/s. Given such a high
data rate, it becomes crucial to consider how this information will be
processed and stored to maximise its scientific utility. In this paper, we
consider one possible data processing scenario for the SKA, for the purposes of
an all-sky pulsar survey. In particular we treat the selection of promising
signals from the SKA processing pipeline as a data stream classification
problem. We consider the feasibility of classifying signals that arrive via an
unlabelled and heavily class imbalanced data stream, using currently available
algorithms and frameworks. Our results indicate that existing stream learners
exhibit unacceptably low recall on real astronomical data when used in standard
configuration; however, good false positive performance and comparable accuracy
to static learners, suggests they have definite potential as an on-line
solution to this particular big data challenge.