Full-text search for arXiv

Knowles, J. D.

Normalized to: Knowles, J.

3 article(s) in total. 4 co-authors, from 1 to 3 common article(s). Median position in authors list is 3,0.

[1] oai:arXiv.org:1603.05166 [pdf] - 1396952

Fifty Years of Pulsar Candidate Selection: From simple filters to a new principled real-time classification approach

Lyon, R. J.; Stappers, B. W.; Cooper, S.; Brooke, J. M.; Knowles, J. D.

Comments: Accepted for publication in MNRAS, 20 pages, 8 figures. See http://www.jb.man.ac.uk/pulsar/Surveys.html for survey data, and https://dx.doi.org/10.6084/m9.figshare.3080389.v1 for our data

Submitted: 2016-03-16

Improving survey specifications are causing an exponential rise in pulsar candidate numbers and data volumes. We study the candidate filters used to mitigate these problems during the past fifty years. We find that some existing methods such as applying constraints on the total number of candidates collected per observation, may have detrimental effects on the success of pulsar searches. Those methods immune to such effects are found to be ill-equipped to deal with the problems associated with increasing data volumes and candidate numbers, motivating the development of new approaches. We therefore present a new method designed for on-line operation. It selects promising candidates using a purpose-built tree-based machine learning classifier, the Gaussian Hellinger Very Fast Decision Tree (GH-VFDT), and a new set of features for describing candidates. The features have been chosen so as to i) maximise the separation between candidates arising from noise and those of probable astrophysical origin, and ii) be as survey-independent as possible. Using these features our new approach can process millions of candidates in seconds (~1 million every 15 seconds), with high levels of pulsar recall (90%+). This technique is therefore applicable to the large volumes of data expected to be produced by the Square Kilometre Array (SKA). Use of this approach has assisted in the discovery of 20 new pulsars in data obtained during the LOFAR Tied-Array All-Sky Survey (LOTAAS).

[2] oai:arXiv.org:1405.2278 [pdf] - 821796

Hellinger Distance Trees for Imbalanced Streams

Lyon, R. J.; Brooke, J. M.; Knowles, J. D.; Stappers, B. W.

Comments: 6 Pages, 2 figures, to be published in Proceedings 22nd International Conference on Pattern Recognition (ICPR) 2014

Submitted: 2014-05-09

Classifiers trained on data sets possessing an imbalanced class distribution are known to exhibit poor generalisation performance. This is known as the imbalanced learning problem. The problem becomes particularly acute when we consider incremental classifiers operating on imbalanced data streams, especially when the learning objective is rare class identification. As accuracy may provide a misleading impression of performance on imbalanced data, existing stream classifiers based on accuracy can suffer poor minority class performance on imbalanced streams, with the result being low minority class recall rates. In this paper we address this deficiency by proposing the use of the Hellinger distance measure, as a very fast decision tree split criterion. We demonstrate that by using Hellinger a statistically significant improvement in recall rates on imbalanced data streams can be achieved, with an acceptable increase in the false positive rate.

[3] oai:arXiv.org:1307.8012 [pdf] - 1515668

A Study on Classification in Imbalanced and Partially-Labelled Data Streams

Lyon, R. J.; Brooke, J. M.; Knowles, J. D.; Stappers, B. W.

Comments: 6 Pages, 2 figures, to be published in Proceedings 2013 IEEE International Conference on Systems, Man, and Cybernetics (SMC)

Submitted: 2013-07-30

The domain of radio astronomy is currently facing significant computational challenges, foremost amongst which are those posed by the development of the world's largest radio telescope, the Square Kilometre Array (SKA). Preliminary specifications for this instrument suggest that the final design will incorporate between 2000 and 3000 individual 15 metre receiving dishes, which together can be expected to produce a data rate of many TB/s. Given such a high data rate, it becomes crucial to consider how this information will be processed and stored to maximise its scientific utility. In this paper, we consider one possible data processing scenario for the SKA, for the purposes of an all-sky pulsar survey. In particular we treat the selection of promising signals from the SKA processing pipeline as a data stream classification problem. We consider the feasibility of classifying signals that arrive via an unlabelled and heavily class imbalanced data stream, using currently available algorithms and frameworks. Our results indicate that existing stream learners exhibit unacceptably low recall on real astronomical data when used in standard configuration; however, good false positive performance and comparable accuracy to static learners, suggests they have definite potential as an on-line solution to this particular big data challenge.