Published June 27, 2016
| Version v1
Publication
Data Cleansing Meets Feature Selection: A Supervised Machine Learning Approach
Description
This paper presents a novel procedure to apply in a sequential
way two data preparation techniques from a different nature such as
data cleansing and feature selection. For the former we have experienced
with a partial removal of outliers via inter-quartile range whereas for
the latter we have chosen relevant attributes with two widespread feature
subset selectors like CFS (Correlation-based Feature Selection) and
CNS (Consistency-based Feature Selection), which are founded on correlation
and consistency measures, respectively. Empirical results on seven
difficult binary and multi-class data sets, that is, with a test error rate of
at least a 10%, according to accuracy, with C4.5 or 1-nearest neighbour
classifiers without any kind of prior data pre-processing are outlined.
Non-parametric statistical tests assert that the meeting of the aforementioned
two data preparation strategies using a correlation measure for
feature selection with C4.5 algorithm is significant better, measured with
roc measure, than the single application of the data cleansing approach.
Last but not least, a weak and not very powerful learner like PART
achieved promising results with the new proposal based on a consistency
measure and is able to compete with the best configuration of C4.5. To
sum up, bearing in mind the new approach, for roc measure PART classifier
with a consistency metric behaves slightly better than C4.5 and a
correlation measure
Abstract
MICYT TIN2007-68084-C02- 02Abstract
MICYT TIN2011-28956-C02-02Abstract
Junta de Andalucía P11-TIC-7528Additional details
Identifiers
- URL
- https://idus.us.es/handle/11441/42752
- URN
- urn:oai:idus.us.es:11441/42752
Origin repository
- Origin repository
- USE