Scaling data mining algorithmsapplication to instance and feature selection
- Haro García, Aida de
- Nicolás García-Pedrajas Director
Defence university: Universidad de Granada
Fecha de defensa: 29 September 2011
- Francisco Herrera Triguero Chair
- José Manuel Benítez Sánchez Secretary
- Colin Fyfe Committee member
- César García Osorio Committee member
- Domingo Ortíz-Boyer Committee member
Type: Thesis
Abstract
The main objective of this thesis is the design of a general methodology to scale up effectively data mining algorithms without severe modifications. We think it is essential for the scaling methodology to be simple and generic. A very complex method will not be widely used by the data mining community and therefore will be less useful. Not only our method is simple but it also treats the data mining algorithms as black boxes so the researchers can use previously implemented versions of these data mining algorithms without adapting the original data mining algorithms or the scaling methodology to a specific problem. This feature of our scaling method makes it generalizable to any data mining problem as the algorithm to be scaled can be swapped to another data mining algorithm without changes, the input format expected by the scaling approach will not change from one problem to another and it will provide the same type of output no matter the problem being dealt with. The only step that varies from one scaled data mining problem to other is the last one that combines effectively the results from the different independent runs. Due to the fact that we have not unlimited time and storage resources, we consider as a must that our scaling methodology succeeds in reducing execution time while maintaining good performance levels. The efficiency of an algorithm can be measured in terms of execution time and needed resources. It should be clear that scaling up to very large datasets implies, in part, that fast learning algorithms must be developed, so the execution time is a key measure of the quality of a scaling method. Regarding the needed resources, almost all existing implementations of learning algorithms operate with the training set entirely in main memory and many algorithms achieve reduced runtime complexity with bookkeeping that increases the space used. However, no matter the runtime computational complexity of the algorithm, if exceeding the main memory limitation leads to virtual memory thrashing, the algorithm will not scale well \citep{provost96}. We think that dealing with data progressively, by means of applying data mining algorithms to subsets of the original dataset, may be a solution to these problems. Finally, the goal of the learning must be considered. Evaluating the performance of a scaling technique becomes complicated if a degradation in the quality of the learning is permitted. The vast majority of work on learning algorithms uses classification accuracy as the metric by which different algorithms are compared. In such cases, the most interesting methods are the ones that scale up without a substantial decrease in accuracy.