Classification methods are used for a growing number of tasks. For many real phenomena, real decision boundaries between classes change with time, which can be detected by learning from data streams. A real concept drift, i.e. a change in p(y|x) can occur. Hence, incremental machine learning methods have been developed. Some of the most successful of them rely on ensemble models. The objective of the thesis is to analyse the way base learners of these models can be trained, used and replaced to obtain increased performance of ensemble models. Particular attention will be paid to identifying subsets of instances for which individual learners produce the best predictions and trying to use this knowledge to provide evolving ensemble models of even better performance than those provided by state-of-the-art methods such as Adaptive Random Forest (ARF) and Streaming Random Patches (SRP).
For many real systems, data documenting their behaviour become available gradually in the form of a potentially unbounded data stream of instances rather than a fixed data set. However, over time data distribution may change i.e. concept drift may occur. In particular, drift detectors can be applied to GPS traces of public transport vehicles to detect where statistically significant delays occur and are reduced. However, this raises the challenge of processing possibly millions of location records revealing disturbances in the functioning of public transport systems and identifying evolving locations in which delays of both significant value and going beyond fluctuations occur. One of the Big Data processing frameworks such as Apache Spark will be used to implement such a system. GPS traces from Public Transport Operator from Warsaw, Poland documenting up to approx. 2,000 buses and trams operated at the same time will be used for this work.
A core part of the thesis is designing algorithms and developing the system that builds in parallel multiple classifiers relying on batch and online learning methods with the data of different data streams and/or sequences of data sets, referred to as data sources. One of the Big Data processing frameworks such as Apache Spark will be used to implement such a system. The primary challenge is to automate the process of detecting concept drift in each data source and update multiple models evolved in parallel based on these change detections while trying to avoid unnecessary model updates e.g. due to false drift detections or only minor, yet statistically significant changes in data.
The fact that data area available as data streams suggests the use of online learning methods, incrementally updating models such as classifiers. However, batch learning applied to batches of data such as sequences of instances from a data stream could potentially yield better performance e.g. higher accuracy of models. The primary objective of the thesis was to design and implement the library that would make it possible to compare the results of using batch and stream mining methods for data streams. Importantly, the library was expected to include a novel method proposed in the thesis automating the process of model selection and/or using underlying batch and online models as members of an ensemble model. The objectives have been fulfilled. A novel hybrid method evolving ensemble of batch and online learners has been proposed. It has been successfully applied to the problem of travel mode choice prediction and tested with multiple real trip data sets collected under revealed preferences approach. Results following from this work were also presented at the Symposium on Intelligent Data Analysis organised in Stockholm, 24-26 April 2024. More details can be found in the paper: Golik, P., Grzenda, M., Sienkiewicz, E., Hybrid Ensemble-Based Travel Mode Prediction, in: Miliou, I., Piatkowski, N., Papapetrou, P. (eds) Advances in Intelligent Data Analysis XXII. IDA 2024. Lecture Notes in Computer Science, vol 14641. Springer, Cham, 2024
As a part of the thesis, the student developing it analysed recent developments in concept drift detection methods with emphasis on their use to trigger updates of classification and/or regression models. The challenge to address was to avoid detecting drifts of minor magnitude not sufficient to update a model in the way actually improving its performance quantified by e.g. by accuracy or MSE. A part of the thesis was the evaluation of selected drift detectors and their comparison for reference data streams considering the performance of the updated models. These objectives were successfully addressed. A method capable of identifying the most relevant features to detect drift periods influencing machine learning models was proposed.
We can propose PhD topics for MSc graduates interested in different topics in data analytics, related to machine learning methods and their applications, and Big Data analytics. Should you want to undertake research within our group and obtain a doctoral degree in Information and Communication Technology, please let us know. Potential candidates can apply for the status of the PhD student at the Doctoral School of the Warsaw University of Technology. For details of admission process please check recent updates at the Doctoral School.
Warsaw University of Technology Faculty of Mathematics and Information Science Koszykowa 75, Warsaw, Poland
maciej.grzenda at pw.edu.pl
© Stream Mining and Urban Data Systems Group. All Rights Reserved. Designed by HTML Codex