Development of a biomedical data classification tool with the use of boosting techniques

High throughput methods have become the standard in current times given that the technology is available and more accessible than ever. The massive amounts of data produced are of high value for scientists and along with clinical and meta-data can lead to breakthroughs and improvements in personaliz...

Πλήρης περιγραφή

Λεπτομέρειες βιβλιογραφικής εγγραφής
Κύριος συγγραφέας: Παναγιωτόπουλος, Κωνσταντίνος
Άλλοι συγγραφείς: Panagiotopoulos, Konstantinos
Γλώσσα:English
Έκδοση: 2022
Θέματα:
Διαθέσιμο Online:https://nemertes.library.upatras.gr/handle/10889/23331
Περιγραφή
Περίληψη:High throughput methods have become the standard in current times given that the technology is available and more accessible than ever. The massive amounts of data produced are of high value for scientists and along with clinical and meta-data can lead to breakthroughs and improvements in personalized medicine and prognosis. Bioinformatics bridges the gap between traditional biology and computer science by developing computational tools that extract useful knowledge from the biological data. Very often though, real-world problems have more than one objectives, with some of them being conflicting to each other. This rises the need for algorithms that can handle multi-objectives and high dimensional problems reliably. The goal of this thesis is to harness the power of multi-objective optimization techniques in order to optimize the feature subset and parameters of boosting classifiers when applied on two-dimensional quantitative datasets, in an attempt to increase predictive accuracy and decrease the size of the revealed biosignatures. XGBoost was used as the boosting classification algorithm, because of the increasing popularity it gained in recent years and its performance advantages as proved in machine learning competitions. This type of algorithms has a great number of hyper-parameters, so an evolutionary algorithm is used to handle both parameter optimization and biomarkers detection in a vast search space. For evaluating the solutions, a niched Pareto rank scheme was used to avoid premature convergence to a local minimum and promote the exploration of the search space. When the termination criteria are reached, the final population is evaluated and the solutions are ranked based on their performance in multiple objectives. Finally, the problems we oppose are multi-objective ones and thus, the algorithm returns multiple Pareto-optimal trained models. For the purpose of the present thesis, two datasets were used in order to test the performance of this pipeline. The first dataset used is the "Ornish” dataset, which refers to Gene Expression profiling by microarrays conducted to people that undergo an intensive lifestyle interventions to study the effects on weight loss and lowering of CardioVascular Disease (CVD) risk. The second one is from the “OPERA” study and consists mostly of nominal features in the form of survey questions to explore the effects of replacing opioid drugs with topical painkillers in four different outcomes of the study. The machine learning models produced with the presented method, significantly improved the discrimination power of state-of-the-art machine learning methods, which were also deployed for comparative reasons. The results of this work are very encouraging and the produced method has the potential to increase predictive accuracy and help with the biomarkers discovery when applied in personalized medicine and biomedical applications in the future.