Modern methods of machine learning with application in bioinformatics

Artificial intelligence (AI) and machine learning are currently advancing at such a fast pace that some even compare to the growth rate of computational power during the last decades. From prophecies about world domination to predictions about AI catalyzing an industrial revolution, it is the wid...

Πλήρης περιγραφή

Λεπτομέρειες βιβλιογραφικής εγγραφής
Κύριος συγγραφέας: Παπαδοκωστάκης, Γεώργιος
Άλλοι συγγραφείς: Papadokostakis, Georgios
Γλώσσα:English
Έκδοση: 2022
Θέματα:
Διαθέσιμο Online:http://hdl.handle.net/10889/16089
id nemertes-10889-16089
record_format dspace
spelling nemertes-10889-160892022-09-05T06:57:06Z Modern methods of machine learning with application in bioinformatics Σύγχρονες μέθοδοι μηχανικής μάθησης με εφαρμογή στην βιοπληροφορική Παπαδοκωστάκης, Γεώργιος Papadokostakis, Georgios Generative adversarial nets RNA Bioinformatics Τεχνητές ακολουθίες ncRNA Βιοπληροφορική Artificial intelligence (AI) and machine learning are currently advancing at such a fast pace that some even compare to the growth rate of computational power during the last decades. From prophecies about world domination to predictions about AI catalyzing an industrial revolution, it is the widespread consensus that these technologies are definitely here to stay and will affect our lives in unexpected ways. They have been applied to many fields including the field of biology which is the focus of this thesis. In the current work we focus on Generative Adversarial Nets (GANs), an algorithm that has boosted the state-of-the-art in artificial data generation and is currently being studied vigorously. Drawing inspiration from previous work, we leverage GANs to gen- erate artificial biological data and more specifically non-coding RNA (ncRNA) sequences of multiple types while controlling the type and properties of the generated samples. We then demonstrate various ways to evaluate the quality of the generated data by combining domain-specific features of the data with general GAN evaluation metrics. Initially, features of the samples are calculated that involve thermodynamic, structural and sequential properties of ncRNA sequences, and Fréchet Inception Distance (FID) is used as a statistical distance between real and artificial as a measure of generated data quality. This measure is then used as a fitness function to optimize the model hyperparameters with random search. Finally, we explore various domain-specific and general ways to further evaluate the quality of generated data and demonstrate the model’s ability to capture the underlying structure of the ncRNA sequences. These include visually comparing their plotted secondary structure between real and artificial ncRNA sequences, aligning generated sequences against databases of known ncRNAs from humans and other mammalian species and using a trained classifier to gauge the artificial data quality by analyzing the classifier predictions on generated data. Overall, the evaluation attempts showed great promise in the model’s ability to generate realistic ncRNA sequences with the desired properties for most ncRNA types of the dataset. Στην παρούσα εργασία, αναπτύξαμε ένα μοντέλο GAN για την παραγωγή τεχνητών ακολουθιών ncRNA. Δείξαμε πως, εκπαιδεύοντας τα δίκτυα με υπό συνθήκη είσοδο μπορούμε να ελέγξουμε τον τύπο της παραγόμενης ακολουθίας. Ερευνήσαμε πολλούς διαφορετικούς τρόπους αξιολόγησης της ποιότητας των τεχνητών δεδομένων, όπως σύγκριση δευτερευόντων χαρακτηριστικών και δομών των ακολουθιών μεταξύ παραγόμενων και των δεδομένων εκπαίδευσης αλλά και σύγκριση με άλλες ακολουθίες από γνωστές βάσεις δεδομένων. Αντλώντας έμπνευση από ήδη υπάρχουσες εργασίες στο πεδίο, δείξαμε την ικανότητα των GANs να αποκωδικοποίούν δομές και ιδιότητες του συγκεκριμένου τύπου δεδομένων και να παράγουν ρεαλιστικά τεχνητά δείγματα για αρκετούς από τους τύπους ακολουθιών ncRNA με δεδομένους τους περιορισμούς του διαθέσιμου συνόλου εκπαίδευσης. Περαιτέρω ανάλυση μπορεί να γίνει πάνω στο πεδίο για βελτιστοποίηση των αποτελεσμάτων για ορισμένους τύπους ακολουθιών που δεν αντιπροσωπεύονταν επαρκώς στο σύνολο δεδομένων. Θα μπορούσε επίσης να εξερευνηθεί η ικανότητα των GANs να παράγει ακολουθίες νουκλεοτιδίων άλλου τύπου, για παράδειγμα γονίδια ή ολόκληρο γονιδίωμα απλών οργανισμών όπως ιών. 2022-03-17T06:58:41Z 2022-03-17T06:58:41Z 2022-03-16 http://hdl.handle.net/10889/16089 en application/pdf
institution UPatras
collection Nemertes
language English
topic Generative adversarial nets
RNA
Bioinformatics
Τεχνητές ακολουθίες ncRNA
Βιοπληροφορική
spellingShingle Generative adversarial nets
RNA
Bioinformatics
Τεχνητές ακολουθίες ncRNA
Βιοπληροφορική
Παπαδοκωστάκης, Γεώργιος
Modern methods of machine learning with application in bioinformatics
description Artificial intelligence (AI) and machine learning are currently advancing at such a fast pace that some even compare to the growth rate of computational power during the last decades. From prophecies about world domination to predictions about AI catalyzing an industrial revolution, it is the widespread consensus that these technologies are definitely here to stay and will affect our lives in unexpected ways. They have been applied to many fields including the field of biology which is the focus of this thesis. In the current work we focus on Generative Adversarial Nets (GANs), an algorithm that has boosted the state-of-the-art in artificial data generation and is currently being studied vigorously. Drawing inspiration from previous work, we leverage GANs to gen- erate artificial biological data and more specifically non-coding RNA (ncRNA) sequences of multiple types while controlling the type and properties of the generated samples. We then demonstrate various ways to evaluate the quality of the generated data by combining domain-specific features of the data with general GAN evaluation metrics. Initially, features of the samples are calculated that involve thermodynamic, structural and sequential properties of ncRNA sequences, and Fréchet Inception Distance (FID) is used as a statistical distance between real and artificial as a measure of generated data quality. This measure is then used as a fitness function to optimize the model hyperparameters with random search. Finally, we explore various domain-specific and general ways to further evaluate the quality of generated data and demonstrate the model’s ability to capture the underlying structure of the ncRNA sequences. These include visually comparing their plotted secondary structure between real and artificial ncRNA sequences, aligning generated sequences against databases of known ncRNAs from humans and other mammalian species and using a trained classifier to gauge the artificial data quality by analyzing the classifier predictions on generated data. Overall, the evaluation attempts showed great promise in the model’s ability to generate realistic ncRNA sequences with the desired properties for most ncRNA types of the dataset.
author2 Papadokostakis, Georgios
author_facet Papadokostakis, Georgios
Παπαδοκωστάκης, Γεώργιος
author Παπαδοκωστάκης, Γεώργιος
author_sort Παπαδοκωστάκης, Γεώργιος
title Modern methods of machine learning with application in bioinformatics
title_short Modern methods of machine learning with application in bioinformatics
title_full Modern methods of machine learning with application in bioinformatics
title_fullStr Modern methods of machine learning with application in bioinformatics
title_full_unstemmed Modern methods of machine learning with application in bioinformatics
title_sort modern methods of machine learning with application in bioinformatics
publishDate 2022
url http://hdl.handle.net/10889/16089
work_keys_str_mv AT papadokōstakēsgeōrgios modernmethodsofmachinelearningwithapplicationinbioinformatics
AT papadokōstakēsgeōrgios synchronesmethodoimēchanikēsmathēsēsmeepharmogēstēnbioplērophorikē
_version_ 1771297173929984000