Modern methods of machine learning with application in bioinformatics
Artificial intelligence (AI) and machine learning are currently advancing at such a fast pace that some even compare to the growth rate of computational power during the last decades. From prophecies about world domination to predictions about AI catalyzing an industrial revolution, it is the wid...
Κύριος συγγραφέας: | |
---|---|
Άλλοι συγγραφείς: | |
Γλώσσα: | English |
Έκδοση: |
2022
|
Θέματα: | |
Διαθέσιμο Online: | http://hdl.handle.net/10889/16089 |
id |
nemertes-10889-16089 |
---|---|
record_format |
dspace |
spelling |
nemertes-10889-160892022-09-05T06:57:06Z Modern methods of machine learning with application in bioinformatics Σύγχρονες μέθοδοι μηχανικής μάθησης με εφαρμογή στην βιοπληροφορική Παπαδοκωστάκης, Γεώργιος Papadokostakis, Georgios Generative adversarial nets RNA Bioinformatics Τεχνητές ακολουθίες ncRNA Βιοπληροφορική Artificial intelligence (AI) and machine learning are currently advancing at such a fast pace that some even compare to the growth rate of computational power during the last decades. From prophecies about world domination to predictions about AI catalyzing an industrial revolution, it is the widespread consensus that these technologies are definitely here to stay and will affect our lives in unexpected ways. They have been applied to many fields including the field of biology which is the focus of this thesis. In the current work we focus on Generative Adversarial Nets (GANs), an algorithm that has boosted the state-of-the-art in artificial data generation and is currently being studied vigorously. Drawing inspiration from previous work, we leverage GANs to gen- erate artificial biological data and more specifically non-coding RNA (ncRNA) sequences of multiple types while controlling the type and properties of the generated samples. We then demonstrate various ways to evaluate the quality of the generated data by combining domain-specific features of the data with general GAN evaluation metrics. Initially, features of the samples are calculated that involve thermodynamic, structural and sequential properties of ncRNA sequences, and Fréchet Inception Distance (FID) is used as a statistical distance between real and artificial as a measure of generated data quality. This measure is then used as a fitness function to optimize the model hyperparameters with random search. Finally, we explore various domain-specific and general ways to further evaluate the quality of generated data and demonstrate the model’s ability to capture the underlying structure of the ncRNA sequences. These include visually comparing their plotted secondary structure between real and artificial ncRNA sequences, aligning generated sequences against databases of known ncRNAs from humans and other mammalian species and using a trained classifier to gauge the artificial data quality by analyzing the classifier predictions on generated data. Overall, the evaluation attempts showed great promise in the model’s ability to generate realistic ncRNA sequences with the desired properties for most ncRNA types of the dataset. Στην παρούσα εργασία, αναπτύξαμε ένα μοντέλο GAN για την παραγωγή τεχνητών ακολουθιών ncRNA. Δείξαμε πως, εκπαιδεύοντας τα δίκτυα με υπό συνθήκη είσοδο μπορούμε να ελέγξουμε τον τύπο της παραγόμενης ακολουθίας. Ερευνήσαμε πολλούς διαφορετικούς τρόπους αξιολόγησης της ποιότητας των τεχνητών δεδομένων, όπως σύγκριση δευτερευόντων χαρακτηριστικών και δομών των ακολουθιών μεταξύ παραγόμενων και των δεδομένων εκπαίδευσης αλλά και σύγκριση με άλλες ακολουθίες από γνωστές βάσεις δεδομένων. Αντλώντας έμπνευση από ήδη υπάρχουσες εργασίες στο πεδίο, δείξαμε την ικανότητα των GANs να αποκωδικοποίούν δομές και ιδιότητες του συγκεκριμένου τύπου δεδομένων και να παράγουν ρεαλιστικά τεχνητά δείγματα για αρκετούς από τους τύπους ακολουθιών ncRNA με δεδομένους τους περιορισμούς του διαθέσιμου συνόλου εκπαίδευσης. Περαιτέρω ανάλυση μπορεί να γίνει πάνω στο πεδίο για βελτιστοποίηση των αποτελεσμάτων για ορισμένους τύπους ακολουθιών που δεν αντιπροσωπεύονταν επαρκώς στο σύνολο δεδομένων. Θα μπορούσε επίσης να εξερευνηθεί η ικανότητα των GANs να παράγει ακολουθίες νουκλεοτιδίων άλλου τύπου, για παράδειγμα γονίδια ή ολόκληρο γονιδίωμα απλών οργανισμών όπως ιών. 2022-03-17T06:58:41Z 2022-03-17T06:58:41Z 2022-03-16 http://hdl.handle.net/10889/16089 en application/pdf |
institution |
UPatras |
collection |
Nemertes |
language |
English |
topic |
Generative adversarial nets RNA Bioinformatics Τεχνητές ακολουθίες ncRNA Βιοπληροφορική |
spellingShingle |
Generative adversarial nets RNA Bioinformatics Τεχνητές ακολουθίες ncRNA Βιοπληροφορική Παπαδοκωστάκης, Γεώργιος Modern methods of machine learning with application in bioinformatics |
description |
Artificial intelligence (AI) and machine learning are currently advancing at such a fast
pace that some even compare to the growth rate of computational power during the last
decades. From prophecies about world domination to predictions about AI catalyzing an
industrial revolution, it is the widespread consensus that these technologies are definitely
here to stay and will affect our lives in unexpected ways. They have been applied to many
fields including the field of biology which is the focus of this thesis.
In the current work we focus on Generative Adversarial Nets (GANs), an algorithm
that has boosted the state-of-the-art in artificial data generation and is currently being
studied vigorously. Drawing inspiration from previous work, we leverage GANs to gen-
erate artificial biological data and more specifically non-coding RNA (ncRNA) sequences
of multiple types while controlling the type and properties of the generated samples. We
then demonstrate various ways to evaluate the quality of the generated data by combining
domain-specific features of the data with general GAN evaluation metrics.
Initially, features of the samples are calculated that involve thermodynamic, structural
and sequential properties of ncRNA sequences, and Fréchet Inception Distance (FID)
is used as a statistical distance between real and artificial as a measure of generated
data quality. This measure is then used as a fitness function to optimize the model
hyperparameters with random search. Finally, we explore various domain-specific and
general ways to further evaluate the quality of generated data and demonstrate the model’s
ability to capture the underlying structure of the ncRNA sequences. These include visually
comparing their plotted secondary structure between real and artificial ncRNA sequences,
aligning generated sequences against databases of known ncRNAs from humans and other
mammalian species and using a trained classifier to gauge the artificial data quality by
analyzing the classifier predictions on generated data. Overall, the evaluation attempts
showed great promise in the model’s ability to generate realistic ncRNA sequences with
the desired properties for most ncRNA types of the dataset. |
author2 |
Papadokostakis, Georgios |
author_facet |
Papadokostakis, Georgios Παπαδοκωστάκης, Γεώργιος |
author |
Παπαδοκωστάκης, Γεώργιος |
author_sort |
Παπαδοκωστάκης, Γεώργιος |
title |
Modern methods of machine learning with application in bioinformatics |
title_short |
Modern methods of machine learning with application in bioinformatics |
title_full |
Modern methods of machine learning with application in bioinformatics |
title_fullStr |
Modern methods of machine learning with application in bioinformatics |
title_full_unstemmed |
Modern methods of machine learning with application in bioinformatics |
title_sort |
modern methods of machine learning with application in bioinformatics |
publishDate |
2022 |
url |
http://hdl.handle.net/10889/16089 |
work_keys_str_mv |
AT papadokōstakēsgeōrgios modernmethodsofmachinelearningwithapplicationinbioinformatics AT papadokōstakēsgeōrgios synchronesmethodoimēchanikēsmathēsēsmeepharmogēstēnbioplērophorikē |
_version_ |
1771297173929984000 |