Modern methods of machine learning with application in bioinformatics

Artificial intelligence (AI) and machine learning are currently advancing at such a fast pace that some even compare to the growth rate of computational power during the last decades. From prophecies about world domination to predictions about AI catalyzing an industrial revolution, it is the wid...

Πλήρης περιγραφή

Λεπτομέρειες βιβλιογραφικής εγγραφής
Κύριος συγγραφέας: Παπαδοκωστάκης, Γεώργιος
Άλλοι συγγραφείς: Papadokostakis, Georgios
Γλώσσα:English
Έκδοση: 2022
Θέματα:
Διαθέσιμο Online:http://hdl.handle.net/10889/16089
Περιγραφή
Περίληψη:Artificial intelligence (AI) and machine learning are currently advancing at such a fast pace that some even compare to the growth rate of computational power during the last decades. From prophecies about world domination to predictions about AI catalyzing an industrial revolution, it is the widespread consensus that these technologies are definitely here to stay and will affect our lives in unexpected ways. They have been applied to many fields including the field of biology which is the focus of this thesis. In the current work we focus on Generative Adversarial Nets (GANs), an algorithm that has boosted the state-of-the-art in artificial data generation and is currently being studied vigorously. Drawing inspiration from previous work, we leverage GANs to gen- erate artificial biological data and more specifically non-coding RNA (ncRNA) sequences of multiple types while controlling the type and properties of the generated samples. We then demonstrate various ways to evaluate the quality of the generated data by combining domain-specific features of the data with general GAN evaluation metrics. Initially, features of the samples are calculated that involve thermodynamic, structural and sequential properties of ncRNA sequences, and Fréchet Inception Distance (FID) is used as a statistical distance between real and artificial as a measure of generated data quality. This measure is then used as a fitness function to optimize the model hyperparameters with random search. Finally, we explore various domain-specific and general ways to further evaluate the quality of generated data and demonstrate the model’s ability to capture the underlying structure of the ncRNA sequences. These include visually comparing their plotted secondary structure between real and artificial ncRNA sequences, aligning generated sequences against databases of known ncRNAs from humans and other mammalian species and using a trained classifier to gauge the artificial data quality by analyzing the classifier predictions on generated data. Overall, the evaluation attempts showed great promise in the model’s ability to generate realistic ncRNA sequences with the desired properties for most ncRNA types of the dataset.