Περίληψη: | Artificial intelligence (AI) and machine learning are currently advancing at such a fast
pace that some even compare to the growth rate of computational power during the last
decades. From prophecies about world domination to predictions about AI catalyzing an
industrial revolution, it is the widespread consensus that these technologies are definitely
here to stay and will affect our lives in unexpected ways. They have been applied to many
fields including the field of biology which is the focus of this thesis.
In the current work we focus on Generative Adversarial Nets (GANs), an algorithm
that has boosted the state-of-the-art in artificial data generation and is currently being
studied vigorously. Drawing inspiration from previous work, we leverage GANs to gen-
erate artificial biological data and more specifically non-coding RNA (ncRNA) sequences
of multiple types while controlling the type and properties of the generated samples. We
then demonstrate various ways to evaluate the quality of the generated data by combining
domain-specific features of the data with general GAN evaluation metrics.
Initially, features of the samples are calculated that involve thermodynamic, structural
and sequential properties of ncRNA sequences, and Fréchet Inception Distance (FID)
is used as a statistical distance between real and artificial as a measure of generated
data quality. This measure is then used as a fitness function to optimize the model
hyperparameters with random search. Finally, we explore various domain-specific and
general ways to further evaluate the quality of generated data and demonstrate the model’s
ability to capture the underlying structure of the ncRNA sequences. These include visually
comparing their plotted secondary structure between real and artificial ncRNA sequences,
aligning generated sequences against databases of known ncRNAs from humans and other
mammalian species and using a trained classifier to gauge the artificial data quality by
analyzing the classifier predictions on generated data. Overall, the evaluation attempts
showed great promise in the model’s ability to generate realistic ncRNA sequences with
the desired properties for most ncRNA types of the dataset.
|