Speech emotion recognition using deep learning

This thesis aims to build a robust system for recognizing speech emotions through the utilization of supervised labeled data and advanced deep-learning techniques with image classification. The core objective of this study is to construct a model that can precisely distinguish between emotional clas...

Full description

Bibliographic Details
Main Author:	Σκουλίδης, Γεώργιος
Other Authors:	Skoulidis, Georgios
Language:	English
Published:	2023
Subjects:	Deep learning Speech emotion recognition Image classification Ekman model Transfer learning Convolutional neural networks Machine learning Acoustic spectral features Parameter tuning Frozen layers Speech to image Back-propagation Activation function Βαθιά μάθηση Αναγνώριση συναισθημάτων ομιλίας Ταξινόμηση εικόνων Μοντέλο Έκμαν Μεταφορά μάθησης Συνελικτικά νευρωνικά δίκτυα Συναίσθημα Μηχανική μάθηση Ακουστικά φασματικά χαρακτηριστικά Παγωμένα στρώματα Βαθιά νευρωνικά δίκτυα
Online Access:	https://hdl.handle.net/10889/25645

id	nemertes-10889-25645
record_format	dspace
spelling	nemertes-10889-256452023-09-16T04:02:07Z Speech emotion recognition using deep learning Αναγνώριση συναισθημάτων ομιλίας με χρήση βαθιάς μάθησης Σκουλίδης, Γεώργιος Skoulidis, Georgios Deep learning Speech emotion recognition Image classification Ekman model Transfer learning Convolutional neural networks Machine learning Acoustic spectral features Parameter tuning Frozen layers Speech to image Back-propagation Activation function Βαθιά μάθηση Αναγνώριση συναισθημάτων ομιλίας Ταξινόμηση εικόνων Μοντέλο Έκμαν Μεταφορά μάθησης Συνελικτικά νευρωνικά δίκτυα Συναίσθημα Μηχανική μάθηση Ακουστικά φασματικά χαρακτηριστικά Παγωμένα στρώματα Βαθιά νευρωνικά δίκτυα This thesis aims to build a robust system for recognizing speech emotions through the utilization of supervised labeled data and advanced deep-learning techniques with image classification. The core objective of this study is to construct a model that can precisely distinguish between emotional classes for English speakers. The investigation extends to the analysis of the influence of three specific spectral features. These features are treated as images, each displayed in four different output sizes resulting from a quadratic transformation achieved through bilinear image interpolation. We also emphasize the evaluation of three custom abstract Convolutional Neural Network (CNN) architectures. These architectures are characterized by their composition of three convolutional layers and three fully-connected layers, among other components. We use parameter tuning to identify the optimal internal parameters and concretize the CNN structures, but to also adjust the batch size and learning rate values to enhance performance. Furthermore, to improve generalization, a custom early-stopping algorithm is integrated with the 5-fold cross-validation method. Specific pre-processing steps are employed, along with some audio-based techniques for cross-validation data. An additional objective is to study the impact of the optimized pre-trained English Speech Emotion Recognition model when applied to speech samples from Greek speakers. A limited dataset of Greek speech is employed to train, validate, and test the model's performance, while we assess the knowledge of the model's pre-trained layers. Αυτή η διατριβή στοχεύει στη δημιουργία ενός ισχυρού μοντέλου για αναγνώριση συναισθημάτων από ανθρώπινη ομιλία, μέσω της χρήσης δεδομένων με ετικέτες, καθώς και προηγμένων τεχνικών βαθιάς μάθησης για ταξινόμηση εικόνων. Ο βασικός στόχος αυτής της μελέτης είναι η κατασκευή ενός μοντέλου που να μπορεί να διακρίνει με ακρίβεια τις συναισθηματικές τάξεις για αγγλόφωνους ομιλητές. Ένας επιπλέον στόχος είναι η μελέτη της επίδρασης του βελτιστοποιημένου προ-εκπαιδευμένου μοντέλου Αγγλικής Αναγνώρισης Συναισθημάτων Ομιλίας, όταν εφαρμόζεται σε δείγματα Ελλήνων ομιλητών. Χρησιμοποιείται ένα περιορισμένο σύνολο δεδομένων ελληνικής ομιλίας για την εκπαίδευση, την επικύρωση και τη δοκιμή της απόδοσης του μοντέλου, ενώ επιδιώκουμε επίσης στην εύρεση ενσωματωμένης γνώσης στα προ-εκπαιδευμένα επίπεδα του μοντέλου. 2023-09-15T12:34:54Z 2023-09-15T12:34:54Z 2023-09-14 https://hdl.handle.net/10889/25645 en Attribution-NonCommercial 3.0 United States http://creativecommons.org/licenses/by-nc/3.0/us/ application/pdf
institution	UPatras
collection	Nemertes
language	English
topic	Deep learning Speech emotion recognition Image classification Ekman model Transfer learning Convolutional neural networks Machine learning Acoustic spectral features Parameter tuning Frozen layers Speech to image Back-propagation Activation function Βαθιά μάθηση Αναγνώριση συναισθημάτων ομιλίας Ταξινόμηση εικόνων Μοντέλο Έκμαν Μεταφορά μάθησης Συνελικτικά νευρωνικά δίκτυα Συναίσθημα Μηχανική μάθηση Ακουστικά φασματικά χαρακτηριστικά Παγωμένα στρώματα Βαθιά νευρωνικά δίκτυα
spellingShingle	Deep learning Speech emotion recognition Image classification Ekman model Transfer learning Convolutional neural networks Machine learning Acoustic spectral features Parameter tuning Frozen layers Speech to image Back-propagation Activation function Βαθιά μάθηση Αναγνώριση συναισθημάτων ομιλίας Ταξινόμηση εικόνων Μοντέλο Έκμαν Μεταφορά μάθησης Συνελικτικά νευρωνικά δίκτυα Συναίσθημα Μηχανική μάθηση Ακουστικά φασματικά χαρακτηριστικά Παγωμένα στρώματα Βαθιά νευρωνικά δίκτυα Σκουλίδης, Γεώργιος Speech emotion recognition using deep learning
description	This thesis aims to build a robust system for recognizing speech emotions through the utilization of supervised labeled data and advanced deep-learning techniques with image classification. The core objective of this study is to construct a model that can precisely distinguish between emotional classes for English speakers. The investigation extends to the analysis of the influence of three specific spectral features. These features are treated as images, each displayed in four different output sizes resulting from a quadratic transformation achieved through bilinear image interpolation. We also emphasize the evaluation of three custom abstract Convolutional Neural Network (CNN) architectures. These architectures are characterized by their composition of three convolutional layers and three fully-connected layers, among other components. We use parameter tuning to identify the optimal internal parameters and concretize the CNN structures, but to also adjust the batch size and learning rate values to enhance performance. Furthermore, to improve generalization, a custom early-stopping algorithm is integrated with the 5-fold cross-validation method. Specific pre-processing steps are employed, along with some audio-based techniques for cross-validation data. An additional objective is to study the impact of the optimized pre-trained English Speech Emotion Recognition model when applied to speech samples from Greek speakers. A limited dataset of Greek speech is employed to train, validate, and test the model's performance, while we assess the knowledge of the model's pre-trained layers.
author2	Skoulidis, Georgios
author_facet	Skoulidis, Georgios Σκουλίδης, Γεώργιος
author	Σκουλίδης, Γεώργιος
author_sort	Σκουλίδης, Γεώργιος
title	Speech emotion recognition using deep learning
title_short	Speech emotion recognition using deep learning
title_full	Speech emotion recognition using deep learning
title_fullStr	Speech emotion recognition using deep learning
title_full_unstemmed	Speech emotion recognition using deep learning
title_sort	speech emotion recognition using deep learning
publishDate	2023
url	https://hdl.handle.net/10889/25645
work_keys_str_mv	AT skoulidēsgeōrgios speechemotionrecognitionusingdeeplearning AT skoulidēsgeōrgios anagnōrisēsynaisthēmatōnomiliasmechrēsēbathiasmathēsēs
_version_	1799945007346483200

Speech emotion recognition using deep learning

Similar Items