Περίληψη: | In recent years the field of Sentiment Analysis, and by extension Emotion Recognition, has been met with increased interest due to the rise of social media. Making machines capable of automatically recognizing emotions will be a vital task, as well as a milestone, in Human-Computer Interaction in the coming years. Most of the early works focused on data of a single modality, such as a product review or a facial expression. More recent efforts have focused on multimodal fusion since human emotion is expressed through multiple modalities, specifically text, facial expressions and voice. As one can imagine, since recognition of the emotional state of a person can be a very challenging task even for humans, it is even more complex for automated methods, and as a result effective Emotion Recognition is required.
In this thesis we study and present the field of Emotion Recognition in-depth. Initially, background topics, related works, methods and approaches are presented for each of the modalities, namely Textual Emotion Recognition and Facial Emotion Recognition. The use of deep learning techniques in the field skyrocketed the performance of classification methods and are the main direction that is currently pursued by researchers, introducing a variety of challenges. In terms of the methodology proposed in this work, a wide variety of architectures and approaches are implemented, leading to different models for the text and the image aspect of the system.
Then, the field of Multimodal Emotion Recognition is presented, including its theory and literature. The main goal is to realize an end-to-end deep learning pipeline, in order to address the problem of understanding human emotions and improve the accuracy over the traditional standalone models. An important aspect of the field that is explored is the fusion of modalities which is often performed through a fusion at the feature and/or decision-level. The task at hand is supervised classification. Two additional topics showcased in this work are attention mechanisms and a systematic review of the available datasets in the Emotion Recognition domain.
In order to explore the performance of the proposed models in recognizing peoples’ emotions we implement them and evaluate them on a variety of real-world datasets. Thus, we come to conclusions regarding their overall emotion recognition accuracy, when compared to each other, as well as when compared to state-of-the-art approaches.
Furthermore, the proposed approach is adapted to a more practical environment by implementing a novel real-world system for Multimodal Emotion Recognition. The user is given the option to enter multiple types of inputs and receives emotion predictions. Overall, we effectively illustrate the different facets of analysis that are performed in the task of Multimodal Emotion Recognition.
From the experimental results it is observed that the proposed models consisting of Recurrent and Convolutional Neural Networks achieve very high performance as well as proving that they are potent and suitable tools for practical real-world emotion recognition.
|