Prosody modelling using machine learning techniques for neutral and emotional speech synthesis

In this doctoral dissertation three proposed approaches were evaluated using two databases of different languages, one American-English and one Greek. The proposed approaches were compared to the state-of-the-art models in the phone duration modelling task. The SVR model outperformed all the other...

Full description

Bibliographic Details
Main Author:	Λαζαρίδης, Αλέξανδρος
Other Authors:	Φακωτάκης, Νικόλαος
Format:	Thesis
Language:	English
Published:	2011
Subjects:	Phone duration modelling Prosody modelling Speech synthesis Machine learning Neutral speech Emotional speech Πρόβλεψη διάρκειας φωνημάτων Μοντελοποίηση προσωδίας Σύνθεση ομιλίας Μηχανική μάθηση Ουδέτερη ομιλία Συναισθηματική ομιλία Μηχανές υποστήριξης διανυσμάτων 006.31
Online Access:	http://nemertes.lis.upatras.gr/jspui/handle/10889/4553

Description
Summary:	In this doctoral dissertation three proposed approaches were evaluated using two databases of different languages, one American-English and one Greek. The proposed approaches were compared to the state-of-the-art models in the phone duration modelling task. The SVR model outperformed all the other individual models evaluated in this dissertation. Their ability to outperform all the other models is mainly based on their advantage of coping in a better way with high-dimensionality feature spaces in respect to the other models used in phone duration modelling, which makes them appropriate even for the case when the amount of the training data would be small respectively to the number of the feature set used. The proposed fusion scheme, taking advantage of the observation that different prediction algorithms perform better in different conditions, when implemented with SVR (SVR-fusion), contributed to the improvement of the phone duration prediction accuracy over that of the best individual model (SVR). Furthermore the SVR-fusion model managed to reduce the outliers in respect to the best individual model (SVR). Moreover, the proposed two-stage scheme using individual phone duration models as feature constructors in the first stage and feature vector extension (FVE) in the second stage, implemented with SVR (SVR-FVE), improved the prediction accuracy over the best individual predictor (SVR), and the SVR-fusion scheme and moreover managed to reduce the outliers in respect to the other two proposed schemes (SVR and SVR-fusion). The SVR two-stage scheme confirms in this way their advantage over all the other algorithms of coping well with high-dimensionality feature sets. The improved accuracy of phone duration modelling contributes to a better control of the prosody, and thus quality of synthetic speech. Furthermore, the first proposed method (SVR) was also evaluated on the phone duration modelling task in emotional speech, outperforming all the state-of-the-art models in all the emotional categories. Finally, perceptual tests were performed evaluating the impact of the proposed phone duration models to synthetic speech. The perceptual test for both the databases confirmed the results of objective tests showing the improvement achieved by the proposed models in the naturalness of synthesized speech.

Prosody modelling using machine learning techniques for neutral and emotional speech synthesis

Similar Items