Automatic speech processing

EE-554

XTTS TTS Exercise

This page is part of the content downloaded from XTTS TTS Exercise on Wednesday, 25 December 2024, 16:58. Note that some content and any files larger than 50 MB are not downloaded.

Description

The goal of this exercise is to get familiar with modern text-to-speech (TTS) technology and explore the strengths and limitations of a popular open-source TTS model. You will use the Coqui.ai TTS library, which provides pre-trained neural models that are easy to download and use with just a few lines of code. We will focus on Coqui's XTTS model, which is a multi-lingual TTS model trained on 16 languages. It also supports zero-shot voice cloning, meaning that the model can copy a person’s voice after listening to a short sample, even if it has never seen that voice before during training.

The exercise consists of a notebook which you can run on Google Colab.

You will test the model's various capabilities, analyze its strengths and weaknesses, and reason about model evaluation in each case. By the end of this exercise, you will have a better understanding of the current state of TTS technology and identify areas where models such as XTTS could be improved.