--- title: "Coqui TTS" resource_files: - audio/sample1.wav - audio/sample2.wav - audio/sample3.wav output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Coqui TTS} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ## Introduction Coqui TTS is a text-to-speech (TTS) library that enables the conversion of regular text into speech and is completely free to use. This is not true of the other text to speech engines used by `text2speech`. Coqui TTS provides pre-trained tts and vocoder models as part of its package. To get a sense of the best tts and vocoder models, take a look at this [GitHub Discussion post](https://github.com/coqui-ai/TTS/discussions/1891). In the [Coqui TTS Hugging Face Space](https://huggingface.co/spaces/coqui/CoquiTTS), you have the opportunity to experiment with a few of these models by inputting text and receiving corresponding audio output. The underlying technology of text-to-speech is highly intricate and will not be the focus of this vignette. However, if you're interested in delving deeper into the subject, here are some recommended talks: - [Pushing the frontier of neural text to speech](https://youtu.be/MA8PCvmr8B0), a webinar by Xu Tan at Microsoft Research Asia - [Towards End-to-end Speech Synthesis](https://youtu.be/tHAdlv7ThjA), a talk given by Yu Zhang, Research Scientist at Google - [Text to Speech Deep Dive](https://www.youtube.com/watch?v=aLBedWj-5CQ), a talk given during the ML for Audio Study Group hosted by Hugging Face. Coqui TTS includes pre-trained models like [Spectogram models](https://github.com/coqui-ai/TTS#spectrogram-models) (such as Tacotron2 and FastSpeech2), [End-to-End Models](https://github.com/coqui-ai/TTS#end-to-end-models) (including VITS and YourTTS), and [Vocoder models](https://github.com/coqui-ai/TTS#vocoders) (like MelGAN and WaveGRAD). ## Installation To install Coqui TTS, you will need to enter the following command in the terminal: ``` $ pip install TTS ``` **Note**: If you are using a Mac with an M1 chip, initial step is to execute the following command in terminal: ``` $ brew install mecab ``` Afterward, you can proceed to install TTS by executing the following command: ``` $ pip install TTS ``` ## Authentication ```{r setup} library(text2speech) ``` To use Coqui TTS, text2speech needs to know the correct path to the Coqui TTS executable. This path can be obtained through two methods: manual and automatic. ### Manual You have the option to manually specify the path to the Coqui TTS executable in R. This can be done by setting a global option using the `set_coqui_path()` function: ```{r, eval = FALSE} set_coqui_path("your/path/to/tts") ``` To determine the location of the Coqui TTS executable, you can enter the command `which tts` in the terminal. Internally, the `set_coqui_path()` function runs `options("path_to_coqui" = path)` to set the provided path as the value for the `path_to_coqui` global option, as long as the Coqui TTS executable exists at that location. ### Automatic The functions `tts_auth(service = "coqui")`, `tts_voices(service = "coqui")`, and `tts(service = "coqui")` incorporate a way to search through a predetermined list of known locations for the Coqui TTS executable. If none of these paths yield a valid TTS executable, an error message will be generated, directing you to use `set_coqui_path()` to manually set the correct path. ## List Voices The function `tts_voices(service = "coqui")` is a wrapper for the system command `tts --list_models`, which lists the released Coqui TTS models. ```{r, eval = FALSE} tts_voices(service = "coqui") ``` The result is a tibble with the following columns: `language`, `dataset`, `model_name`, and `service`. - `language` column contains the language code associated with the speaker. - `dataset` column indicates the specific dataset on which the text-to-speech model, denoted by `model_name`, was trained. - `model_name` column refers to the name of the text-to-speech model. - `service` column refers to the specific TTS service used (Amazon, Google, Microsoft, or Coqui TTS) You can find a list of papers associated with some of the implemented models for Coqui TTS [here](https://github.com/coqui-ai/TTS#implemented-models). By providing the values from this tibble (`language`, `dataset`, and `model_name`) in `tts()`, you can select the specific voice you want for text-to-speech synthesis. ## Text-to-Speech To convert text to speech, you can use the function `tts(text = "Hello world!", service = "coqui")`. ```{r, eval = FALSE} tts(text = "Hello world!", service = "coqui") ``` The result is a tibble with the following columns: `index`, `original_text`, `text`, `wav`, `file`, `audio_type`, `duration`, and `service`. Some of the noteworthy ones are: - `text`: If the `original_text` exceeds the character limit, `text` represents the outcome of splitting `original_text`. Otherwise, `text` remains the same as `original_text`. - `file`: The location where the audio output is saved. - `audio_type`: The format of the audio file, either mp3 or wav. By default, the function `tts(service = "coqui")` uses the `tacotron2-DDC_ph` model and the `ljspeech/univnet` vocoder. You can specify a different model with the argument `model_name`, or a different vocoder with the argument `vocoder_name`. ```{r, eval = FALSE} tts(text = "Hello world, using a different voice!", service = "coqui", model_name = "fast_pitch", vocoder_name = "ljspeech/hifigan_v2") ``` Another default is that `tts(service = "coqui")` saves the audio output in a temporary folder and its path is shown in the `file` column of the resulting tibble. However, a temporary directory lasts only as long as the current R session, which means that when you restart your R session, that path will not exist! A more sustainable workflow would be to save the audio output in a local folder. To save the audio output in a local folder, set the arguments `save_local = TRUE` and `save_local_dest = /full/path/to/local/folder`. Make sure to provide the full path to the local folder. ```{r, eval = FALSE} tts(text = "Hello world! I am saving the audio output in a local folder", service = "coqui", save_local = TRUE, save_local_dest = "/full/path/to/local/folder") ```
## Session Info ```{r, echo = FALSE} sessionInfo() ```