Contents

Using Whisper (speech-to-text) and Tortoise (text-to-speech)

singing squirrel, AI-generated using DALL·E

Introduction

This blog post shows you how to perform speech recognition using Whisper (i.e., producing a written transcript of an audio file) and speech generation using Tortoise (i.e., creating an audio file based on someone’s voice for arbitrary text). First, I’ll demonstrate how to download audio from a YouTube video, and then we’ll use it for these speech tasks. Everything will be written in Python.

Some background information:

  • Whisper is an automatic speech recognition system released by OpenAI last month, and was trained on 680,000 hours of data (about one-third of its audio data is non-English). Whisper is open source and you can find it on GitHub, read the accompanying paper, or check out the OpenAI blog post.
  • Tortoise is a text-to-speech program by James Betker, and was trained on a dataset consisting of audiobooks. Its development prioritized realistic intonation and rhythm in speech as well as multi-voice capabilities. Check out its repo on GitHub.

Getting Audio From YouTube

In order to perform speech tasks, the first step is to download audio from a YouTube video so that we have something to work with.

To do so, I used pytube (docs), which is a dependency-free library for downloading YouTube videos. I installed it using conda: conda install pytube. Once installed, we can import the module in Python and start using it. For this step, I used a Jupyter notebook.

Import pytube and define a YouTube object:

from pytube import YouTube

yt = YouTube('https://www.youtube.com/watch?v=...')

Replace the URL above with the URL of any YouTube video that contains the voice that will be cloned. This can be a video of your own voice, for example. A length between 5 to 15 minutes is ideal, so that you have enough audio for the speech generation task but not so much that it slows down the speech recognition task.

Verify that you have the correct video by checking its title:

yt.title

Query the audio-only stream:

yt.streams.get_audio_only()
<Stream: itag="140" mime_type="audio/mp4" abr="128kbps" acodec="mp4a.40.2" progressive="False" type="audio">

Note that you can view more streams with audio-only tracks with the command yt.streams.filter(only_audio=True). Additionally, if you wanted to view all streams, use the command yt.streams.

Looking at the output of the above command, it appears that the audio stream has itag of 140. We’ll use that to identify the correct stream to download.

Download the stream:

stream = yt.streams.get_by_itag(140)
stream.download()

And that’s it! If you followed the above steps, you should have a downloaded audio file of your chosen YouTube video. We will use this audio file for the speech tasks in the following sections.

Using Whisper (speech-to-text)

OpenAI has made it very simple to use Whisper; it only takes a few lines of code to get a transcript of an audio file.

The first step is to install Whisper. I installed it on my local machine using pip: pip install git+https://github.com/openai/whisper.git

The next step is to select a model. Whisper’s GitHub provides a table (reproduced below) of the different models, sizes, and their speed-accuracy tradeoffs. I chose the base model, and given that my task would be English-only, I chose base.en which typically has better performance than the multilingual version.

Import Whisper and load the model:

import whisper

model = whisper.load_model('base.en')
100%|███████████████████████████████████████| 139M/139M [00:55<00:00, 2.64MiB/s]

Transcribe the audio file:

result = model.transcribe('audio_file.mp4')

result['text']

It took about 1 minute on my CPU to perform inference on a 13-minute audio file. result['text'] contains the transcription.

Now that we’ve shown how to use Whisper to speech-to-text, let’s move on to speech generation in the next section.

Using Tortoise (text-to-speech)

Before using Tortoise, we need some short clips from our downloaded audio file of the voice we want to clone. Each clip should be about 6 to 10 seconds long, and I recommend having 5 to 10 clips total (I used 8 clips). Pick higher-quality clips without background noise, if possible. If you have existing software on your computer that you prefer to use, feel free to use it to create these clips. Since I have a Mac machine, I used Apple’s Voice Memos app to trim my audio file to create short clips (which are saved in ~/Library/Application\ Support/com.apple.voicememos).

Once you have created these audio clips, convert them to .wav format with a 22,050 sample rate. I used an online M4A to WAV Converter that allowed me to specify the sample rate.

Now we’re ready to use Tortoise! We’ll be running it in inference mode; we won’t be training or fine-tuning. Note that Tortoise is a slow model (hence the name) and since my local computer doesn’t have an NVIDIA GPU, I decided to run this section’s code in a notebook environment on Google Colab. The added benefit is that I don’t need to mess with anything on my local computer, such as installing a bunch of dependencies or dealing with any installation errors that pop up.

Open a new notebook in Colab, turn on a GPU runtime, and check your GPU:

!nvidia-smi
Mon Oct 24 22:39:39 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   62C    P8    13W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Looks good.

Install the latest versions of SciPy and Tortoise, plus its dependencies:

!pip3 install -U scipy

!git clone https://github.com/neonbjb/tortoise-tts.git
%cd tortoise-tts
!pip3 install -r requirements.txt
!python3 setup.py install

These commands should take a bit to run, and will produce a lot of output.

Import the modules we’ll need:

import torch
import torchaudio
import torch.nn as nn
import torch.nn.functional as F

import IPython

from tortoise.api import TextToSpeech
from tortoise.utils.audio import load_audio, load_voice, load_voices

Download models used by Tortoise from HuggingFace:

tts = TextToSpeech()

Now we’re ready to generate speech. Think about what you want your cloned voice to say — I chose a poem from The Lord of the Rings. Note that the longer the text, the longer it will take to generate; I suggest starting with something short.

Specify text and preset mode:

text = 'Three Rings for the Elven-kings under the sky, \
Seven for the Dwarf-lords in their halls of stone, \
Nine for Mortal Men doomed to die, \
One for the Dark Lord on his dark throne \
In the Land of Mordor where the Shadows lie. \
One Ring to rule them all, One Ring to find them, \
One Ring to bring them all, and in the darkness bind them, \
In the Land of Mordor where the Shadows lie.'

preset = "fast"

The preset mode determines the quality of the generated audio. The options include ultra_fast, fast, standard, and high_quality.

On Colab, navigate to Files using the left menubar and locate the tortoise/voices folder. Inside that folder, create a subfolder named after your chosen voice, such as michael. Upload all of your .wav clips into the newly created folder.

List all of the available voices, and display one of your audio clips:

%ls tortoise/voices

IPython.display.Audio('tortoise/voices/michael/7.wav')
angie/      freeman/  michael/  rainbow/       train_daws/     train_kennard/
applejack/  geralt/   mol/      snakes/        train_dotrice/  train_lescault/
daniel/     halle/    myself/   tim_reynolds/  train_dreams/   train_mouse/
deniro/     jlaw/     pat/      tom/           train_empire/   weaver/
emma/       lj/       pat2/     train_atkins/  train_grace/    william/

You can see that Tortoise comes with a number of other voices you can use, if you decide not to use your custom voice.

Edit the path above to display the audio for one of your clips. If you have trouble playing it, it’s possible that your audio clip isn’t in the correct format. If that’s the case, try a different .wav converter and see if that works.

Specify the voice and generate the audio sample:

voice = 'michael'

voice_samples, conditioning_latents = load_voice(voice)
gen = tts.tts_with_preset(text, voice_samples=voice_samples, conditioning_latents=conditioning_latents, preset=preset)
torchaudio.save('generated.wav', gen.squeeze(0).cpu(), 24000)
IPython.display.Audio('generated.wav')

This took about 5 minutes on the Colab GPU. Once it’s done, you should see a file called generated.wav in your working directory.

This is my generated audio:

This is one of the 8 clips used to generate the cloned voice:

Sounds like a pretty good clone of the original voice, especially considering how I ran the model in inference mode and did not fine-tune Tortoise to my chosen voice.