Findings words, syllables, and sounds
When you have some speech recordings for your research, you may want to know when in time the target word occurs, or how long the vowels are. These measurements are often done in TextGrids in Praat. When you have a small set of recordings, you can manually create word-level, syllable-level, or phoneme-level intervals by hand (also see our how-to-annotate-in-praat). However, this is hard work and very time-consuming (and outright boring). So let’s speed things up a bit!
Forced alignment
Forced alignment is when you try to automatically match orthographic transcriptions (text) to speech recordings (audio). The resulting output is sometimes called ’timestamps’, ‘speech marks’, ‘annotations’, or ‘intervals’, for instance in TextGrids in Praat. There are various forced-aligners available nowadays, the most famous being WhisperX, Montreal Forced Aligner, Penn Forced Aligner (P2FA), EasyAlign, and WebMAUS.
WebMAUS
I selected WebMAUS for this how-to because:
- it can handle a large number of languages
- it can do word-level AND syllable-level AND phoneme-level annotations
- it is pretty accurate
- it is accessible through a Python script as well as a fairly user-friendly web interface
- it gives TextGrids as output.
Here’s an example of some WebMAUS output, showing words (in orthography and X-SAMPA), syllables, and phonemes in different tiers in a TextGrid:
Web interface
The web interface of WebMAUS is fairly straight-forward. All you need to do is upload a speech recording (e.g., “sentence1.wav”) together with a text file with the exact same name (“sentence1.txt”). This text file should contain the words of the sentence in regular orthography. Select the language, agree to the Terms of Usage, and a few seconds later you can download the “sentence1.TextGrid”.
But there’s a catch! WebMAUS Basic does not give you syllable-level annotations by default! For this, you need a work-around. The regular process is called WebMAUS Basic: wav + txt
» TextGrid
(with only words & phonemes). WebMAUS can give you syllables too, but only through this extra process called Pho2Syl. The trick is to run WebMAUS Basic and Pho2Syl in a sequence:
- WebMAUS Basic:
wav + txt
»BPF .par file
- Pho2Syl:
BPF .par file
»TextGrid
(with words, syllables, phonemes)
Doing this by hand is just a hassle, even if you only have a single recording. So let’s script it!
Let’s get set up!
- Install Python.
- Install the Python module
requests
:- Type “terminal” in the Start Menu in Windows to start Windows PowerShell
- Paste this line and hit ENTER:
pip install requests
- Download forced-alignment.py and store it in a separate new folder.
Let’s go!
- Select the language of your choice in line 29:
webmaus_language = 'eng-US'
- For a list of languages supported by WebMAUS, check out the drop-down Language menu in the web interface.
- NOTE: the script only accepts language codes, like “eng-US”. These can be found here (…it’s a long page of text, so search for “eng-US” to find the list of languages).
- Select the extension of your audio files in line 39 (wav or mp3):
audio_extension = "wav"
- Copy all your speech recordings with accompanying text files into the same folder as the script.
- NOTE: the script only runs on pairs of files with the same name, like “item3.wav” + “item3.txt”.
- The txt files should contain the words in the speech in regular orthography.
- Run the script! For instance, right-click some empty space inside the folder shown in Windows Explorer, click “Open in Terminal”, type:
python forced-alignment.py
and hit Enter. - Done!
Happy aligning!