Phonetic Extraction and Alignment of Subtitled YouTube Videos

PEASYV (Phonetic Extraction and Alignment of Subtitled YouTube Videos) is a personal project which aims at collecting phonetic data from subtitled videos.

TL;DR: click here for a visualization of vocalic trapezoids generated from the collected data.

Aligning the data

PEASYV uses subtitles to create boundaries in Praat (Boersma and Weenink (2019)) from the time-stamps. The subtitles themselves are then added to the intervals.

The alignment thus created is split into audio subfiles (one per interval), which are then fed into two aligners, Bigi (2012) and Yuan and Liberman (2008). This step makes it possible to contain potential alignment errors to one interval only, and feed smaller chunks of recordings to the aligners in order to increase accuracy. Intervals last a few seconds at most.

These aligners return two separate segmental alignments, along with MOMEL and INSINT (Hirst (2007)) tiers, which are then fed into a personal database containing the transcription and syllabification of English words. This database is based on the Longman Pronunciation Dictionary ( Wells (2008)). This step returns a syllable tier for both aligners.

What a final Praat TextGrid looks like can be seen here:

Extracting the data

Once the acoustic signal has been aligned to its transcription, the process of collecting data starts.

Having written my PhD (Méli (2018)) on the phonological acquisition of English vowels by French L2 learners, the data is collected in a way that focuses on vowels. However, anybody with Praat-scripting experience can retrieve data suitable to them from the TextGrids.

Two data-collecting runs are carried out, one for each aligner (SPPAS or P2FA). Each row in the generated spreadsheets corresponds to a vowel. For each vowel, 321 datapoints are collected. These datapoints include references such as the video where the vowel was pronounced, the speaker, the word, the syllable, whether that syllable was stressed, the preceding and succeeding phonemes, the duration, the time-stamp… But also (and that’s the bulk of the datapoints), formant readings at each centile of the vowel’s duration. Readings are collected for the first three formants of the vowel.

So far, I have managed to collect data for approximately 350,000 vowels.


The following files were generated from The Queen’s Christmas Message 19971

I have also devised a way to generate automated reports to analyze monophthongs. I compare the data found with that published for General American ( Hillenbrand et al. (1995)) and Received Pronunciation ( Deterding (1997)).

These reports make it easy to assess the accuracy of the aligning and extracting processes. So far, I’m all the happier with the results as the values found are for actual connected speech. I believe the venues of research with this kind of data are quite numerous.

An example of the reports that can be produced from the data can be found below. It contains analyses of the video above, along with five other Christmas speeches by The Queen. Don’t forget to download the .pdf version from the page below!

Bigi, B. 2012. SPPAS: a tool for the phonetic segmentations of Speech.” Istanbul.
Boersma, Paul, and David Weenink. 2019. “Praat: Doing Phonetics by Computer [Computer Program]. Version 6.1.07, Retrieved 26 November 2019 from Http://” 2019.
Deterding, David. 1997. “The Formants of Monophthong Vowels in Standard Southern British English Pronunciation.” Journal of the International Phonetic Association 27 (1-2): 47–55.
Hillenbrand, J., L. A. Getty, M. J. Clark, and K. Wheeler. 1995. “Acoustic Characteristics of American English Vowels.” The Journal of the Acoustical Society of America 97 (5): 3099–3111.
Hirst, Daniel. 2007. “A Praat Plugin for Momel and INTSINT with Improved Algorithms for Modelling and Coding Intonation.” Proceedings of the 16th International Congress of Phonetic Sciences.
Méli, Adrien. 2018. A longitudinal study of the oral properties of the French-English interlanguage. A quantitative approach of the acquisition of the /I/-/i:/ and /U/-/u:/ contrasts.” PhD thesis, Université Paris Diderot.
Wells, J. C. 2008. Longman Pronunciation Dictionary. London: Pearson Longman.
Yuan, J., and M. Liberman. 2008. “Speaker Identification on the SCOTUS Corpus.” Journal of the Acoustical Society of America, 123(5): 5687.

  1. As a reference and a tribute to (harrington2000monophthongal?).↩︎