PEASYV

From videos to TextGrids

Adrien Méli and Nicolas Ballier

2024-03-28

Basic concept

Phonetic

Extraction and

Alignment of

Subtitled

YouTube

Videos

flowchart LR
  A(Link to video) --> B(Praat TextGrid)

Key tools

#! /bin/sh
SPPAS (Bigi 2012; Bigi and Hirst 2012)
P2FA (Yuan and Liberman 2008)
PRAAT (Boersma and Weenink 2019)
R (R Core Team 2023)
yt-dlp (yt-dlp 2022)
ffmpeg (Developers 2021)
the Longman Pronunciation Dictionary (Wells 2008)

Detailed flowchart

flowchart LR

subgraph S1[Step 1]
direction TB
A(fa:fa-file-lines list of links) --> |yt-dlp| B(fa:fa-video video)
A --> |yt-dlp| C(fa:fa-closed-captioning subtitles)
B -->|ffmpeg| D(fa:fa-file-audio fa:fa-regular far:fa-square Main Audio TG)
C -->|praat| D
D -->|praat| E1(fa:fa-file-audio far:fa-square)
D -->|praat| E2(fa:fa-file-audio far:fa-square)
D -->|praat| E3(fa:fa-file-audio far:fa-square)
end

subgraph S2[Step 2]
direction LR
F1(fa:fa-file-audio far:fa-square) -->|SPPAS| G1(fa:fa-table-cells-large Segm TG)
F2(fa:fa-file-audio far:fa-square) -->|SPPAS| G2(fa:fa-table-cells-large Segm TG)
F3(fa:fa-file-audio far:fa-square) -->|SPPAS| G3(fa:fa-table-cells-large Segm TG)
F1(fa:fa-file-audio far:fa-square) -->|fa:fa-align P2FA| H1(fa:fa-table-cells-large Segm TG)
F2(fa:fa-file-audio far:fa-square) -->|fa:fa-align P2FA| H2(fa:fa-table-cells-large Segm TG)
F3(fa:fa-file-audio far:fa-square) -->|fa:fa-align P2FA| H3(fa:fa-table-cells-large Segm TG)
G1--> GH(fa:fa-file-audio fa:fa-table-cells-large)
H1--> GH
G2--> GH
H2--> GH
G3--> GH
H3--> GH
end

subgraph S3[Step 3]
direction TB
I(fa:fa-file-audio fa:fa-table-cells-large) -->|praat| K(fa:fa-table-cells Segm Syll TG)
J(fa:fa-book LPD) -->|praat|K
K -->|R| L(fa:fa-file-csv spreadsheets)
L -->|R| M(fa:fa-chart-line vocalic diagnoses)

end

S1 --> S2
S2 --> S3

Let us now have a look at the workflow in detail.

First, a simple text file containing the links to the YouTube videos is created, one link per line. Gender and variety of English can be added at the end of each line.

PEASYV uses that text file to download the videos, their subtitles and their metadata using yt-dlp. A folder named after the videos is created for each link.

The video is converted to a sound file with ffmpeg, and the timestamps in the subtitles are used to create intervals on a tier in a first main TextGrid in Praat. The tier features the content of the subtitles at the corresponding interval.

Then the main recording and the main TextGrid are split into subfiles along the intervals created from the timestamps in the subtitles: the audio is extracted, along with its transcription in the interval.

Let us now have a look at step 2.

Here the subfiles are fed into SPPAS first, then P2FA. Each aligner returns TextGrids aligned at the lexical and segmental level. SPPAS also returns MOMEL and INTSINT tiers for analyzing intonation. All the sub-TextGrids are then merged together in one intermediary main TextGrid.

In step 3, a praat script taps into the LPD and adds syllabic tiers for the two alignments. Two other tiers are added, one for each aligner, in an attempt to measure the potential gaps between the alignments—more on this later.

Two optional outputs are also generated, spreadsheets and vocalic diagnoses.

About the two aligners

SPPAS:
- Julius (Lee and Kawahara 2019)
- CMU (Weide 1994) in SAMPA
P2FA:
- htk (Young et al. 2006)
- CMU in ArpaBet

Claims

2 aligners are better than just one
- SPPAS
- P2FA
Step 2 prevents cascading alignment errors
Added values:
- low-tech
- syllabic tiers based on the LPD (Wells 2008)
- (MIS)MATCHES on the TextGrid

… and this fact probably PEASYV’s grandest claim:

2 aligners are better, as they use two different speech recognition engines.
step 2 is a key element of how PEASYV was designed. I was, am, obsessed by the fear that the collected data might be inaccurate. By splitting the main recording into subfiles of a few seconds each, aligners only have to align small chunks of speech at a time. This decreases the likelihood of misalignments but also prevents potential “domino effects” of such misalignments, to the rest of the file.
of, in my view, not uninteresting value are also the following points: PEASYV is low-tech, easy to install and easy to run.

it is unique in the syllabic alignments it offers, along with its built-in assessment of differences between the two aligners (I’m getting there, I promise! as we’re now going to have a look at the outputs PEASYV returns.

Outputs

the PRAAT TextGrid

Tiers

Transcription
Momel
INTSINT

SPPAS Word
SPPAS Phoneme
SPPAS LPD Word
SPPAS Syllable
SPPAS LPD Syllable

P2FA Word
P2FA Phoneme
P2FA LPD Word
P2FA Syllable
P2FA LPD Syllable

SPPAS MISMATCH
P2FA MISMATCH

Screenshot of a TextGrid

So this is a screenshot of a final TextGrid.

As you can see, the cursor is located at roughly the middle of the SPPAS-aligned syllable /@t/ in the word “clarity”.

If you trace the cursor down from the highlighted tier to five tiers below (the antepenultimate tier), you can see that at the time indicated by the cursor, P2FA has returned the last syllable of “clarity”. The second syllable, /@t/, can be found a few milliseconds earlier, on the left.

This is flagged on tier 14, called “SPPAS MISMATCH”, as a mismatch.

Should the reverse situation happen, ie the midtemporal value of a P2FA-aligned syllable not being comprised in the corresponding SPPAS-aligned syllabic interval, tier 15 will feature “MISMATCH” as well.

This procedure, I contend, makes it possible to identify potentially problematic passages, because they might be mis-aligned.

Secondary outputs

Spreadsheets

.csv format
one per aligner
one row per vowel
formant readings at each centile of a vowel’s duration
= 300 + 24 columns

Columns and data

Diagnoses

An example: English Like A Native

SPPAS

Data aligned by SPPAS.

General information

Nb of TextGrids: 453

Total length of the videos: 172:39:22

Monophthongs

Data on monophthongs.

Distribution

Durations

Vocalic Trapezoids

Scatterplots

Per-monophthong mean F1/F2 values with error bars (1 SE)

Deterding

Dotted grey line: reported native values — Black line: speaker

References: Deterding (1997)

Hillenbrand

References: Hillenbrand et al. (1995)

Density plots

F1

F2

Formant tracking

Next are the formant tracks for monophthongs.

F1

F2