Low-Burden Data Augmentation for Dysarthric ASR

Project overview

Collecting dysarthric speech data is labor-intensive and expensive. This work investigates Zero-Shot Voice Cloning as a scalable solution. Using only a single ~7-second reference utterance per speaker (from the TORGO dataset), we synthesized 14.94 hours of linguistically diverse speech using Higgs Audio V2.

Key result: Adding just 1.55 hours of real dysarthric speech to our cloned dataset reduced WER by 57.59% compared to the baseline.

Per speaker input

1 reference

About 7 seconds

Synthetic speech

14.94 hours

Higgs Audio V2

Real speech added

1.55 hours

Small, high impact

WER change

57.59%

Relative reduction

Results

Two complementary views of Word Error Rate (WER): severity-level and speaker-level.

Word Error Rate (WER)

Baseline vs fine-tuning with clone-only and clone-plus-real data.

Baseline FT–Clone-only FT–Clone+Real

By severity

By speaker

1. Mild Dysarthria

Speakers with mild dysarthria retain relatively clear articulation but exhibit minor prosodic irregularities. Cloning fidelity is generally very high.

Speaker F04

Mild

Reference (Input)

"The quick brown fox jumps..."

Cloned Output

"The princess was the first to speak."

Speaker M03

Mild

Reference (Input)

"The quick brown fox jumps over the lazy dog."

Cloned Output

"Play a beatles song on Amazon music."

Speaker F03

Mild

Reference (Input)

"The quick brown fox jumps..."

Cloned Output

"This morning he was feeling very goodnatured."

2. Moderate Dysarthria

Consistent patterns of dysarthria but with better intelligibility than severe cases. The model begins to capture more distinct pathological traits.

Speaker M05

Moderate

Reference (Input)

"The quick brown fox jumps..."

Cloned Output

"Brown and day had asked him to call again."

3. Moderate-Severe Dysarthria

Prosody becomes significantly labored, with noticeable breathiness, pauses, and articulation errors.

Speaker F01

Moderate-Severe

Reference (Input)

"The quick brown fox jumps..."

Cloned Output

"Your son told me you were ill and I came right over."

Speaker M01

Moderate-Severe

Reference (Input)

"The quick brown fox jumps..."

Cloned Output

"What is on my calendar tomorrow?"

Speaker M02

Moderate-Severe

Reference (Input)

"The quick brown fox jumps..."

Cloned Output

"I parked on level one."

4. Severe Dysarthria

High variability, significant pauses, slurrying, and unstable phonation. Cloning fidelity is most challenging here, yet significant ASR gains were achieved.

Speaker M04

Severe

Reference (Input)

"The quick brown fox jumps..."

Cloned Output (Zero-Shot)

"I will explain to his lordship."