ICASSP 2026 • Audio Samples

Low-Burden Data Augmentation for Dysarthric ASR via Zero-Shot Voice Cloning

A lightweight approach that uses a single short reference utterance per speaker to generate diverse synthetic speech, improving ASR performance with minimal real data.

Satwinder Singh*, Qianli Wang*, Zihan Zhong*, Clarion Mendes†, Mark Hasegawa-Johnson†, Waleed Abdulla*, Seyed Reza Shahamiri*

*University of Auckland, New Zealand  |  †University of Illinois Urbana-Champaign, USA

Project overview

Collecting dysarthric speech data is labor-intensive and expensive. This work investigates Zero-Shot Voice Cloning as a scalable solution. Using only a single ~7-second reference utterance per speaker (from the TORGO dataset), we synthesized 14.94 hours of linguistically diverse speech using Higgs Audio V2.

Key result: Adding just 1.55 hours of real dysarthric speech to our cloned dataset reduced WER by 57.59% compared to the baseline.

Per speaker input

1 reference

About 7 seconds
Synthetic speech

14.94 hours

Higgs Audio V2
Real speech added

1.55 hours

Small, high impact
WER change

57.59%

Relative reduction

Results

Two complementary views of Word Error Rate (WER): severity-level and speaker-level.

Word Error Rate (WER)

Baseline vs fine-tuning with clone-only and clone-plus-real data.

Baseline FT–Clone-only FT–Clone+Real

By severity

By speaker

1. Mild Dysarthria

Speakers with mild dysarthria retain relatively clear articulation but exhibit minor prosodic irregularities. Cloning fidelity is generally very high.

Speaker F04
Mild
Reference (Input)

"The quick brown fox jumps..."

Cloned Output

"The princess was the first to speak."

Speaker M03
Mild
Reference (Input)

"The quick brown fox jumps over the lazy dog."

Cloned Output

"Play a beatles song on Amazon music."

Speaker F03
Mild
Reference (Input)

"The quick brown fox jumps..."

Cloned Output

"This morning he was feeling very goodnatured."

2. Moderate Dysarthria

Consistent patterns of dysarthria but with better intelligibility than severe cases. The model begins to capture more distinct pathological traits.

Speaker M05
Moderate
Reference (Input)

"The quick brown fox jumps..."

Cloned Output

"Brown and day had asked him to call again."

3. Moderate-Severe Dysarthria

Prosody becomes significantly labored, with noticeable breathiness, pauses, and articulation errors.

Speaker F01
Moderate-Severe
Reference (Input)

"The quick brown fox jumps..."

Cloned Output

"Your son told me you were ill and I came right over."

Speaker M01
Moderate-Severe
Reference (Input)

"The quick brown fox jumps..."

Cloned Output

"What is on my calendar tomorrow?"

Speaker M02
Moderate-Severe
Reference (Input)

"The quick brown fox jumps..."

Cloned Output

"I parked on level one."

4. Severe Dysarthria

High variability, significant pauses, slurrying, and unstable phonation. Cloning fidelity is most challenging here, yet significant ASR gains were achieved.

Speaker M04
Severe
Reference (Input)

"The quick brown fox jumps..."

Cloned Output (Zero-Shot)

"I will explain to his lordship."