"The quick brown fox jumps..."
"The princess was the first to speak."
A lightweight approach that uses a single short reference utterance per speaker to generate diverse synthetic speech, improving ASR performance with minimal real data.
Collecting dysarthric speech data is labor-intensive and expensive. This work investigates Zero-Shot Voice Cloning as a scalable solution. Using only a single ~7-second reference utterance per speaker (from the TORGO dataset), we synthesized 14.94 hours of linguistically diverse speech using Higgs Audio V2.
Key result: Adding just 1.55 hours of real dysarthric speech to our cloned dataset reduced WER by 57.59% compared to the baseline.
1 reference
14.94 hours
1.55 hours
57.59%
Two complementary views of Word Error Rate (WER): severity-level and speaker-level.
Baseline vs fine-tuning with clone-only and clone-plus-real data.
Speakers with mild dysarthria retain relatively clear articulation but exhibit minor prosodic irregularities. Cloning fidelity is generally very high.
"The quick brown fox jumps..."
"The princess was the first to speak."
"The quick brown fox jumps over the lazy dog."
"Play a beatles song on Amazon music."
"The quick brown fox jumps..."
"This morning he was feeling very goodnatured."
Consistent patterns of dysarthria but with better intelligibility than severe cases. The model begins to capture more distinct pathological traits.
"The quick brown fox jumps..."
"Brown and day had asked him to call again."
Prosody becomes significantly labored, with noticeable breathiness, pauses, and articulation errors.
"The quick brown fox jumps..."
"Your son told me you were ill and I came right over."
"The quick brown fox jumps..."
"What is on my calendar tomorrow?"
"The quick brown fox jumps..."
"I parked on level one."
High variability, significant pauses, slurrying, and unstable phonation. Cloning fidelity is most challenging here, yet significant ASR gains were achieved.
"The quick brown fox jumps..."
"I will explain to his lordship."