Srivastava Manu, Ferro Marcello, Pirrelli Vito, Coro Gianpaolo
Automatic Speech Recognition, Statistical analysis, Disfluencies, Voice Activity Detection
This paper presents an open-source Automatic Speech Recognition (ASR) pipeline optimised for disfluent Italian read speech, designed to enhance both transcription accuracy and token boundary precision in low-resource settings. The study aims to address the difficulty that conventional ASR systems face in capturing the temporal irregularities of disfluent reading, which are crucial for psycholinguistic and clinical analyses of fluency. Building upon the WhisperX framework, the proposed system replaces the neural Voice Activity Detection module with an energy-based segmentation algorithm designed to preserve prosodic cues such as pauses and hesitations. A dual-alignment strategy integrates two complementary phoneme-level ASR models to correct onset–offset asymmetries, while a bias-compensation post-processing step mitigates systematic timing errors. Evaluation on the READLET (child read speech) and CLIPS (adult read speech) corpora shows consistent improvements over baseline systems, confirming enhanced robustness in boundary detection and transcription under disfluent conditions. The results demonstrate that the proposed architecture provides a general, language-independent framework for accurate alignment and disfluency-aware ASR. The approach can support downstream analyses of reading fluency and speech planning, contributing to both computational linguistics and clinical speech research.
Source: INTELLIGENT SYSTEMS WITH APPLICATIONS, vol. 29
@article{oai:iris.cnr.it:20.500.14243/561481,
title = {Enhancing token boundary detection in disfluent speech},
author = {Srivastava Manu and Ferro Marcello and Pirrelli Vito and Coro Gianpaolo},
doi = {10.1016/j.iswa.2025.200614},
year = {2026}
}READLET
READLET