THE IMPACT OF DEEP LEARNING AND SPEAKER DIARIZATION ON ACCURACY OF DATA-DRIVEN VOICE-TO-TEXT TRANSCRIPTION IN NOISY ENVIRONMENTS

Authors

  • Abdulla Mamun Business Data Analyst, Moment A/S , Copenhagen, Denmark Author
  • Alifa Majumder Nijhum Master in Digital Marketing,St. Francies College, NY, USA Author

DOI:

https://doi.org/10.63125/rpjwke42

Keywords:

Deep Learning ASR, Speaker Diarization, Noisy Environments, Transcription Accuracy, Enterprise Voice-To-Text

Abstract

This quantitative, cross-sectional, case-based study investigated why cloud and enterprise voice-to-text deployments still produce variable transcription accuracy in noisy, multi-speaker settings and whether Deep Learning ASR Capability (DL) and Speaker Diarization Quality (SD) function as complementary drivers of perceived Transcription Accuracy in Noise (TA). Using a one-time Likert-scale survey (1 = strongly disagree to 5 = strongly agree), the study retained N = 156 usable responses from users and reviewers embedded in operational enterprise and cloud transcription cases. The research problem centers on the persistent gap between real-world acoustic difficulty and dependable transcript quality for analytics, compliance, and decision support. In the model, DL and SD were the core independent variables, TA was the dependent variable, and Noise Severity and Overlap Frequency were included as contextual controls to isolate technical effects from environmental difficulty. Descriptively, the case context was genuinely adverse, with Noise Severity M = 3.94 (SD = 0.71) and Overlap Frequency M = 3.52 (SD = 0.82), while outcomes remained only moderate (TA M = 3.46, SD = 0.64); DL was rated moderately high (M = 3.62, SD = 0.59) and SD moderate (M = 3.38, SD = 0.67). Measurement quality supported inferential testing, with strong internal consistency (DL α = 0.88, SD α = 0.85, TA α = 0.90). The analysis plan applied composite scoring, reliability testing, Pearson correlations, and multiple regression. Correlation results showed strong positive relationships between DL and TA (r = 0.61, p < .001) and SD and TA (r = 0.55, p < .001), alongside negative associations between TA and noise (r = −0.31, p < .001) and TA and overlap (r = −0.28, p < .01). Regression findings confirmed joint predictive power: the model was significant (F(4,151) = 39.18, p < .001) and explained 51% of TA variance (R² = 0.51; Adjusted R² = 0.49); DL was the strongest positive predictor (B = 0.47, β = 0.43, p < .001) and SD added an independent positive contribution (B = 0.34, β = 0.31, p < .001), while noise (B = −0.11, p = .046) and overlap (B = −0.10, p = .021) reduced accuracy. A robustness check using TA groups further reinforced the pattern: 29.5% low TA (≤ 3.0), 52.6% moderate TA, and 17.9% high TA (> 3.75), with monotonic increases in DL and SD means across groups (DL 3.18 → 3.63 → 4.12; SD 2.97 → 3.36 → 3.98). Overall, the findings imply that enterprise teams should optimize ASR and diarization together, prioritize overlap-aware diarization improvements, and treat noise and overlap profiling as first-class deployment controls to improve transcript trust and downstream usability.

Downloads

Published

2023-12-28

How to Cite

Abdulla Mamun, & Alifa Majumder Nijhum. (2023). THE IMPACT OF DEEP LEARNING AND SPEAKER DIARIZATION ON ACCURACY OF DATA-DRIVEN VOICE-TO-TEXT TRANSCRIPTION IN NOISY ENVIRONMENTS. American Journal of Scholarly Research and Innovation, 2(02), 415–448. https://doi.org/10.63125/rpjwke42

Cited By: