This page categorizes papers based on their research topics. For a “clean” list of publications, please to my Goole Scholar

The (Un)usefulness of Attention Mechanisms for Speech Recognition

Although attention-based models, especially Transformer variants, have been shown to provide outstanding performance in automatic speech recognition (ASR), it is questionable whether attention is necessary for ASR. The reasoning is straightforward: when recognizing a current word, humans do not rely on words spoken tens of seconds later by the speaker, yet attention mechanisms do. Currently, our linear models have outperformed both Branchformer and Conformer models, achieving state-of-the-art (SOTA) ASR results across five different languages and three distinct acoustic conditions.

T. Parcollet^, R. van Dalen^, S. Zhang^, S Bhattacharya (^ equal contribution). “Sumformer: A Linear-Complexity Alternative to Self-Attention for Speech Recognition” (preprint 2023. Paper to be updated soon.)
TL;DR: A linear alternative to self-attention reduces the VRAM usage by half and the training/inference time by up to 28%, and outperforms SOTA speech recognition models across five different languages under three different acoustic conditions. The benefits can also be generalized to speech understanding.
S. Zhang, E. Loweimi, P. Bell, S. Renals. “On the Usefulness of Self-Attention for Automatic Speech Recognition with Transformers.” IEEE Spoken Language Technology Workshop (SLT) 2021
TL;DR: We unveil that for Transformer-based speech recognition systems, some self-attention layers can behave similarly to and can be replaced by linear layers. (Subsequent works show a similar conclusion has also been observed for SOTA speech recognition systems based on Conformer and Branchformer.)
S. Zhang, E. Loweimi, P. Bell, S. Renals. “Stochastic Attention Head Removal: A Simple and Effective Method for Improving Automatic Speech Recognition with Transformers.” INTERSPEECH 2021
TL;DR: We show that not all attention heads are necessary in Transformer-based speech recognition systems. Removing attention heads randomly during training can improve the performance of SOTA Transformer and Conformer speech recognition models.
S. Zhang, E. Loweimi, P. Bell, S. Renals. “Windowed attention mechanisms for speech recognition.” ICASSP 2019
TL;DR: For the attentional encoder-decoder speech recognition system, we demonstrate restricting the attention mechanism to attend to a window of input frames with trained window length and shift will improve the accuracy and reduce the time complexity.
S. Zhang, E. Loweimi, Y. Xu, P. Bell, S. Renals. “Trainable Dynamic Subsampling for End-to-End Speech Recognition.” INTERSPEECH 2019
TL;DR: We demonstrate that a fully trainable input sequence subsampling strategy for attention-based ASR models outperforms conventional subsampling methods with fixed sampling rates..

Self-supervised Learning for Speech Processing

Self-supervised learning (SSL) typically requires extremely large amounts of resources, which can prevent the advancement of this area. Additionally, training such models often results in a large carbon footprint. We aim to make the training of SSL more accessible and environmentally friendly. Currently, we speed up the training of SOTA SSL models up to 3.7x and reduce the GPU requirement from A100 GPUs to RTX 3090 GPUs.

T. Parcollet, H. Nguyen, S. Evain, M. Zanon Boito, A. Pupier, S. Mdhaffar, H. Le, S. Alisamir, N. Tomashenko, M. Dinarelli, S. Zhang, A. Allauzen, M. Coavoux, Y. Esteve, M. Rouvier, J. Goulian, B. Lecouteux, F. Portet, S. Rossato, F. Ringeval, D. Schwab, L. Besacier. “LeBenchmark 2.0: a Standardized, Replicable and Enhanced Framework for Self-supervised Representations of French Speech” (preprint 2023. Paper to be updated soon. )
TL;DR: We conducted benchmarks and open-sourced self-supervised learning models for French speech, which encompassed data ranging from 1,000 to 14,000 hours and model sizes from 26 million to 1 billion parameters. Additionally, we explored the pre-training and fine-tuning techniques for downstream tasks and their impact on the carbon footprint.
T. Parcollet, S. Zhang, R. van Dalen, AGCP. Ramos, S. Bhattacharya. “On the (In)Efficiency of Acoustic Feature Extractors for Self-Supervised Speech Representation Learning” INTERSPEECH 2023
TL;DR: We carefully design the feature extractor frontend for self-supervised speech processing models, successfully reducing the minimum hardware requirements for training state-of-the-art (SOTA) models from A100 GPUs to RTX 3090 GPUs.

Noise Robust Speech Recognition

We address noisy inputs from two approaches: 1. training a speech enhancement front-end to remove noise, and 2. using transfer learning to enable the ASR model to learn noise-invariant features.

S. Zhang, M. Chadwick, AGCP. Ramos, T. Parcollet, R. van Dalen, S. Bhattacharya. “Real-Time Personalised Speech Enhancement Transformers with Dynamic Cross-attended Speaker Representation” INTERSPEECH 2023
TL;DR: We demonstrate that the cross-attention mechanism, despite its apparent computational expense, is more effective than conventional static single-vector approaches in learning speaker representations. Consequently, speech enhancement systems based on cross-attention outperform traditional models, achieving better performance with a smaller model size, which in turn results in faster speech inference.
S. Zhang, CT. Do, R. Doddipatla, E. Loweimi, P. Bell, S. Renals. “Train Your Classifier First: Cascade Neural Networks Training from Upper Layers to Lower Layers.” ICASSP 2021
TL;DR: We demonstrate that training neural networks layer by layer, starting from the upper layers (near the output) and progressing to the lower layers (near the input), can yield substantial and consistent performance gains for speech recognition, image classification, and language modeling tasks.
S. Zhang, CT. Do, R. Doddipatla, S. Renals. “Learning Noise Invariant Features Through Transfer Learning for Robust End-to-End Speech Recognition.” ICASSP 2020
TL;DR: We show the upper layers (near the output) of the neural-based speech recognition models are more robust to noise and thus can be used to guide the lower layers of the network to learn noise-invariant features.
CT. Do, S. Zhang, T. Hain. “Selective Adaptation of End-to-End Speech Recognition using Hybrid CTC/Attention Architecture for Noise Robustness.” European Signal Processing Conference (EUSIPCO) 2020
TL;DR: “With a constraint on computational budget, we benchmark which part of LSTM-based speech recognition systems is the most effective for noise-robust adaptation using a very limited amount data (about 2.4 minutes).”

Streaming Transformer ASR Model

M. Li, S. Zhang, C. Zorila, R. Doddipatla. “Transformer-based Streaming ASR with Cumulative Attention.” ICASSP 2022
TL;DR: We enhance the performance of Transformer-based encoder-decoder streaming speech recognition systems, achieving improvements in both speed and accuracy.