Modeling strategies for speech enhancement in the latent space of a neural audio codec
This repository provides the official model checkpoints for the paper Modeling strategies for speech enhancement in the latent space of a neural audio codec authored by Sofiene Kammoun, Xavier Alameda-Pineda, and Simon Leglaive, and published at IEEE ICASSP 2026.
We explore different modeling strategies (autoregressive vs. non-autoregressive) and representation spaces (discrete vs. continuous) for speech enhancement using neural audio codecs and Conformer-based architectures.
arXiv | Code and Audio examples | Bibtex
Overview
Our work introduces and compares a family of speech enhancement models that systematically vary along two main axes:
Representation Type
Discrete tokens
Continuous latent vectors
Modeling Strategy
Autoregressive (AR): Sequential prediction of clean speech representation
Non-Autoregressive (NAR): Parallel prediction of clean speech representation
The current release includes the following models:
| Model Name | Modeling Strategy | Input Representation | Output Representation | Model Checkpoint |
|---|---|---|---|---|
| D-AR | Autoregressive | Discrete | Discrete | D-AR_ckpt_300.pt |
| D-NAR | Non-Autoregressive | Discrete | Discrete | D-NAR_ckpt_300.pt |
| D-NAR* | Non-Autoregressive | Continuous | Discrete | D-NAR_star_ckpt_300.pt |
| C-AR | Autoregressive | Continuous | Continuous | C-AR_ckpt_300.pt |
| C-NAR | Non-Autoregressive | Continuous | Continuous | C-NAR_ckpt_300.pt |
Additional models:
- C-FT (
C-FT-encoder_ckpt_300.pt) and D-FT (D-FT-encoder_ckpt_300.pt), where we only finetune the NAC's encoder with an MSE loss and a cross-entropy loss, respectively. - STFT-NAR (
STFT_NAR_Mask_ckpt_300.pt), where instead of the embeddings of the NAC, we work with STFT representations, and we train the model to output an STFT mask.