LocalVQE

Open in Spaces GitHub License: Apache 2.0

Local Voice Quality Enhancement β€” a compact neural model for joint acoustic echo cancellation (AEC), noise suppression, and dereverberation of 16 kHz speech, designed to run on commodity CPUs in real time.

  • 1.3 M parameters (~5 MB F32)
  • ~1.66 ms per 16 ms frame on Zen4 (24 threads) β€” β‰ˆ9.6Γ— realtime
  • Causal, streaming: 256-sample hop, 16 ms algorithmic latency
  • F32 reference inference in C++ via GGML; PyTorch reference included for verification and research

Try it live: https://huggingface.co/spaces/LocalAI-io/LocalVQE-demo.

This page is the Hugging Face model card β€” it hosts the published weights. Source code, build system, tests, and training pipeline live in the GitHub repository: https://github.com/localai-org/LocalVQE.

The current release is v1.1, which fixes intermittent crackling the previous release produced under heavy background noise.

The technical report describing the architecture, streaming-state contract, and streaming-causal normalisation operator is included in this repo as localvqe-technical-report.pdf. We would like to publish it to arXiv (eess.AS / cs.SD) but need an endorsement from an existing author in those categories β€” if you can endorse, please reach out via the GitHub repo.

Authors:

  • Richard Palethorpe (richiejp)
  • Claude (Anthropic)

LocalVQE is a derivative of DeepVQE (Indenbom et al., Interspeech 2023 β€” DeepVQE: Real Time Deep Voice Quality Enhancement for Joint Acoustic Echo Cancellation, Noise Suppression and Dereverberation, arXiv:2306.03177) β€” smaller, GGML-native, and tuned for streaming CPU inference. The architecture is documented in the technical report linked above.

A concrete example

Picture a video call from a laptop. Your microphone picks up three things alongside your voice:

  1. The remote participant's voice, played back through your speakers and caught again by your mic β€” this is the echo. Without cancellation they hear themselves a fraction of a second later.
  2. Your own voice bouncing off walls, desk, and monitor before reaching the mic β€” this is reverberation, the "tunnel" or "bathroom" sound that makes you feel far away from the listener.
  3. A fan, keyboard clatter, a dog barking, or traffic outside β€” plain background noise.

LocalVQE removes all three in a single causal pass, frame by frame, on the CPU, so only your voice reaches the far end.

Why this, and not a classical AEC/NS stack?

Hand-tuned DSP pipelines (NLMS/AP/Kalman AEC, Wiener/spectral-subtraction NS, MCRA noise tracking, RLS dereverb) can run in tens of microseconds per frame and remain a strong baseline when the acoustic path is benign. LocalVQE is interesting when you want:

  • Robustness to non-linear echo paths (small loudspeakers, handheld devices, plastic laptop chassis) where linear AEC leaves residual echo.
  • Non-stationary noise suppression (babble, keyboards, fans changing speed) that energy-based noise estimators struggle with.
  • One model, many conditions β€” no per-device tuning of step sizes, forgetting factors, or VAD thresholds.
  • A single deterministic causal pass β€” no double-talk detector, no adaptation state that can diverge.

The trade-off is CPU: a classical stack might cost ~0.1 ms/frame, LocalVQE ~1–2 ms/frame. On anything larger than a microcontroller that's still a small fraction of a real-time budget.

Why this, and not DeepVQE?

Microsoft never released DeepVQE β€” no weights, no reference implementation, no streaming runtime. We re-implemented it from the paper as a GGML graph at richiejp/deepvqe-ggml (the full-width 7.5 M-parameter version) before starting LocalVQE. LocalVQE is the same idea pruned and rebuilt to ~1.3 M parameters (5 MB F32), small enough to run on commodity CPUs in real time.

Files in this repository

File Size Description
localvqe-v1.1-1.3M.pt 11 MB PyTorch checkpoint β€” DNS5 pre-training + ICASSP 2022/2023 AEC Challenge fine-tune.
localvqe-v1.1-1.3M-f32.gguf 5 MB GGML F32 export β€” what the C++ inference engine loads.

Only F32 GGUF is published today. A quantize tool is included in the C++ build (see below); calibrated Q4_K / Q8_0 weights have not yet been released.

Validation Results

Full 800-clip eval on the ICASSP 2022 AEC Challenge blind test set β€” real recordings, not synthetic mixes.

Scenario n AECMOS echo ↑ AECMOS deg ↑ blind ERLE ↑ DNSMOS OVRL ↑
doubletalk 115 4.70 2.35 8.4 dB 2.85
doubletalk-with-movement 185 4.63 2.35 8.3 dB 2.80
farend-singletalk 107 2.98 4.91 44.7 dB 1.93
farend-singletalk-with-movement 193 3.40 4.95 45.0 dB 1.91
nearend-singletalk 200 4.99 4.05 2.5 dB 3.13
  • AECMOS (Purin et al., ICASSP 2022) is Microsoft's non-intrusive AEC quality predictor. "Echo" rates how well echo was removed; "degradation" rates how clean the resulting speech is. 1–5 MOS scale, higher is better.
  • Blind ERLE is 10Β·log10(E[micΒ²] / E[enhΒ²]). Only meaningful on far-end single-talk where the input is echo-only; on scenes with active near-end speech it understates echo removal because both numerator and denominator are dominated by speech.

Building the C++ Inference Engine

Source, build system, and tests live at https://github.com/localai-org/LocalVQE. Requires CMake β‰₯ 3.20 and a C++17 compiler. A Nix flake is provided:

git clone --recursive https://github.com/localai-org/LocalVQE.git
cd LocalVQE

# With Nix:
nix develop
cmake -S ggml -B ggml/build -DCMAKE_BUILD_TYPE=Release
cmake --build ggml/build -j$(nproc)

# Without Nix β€” install cmake, gcc/clang, pkg-config, libsndfile, then:
cmake -S ggml -B ggml/build -DCMAKE_BUILD_TYPE=Release
cmake --build ggml/build -j$(nproc)

Binaries land in ggml/build/bin/. The CPU build produces multiple libggml-cpu-*.so variants (SSE4.2 / AVX2 / AVX-512) selected at runtime. Keep the binaries and .so files together.

Vulkan backend (embedded / integrated-GPU targets)

Add -DLOCALVQE_VULKAN=ON to the configure step. This composes with the CPU build β€” an additional libggml-vulkan.so is produced in ggml/build/bin/ and the runtime loader picks it up when a Vulkan ICD is present, otherwise it falls back to the CPU variants.

cmake -S ggml -B ggml/build -DCMAKE_BUILD_TYPE=Release -DLOCALVQE_VULKAN=ON
cmake --build ggml/build -j$(nproc)

The Nix flake's dev shell already includes vulkan-loader, vulkan-headers, and shaderc. Without Nix, install the equivalents from your distro (Debian: libvulkan-dev vulkan-headers glslc/shaderc).

Streaming latency (per-hop, 16 kHz / 256-sample hop β†’ 16 ms budget)

Measured with bench on Zen4 desktop (Ryzen 9 7900). Each hop is a full ggml_backend_graph_compute.

Backend Threads p50 p99 max
CPU 1 3.40 ms 3.57 ms 5.06 ms
CPU 2 2.07 ms 2.25 ms 3.65 ms
CPU 4 1.32 ms 1.57 ms 6.91 ms
Vulkan β€” AMD iGPU (RADV) β€” 4.43 ms 4.62 ms 5.07 ms
Vulkan β€” NVIDIA RTX 5070 Ti β€” 1.79 ms 3.41 ms 4.14 ms

Vulkan p50/p95/p99 are tight, but worst-case single-hop latency on a shared desktop is sensitive to external GPU clients (display compositor, browser). On a dedicated embedded device with no compositor contending for the queue, expect the quieter end of the range.

Running Inference

Download localvqe-v1.1-1.3M-f32.gguf from this repository (the file list above) either via huggingface-cli, the Hub web UI, or hf_hub_download from huggingface_hub. Then:

CLI

./ggml/build/bin/localvqe localvqe-v1.1-1.3M-f32.gguf \
    --in-wav mic.wav ref.wav \
    --out-wav enhanced.wav

Expects 16 kHz mono PCM for both mic and far-end reference.

Benchmark

./ggml/build/bin/bench localvqe-v1.1-1.3M-f32.gguf \
    --in-wav mic.wav ref.wav --iters 10 --profile

Shared Library (C API)

cmake -S ggml -B ggml/build -DLOCALVQE_BUILD_SHARED=ON
cmake --build ggml/build -j$(nproc)

Produces liblocalvqe.so with the API in ggml/localvqe_api.h. See ggml/example_purego_test.go in the GitHub repo for a Go / purego integration.

Quantizing (experimental)

Calibrated Q4_K / Q8_0 weights are not yet published. The quantize tool in the C++ build can produce GGUF variants from the F32 reference for experimentation:

./ggml/build/bin/quantize localvqe-v1.1-1.3M-f32.gguf localvqe-v1.1-1.3M-q8.gguf Q8_0

Expect end-to-end quality loss until proper per-tensor selection and calibration have been worked through.

PyTorch Reference

localvqe-v1.1-1.3M.pt is the PyTorch checkpoint used to produce the GGUF export. It is provided for verification, ablation, and downstream research β€” not for end-user inference, which should go through the GGML build above. The model definition lives under pytorch/ in the GitHub repo:

git clone https://github.com/localai-org/LocalVQE.git
cd LocalVQE/pytorch
pip install -r requirements.txt

Citing LocalVQE

If you use LocalVQE in academic work, please cite the repository via the CITATION.cff at https://github.com/localai-org/LocalVQE β€” GitHub renders a "Cite this repository" button that produces APA and BibTeX entries automatically.

For a DOI, we recommend citing a specific release via Zenodo, which mints a DOI per GitHub release. Please also cite the upstream DeepVQE paper:

@inproceedings{indenbom2023deepvqe,
  title     = {DeepVQE: Real Time Deep Voice Quality Enhancement for Joint
               Acoustic Echo Cancellation, Noise Suppression and Dereverberation},
  author    = {Indenbom, Evgenii and Beltr{\'a}n, Nicolae-C{\u{a}}t{\u{a}}lin
               and Chernov, Mykola and Aichner, Robert},
  booktitle = {Interspeech},
  year      = {2023},
  doi       = {10.21437/Interspeech.2023-2176}
}

Dataset Attribution

Published weights are trained on data from the ICASSP 2023 Deep Noise Suppression Challenge (Microsoft, CC BY 4.0) and fine-tuned on the ICASSP 2022/2023 Acoustic Echo Cancellation Challenge.

Safety Note

Training data was filtered by DNSMOS perceived-quality scores, which can misclassify distressed speech (screaming, crying) as noise. LocalVQE may attenuate or distort such signals and must not be relied upon for emergency call or safety-critical applications.

License

Apache License 2.0.

Downloads last month
312
GGUF
Model size
1.29M params
Architecture
localvqe
Hardware compatibility
Log In to add your hardware

32-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Space using LocalAI-io/LocalVQE 1

Paper for LocalAI-io/LocalVQE