DeepSeek-V4-Flash Q8_0 GGUF

Lossless Q8_0 GGUF conversion of deepseek-ai/DeepSeek-V4-Flash (284B params, 13B active).

Use cases

Inference: ~282 GB. Suitable for rigs with combined VRAM ≥ 320 GB (e.g., 4× A100 80GB, 4× H100 80GB, 4× RTX Pro 6000 96GB) using llama.cpp's tensor-parallel paths. Quality is essentially indistinguishable from the original FP8/FP4 source — Q8_0 is the gold-standard "near-lossless" quantization.

Calibration source: serves as the input for downstream IQ-quant generation via llama-quantize --imatrix. If you want to make your own custom quants, this is the right starting point.

Provenance

Converted from the original FP8/FP4 mixed-precision safetensors via convert_hf_to_gguf.py from nisparks/llama.cpp wip/deepseek-v4-support (PR #22378).

Compatibility

Requires llama.cpp built from PR #22378 or later (mainline once merged). The new deepseek4 architecture is not yet in stable releases. See nisparks's branch for build instructions.

Related artifacts

Downloads last month
116
GGUF
Model size
284B params
Architecture
deepseek4
Hardware compatibility
Log In to add your hardware

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Preyazz/DeepSeek-V4-Flash-Q8_0-GGUF

Quantized
(21)
this model
Quantizations
1 model