DeepSeek-V4-Flash Q8_0 GGUF

Lossless Q8_0 GGUF conversion of deepseek-ai/DeepSeek-V4-Flash (284B params, 13B active).

Use cases

Inference: ~282 GB. Suitable for rigs with combined VRAM ≥ 320 GB (e.g., 4× A100 80GB, 4× H100 80GB, 4× RTX Pro 6000 96GB) using llama.cpp's tensor-parallel paths. Quality is essentially indistinguishable from the original FP8/FP4 source — Q8_0 is the gold-standard "near-lossless" quantization.

Calibration source: serves as the input for downstream IQ-quant generation via llama-quantize --imatrix. If you want to make your own custom quants, this is the right starting point.

Provenance

Converted from the original FP8/FP4 mixed-precision safetensors via convert_hf_to_gguf.py from nisparks/llama.cpp wip/deepseek-v4-support (PR #22378).

Compatibility

Requires llama.cpp built from PR #22378 or later (mainline once merged). The new deepseek4 architecture is not yet in stable releases. See nisparks's branch for build instructions.

Related artifacts

Preyazz/DeepSeek-V4-Flash-GGUF — derived K-quants and (forthcoming) IQ-quants for smaller deployment
Preyazz/DeepSeek-V4-Flash-imatrix — importance matrix for IQ quantization (private)

Downloads last month: 116

GGUF

Model size

284B params

Architecture

deepseek4

Hardware compatibility

8-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Preyazz/DeepSeek-V4-Flash-Q8_0-GGUF

Base model

deepseek-ai/DeepSeek-V4-Flash

Quantized

(21)

this model

Quantizations

1 model