DeepSeek-V4-Flash Q8_0 GGUF
Lossless Q8_0 GGUF conversion of deepseek-ai/DeepSeek-V4-Flash (284B params, 13B active).
Use cases
Inference: ~282 GB. Suitable for rigs with combined VRAM ≥ 320 GB (e.g., 4× A100 80GB, 4× H100 80GB, 4× RTX Pro 6000 96GB) using llama.cpp's tensor-parallel paths. Quality is essentially indistinguishable from the original FP8/FP4 source — Q8_0 is the gold-standard "near-lossless" quantization.
Calibration source: serves as the input for downstream IQ-quant generation via llama-quantize --imatrix. If you want to make your own custom quants, this is the right starting point.
Provenance
Converted from the original FP8/FP4 mixed-precision safetensors via convert_hf_to_gguf.py from nisparks/llama.cpp wip/deepseek-v4-support (PR #22378).
Compatibility
Requires llama.cpp built from PR #22378 or later (mainline once merged). The new deepseek4 architecture is not yet in stable releases. See nisparks's branch for build instructions.
Related artifacts
Preyazz/DeepSeek-V4-Flash-GGUF— derived K-quants and (forthcoming) IQ-quants for smaller deploymentPreyazz/DeepSeek-V4-Flash-imatrix— importance matrix for IQ quantization (private)
- Downloads last month
- 116
8-bit