Model Overview

  • Model Architecture: gpt-oss-120b
    • Input: Text
    • Output: Text
  • Supported Hardware Microarchitecture: AMD MI350/MI355
  • ROCm: 7.0
  • Operating System(s): Linux
  • Inference Engine: vLLM
  • Model Optimizer: AMD-Quark
    • Weight quantization: OCP MXFP4, Static
    • Activation quantization: OCP MXFP4, Dynamic
  • Calibration Dataset: Pile

This model was built with gpt-oss-120b model by applying AMD-Quark for MXFP4 quantization.

Model Quantization

The model was quantized from openai/gpt-oss-120b using AMD-Quark. The weights are quantized MXFP4 and activations were quantized to FP8. Attention and it's KV Cache was quantized to FP8.

Quantization Instructions:

Downloading base model:

hf download openai/gpt-oss-120b --local-dir /path/to/openai-gpt-oss-120b 

quantization_command.sh:

#/bin/bash
exclude_layers="*lm_head* *router*"

python3 quantize_quark.py \
--model_dir /path/to/openai-gpt-oss-120b \
--quant_scheme mxfp4_fp8 \
--kv_cache_dtype fp8 \
--attention_dtype fp8 \
--exclude_layers $exclude_layers \
--num_calib_data 512 \
--output_dir /path/to/gpt-oss-120b-w-mxfp4-a-fp8-kv-fp8-fp8attn-no_lmhead_router \
--model_export hf_format \
--multi_gpu

Quantization instruction:

wget https://download.amd.com/opendownload/Quark/amd_quark-0.11.1-py3-none-any.whl
pip install amd_quark-0.11.1-py3-none-any.whl
wget https://download.amd.com/opendownload/Quark/amd_quark-0.11.1.zip
unzip amd_quark-0.11.1.zip
cd amd_quark-0.11.1/examples/torch/language_modeling/llm_ptq
chmod +x quantization_command.sh
./quantization_command.sh

Evaluation

The model was evaluated on AIME25 and GPQA Diamond benchmarks with low reasoning effort.

Accuracy

Benchmark gpt-oss-120b gpt-oss-120b-w-mxfp4-a-fp8-kv-fp8-fp8attn-no_lmhead_router(this model) Recovery
AIME25 65.25 47.91 71.37%
GPQA 51.67 64.64 125.10%

Reproduction

The results of AIME25 and GPQA Diamond were obtained using gpt_oss.evals with low effort setting, and vLLM docker rocm/vllm-dev:nightly.

Launching server

vllm serve amd/gpt-oss-120b-w-mxfp4-a-fp8-kv-fp8-fp8attn-no_lmhead_router \
  --tensor_parallel_size 2 \
  --gpu-memory-utilization 0.90 \
  --no-enable-prefix-caching \
  --max-num-batched-tokens 1024

Evaluating model in a new terminal

python -m gpt_oss.evals --model /shareddata/amd/gpt-oss-120b-w-mxfp4-a-fp8-kv-fp8-fp8attn-no_lmhead_router --eval aime25,gpqa --reasoning-effort low --n-threads 128

License

Modifications Copyright(c) 2026 Advanced Micro Devices, Inc. All rights reserved.

Downloads last month
13
Safetensors
Model size
59B params
Tensor type
BF16
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for amd/gpt-oss-120b-w-mxfp4-a-fp8-kv-fp8-fp8attn-no_lmhead_router

Quantized
(101)
this model