Qwen3.6-27B Opus 4.6 Reasoning MLX 4-bit

This repository is an MLX 4-bit conversion of a merged Qwen3.6 vision-language model:

The adapter was merged into the base Hugging Face checkpoint, then the merged checkpoint was converted to MLX and quantized to 4-bit affine weights.

Model Details

  • Architecture: qwen3_5
  • Task: image-text-to-text
  • Format: MLX safetensors
  • Quantization: 4-bit affine
  • Quantization group size: 64
  • Uploaded weight shards: 3
  • Total indexed weight storage: ~16.05 GB
  • Conversion date: 2026-04-24

The final MLX config includes:

"quantization": {
  "group_size": 64,
  "bits": 4,
  "mode": "affine"
}

Conversion Process

The conversion pipeline was:

  1. Load unsloth/Qwen3.6-27B with Transformers.
  2. Load kai-os/Qwen3.6-27b-Opus4.6-reasoning as a PEFT LoRA adapter.
  3. Merge the adapter into the base model with merge_and_unload.
  4. Save the merged Hugging Face checkpoint as safetensors.
  5. Copy tokenizer, image processor, video processor, and chat-template assets.
  6. Convert the merged checkpoint with mlx_vlm.convert.
  7. Quantize to 4-bit affine weights.
  8. Upload the resulting MLX checkpoint to this repository.

The conversion was run on Google Colab Pro using an A100 runtime. The MLX conversion used the Linux CUDA backend (mlx[cuda12]) with mlx-vlm 0.4.4.

Files

Key files in this repo:

  • config.json: MLX/VLM architecture and quantization configuration.
  • model.safetensors.index.json: weight index, including total quantized storage size.
  • model-00001-of-00003.safetensors, model-00002-of-00003.safetensors, model-00003-of-00003.safetensors: quantized MLX weight shards.
  • tokenizer.json, tokenizer_config.json, special_tokens_map.json, added_tokens.json: tokenizer assets.
  • preprocessor_config.json, processor_config.json, video_preprocessor_config.json: VLM processor assets.
  • chat_template.jinja: chat template copied from the source model assets.

Usage

Install mlx-vlm:

pip install -U mlx-vlm

Run an image-text prompt:

python -m mlx_vlm.generate \
  --model micic-mihajlo/Qwen3.6-27B-Opus4.6-Reasoning-MLX-4bit \
  --prompt "Describe this image." \
  --image /path/to/image.jpg \
  --max-tokens 256 \
  --temperature 0.0

Run a text-only prompt through the VLM entrypoint:

python -m mlx_vlm.generate \
  --model micic-mihajlo/Qwen3.6-27B-Opus4.6-Reasoning-MLX-4bit \
  --prompt "Solve: If x^2 - 5x + 6 = 0, what are the possible values of x?" \
  --max-tokens 256 \
  --temperature 0.0

Python example:

from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template

model_path = "micic-mihajlo/Qwen3.6-27B-Opus4.6-Reasoning-MLX-4bit"
model, processor = load(model_path)

messages = [
    {
        "role": "user",
        "content": "Explain the difference between LoRA merging and quantization.",
    }
]

prompt = apply_chat_template(processor, model.config, messages)
response = generate(model, processor, prompt, max_tokens=256, temperature=0.0)
print(response)

If the Python API changes in a future mlx-vlm release, prefer the CLI example above or consult the installed mlx-vlm documentation for the current load and generate signatures.

Intended Use

This model is intended for local MLX inference on Apple Silicon or compatible MLX environments. It should be useful for:

  • Reasoning-heavy chat and instruction following.
  • Math/code-style reasoning inherited from the LoRA adapter.
  • Image-text tasks supported by the Qwen3.6 VLM architecture.
  • Experiments comparing the merged adapter against the base Qwen3.6 model.

Limitations

  • This is a quantized 4-bit model. Quality may differ from the merged BF16 checkpoint.
  • This is a community conversion and has not been exhaustively benchmarked.
  • The upstream adapter metadata includes reasoning/math/code tags, but no new evaluation suite is included in this repository.
  • The model can still hallucinate and should not be used as an authoritative source for high-stakes decisions.
  • The model inherits the base model's multimodal behavior and the adapter's fine-tuning behavior; review both upstream repos for more context.

Verification Performed

During conversion, the following checks completed:

  • The LoRA adapter loaded successfully on top of unsloth/Qwen3.6-27B.
  • The adapter merge completed successfully.
  • The merged checkpoint was saved as Hugging Face safetensors.
  • mlx-vlm loaded the merged checkpoint for conversion.
  • 4-bit affine quantization completed.
  • The Hub upload completed successfully.
  • The uploaded repo contains the expected config, tokenizer, processor, and three safetensor shard files.

The conversion log reported:

Quantized model with 4.695 bits per weight.
Upload successful

After upload, the model was smoke-tested locally on Apple Silicon with mlx-vlm, torch, and torchvision installed for processor support:

  • Text-only reasoning prompt: solved a chickens/cows system correctly as 15 chickens and 11 cows.
  • Image-text prompt: correctly identified a generated test image containing a red square and a blue circle.
  • Observed local peak memory: about 16.4 GB for text-only generation and about 16.8 GB for the simple image-text test.
  • Observed local generation speed: about 16 tokens/sec for these short smoke tests.

Lineage and Licensing

This is a derived model built from:

  • Qwen/Qwen3.6-27B / unsloth/Qwen3.6-27B, licensed under Apache-2.0.
  • kai-os/Qwen3.6-27b-Opus4.6-reasoning, a PEFT LoRA adapter with MIT metadata.

Because the base model is Apache-2.0 and the adapter is MIT, downstream users should comply with the Apache-2.0 terms of the base model and the MIT terms of the adapter. This model card uses apache-2.0 in the Hub metadata to reflect the base model license.

Acknowledgements

Thanks to the Qwen, Unsloth, kai-os, MLX, and mlx-vlm maintainers. This repository is only a merged and quantized MLX port; the original modeling and adapter work belongs to the upstream authors.

Downloads last month
746
Safetensors
Model size
5B params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for micic-mihajlo/Qwen3.6-27B-Opus4.6-Reasoning-MLX-4bit

Base model

Qwen/Qwen3.6-27B
Adapter
(14)
this model