Qwen3.6-27B Opus 4.6 Reasoning MLX 4-bit
This repository is an MLX 4-bit conversion of a merged Qwen3.6 vision-language model:
- Base model:
unsloth/Qwen3.6-27B - Upstream base family:
Qwen/Qwen3.6-27B - LoRA adapter:
kai-os/Qwen3.6-27b-Opus4.6-reasoning - MLX target library:
mlx-vlm
The adapter was merged into the base Hugging Face checkpoint, then the merged checkpoint was converted to MLX and quantized to 4-bit affine weights.
Model Details
- Architecture:
qwen3_5 - Task: image-text-to-text
- Format: MLX safetensors
- Quantization: 4-bit affine
- Quantization group size: 64
- Uploaded weight shards: 3
- Total indexed weight storage: ~16.05 GB
- Conversion date: 2026-04-24
The final MLX config includes:
"quantization": {
"group_size": 64,
"bits": 4,
"mode": "affine"
}
Conversion Process
The conversion pipeline was:
- Load
unsloth/Qwen3.6-27Bwith Transformers. - Load
kai-os/Qwen3.6-27b-Opus4.6-reasoningas a PEFT LoRA adapter. - Merge the adapter into the base model with
merge_and_unload. - Save the merged Hugging Face checkpoint as safetensors.
- Copy tokenizer, image processor, video processor, and chat-template assets.
- Convert the merged checkpoint with
mlx_vlm.convert. - Quantize to 4-bit affine weights.
- Upload the resulting MLX checkpoint to this repository.
The conversion was run on Google Colab Pro using an A100 runtime. The MLX conversion used the Linux CUDA backend (mlx[cuda12]) with mlx-vlm 0.4.4.
Files
Key files in this repo:
config.json: MLX/VLM architecture and quantization configuration.model.safetensors.index.json: weight index, including total quantized storage size.model-00001-of-00003.safetensors,model-00002-of-00003.safetensors,model-00003-of-00003.safetensors: quantized MLX weight shards.tokenizer.json,tokenizer_config.json,special_tokens_map.json,added_tokens.json: tokenizer assets.preprocessor_config.json,processor_config.json,video_preprocessor_config.json: VLM processor assets.chat_template.jinja: chat template copied from the source model assets.
Usage
Install mlx-vlm:
pip install -U mlx-vlm
Run an image-text prompt:
python -m mlx_vlm.generate \
--model micic-mihajlo/Qwen3.6-27B-Opus4.6-Reasoning-MLX-4bit \
--prompt "Describe this image." \
--image /path/to/image.jpg \
--max-tokens 256 \
--temperature 0.0
Run a text-only prompt through the VLM entrypoint:
python -m mlx_vlm.generate \
--model micic-mihajlo/Qwen3.6-27B-Opus4.6-Reasoning-MLX-4bit \
--prompt "Solve: If x^2 - 5x + 6 = 0, what are the possible values of x?" \
--max-tokens 256 \
--temperature 0.0
Python example:
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
model_path = "micic-mihajlo/Qwen3.6-27B-Opus4.6-Reasoning-MLX-4bit"
model, processor = load(model_path)
messages = [
{
"role": "user",
"content": "Explain the difference between LoRA merging and quantization.",
}
]
prompt = apply_chat_template(processor, model.config, messages)
response = generate(model, processor, prompt, max_tokens=256, temperature=0.0)
print(response)
If the Python API changes in a future mlx-vlm release, prefer the CLI example above or consult the installed mlx-vlm documentation for the current load and generate signatures.
Intended Use
This model is intended for local MLX inference on Apple Silicon or compatible MLX environments. It should be useful for:
- Reasoning-heavy chat and instruction following.
- Math/code-style reasoning inherited from the LoRA adapter.
- Image-text tasks supported by the Qwen3.6 VLM architecture.
- Experiments comparing the merged adapter against the base Qwen3.6 model.
Limitations
- This is a quantized 4-bit model. Quality may differ from the merged BF16 checkpoint.
- This is a community conversion and has not been exhaustively benchmarked.
- The upstream adapter metadata includes reasoning/math/code tags, but no new evaluation suite is included in this repository.
- The model can still hallucinate and should not be used as an authoritative source for high-stakes decisions.
- The model inherits the base model's multimodal behavior and the adapter's fine-tuning behavior; review both upstream repos for more context.
Verification Performed
During conversion, the following checks completed:
- The LoRA adapter loaded successfully on top of
unsloth/Qwen3.6-27B. - The adapter merge completed successfully.
- The merged checkpoint was saved as Hugging Face safetensors.
mlx-vlmloaded the merged checkpoint for conversion.- 4-bit affine quantization completed.
- The Hub upload completed successfully.
- The uploaded repo contains the expected config, tokenizer, processor, and three safetensor shard files.
The conversion log reported:
Quantized model with 4.695 bits per weight.
Upload successful
After upload, the model was smoke-tested locally on Apple Silicon with mlx-vlm, torch, and torchvision installed for processor support:
- Text-only reasoning prompt: solved a chickens/cows system correctly as
15 chickens and 11 cows. - Image-text prompt: correctly identified a generated test image containing a red square and a blue circle.
- Observed local peak memory: about
16.4 GBfor text-only generation and about16.8 GBfor the simple image-text test. - Observed local generation speed: about
16 tokens/secfor these short smoke tests.
Lineage and Licensing
This is a derived model built from:
Qwen/Qwen3.6-27B/unsloth/Qwen3.6-27B, licensed under Apache-2.0.kai-os/Qwen3.6-27b-Opus4.6-reasoning, a PEFT LoRA adapter with MIT metadata.
Because the base model is Apache-2.0 and the adapter is MIT, downstream users should comply with the Apache-2.0 terms of the base model and the MIT terms of the adapter. This model card uses apache-2.0 in the Hub metadata to reflect the base model license.
Acknowledgements
Thanks to the Qwen, Unsloth, kai-os, MLX, and mlx-vlm maintainers. This repository is only a merged and quantized MLX port; the original modeling and adapter work belongs to the upstream authors.
- Downloads last month
- 746
4-bit
Model tree for micic-mihajlo/Qwen3.6-27B-Opus4.6-Reasoning-MLX-4bit
Base model
Qwen/Qwen3.6-27B