Papers
arxiv:2509.24244

Model Merging Scaling Laws in Large Language Models

Published on May 11
· Submitted by
Yuanyi Wang
on May 12
Authors:
,
,
,
,
,
,
,

Abstract

Empirical scaling laws for language model merging reveal power-law relationships between model size, expert count, and cross-entropy performance, enabling predictive planning for optimal model composition.

AI-generated summary

We study empirical scaling laws for language model merging measured by cross-entropy. Despite its wide practical use, merging lacks a quantitative rule that predicts returns as we add experts or scale the model size. We identify a compact power law that links model size and expert number: the size-dependent floor decreases with model capacity, while the merging tail exhibits clear diminishing returns in the number of experts. The law holds in-domain and cross-domain, tightly fits measured curves across diverse architectures and methods (Average, TA, TIES, DARE), and explains two robust regularities: most gains arrive early, and variability shrinks as more experts are included. Building on this, we present a simple theory that explains why gains fall roughly as 1/k and links the floor and tail to properties of the base model and the diversity across domains. This law enables predictive planning: estimate how many experts are needed to reach a target loss, decide when to stop adding experts, and trade off scaling the base model versus adding experts under a fixed budget--turning merging from heuristic practice into a computationally efficient, planable alternative to multitask training. This suggests a scaling principle for distributed generative AI: predictable gains can be achieved by composing specialists, offering a complementary path toward AGI-level systems.

Community

Paper author Paper submitter
This comment has been hidden (marked as Resolved)
Paper author Paper submitter

Can we predict the returns of language model merging before trying every expert combination?

We study scaling laws for LLM merging and find a compact floor-plus-tail law that predicts merged-model cross-entropy from base model size and the number of merged experts.

Across 10,866 merged models, 0.5B–72B base sizes, nine domains, and four merging methods (Average, TA, TIES, DARE), we observe consistent regularities: most gains come from the first few experts, larger models are easier to merge, variance contracts with more experts, and method differences shrink at scale.

This turns model merging from a mostly empirical trial-and-error procedure into a predictable, budget-aware alternative to multitask fine-tuning.

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2509.24244
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2509.24244 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2509.24244 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2509.24244 in a Space README.md to link it from this page.

Collections including this paper 2