1. M2N2: Mathematical Foundations
Suppose you have \(K\) seed models, each with parameter vector \(\theta_i\), \(i=1, ..., K\) (for example, these could be all the weights of a neural network flattened into a vector).
Goal: Find merged model parameters \(\theta^*\) that maximize fitness over a dataset: \(\theta^* = \arg\max_\theta \sum_{j=1}^N s(x_j \mid \theta)\) where \(s(x_j \mid \theta)\) is a “score” (like accuracy on sample \(x_j\)), and \(N\) is the number of examples.
Merging two networks is parameterized by a merging function \(h_w\) that defines how parameters are split and combined. \(\theta = h_w(\theta_1, ..., \theta_K)\)
1.1. Dynamic Merging Boundaries
Instead of hand-selecting which layers or regions to merge, M2N2 evolves both the fusion coefficients and the splitting boundaries: For two models \(\theta_A, \theta_B\), and parameters \(w_m \in\) (mixing) and \(w_s\) (split index):[1]
\[h_\text{M2N2}(\theta_A, \theta_B, w_m, w_s) = \text{concat}\Big( f_{w_m}(\theta_A^{<w_s}, \theta_B^{<w_s}),\ f_{1-w_m}(\theta_A^{\ge w_s}, \theta_B^{\ge w_s}) \Big)\]- \(\theta^{<w_s}\): model parameters before split-point \(w_s\)
- \(\theta^{\ge w_s}\): model parameters after split-point \(w_s\)
- \(f_t(\theta_A, \theta_B)\): Spherical Linear Interpolation (SLERP) between \(\theta_A\) and \(\theta_B\) with weight \(t\)
SLERP (Spherical Linear Interpolation)
Given vectors \(a\), \(b\) and interpolation parameter \(t\): \(\operatorname{SLERP}(a, b, t) = \frac{\sin((1-t)\Omega)}{\sin \Omega} a + \frac{\sin(t\Omega)}{\sin \Omega} b\) where \(\Omega = \arccos\left(\frac{a^\top b}{\|a\|\|b\|}\right)\). For high-dimensional neural nets this just ensures that weights interpolate smoothly.
1.2. Diversity by Resource Competition
Introduce diversity by controlling how much “reward” any given data point can contribute (simulate competition for limited resources): Let \(z_j = \sum_{k=1}^P s(x_j|\theta_k)\) for population size \(P\). Set \(c_j\) as the “capacity” (e.g., 1 for binary score).
\[\text{Fitness:}\qquad f(\theta_i) = \sum_{j=1}^N \frac{ s(x_j|\theta_i) } {z_j^\alpha + \epsilon }\, c_j\]- \(\alpha\): competition intensity; \(\epsilon\) is a small number to prevent division by zero.
- When \(\alpha = 0\): no competition
- When \(\alpha = 1\): fixed capacity per data point
- When \(\alpha > 1\): “fighting” scenario; higher competition.
1.3. Attraction: Pair Selection for Merging
When merging, don’t pick parents at random: pair “complementary” models (those strong where the other is weak):
\(g(\theta_A, \theta_B) = \sum_{j=1}^N \frac{c_j}{z_j + \epsilon} \cdot \max\big( s(x_j|\theta_B) - s(x_j|\theta_A),\ 0 \big)\) This biases towards fusing models B that are better on samples where A is weak, focusing merging pressure on “complementarity.”
2. Pseudocode Outline
Here is a high-level Python-like pseudocode for M2N2:
import numpy as np
def slerp(a, b, t):
# Spherical linear interpolation between a and b
a_norm = a / np.linalg.norm(a)
b_norm = b / np.linalg.norm(b)
dot = np.clip(np.dot(a_norm, b_norm), -1.0, 1.0)
omega = np.arccos(dot)
if np.isclose(omega, 0):
return (1-t)*a + t*b # fallback to lerp
return (np.sin((1-t)*omega)/np.sin(omega))*a + (np.sin(t*omega)/np.sin(omega))*b
def merge_models(theta_A, theta_B, w_m, w_s):
# w_s = int split index; w_m = float in [0,1]
merged = np.empty_like(theta_A)
merged[:w_s] = slerp(theta_A[:w_s], theta_B[:w_s], w_m)
merged[w_s:] = slerp(theta_A[w_s:], theta_B[w_s:], 1-w_m)
return merged
def fitness(theta, X, s, population, alpha, c):
# X: data points, s: scoring function, population: list of models
score = 0
for j, xj in enumerate(X):
z_j = sum(s(xj, p) for p in population)
score += s(xj, theta) / (z_j**alpha + 1e-8) * c[j]
return score
def attraction(theta_A, theta_B, X, s, population, c):
g = 0
for j, xj in enumerate(X):
z_j = sum(s(xj, p) for p in population)
d = max(s(xj, theta_B) - s(xj, theta_A), 0)
g += c[j] / (z_j + 1e-8) * d
return g3. Algorithm: Full Evolutionary Loop
Input: List of initial models \(\{\theta_i\}\), dataset \(X\), score function \(s\), population size \(P\), number of generations \(T\), capacity list \(c\), hyperparams \((\alpha, ...)\).
- Initialize population:
- Randomly initialize \(P\) models.
- For each generation (or until convergence):
- For each candidate offspring:
- Select Parent A: Sample from population weighted by fitness.
- Select Parent B: Sample from population using attraction score relative to Parent A.
- Choose merging parameters: Randomly pick mixing (\(w_m \in\)), split point (\(w_s \in [1, N_\text{params} - 1]\)).[1]
- Merge:
- \(\theta_\text{child} = \text{merge\_models}(\theta_A, \theta_B, w_m, w_s)\).
- Evaluate fitness: On held-out data, or using Eq. above.
- Archive update: If \(\theta_\text{child}\) outperforms the worst member, insert (possibly removing weak models).
- Periodically, update population diversity statistics, archive, and hyperparameters if desired.
- For each candidate offspring:
- Output:
- Archive with diverse and high-fitness models; best model(s) as per evaluation.
4. Minimal Example: Evolving MNIST Classifiers
Suppose each θ is the weights of a small MLP classifier on MNIST.
Score function:
Let’s define \(s(x_j | \theta) = 1\) if \(\operatorname{argmax}_k f_\theta(x_j) = y_j\) (correct label), else 0.
Merge operation:
# theta_A, theta_B: network parameters as numpy arrays
# w_m: e.g., 0.5 for midpoint interpolation
# w_s: split at 60% of parameter vector
w_s = int(0.6 * theta_A.shape[0])
child = merge_models(theta_A, theta_B, w_m, w_s)Evolutionary loop (simplified):
population = [random_init_theta() for _ in range(P)]
for gen in range(T):
parent_A = select_parent_fitness_weighted(population, X, s, ...)
parent_B = select_parent_attraction_weighted(population, parent_A, X, s, ...)
w_m, w_s = np.random.uniform(), np.random.randint(1, theta_size)
child = merge_models(parent_A, parent_B, w_m, w_s)
if fitness(child, X, s, population, alpha, c) > worst_fitness_in_population:
replace_worst(child)5. Special Cases & Real-World Scaling
- For LLM/Diffusion merging, operate on flat vectors or logical blocks (e.g., transformer layers, U-Net blocks), merge accordingly.
- No mutation step for large models due to fragility.
- Archive size and competition parameter (\(\alpha\)) adjust diversity/exploration.
6. References & Real Implementation
- Official Code:
SakanaAI/natural_niches: GECCO 2025 code for M2N2 - Paper:
arXiv:2508.16204
Full mathematical details, pseudocode, and algorithm design above should give you a precise handle to implement or extend M2N2 in any deep learning or neuroevolution project.