M2N2 Algorithm - Medium Blog

1. M2N2: Mathematical Foundations

Suppose you have \(K\) seed models, each with parameter vector \(\theta_i\), \(i=1, ..., K\) (for example, these could be all the weights of a neural network flattened into a vector).

Goal: Find merged model parameters \(\theta^*\) that maximize fitness over a dataset: \(\theta^* = \arg\max_\theta \sum_{j=1}^N s(x_j \mid \theta)\) where \(s(x_j \mid \theta)\) is a “score” (like accuracy on sample \(x_j\)), and \(N\) is the number of examples.

Merging two networks is parameterized by a merging function \(h_w\) that defines how parameters are split and combined. \(\theta = h_w(\theta_1, ..., \theta_K)\)

1.1. Dynamic Merging Boundaries

Instead of hand-selecting which layers or regions to merge, M2N2 evolves both the fusion coefficients and the splitting boundaries: For two models \(\theta_A, \theta_B\), and parameters \(w_m \in\) (mixing) and \(w_s\) (split index):[1]

\[h_\text{M2N2}(\theta_A, \theta_B, w_m, w_s) = \text{concat}\Big( f_{w_m}(\theta_A^{<w_s}, \theta_B^{<w_s}),\ f_{1-w_m}(\theta_A^{\ge w_s}, \theta_B^{\ge w_s}) \Big)\]

\(\theta^{<w_s}\): model parameters before split-point \(w_s\)
\(\theta^{\ge w_s}\): model parameters after split-point \(w_s\)
\(f_t(\theta_A, \theta_B)\): Spherical Linear Interpolation (SLERP) between \(\theta_A\) and \(\theta_B\) with weight \(t\)

SLERP (Spherical Linear Interpolation)

Given vectors \(a\), \(b\) and interpolation parameter \(t\): \(\operatorname{SLERP}(a, b, t) = \frac{\sin((1-t)\Omega)}{\sin \Omega} a + \frac{\sin(t\Omega)}{\sin \Omega} b\) where \(\Omega = \arccos\left(\frac{a^\top b}{\|a\|\|b\|}\right)\). For high-dimensional neural nets this just ensures that weights interpolate smoothly.

1.2. Diversity by Resource Competition

Introduce diversity by controlling how much “reward” any given data point can contribute (simulate competition for limited resources): Let \(z_j = \sum_{k=1}^P s(x_j|\theta_k)\) for population size \(P\). Set \(c_j\) as the “capacity” (e.g., 1 for binary score).

\[\text{Fitness:}\qquad f(\theta_i) = \sum_{j=1}^N \frac{ s(x_j|\theta_i) } {z_j^\alpha + \epsilon }\, c_j\]

\(\alpha\): competition intensity; \(\epsilon\) is a small number to prevent division by zero.
When \(\alpha = 0\): no competition
When \(\alpha = 1\): fixed capacity per data point
When \(\alpha > 1\): “fighting” scenario; higher competition.

1.3. Attraction: Pair Selection for Merging

When merging, don’t pick parents at random: pair “complementary” models (those strong where the other is weak):

\(g(\theta_A, \theta_B) = \sum_{j=1}^N \frac{c_j}{z_j + \epsilon} \cdot \max\big( s(x_j|\theta_B) - s(x_j|\theta_A),\ 0 \big)\) This biases towards fusing models B that are better on samples where A is weak, focusing merging pressure on “complementarity.”

2. Pseudocode Outline

Here is a high-level Python-like pseudocode for M2N2:

import numpy as np

def slerp(a, b, t):
    # Spherical linear interpolation between a and b
    a_norm = a / np.linalg.norm(a)
    b_norm = b / np.linalg.norm(b)
    dot = np.clip(np.dot(a_norm, b_norm), -1.0, 1.0)
    omega = np.arccos(dot)
    if np.isclose(omega, 0):
        return (1-t)*a + t*b  # fallback to lerp
    return (np.sin((1-t)*omega)/np.sin(omega))*a + (np.sin(t*omega)/np.sin(omega))*b

def merge_models(theta_A, theta_B, w_m, w_s):
    # w_s = int split index; w_m = float in [0,1]
    merged = np.empty_like(theta_A)
    merged[:w_s] = slerp(theta_A[:w_s], theta_B[:w_s], w_m)
    merged[w_s:] = slerp(theta_A[w_s:], theta_B[w_s:], 1-w_m)
    return merged

def fitness(theta, X, s, population, alpha, c):
    # X: data points, s: scoring function, population: list of models
    score = 0
    for j, xj in enumerate(X):
        z_j = sum(s(xj, p) for p in population)
        score += s(xj, theta) / (z_j**alpha + 1e-8) * c[j]
    return score

def attraction(theta_A, theta_B, X, s, population, c):
    g = 0
    for j, xj in enumerate(X):
        z_j = sum(s(xj, p) for p in population)
        d = max(s(xj, theta_B) - s(xj, theta_A), 0)
        g += c[j] / (z_j + 1e-8) * d
    return g

3. Algorithm: Full Evolutionary Loop

Input: List of initial models \(\{\theta_i\}\), dataset \(X\), score function \(s\), population size \(P\), number of generations \(T\), capacity list \(c\), hyperparams \((\alpha, ...)\).

Initialize population:
- Randomly initialize \(P\) models.
For each generation (or until convergence):
- For each candidate offspring:
  1. Select Parent A: Sample from population weighted by fitness.
  2. Select Parent B: Sample from population using attraction score relative to Parent A.
  3. Choose merging parameters: Randomly pick mixing (\(w_m \in\)), split point (\(w_s \in [1, N_\text{params} - 1]\)).[1]
  4. Merge:
    - \(\theta_\text{child} = \text{merge\_models}(\theta_A, \theta_B, w_m, w_s)\).
  5. Evaluate fitness: On held-out data, or using Eq. above.
  6. Archive update: If \(\theta_\text{child}\) outperforms the worst member, insert (possibly removing weak models).
- Periodically, update population diversity statistics, archive, and hyperparameters if desired.
Output:
- Archive with diverse and high-fitness models; best model(s) as per evaluation.

4. Minimal Example: Evolving MNIST Classifiers

Suppose each θ is the weights of a small MLP classifier on MNIST.

Score function:
Let’s define \(s(x_j | \theta) = 1\) if \(\operatorname{argmax}_k f_\theta(x_j) = y_j\) (correct label), else 0.

Merge operation:

# theta_A, theta_B: network parameters as numpy arrays
# w_m: e.g., 0.5 for midpoint interpolation
# w_s: split at 60% of parameter vector
w_s = int(0.6 * theta_A.shape[0])
child = merge_models(theta_A, theta_B, w_m, w_s)

Evolutionary loop (simplified):

population = [random_init_theta() for _ in range(P)]
for gen in range(T):
    parent_A = select_parent_fitness_weighted(population, X, s, ...)
    parent_B = select_parent_attraction_weighted(population, parent_A, X, s, ...)
    w_m, w_s = np.random.uniform(), np.random.randint(1, theta_size)
    child = merge_models(parent_A, parent_B, w_m, w_s)
    if fitness(child, X, s, population, alpha, c) > worst_fitness_in_population:
        replace_worst(child)

5. Special Cases & Real-World Scaling

For LLM/Diffusion merging, operate on flat vectors or logical blocks (e.g., transformer layers, U-Net blocks), merge accordingly.
No mutation step for large models due to fragility.
Archive size and competition parameter (\(\alpha\)) adjust diversity/exploration.

6. References & Real Implementation

Official Code:
SakanaAI/natural_niches: GECCO 2025 code for M2N2
Paper:
arXiv:2508.16204

Full mathematical details, pseudocode, and algorithm design above should give you a precise handle to implement or extend M2N2 in any deep learning or neuroevolution project.