LOLM Scaling Projections

Empirical results (20.5M-1.57B) and projected behavior (7B, 13B) based on observed scaling trends from the LOLM framework.

Actual Projected
Metric 304M 1.57B 7B ⟶ 13B ⟶
Gate equilibrium0.830.720.60-0.650.55-0.60
Latent contribution17%29%35-40%40-45%
Regimes alive32/3264/64128/128128/128
Dependency inversion+744%14M×strongerstronger
Compute efficiency~2×≥2×≥2×
PPL advantage15%15-25%15-25%

All projected values are hypotheses to be tested. The 7B and 13B columns represent expected behavior based on directional trends observed from 304M to 1.57B. TPU scale runs will confirm or refute these projections. Negative results are equally valuable.

Gate Equilibrium Across Scale

Lower gate = more latent contribution. The model allocates increasingly to the SSM at larger scale.

The initial rise from 0.27 → 0.83 (20.5M → 304M) reflects architecture maturing. Gradient isolation wasn't implemented until 304M. The meaningful trend starts at 304M: 0.83 → 0.72 shows the model allocating more to latent at scale. Projected to continue: 0.625 at 7B, 0.575 at 13B. If the gate approaches 0.50, surface and latent become equal partners.

Latent Contribution (%) Across Scale

How much of the fused representation comes from the SSM latent path. Rising = architecture becoming more hybrid.

20.5M and 149M had high latent % due to immature training (no gradient isolation). From 304M onward the trend is clear: 17% → 29%. Projected: 37.5% at 7B, 42.5% at 13B. At 45%+ the architecture would be a genuine dual stream system rather than a Transformer with an auxiliary module.

Dependency Inversion Intensity (log scale)

PPL multiplier when latent path is removed (gate=1.0). Higher = deeper integration.

744×
304M
measured
14M×
1.57B
measured
500M×
7B
projected
5B×
13B
projected

The trend from 304M to 1.57B is superexponential on a log scale: 2.87 → 7.15 (log₁₀). If this trajectory continues, 7B and 13B surface only collapse may produce effectively infinite perplexity. The model would output near random tokens. At that point, the dependency inversion is total: the Transformer is no longer a language model without the SSM.

PPL Advantage vs Matched Baseline

Percentage lower perplexity than decoder only baseline at matched compute.

15%
1.57B (controlled)
15-25%
7B
15-25%
13B

This is the most uncertain projection. We only have one controlled data point (1.57B). The 15-25% range at 7B and 13B is conservative. It assumes the advantage holds but doesn't necessarily grow. If it exceeds 25% at 7B, that's the strongest possible signal that the architecture's advantage compounds with scale.

Compute Efficiency vs Baseline

How many fewer training steps LOLM needs to reach baseline quality. 2× = half the compute.

At 1.57B, LOLM at step 10K matches baseline at step 20K, achieving 2× compute efficiency. At 7B with more latent contribution (projected 35-40%), the SSM's inductive bias should provide an even stronger early advantage. Conservative projection: 2-3× at 7B and 13B. In dollar terms, a model that costs $10M to train conventionally could reach equivalent quality for $3.3-5M with LOLM's architecture.