Empirical results (20.5M-1.57B) and projected behavior (7B, 13B) based on observed scaling trends from the LOLM framework.
| Metric | 304M | 1.57B | 7B ⟶ | 13B ⟶ |
|---|---|---|---|---|
| Gate equilibrium | 0.83 | 0.72 | 0.60-0.65 | 0.55-0.60 |
| Latent contribution | 17% | 29% | 35-40% | 40-45% |
| Regimes alive | 32/32 | 64/64 | 128/128 | 128/128 |
| Dependency inversion | +744% | 14M× | stronger | stronger |
| Compute efficiency | — | ~2× | ≥2× | ≥2× |
| PPL advantage | — | 15% | 15-25% | 15-25% |
All projected values are hypotheses to be tested. The 7B and 13B columns represent expected behavior based on directional trends observed from 304M to 1.57B. TPU scale runs will confirm or refute these projections. Negative results are equally valuable.
Lower gate = more latent contribution. The model allocates increasingly to the SSM at larger scale.
The initial rise from 0.27 → 0.83 (20.5M → 304M) reflects architecture maturing. Gradient isolation wasn't implemented until 304M. The meaningful trend starts at 304M: 0.83 → 0.72 shows the model allocating more to latent at scale. Projected to continue: 0.625 at 7B, 0.575 at 13B. If the gate approaches 0.50, surface and latent become equal partners.
How much of the fused representation comes from the SSM latent path. Rising = architecture becoming more hybrid.
20.5M and 149M had high latent % due to immature training (no gradient isolation). From 304M onward the trend is clear: 17% → 29%. Projected: 37.5% at 7B, 42.5% at 13B. At 45%+ the architecture would be a genuine dual stream system rather than a Transformer with an auxiliary module.
PPL multiplier when latent path is removed (gate=1.0). Higher = deeper integration.
The trend from 304M to 1.57B is superexponential on a log scale: 2.87 → 7.15 (log₁₀). If this trajectory continues, 7B and 13B surface only collapse may produce effectively infinite perplexity. The model would output near random tokens. At that point, the dependency inversion is total: the Transformer is no longer a language model without the SSM.
Percentage lower perplexity than decoder only baseline at matched compute.
This is the most uncertain projection. We only have one controlled data point (1.57B). The 15-25% range at 7B and 13B is conservative. It assumes the advantage holds but doesn't necessarily grow. If it exceeds 25% at 7B, that's the strongest possible signal that the architecture's advantage compounds with scale.
How many fewer training steps LOLM needs to reach baseline quality. 2× = half the compute.
At 1.57B, LOLM at step 10K matches baseline at step 20K, achieving 2× compute efficiency. At 7B with more latent contribution (projected 35-40%), the SSM's inductive bias should provide an even stronger early advantage. Conservative projection: 2-3× at 7B and 13B. In dollar terms, a model that costs $10M to train conventionally could reach equivalent quality for $3.3-5M with LOLM's architecture.