Interactive 3D visualization of the Latent Order Language Model. Hover components for details.
16 Transformer layers with RoPE, 16 attention heads (d_k=64), d_ff=4096 with GELU. Handles fast token level prediction. Output h ∈ (T, 1024).
4 selective Mamba layers, d_inner=2048, d_state=32. Captures slow discourse level structure. Trained via CPC loss (InfoNCE). Output z ∈ (T, 1024).
32 Gumbel-Softmax codes, d_r=128, τ=0.5. Detects discourse transitions. Gradient detached before fusion. Changepoint + diversity losses.
3 banks (episodic, semantic, self), 128 slots each, d_s=1024. Chunked read/write (C=4). Gated update per slot. Output m ∈ (T, 1024).
Per dimension g ∈ [0,1]^1024. 2-layer MLP conditioned on h, z, m, r. Learns contextual surface/latent blend per token position.
g⊙LN(W_h · h) + (1-g)⊙LN(W_z · z) + W_m · m + W_r · r̄. Per dimension gated blend plus additive memory and regime contributions.