Independent AI Research Lab
QIRA is an independent research lab developing hybrid Transformer-SSM architectures. At 1.57B on H200, LOLM achieves 15% lower perplexity than a matched baseline. On TPU, LOLM converges up to 43% faster than a parameter-matched baseline during early training. Validated from 20.5M to 1.57B parameters across NVIDIA H200 and Google Cloud TPU. Patent pending. Open code and weights.
Why It Matters
Most large language models treat all of language as one prediction problem. But language operates on multiple levels: fast token level patterns and slow discourse level structure like planning, topic shifts, and coherence. By explicitly modeling this separation, we can build models that learn richer representations with fewer parameters, reducing the compute barrier to capable AI.
Verified Results
LOLM achieves PPL 33.2 vs a matched decoder-only baseline at PPL 39.1. Same data, same batch size, same hyperparameters. The improvement is purely architectural.
Full architecture validated on both GPU and TPU under identical controlled conditions. On TPU, LOLM converges up to 43% faster than a parameter-matched baseline during the first 15K steps, confirming the convergence advantage is hardware-agnostic.
Removing the latent SSM path (29% of signal) causes perplexity to explode from 34.5 to 485 million. The two representation streams are deeply integrated. Neither can function alone.
Validated on FineWeb-Edu, C4, and The Pile across 20.5M, 149M, 304M, 1.57B, and 7B (in progress). Downstream benchmarks include HellaSwag, WikiText-103, and LAMBADA.
Impact
LOLM achieves faster convergence and competitive language modeling quality through a hybrid architecture. At 1.57B on H200, LOLM achieves 15% lower perplexity than a matched baseline at step 24K. On TPU at 300M, LOLM converges up to 43% faster during early training, reducing the compute needed to reach a given quality level. The architecture is hardware-agnostic, open source, and protected by U.S. patent.
Research Focus
Designing models that combine multiple representation streams, such as Transformers and state space models, to capture both surface level and latent language dynamics.
Building models that achieve stronger performance with fewer parameters by rethinking how representations are structured and combined.
Exploring how language operates at multiple timescales, from fast token level prediction to slow discourse level planning, and modeling these processes explicitly.
Studying how architectural innovations behave across model sizes, from 20M to 1.57B parameters, and how component contributions change with scale.
About QIRA
QIRA was founded on a specific thesis: that natural language has distinct surface level and latent level structure. Token sequences on the surface, and deeper processes like planning, discourse tracking, and topic management underneath. We believe modeling this separation explicitly is the path to more capable and more efficient language systems.
We operate independently, publish all findings with open code and weights, and design every experiment for reproducibility. Our goal is not just to build better models, but to advance the community's collective understanding of how language models work.
Prove that smarter architectures can replace brute force scaling. We build models that do more with less: fewer parameters, less compute, lower energy. We separate what language shows from what language means.
A future where capable AI doesn't require datacenter scale resources. Hybrid architectures that achieve frontier level understanding at a fraction of the energy cost, putting serious research within reach of independent labs, not just corporations.
Every watt matters. We design for efficiency as a moral position, not just an engineering goal. All research is published openly with code, weights, and full reproducibility. Progress locked behind closed doors isn't progress.
Research
Four interconnected lines of investigation, all aimed at understanding and improving how language models represent and process information.
Designing models that combine the strengths of Transformer attention with state space model efficiency. We study how hybrid designs can capture both local dependencies and long range context more effectively than either approach alone.
Exploring the hypothesis that language has distinct surface level and latent level structures. We investigate how explicitly separating these representations (surface token prediction and deeper discourse level modeling) can improve language understanding.
Studying how model capabilities change across scales, from 20.5M to 1.57B parameters and beyond. At 304M the latent path contributes 17% of the signal. At 1.57B it contributes 29%. Removing it causes a 14,000,000x perplexity explosion, despite comprising the minority of the fused representation.
Developing training paradigms that optimize for multiple objectives simultaneously, combining standard language modeling with auxiliary tasks that encourage richer internal representations and more robust generalization.
Introduces a hybrid architecture that augments a Transformer decoder with four parallel subsystems (selective SSM, persistent memory, regime layer, manifestation gate), achieving 15% improvement over a controlled baseline at 1.57B on H200, and up to 43% faster convergence on TPU at 300M under identical conditions. Gate ablation reveals a 14,000,000x dependency inversion. Validated across 3 datasets (FineWeb-Edu, C4, Pile) and 2 hardware platforms (NVIDIA H200, Google TPU v4). U.S. Patent Pending (#64/002,166).
Experiments & Projects
What we're building and releasing right now.
Training and evaluating hybrid Transformer SSM language models across scales from 20.5M to 1.57B parameters. At 1.57B on H200, 15% improvement over a matched baseline. On TPU at 300M, up to 43% faster convergence than a parameter-matched baseline during early training. 50K-step runs completed for Full LOLM, Matched Baseline, No-SSM ablation, and cross-dataset (Pile) validation.
Training with seven complementary loss terms: token cross entropy, contrastive predictive coding, changepoint alignment, regime diversity, competitive gate, memory focus, and gate regularization. Systematic ablations confirm all components contribute.
Releasing trained model checkpoints, code, and evaluation results for the research community. All LOLM model weights and training code are publicly available.
Interactive visualizations of LOLM's scaling behavior: gate trajectories, latent contribution, dependency inversion, PPL advantage, and compute efficiency across model sizes.
See how LOLM's manifestation gate blends surface and latent representations per token, visualizing when the model leans on syntax vs discourse structure.
Compute & Infrastructure
All LOLM models through 1.57B were trained on a single NVIDIA H200 GPU (140GB), achieving 15% lower perplexity than a matched decoder-only baseline at step 24K. Cross-hardware validation on Google Cloud TPU v4 confirmed up to 43% faster convergence during early training under controlled conditions, with 50K-step runs completed for multiple configurations. QIRA holds U.S. Provisional Patent Application #64/002,166 (filed March 10, 2026) covering the hybrid architecture.
All models through 1.57B trained on a single NVIDIA H200 (140GB). Cross-hardware validation completed on Google Cloud TPU v4-8 with 50K-step runs for Full LOLM, Matched Baseline (317M), No-SSM ablation, and Pile cross-dataset validation. Downstream benchmarks (HellaSwag, WikiText-103, LAMBADA) completed on TPU with parameter-matched comparison.
Systematic logging of hyperparameters, loss curves, and evaluation metrics across all training runs. Every published result can be independently reproduced.
Automated infrastructure for training at multiple scales (20.5M, 149M, 304M, 1.57B) with consistent evaluation. Three datasets (FineWeb-Edu, C4, Pile), downstream benchmarks (HellaSwag, WikiText-103, LAMBADA), and ablation studies with 50K-step runs on TPU v4-8.
Blog & Research Notes
Technical deep dives and research notes from our ongoing work.
Combining Transformers and state space models for next generation language modeling. What each brings to the table, why neither alone is sufficient, and how LOLM's per dimension gating enables dependency inversion.
Emergent behaviors, gate trajectory reversals, regime code collapse solutions, and 2x compute efficiency. Observations from training LOLM across four orders of magnitude.
An introduction to dual stream language modeling. Why a single representation is unnecessarily lossy, and how explicitly separating surface and latent streams enables representations that single stream models cannot learn.
Team
Independent researchers building at the frontier of language model architecture.
Co-Founder / AI Researcher
Builds the groundwork that makes LOLM possible. Owns training infrastructure, experiment pipelines, and day to day model runs. From setting up multi GPU environments to debugging loss curves at 3 AM. Responsible for scaling experiments across all five model sizes.
Co-Founder / AI Researcher
The architect behind LOLM's design. Drives the big picture research direction: hybrid architecture design, data implementation strategy, and the surface latent separation thesis. Translates theoretical insight into model architecture and defines what QIRA builds next.
QIRA operates as a focused two person research lab. We believe meaningful AI breakthroughs come from depth of investigation, not size of team. All our work is published openly with code and weights.
Contact
We're actively seeking research partnerships, compute grants, and institutional collaborations to accelerate our work on hybrid language model architectures.
We're pursuing grants to scale our hybrid architecture research beyond 1.57B parameters. Our work is open access and reproducible by design.
Researchers working on architecture design, scaling laws, or multi objective training. We welcome collaboration and coauthorship.
Cloud providers, GPU sponsors, and compute grant programs. Additional resources directly translate to larger scale experiments and faster progress.