Independent AI Research Lab

Language has hidden structure.
We're building models that see it.

QIRA is an independent research lab developing hybrid Transformer-SSM architectures. At 1.57B on H200, LOLM achieves 15% lower perplexity than a matched baseline. On TPU, LOLM converges up to 43% faster than a parameter-matched baseline during early training. Validated from 20.5M to 1.57B parameters across NVIDIA H200 and Google Cloud TPU. Patent pending. Open code and weights.

Scroll to explore

Current language models waste parameters on a single objective.

Most large language models treat all of language as one prediction problem. But language operates on multiple levels: fast token level patterns and slow discourse level structure like planning, topic shifts, and coherence. By explicitly modeling this separation, we can build models that learn richer representations with fewer parameters, reducing the compute barrier to capable AI.

15% Better PPL at 1.57B (H200)
14M× Dependency Inversion
43% Faster Convergence on TPU
2 Hardware Platforms Validated

Proven across scales, datasets, and hardware

15% Better at 1.57B Scale

LOLM achieves PPL 33.2 vs a matched decoder-only baseline at PPL 39.1. Same data, same batch size, same hyperparameters. The improvement is purely architectural.

Cross-Hardware Validation

Full architecture validated on both GPU and TPU under identical controlled conditions. On TPU, LOLM converges up to 43% faster than a parameter-matched baseline during the first 15K steps, confirming the convergence advantage is hardware-agnostic.

14,000,000× Dependency Inversion

Removing the latent SSM path (29% of signal) causes perplexity to explode from 34.5 to 485 million. The two representation streams are deeply integrated. Neither can function alone.

3 Datasets, 5 Scales

Validated on FineWeb-Edu, C4, and The Pile across 20.5M, 149M, 304M, 1.57B, and 7B (in progress). Downstream benchmarks include HellaSwag, WikiText-103, and LAMBADA.

Smarter architectures mean less compute for the same capability.

LOLM achieves faster convergence and competitive language modeling quality through a hybrid architecture. At 1.57B on H200, LOLM achieves 15% lower perplexity than a matched baseline at step 24K. On TPU at 300M, LOLM converges up to 43% faster during early training, reducing the compute needed to reach a given quality level. The architecture is hardware-agnostic, open source, and protected by U.S. patent.

43% Faster Convergence vs Matched Baseline (TPU)
Open Code, Weights, Paper & Proof Pack

Core areas of investigation

Hybrid Architectures

Designing models that combine multiple representation streams, such as Transformers and state space models, to capture both surface level and latent language dynamics.

Parameter Efficiency

Building models that achieve stronger performance with fewer parameters by rethinking how representations are structured and combined.

Multi Scale Representations

Exploring how language operates at multiple timescales, from fast token level prediction to slow discourse level planning, and modeling these processes explicitly.

Scaling Behavior

Studying how architectural innovations behave across model sizes, from 20M to 1.57B parameters, and how component contributions change with scale.

LOLM: Scaling to 1.57B parameters and beyond.

Our flagship framework, LOLM (Latent Order Language Model), separates surface token prediction from latent discourse modeling in a unified hybrid architecture. Validated across four scale points (20.5M, 149M, 304M, 1.57B) on NVIDIA H200 and Google Cloud TPU. At 1.57B on H200, LOLM achieves 15% lower perplexity than a matched baseline. On TPU at 300M, LOLM converges up to 43% faster than a parameter-matched baseline during early training (50K steps completed for 4 of 5 experiment configurations). U.S. Patent Application #64/002,166 filed March 2026.

Independent research,
open by default.

QIRA was founded on a specific thesis: that natural language has distinct surface level and latent level structure. Token sequences on the surface, and deeper processes like planning, discourse tracking, and topic management underneath. We believe modeling this separation explicitly is the path to more capable and more efficient language systems.

We operate independently, publish all findings with open code and weights, and design every experiment for reproducibility. Our goal is not just to build better models, but to advance the community's collective understanding of how language models work.

Mission

Prove that smarter architectures can replace brute force scaling. We build models that do more with less: fewer parameters, less compute, lower energy. We separate what language shows from what language means.

Vision

A future where capable AI doesn't require datacenter scale resources. Hybrid architectures that achieve frontier level understanding at a fraction of the energy cost, putting serious research within reach of independent labs, not just corporations.

Ethics & Approach

Every watt matters. We design for efficiency as a moral position, not just an engineering goal. All research is published openly with code, weights, and full reproducibility. Progress locked behind closed doors isn't progress.

Research areas

Four interconnected lines of investigation, all aimed at understanding and improving how language models represent and process information.

01

Hybrid Transformer SSM Architectures

Designing models that combine the strengths of Transformer attention with state space model efficiency. We study how hybrid designs can capture both local dependencies and long range context more effectively than either approach alone.

Hybrid Models State Space Models Attention Mechanisms
02

Surface & Latent Representations

Exploring the hypothesis that language has distinct surface level and latent level structures. We investigate how explicitly separating these representations (surface token prediction and deeper discourse level modeling) can improve language understanding.

Multi Scale Modeling Latent Representations Discourse Structure
03

Scaling Behavior & Efficiency

Studying how model capabilities change across scales, from 20.5M to 1.57B parameters and beyond. At 304M the latent path contributes 17% of the signal. At 1.57B it contributes 29%. Removing it causes a 14,000,000x perplexity explosion, despite comprising the minority of the fused representation.

Scaling Laws Parameter Efficiency Compute Optimization
04

Multi Objective Training

Developing training paradigms that optimize for multiple objectives simultaneously, combining standard language modeling with auxiliary tasks that encourage richer internal representations and more robust generalization.

Training Paradigms Auxiliary Objectives Generalization

Publications

2026

LOLM: Language Modeling Beyond the Surface with Hybrid Transformer-SSM Latent Order Fields

Bryan Leonard & Brandyn Leonard, Qira LLC

Introduces a hybrid architecture that augments a Transformer decoder with four parallel subsystems (selective SSM, persistent memory, regime layer, manifestation gate), achieving 15% improvement over a controlled baseline at 1.57B on H200, and up to 43% faster convergence on TPU at 300M under identical conditions. Gate ablation reveals a 14,000,000x dependency inversion. Validated across 3 datasets (FineWeb-Edu, C4, Pile) and 2 hardware platforms (NVIDIA H200, Google TPU v4). U.S. Patent Pending (#64/002,166).

Active projects

What we're building and releasing right now.

Active

LOLM Model Family

Training and evaluating hybrid Transformer SSM language models across scales from 20.5M to 1.57B parameters. At 1.57B on H200, 15% improvement over a matched baseline. On TPU at 300M, up to 43% faster convergence than a parameter-matched baseline during early training. 50K-step runs completed for Full LOLM, Matched Baseline, No-SSM ablation, and cross-dataset (Pile) validation.

Hybrid Architecture Scaling
Active

Multi Scale Training Experiments

Training with seven complementary loss terms: token cross entropy, contrastive predictive coding, changepoint alignment, regime diversity, competitive gate, memory focus, and gate regularization. Systematic ablations confirm all components contribute.

Training Multi Objective
Active

Open Model & Weight Releases

Releasing trained model checkpoints, code, and evaluation results for the research community. All LOLM model weights and training code are publicly available.

Open Source Reproducibility
Interactive

Scaling Projections Dashboard

Interactive visualizations of LOLM's scaling behavior: gate trajectories, latent contribution, dependency inversion, PPL advantage, and compute efficiency across model sizes.

Data Visualization Scaling Analysis
Interactive

Gate Behavior Explorer

See how LOLM's manifestation gate blends surface and latent representations per token, visualizing when the model leans on syntax vs discourse structure.

Manifestation Gate Token Analysis

Research at meaningful scale.

All LOLM models through 1.57B were trained on a single NVIDIA H200 GPU (140GB), achieving 15% lower perplexity than a matched decoder-only baseline at step 24K. Cross-hardware validation on Google Cloud TPU v4 confirmed up to 43% faster convergence during early training under controlled conditions, with 50K-step runs completed for multiple configurations. QIRA holds U.S. Provisional Patent Application #64/002,166 (filed March 10, 2026) covering the hybrid architecture.

Supported by

Single GPU to TPU Scaling

All models through 1.57B trained on a single NVIDIA H200 (140GB). Cross-hardware validation completed on Google Cloud TPU v4-8 with 50K-step runs for Full LOLM, Matched Baseline (317M), No-SSM ablation, and Pile cross-dataset validation. Downstream benchmarks (HellaSwag, WikiText-103, LAMBADA) completed on TPU with parameter-matched comparison.

Experiment Tracking & Reproducibility

Systematic logging of hyperparameters, loss curves, and evaluation metrics across all training runs. Every published result can be independently reproduced.

Scaling Pipeline

Automated infrastructure for training at multiple scales (20.5M, 149M, 304M, 1.57B) with consistent evaluation. Three datasets (FineWeb-Edu, C4, Pile), downstream benchmarks (HellaSwag, WikiText-103, LAMBADA), and ablation studies with 50K-step runs on TPU v4-8.

Insights from the lab

Technical deep dives and research notes from our ongoing work.

The people behind QIRA

Independent researchers building at the frontier of language model architecture.

Bryan Leonard

Bryan Leonard

Co-Founder / AI Researcher

Builds the groundwork that makes LOLM possible. Owns training infrastructure, experiment pipelines, and day to day model runs. From setting up multi GPU environments to debugging loss curves at 3 AM. Responsible for scaling experiments across all five model sizes.

Brandyn Leonard

Brandyn Leonard

Co-Founder / AI Researcher

The architect behind LOLM's design. Drives the big picture research direction: hybrid architecture design, data implementation strategy, and the surface latent separation thesis. Translates theoretical insight into model architecture and defines what QIRA builds next.

QIRA operates as a focused two person research lab. We believe meaningful AI breakthroughs come from depth of investigation, not size of team. All our work is published openly with code and weights.

Work with us.

We're actively seeking research partnerships, compute grants, and institutional collaborations to accelerate our work on hybrid language model architectures.

Grant & Funding Bodies

We're pursuing grants to scale our hybrid architecture research beyond 1.57B parameters. Our work is open access and reproducible by design.

Academic Collaborators

Researchers working on architecture design, scaling laws, or multi objective training. We welcome collaboration and coauthorship.

Compute Partners

Cloud providers, GPU sponsors, and compute grant programs. Additional resources directly translate to larger scale experiments and faster progress.

Please enter your name
Please enter a valid email
Please enter a message