Kimi-Linear : An Expressive, Efficient Attention Architecture
A linear-time scaling architecture has long been the goal. Kimi-Linear takes a major step toward achieving efficient, ultra-long context modeling without trading off quality.
Table of contents
Detailed walkthrough on the parallel chunking of KDA by DPLR
Introduction
Every major advance from LLMs to multimodal systems and even biological modeling traces back to one breakthrough: attention. It lets models focus selectively, deciding which parts of input truly matter. Before it, CNNs and RNNs could only reason locally, missing global relationships. Self-attention changed that, enabling direct interaction between any tokens or features, unlocking true contextual reasoning and transforming how architectures are built.
But the very mechanism that made Transformers powerful also became their constraint. As context lengths and model sizes grow, attention’s quadratic compute and memory cost turns into a bottleneck, making large-scale reasoning expensive and inefficient.
This tension between expressivity and efficiency has defined the next generation of architecture research (We covered a lot of these advancements in detail in our previous blog here :
The industry’s trajectory has shifted from making attention more accurate to making it more efficient.
And that’s where Kimi Linear comes in; a promised breakthrough architecture that redefines what efficient attention can look like. Kimi Linear combines the structural expressivity of full attention with the speed and efficiency of linear attention mechanisms. It introduces a hybrid design where linear recurrence meets selective full attention, achieving a balance that neither side managed alone.
In this article, we’ll unpack the story and methodology behind, it starting from why attention matters, exploring why it breaks at scale, revisiting earlier attempts like Linformer, Longformer, Mamba, and GND, and finally dissecting how Kimi Linear rewires attention using Kimi Delta Attention (KDA) and some cool Linear Algebra.
By the end, you’ll not only understand how Kimi Linear works but also why it might signal the next paradigm shift in long-context, reasoning-capable AI.
Why Attention Has a Computation Bottleneck?
As this article is built over our previous attempt to understand issues and computational intricacies of attention, you can go through first 2 sections of the above article. The sections I would recommend are :
(You can learn all these topics from the sections I pointed to)
Let me just give you a quick taste of the issue, in case you don't want to switch the tabs.
The Root of the Problem: Quadratic Complexity
The core attention operation requires computing a similarity matrix between all pairs of tokens:
If your sequence has N tokens, Q and K are both n×d, meaning A is an n×n matrix (here d is dimension of embeddings which typically is 512, 1024 or similar; hence 10^2 or 10^3).
That’s O(n^2) time and O(n^2) memory : both scale quadratically with sequence length.
This means:
A 1,000-token sequence needs 1,000,000 attention weights.
A 32,000-token context (like modern long-context LLMs) explodes to over 1 billion weights per layer, per head.
And that’s before we even multiply by the number of layers or heads. Even with half-precision floats (16-bit), that’s gigabytes of activation memory per layer, making full attention infeasible for long sequences.
The Compute-Memory Tradeoff
The major bottleneck isn’t just FLOPs, it’s memory bandwidth. Every forward pass requires:
Computing QK^⊤ (matrix multiplication),
Storing the full (n×n) attention map,
Applying softmax and multiplying with V.
At inference, this becomes a latency nightmare. At training, it becomes a memory bottleneck, leading to “OOM” (Out of Memory) errors even on top-tier GPUs. This is why scaling from 4k → 32k context requires dozens of A100s or H100s working in parallel.
A simple fix for this compute bottleneck during inference is KV Caching. During inference, the model doesn’t recompute attention from scratch at every token, instead, it caches the keys and values from previous steps. When generating token t, it only computes attention between the current query qt and the cached (K,V) pairs.
This reduces the time complexity per token from O(n^2) to O(n). But; memory still scales as O(n) because every past key and value must be stored. In other words; KV caching helps with inference speed but worsens the memory footprint, especially at longer contexts (like 128k tokens). You can read much more on this in the attached article.
Earlier Attempts at Efficient Attention
Before we understand why Kimi Linear matters, it’s essential to look at what came before. Over the past few years, multiple architectures have attempted to address the computational explosion of full attention — each from a different philosophical angle. Some compress, some sparsify, and some entirely rethink what “attention” even means.
Let’s walk through a few key milestones that shaped this evolution.
1. RNNs and LSTMs: Sequential Context Before Attention
Before the age of Transformers, we had Recurrent Neural Networks (RNNs) and LSTMs.
Their attention-free design made them naturally linear; hence, each step depends only on the previous one (one-step markov chains):
This recurrence provided a memory chain through time. I won’t go any deep into this as there is sufficient content out there. But, you can think of this a small whiteboard where you are supposed to write new info which you read from the current token (t) and also erase/forget already written information on board. Larger is the whiteboard, more content you can accommodate, but it has its own issues (think about LSTMs here).
Complexity:
Time: O(n) (sequential, not parallelizable)
Memory: O(n)
Recurrent architectures have Linear scaling in theory and are excellent temporal continuity. But, they are
Hard to parallelize; hence, slow in practice.
Struggles with vanishing/exploding gradients.
Limited ability to represent diverse contexts (only a single evolving state vector; our whiteboard).
RNNs inspired later linear attention designs like DeltaNets and State space models like Mamba, where memory updates become differentiable and parallelizable approximations of recurrence.
2. Linformer: Low-Rank Projections for Linear Complexity
The Linformer (Wang et al., 2020) started with a very elegant hypothesis “Maybe the attention matrix doesn’t actually need to be full rank.” In standard attention, we compute:
where both Q and K are of size n×d. This creates a quadratic memory footprint (O(n^2)) because every token interacts with every other.
Linformer proposes that this full attention map can be approximated by a low-rank matrix; effectively saying the contextual dependencies across tokens live in a lower-dimensional subspace.
So, instead of K and V having n tokens, they are projected to k dimensions (where k << n) using learned linear projections EK and EV:
Hence, instead of attending to all tokens, each token attends to a compressed “summary” of them.
Complexity:
Time: O(n*k)
Memory: O(n*k)
Though the complexity is reduced, but the linear projection is static (not content-aware). Hence, the performance drops on highly context-dependent or long-range reasoning tasks. Also, the compression may oversimplify fine-grained interactions which leads to context collapse/loss of long-term information.
3. Longformer: Sparse Patterns for Structured Context
While Linformer reduced dimensionality, Longformer (Beltagy et al., 2020) took another path, they re-utilized low rank assumption in a more subtle way; through sparsity. It observed that in most real-world tasks, tokens only need to attend to their local neighborhood or specific global anchors (like CLS tokens).
So, instead of full attention, it defines a local sliding window of size www and a few global tokens that attend everywhere.
Here each token looks at nearby w tokens (for local coherence) and at a few global ones (for context).
Complexity:
Time: O(nw)
Memory: O(nw)
Though Longformer is efficient for long text (documents, scientific papers) and keeps interpretability (local windows are intuitive to understand), but, it has few major issues:
Long-term dependencies still limited by window size.
Not ideal for tasks requiring deep hierarchical or global reasoning.
Fixed sparsity pattern are not adaptive to content.
4. Interleaved Attention: Balancing Full and Sparse
While models like Longformer used fixed sparsity, newer hybrid designs explore a smarter balance (Kimi-linear also uses this) : what if we don’t need every layer to be full attention?
That’s the idea behind Interleaved Attention; alternating efficient and expressive layers in a fixed ratio, often written as k:1, e.g., 3:1 between lightweight (linear or sparse) attention layers and one full attention layer. The lightweight layers handle local reasoning and memory compression, while the full attention layer occasionally “refreshes” global coherence. This technique mirrors how the brain alternates between focused local thought and occasional global reflection.
Kimi uses 3 KDA (Kimi Delta Attention) layers for every 1 full Multi-Head Latent Attention (MLA) layer. This 3:1 interleave offers the best tradeoff between cost and expressivity.
Complexity:
Average Time: ≈ O(n) per KDA layer, O(n^2) for MLA.
Effective scaling ≈ sub-quadratic with near full-attention accuracy.
Advantages:
Significant compute and memory savings.
Retains global awareness while most computation remains linear.
Naturally compatible with hybrid architectures.
But, Finding the right k:1 ratio requires empirical tuning. As too few full-attention layers leads to context drift or too many leads to lose efficiency.
5. Mamba: Continuous-Time Selective State Space Models
Transformers, for all their power, treat text as discrete jumps token by token.
But what if we could make the flow of information continuous, just like signals evolving over time in a physical system?
That’s exactly the idea behind Mamba, a state-space model (SSM) that models sequential data not as discrete transitions but as a differential system:
Here:
x(t) is the input signal (tokens),
h(t) is the latent hidden state (memory),
A,B,C are learned continuous operators.
Each token influences the hidden state, and this state continuously evolves accumulating memory smoothly rather than resetting per token. That’s fundamentally different from Transformers, which rebuild context every forward pass.
Mamba’s recurrent dynamics let it carry context seamlessly across thousands of tokens efficiently (similar to RNNs), but instead of storing all information for each state, they operate on delta space (RNNs store states, Mamba models the deltas between states, but those deltas are learned, continuous, and adaptive, not just raw differences).
Complexity:
Time: O(n)
Memory: O(1)
Advantages:
Handles ultra-long sequences effortlessly.
Naturally preserves temporal and causal structure.
Extremely memory efficient; which is suitable for edge and streaming setups.
Limitations:
Limited cross-token expressivity (great at remembering, less so at reasoning across multiple entities).
Captures sequential flow well, but lacks flexible, relational representation like attention.
Also, its Difficult to parallelize fully; as updates are inherently sequential in structure.
6. Gated Delta Networks (GDN): Recurrent Memory Inside Attention
After Mamba, another line of thinking emerged “What if we don’t throw away attention, but reformulate it as a recurrent update?” That’s where Gated Delta Networks (GDN) step in; a clever bridge between full attention and state-space models.
Attention as Incremental Memory
In standard attention, we compute:
Every token interacts with every other; powerful but expensive (O(n²)). GDN rethinks this interaction as a series of incremental updates to a shared memory matrix (S):
Here:
St is the global memory at step t,
kt,vt are the key and value vectors,
βt is a gating coefficient controlling how much new information overwrites the old.
Instead of re-computing full pairwise attention, GDN maintains a running summary of past keys and values (a low-rank memory that’s updated on the fly).
Each step asks: “How should the new token reshape my understanding of the past?” That’s the delta rule, the change between two states encoding new learning. You can actually interpret GDN’s recurrence as online gradient descent on a reconstruction loss:
At every token, GDN minimizes the mismatch between predicted and true values, updating the memory St just like a tiny optimizer would. This is what makes it self-correcting; every token is a small, local learning step.
The gating term βt (which we saw above) is shared per attention head, not per feature channel. That means all dimensions within a head forget or remember at the same rate hence a coarse control. Kimi Linear’s Delta Attention (KDA) fixes this by introducing a channel-wise gate (αt), allowing each feature dimension to evolve independently which is a huge leap in representational fidelity. Hence, in terms of time and memory complexity:
Time: O(n)
Memory: O(1)
Parallelism: chunkwise sequential
Because the update is rank-1 (only one new key–value pair at a time), memory grows linearly; so no quadratic blowup.
Limitations
Lacks fine-grained per-feature memory gating.
No inherent position encoding as it relies on external mechanisms like RoPE (this is also fixed in Kimi-Linear).
Slightly less expressive than full attention for relational understanding.
Why Positional Embeddings Are a Problem?
Transformers, unlike RNNs, have no inherent notion of order as they treat inputs as sets, not sequences. To fix this, we inject positional information externally, but that workaround comes with hidden tradeoffs.
Fixed (Sinusoidal) Embeddings : The earliest approach used sinusoidal patterns (like waves of different frequencies) to encode position. These embeddings are deterministic, memory-free, and extrapolate to longer sequences, but they’re rigid. The model learns correlations tied to specific frequency combinations, making it hard to generalize when positional dependencies shift or extend beyond trained lengths.
Learned (Absolute) Embeddings : Then came learned positional embeddings, where each position had a trainable vector. This improves flexibility, but locks the model to the maximum sequence length it was trained for. Worse, it consumes parameter space and breaks when extrapolating beyond that length.
Rotary Position Embeddings (RoPE) : RoPE introduced a clever trick: instead of adding position vectors, it rotates the query-key pairs in a continuous, frequency-dependent way. This preserves relative distance and supports extrapolation better than fixed embeddings. But RoPE still couples position and token meaning too tightly; at high sequence lengths, the rotations can distort similarity structure, leading to attention drift and unstable gradients.
All these methods bolt position onto the model rather than letting it emerge from computation itself. They’re syntactic fixes for a structural problem; Transformers should understand order intrinsically, not by addition or rotation.
This is why newer architectures like Kimi Linear embed positional dynamics within their recurrence updates. Instead of appending position as metadata, the model’s decay and gating mechanisms inherently reflect recency, sequence flow, and structural memory; hence making positional awareness part of how it thinks, not what it remembers.
The Ideal Case
In the ideal world of attention, computation wouldn’t scale quadratically, memory wouldn’t explode with context length, and positional information wouldn’t need to be patched in as an afterthought; it would simply emerge from how the model processes sequences.
At its best, an attention mechanism should:
Preserve long-range context without truncation.
Scale linearly (or near-linearly) with sequence length.
Dynamically allocate compute like spending more on reasoning-heavy or high-signal tokens and less on trivial ones.
Encode positional bias natively within its architecture, not through external embeddings or fixed rotations.
In this perfect setup, the boundary between efficiency and expressivity disappears and models can think deeply without paying the exponential compute tax.
This is exactly where Kimi Linear fits in. It brings us a step closer to that equilibrium where memory, positional understanding, and attention all blend into a single, continuous mechanism. By embedding recency and decay directly into its update rules, Kimi Linear makes positional information part of the computation itself (as done in RNNs due to their sequential nature) not a separate add-on.
In essence, it’s what attention should have been from the start: efficient, dynamic, and inherently aware of order and context. That said, lets understand the methodology (How Kimi-Linear achieves our goals above) behind Kimi-Linear.
Methodology
At the heart of Kimi-Linear lies a simple question : can we retain the expressiveness of full attention while achieving linear time and memory complexity?
To get there, the authors combine insights from state-space models (like Mamba) and efficient attention architectures (like GDN) to create a hybrid mechanism that balances structure, scalability, and stability.
But before diving into advancements lets see how the training and inference pipeline look-like, this will also give us a high level idea for each component and their relation to each other.
Training Pipeline
Input & Chunking
Text is tokenized, embedded, and split into chunks for efficient parallel processing.
(Each chunk holds a few hundred tokens.)Projection & Gating
For each token:\(q_t, k_t, v_t = W_qE(x_t), W_kE(x_t), W_vE(x_t)\)\(a_t=f_a(k_t), b_t=f_b(k_t)\)Channel-wise decay αt and update strength βt are computed via lightweight MLPs (fa and fb).
KDA State Update (core recurrence)
\(S_t = (I - β_t k_t k_t^T) Diag(α_t) S_{t-1} + β_t k_t v_t^T \)Each chunk updates its own compact memory SSS using rank-1 operations (linear time).
Chunk Compression (DPLR trick)
Within a chunk, multiple rank-1 updates are compressed into a single low-rank form using the Diagonal-Plus-Low-Rank (DPLR) and WY representations. This Enables chunk-level parallelism and constant memory.Interleaved Full Attention (MLA)
After every 3 KDA layers, a Multi-Head Latent Attention layer re-introduces global context. (No positional embeddings is required as KDA’s α acts as learned positional decay.)Loss & Backprop
Outputs go through the LM head for next-token prediction or reasoning-based objectives. Gradients flow through compressed operators; memory grows linearly with sequence length
Now we have seen all the required components on an abstract level, lets dive into each one-by-one.
1. From GDN to Kimi Delta Attention (KDA)
Gated Delta Networks (GDN) were designed as a bridge between recurrent updates and attention-style information aggregation.
They operate using a forget gate that controls how much of the previous memory to retain at each step which is similar in spirit to GRUs or LSTMs, but formulated in continuous time.
However, GDN uses a head-wise gating mechanism: each attention head has a single forget coefficient.
That’s a very coarse form of control, you can’t decide which feature channels (dimensions) within the head should decay faster or slower.
Kimi Delta Attention (KDA) refines this by introducing channel-wise gating, allowing each dimension in the feature space to have its own independent forgetting rate.
This flexibility makes it far more expressive as certain semantic dimensions (like syntax, recency, topic cues) can persist longer, while noisy or irrelevant channels can decay rapidly.
2. The KDA Update Rule
Formally, KDA defines a recursive update on a state matrix St∈Rdk×dv as:
Here’s what each term does:
kt,vt: key and value vectors for the current token
βt: learning rate-like term controlling update strength
αt: per-channel forget gate (a vector)
Diag(αt): diagonal matrix that scales each feature channel independently
Intuitively, this is an online gradient descent step (as we saw above in GDN) minimizing a local reconstruction loss:
meaning the model updates its memory so that S⊤kt better reconstructs vt (our target).
The forget term acts as a soft weight decay, keeping the memory bounded and stable.
We can think of this as three analytic components (in KDA update recurrence rule):
Decay / shrinkage: Diag(αt) multiplies each channel (row) of St−1 by a channel-specific factor (very similar to formulation of regularization)
Rank-1 forget/update: (I−βtktkt⊤) is a low-rank transform that modulates memory along the key direction kt.
Rank-1 write: βtktvt⊤ is a simple outer product that writes the new key/value into memory.
These combine to produce an efficient online update that (a) keeps memory bounded, (b) allows per-feature persistence, and (c) makes the memory amenable to fast low-rank algebra.
This gives KDA two crucial properties:
Linear-time recurrence : since the update depends only on St−1, not all previous tokens.
Learnable memory decay : the model controls how long past information stays relevant.
But, What does Diag(αt) really do?
Intuition first; Diag(αt) is a per-channel multiplicative gate unlike per-head gate as we saw in GDN. If the memory St−1 has dk rows (each row ↔ a feature channel), then αt[i]∈(0,1) scales row i of St−1 before we apply the rank-1 forget and write. This single operation has several connected interpretations:
Soft selection / retention. If αt[i] ≈ 1, channel i is preserved; if αt[i] ≈ 0, channel i is effectively zeroed out (forgotten). So, Diag(αt) realizes a soft, differentiable selection across channels provides finer-grained than a head-wise forget gate.
Anisotropic regularization (shrinkage). Multiplicative decay is mathematically equivalent to a structured weight decay on the memory: repeatedly multiplying by numbers <1 progressively shrinks certain channels. This is analogous to L2-style shrinkage but applied online across time, with the shrinkage amount learned and data-dependent.
Interference control. Channels in S capture different semantic/functional components (syntax, topic, recency signals, etc.). A channel-wise gate prevents strong signals in some channels from being overwritten or overwhelmed by noisy updates in other channels, so it reduces catastrophic interference.
Stability & eigenvalue control. Since the memory is repeatedly multiplied, keeping channel gains controlled avoids explosive eigenvalues (which cause unstable training). A per-channel α allows selective damping where necessary.
Algorithmic consequence. Because αt is computed from input (via a small network / low-rank projection of q/k/v), the model learns when to remember which feature channels, making memory usage adaptive to context. This is (as we saw above) is better than head-wise gates as they apply the same scalar to all channels in a head; Diag(αt) gives dk different scalars which is a huge jump in representational flexibility.
3. The Role of DPLR Matrices
The above recurrence (KDA update rule of St) is elegant but not directly parallelizable as you’d have to process one token at a time. To solve this, the authors introduce a chunkwise parallelization strategy using Diagonal-Plus-Low-Rank (DPLR) matrices.
A general DPLR matrix looks like:
where Dt is diagonal and atbt⊤ is a rank-1 correction term. If at and bt are independent learned vectors, applying a sequence of updates or combining chunked results often requires multiple matrix multiplies and extra reductions. Naively implementing many rank-1 updates sequentially leads to multiple GEMMs (GEneral Matrix Multiplication) and lots of memory traffic.
Kimi ties the DPLR low-rank vectors to the key vector:
(which is simple linear projections or identity). Why does this help?
Fewer free vectors to handle: rank-1 updates now share structure across tokens (they all come from k-space), so when you compress a chunk of updates you can represent the whole chunk with fewer aggregated factors.
Eliminates redundant chunking steps: with general DPLR you’d need two levels of chunk reduction (collect a, b per block, then form final products). With k-binding you can fold reductions algebraically (the sums factor through k), saving intermediate matmuls.
Concrete operator savings: This simplification allows fast multiplication via WY representation and UT transforms (maybe; not explicitly mentioned in the paper), compressing several rank-1 updates into compact form, hence, the authors report roughly three fewer matrix multiplications per chunk and about ~100% improvement in operator efficiency compared to a generic DPLR implementation.
Put differently: for large chunk sizes the asymptotic number of GEMMs drops significantly, which matters on GPU kernels where GEMM is the bottleneck.
This is where hardware-awareness really shows; as the DPLR formulation converts what would be a sequential recurrence into parallel chunks that can be computed efficiently across GPU threads.
(If this is not satisfactory, scroll down a bit more to the optional/additional section below to find me struggling to explain this; but, yes it is a better attempt of explaining the above mess).
4. Architectural Changes : a Hybrid, Hierarchical Design
At its heart, Kimi Linear adopts a hybrid attention architecture, interleaving linear and full attention layers to combine long-range efficiency with global synchronization.
(a) Interleaved Attention Layout
The architecture alternates KDA (Kimi Delta Attention) layers with MLA (Multi-head Latent Attention) layers.
The empirical best ratio is 3:1, meaning every three lightweight KDA layers are followed by one full-attention MLA layer.
Why?
Linear layers (KDA) are efficient but somewhat local in their context propagation, while full-attention layers (MLA) reintroduce global communication.
This interleaving allows:
KDA layers to handle local recurrence and memory consolidation.
MLA layers to periodically re-align global dependencies and long-term semantics.
(b) (NoPE) No Positional Encoding
Kimi Linear eliminates explicit positional embeddings (like Rotary or Absolute encodings).
Instead, the gating and decay mechanisms inside KDA implicitly encode position and recency bias.
This shift is huge as it removes positional encoding as a preprocessing step and makes positional information an emergent property of the architecture/modelling choice itself (similar to recurrent architecture).
The forget rate (αt) acts as a continuous, learnable positional decay, recent tokens have stronger influence, older ones fade gracefully, without discrete embedding overheads or hard orthogonal constraints like RoPE.
(c) Normalized Query/Key Vectors
Each qt, kt vector is L2-normalized, stabilizing spectral behavior of the update matrix StS_tSt. This combined with Diag(αt) ensures that the recurrent memory doesn’t explode or vanish, allowing longer unrolls.
(d) Low-Rank Parameterization of Forget Gates
αt (the channel-wise decay vector) isn’t learned directly per token; it’s produced through a low-rank projection:
This acts as an efficient bottleneck layer, reducing compute while keeping the gate expressive.
(e) Output Gating
A Sigmoid-activated output gate modulates final token activations, functioning like a selective readout which is crucial when mixing multiple timescales of attention (linear + full).
In summary (architecture-level):
Hybrid KDA/MLA stacking (3:1 ratio)
Implicit positional encoding via αₜ gating
Per-channel forget gates and normalized q/k
Low-rank gate generation for efficiency
Lightweight output gating for stability
5. Training-time Changes
Kimi Linear is explicitly designed to scale to long contexts and RL-style fine-tuning. That requires a few critical training modifications.
(a) Stable Recurrence Unrolling
Because KDA has a recurrent state St, it’s trained with truncated backpropagation through time (BPTT), similar to RNNs.
However, the diagonal decay (αt) and normalization of q,k make this recurrence numerically stable even over thousands of steps.
KDA’s St is the compressed, learned summary of past K/V pairs. Storing K/V explicitly (like standard Transformers) is unnecessary and memory-heavy. S functions as the write-once, read-many store.
(b) NoPE + Hybrid Regularization
Since positional embeddings are gone, training loss includes regularizers on αₜ to prevent collapse (e.g., biasing toward mid-range decay values early in training).
This ensures that the network learns to use αₜ for positional coding naturally, rather than collapsing all gates to 1 (no decay) or 0 (total forget).
Now, we have understanding of each component (better than we started); lets see how inference step looks like here:
Inference Pipeline (summary)
State Initialization
Initialize compact state S0; no KV cache.Per-Token Update (O(1) cost)
For each new token:Compute qt, kt, vt, αt, βt
Update memory:
\(S_t = (I - β_t k_t k_t^T) Diag(α_t) S_{t-1} + β_t k_t v_t^T\)
Readout
Predicted hidden state\(h_t=S_t^Tq_t\)this project to logits to sample next token.
Periodic MLA Correction
Every few layers/tokens, run a compact MLA block for fine-grained global attention.Output Generation
Since no KV cache or quadratic softmax is used, compute and memory scale linearly with context.
In Summary
At its core, Kimi Linear reimagines attention as a linear, recurrent process; one that captures long-term context efficiently without the quadratic overhead of full attention.
It builds on Gated DeltaNet (GDN) but introduces a more expressive Kimi Delta Attention (KDA), where forgetting happens per channel instead of per head. This channel-wise gating (Diag(αₜ)) lets each feature control its own memory decay, allowing finer temporal precision and stability.
To make this efficient, Kimi Linear uses a Diagonal-Plus-Low-Rank (DPLR) structure that packs a sequence of rank-1 updates into a compact form, removing redundant computations. This is further optimized through a UT transform, which replaces expensive matrix inversions with lightweight triangular updates while keeping everything GPU-friendly.
Architecturally, it alternates three KDA layers with one full Multi-Head Latent Attention (MLA) layer; hence; balancing efficiency with global context. The design uses no positional encodings; instead, positional awareness naturally emerges from KDA’s decay dynamics.
Finally, small details like L2-normalized keys and queries, low-rank projections for decay gates, and ShortConv activations stabilize training and maintain smooth information flow.
The result is a linear attention system that behaves like full attention, but at a fraction of the cost.
see the table below to get a glimpse again at the updates in Kimi-Linear paper:

Detailed walkthrough on the parallel chunking of KDA by DPLR (Additional + Optional; I am not sure how to make this collapsable :-D)
1. The core challenge
Let’s recall the KDA recurrence:
If you expand this over many time steps, you’ll see that each update depends on every previous key-value pair. Naively doing this for a full sequence of thousands of tokens would still be sequential and slow. The authors rework the same recurrence into a chunkwise formulation that can be parallelized efficiently while still preserving the causal, step-by-step structure.
1. From sequential to chunkwise
Equation (2) expands the recurrence into chunks of r steps (instead of a single step).
Formally/simply as:
What’s happening here:
Pr[t] captures how past states decay and interact within a chunk (it’s like the “propagation” part).
Hr[t] accumulates all the new information injected by tokens inside the chunk.
So rather than updating token-by-token, we compute all of them together for the chunk using products of small rank-1 transformations like (I−βikikiT).
This means, Kimi-Linear splits the sequence into smaller “chunks” (say, 256 tokens).
Each chunk is processed independently and in parallel, and then combined recurrently.
This is where the chunkwise formulation enters; this makes the math chunk-parallel but logically identical to the original recurrence.
2. Packing updates compactly (WY Representation)
Each token update
is a rank-1 modification to the running state matrix St. Over a chunk of 256 tokens, that’s 256 tiny rank-1 updates; this is a perfect setup for low-rank compression.
The WY representation compresses all these small updates into a single, compact form, avoiding repeated multiplications or inversions.
Here’s the intuition:
The Diagonal part (Diag(γr[t])) captures forgetting; a kind of decay over time controlled by α.
The Low-Rank part (kiwiT) captures the accumulated effect of all keys within the chunk.
Together, this means instead of storing a full large matrix, you only store a few compact rank-1 terms.
That’s where the hardware win comes from.
This is like compressing hundreds of small linear transformations into a single efficient operator that GPUs can chew through in one go.
3. The role of wt and ut
The auxiliary vectors wt and ut are not arbitrary; they recursively capture how each new token’s key-value pair interacts with the previous memory within a chunk:
Intuitively:
wt tracks key-wise interactions (structure of attention).
ut tracks value-wise accumulation (content or meaning).
Both act like memory traces; storing compressed summaries of all intra-chunk dependencies without explicitly materializing full attention matrices.
4. The UT transform (hardware optimization)
Next, the UT transform steps in to make this structure hardware-efficient as Matrix multiplications are expensive if not batched well. The UT transform cleverly reformulates the update so that:
Most operations can be done as triangular solves (fast on GPU, as triangular/diagonal matrix is easier to multiply; read about similar matrix),
And the number of non-matmul FLOPs drops drastically.
The transform:
is essentially an in-place preconditioning and its task is turning a series of dependent operations into a form that can be executed efficiently by the GPU’s Tensor Cores.
Then:
These are the compressed representations of all keys and values in the chunk, ready for parallel processing.
5. Chunkwise state update
After all this transformation, the memory update becomes beautifully compact:
Intuitively:
The first term is memory decay (old state fades by γ).
The second term adds new contextual information computed from all keys and values in the chunk.
This update is now fully parallel within chunks, while causal across chunks.
6. Output stage — inter-block recurrence, intra-block parallelism
Finally, the output computation (Eq. 9) combines two paths:
During the output stage, authors adopt an inter-block recurrent and intra-block parallel strategy to maximize matrix multiplication throughput, thereby fully utilizing the computational potential of Tensor Cores.
Think of it like this:
Inter-chunk recurrence keeps the long-term dependency chain alive.
Intra-chunk parallelism maximizes throughput inside each block using matrix operations that fully exploit GPU Tensor Cores.
The second term (called “pseudo-value”) acts like a correction term ensuring local consistency as it refines the output using recent tokens’ context.
In short
The entire chunkwise algorithm is an elegant hardware-aware reformulation of linear attention:
It replaces serial token updates with parallel chunk updates.
It compresses many rank-1 updates via DPLR structure and WY representation.
It uses UT transforms to align operations with GPU efficiency.
It preserves causality and context fidelity, while scaling linearly with sequence length.
So rather than an approximation of attention, this is a reconstruction of the same reasoning dynamics, reshaped to match the physics of computation.
Results & Outcomes
1. Synthetic Tasks
On classic sequence reasoning benchmarks like Palindrome, Multi-Query Associative Recall (MQAR), and Stack. Kimi Delta Attention (KDA) not only converged faster but achieved higher accuracy than both Gated DeltaNet (GDN) and Mamba2.
These results validate that fine-grained channel-wise decay (αₜ) leads to more expressive and controllable memory updates compared to GDN’s coarse head-wise forget gate.
Put simply: Kimi doesn’t just remember better — it remembers smarter.
2. Ablations
Through extensive ablation studies, several architectural decisions were confirmed as crucial:
3:1 KDA-to-MLA ratio emerged as the sweet spot, balancing reasoning depth with global context modeling.
Sigmoid output gating stabilized gradient flow and improved representational sharpness.
ShortConv layers were shown to enrich local feature mixing.
No Position Encoding (NoPE) outperformed traditional RoPE when integrated with KDA layers, proving that positional awareness can emerge naturally through gated delta recurrences.
3. Scaling Behavior
Under compute-optimal scaling, Kimi Linear achieved 1.16× higher computational efficiency than Multi-Head Latent Attention (MLA) baselines.
This scaling law behavior mirrors that of full transformers but with significantly lower FLOPs per token, confirming that the architecture generalizes efficiently as it scales.
4. Main Results (1.4T Tokens)
When pretrained on 1.4 trillion tokens, Kimi Linear consistently outperformed strong baselines both during pretraining and supervised fine-tuning (SFT):
The training and test accuracy curves for Kimi Linear@1.4T and MLA@1.4T during Math RL training. Kimi Linear consistently outperforms the full attention baseline by a sizable margin during the whole RL process.
When scaled to 5.7T tokens, Kimi Linear continued to outperform powerful baselines like Moonlight, achieving a RULER@1M score of 94.8; one of the highest reported for any efficient attention model to date. This result confirms that its hybrid recurrence–attention framework doesn’t saturate early; it scales smoothly with more data and compute.
5. Efficiency and Hardware Metrics
At the hardware level, Kimi Linear sets a new bar for throughput and memory efficiency:
2× faster KDA kernels than generic DPLR kernels (for up to 64k sequences).
Negligible prefill latency, matching GDN-H and far surpassing MLA at scale:
2.3× faster at 512k tokens.
2.9× faster at 1M tokens.
Decoding Efficiency: Up to 6.3× faster Time Per Output Token (TPOT) than MLA at 1M context, on par with GDN-H.
KV Cache: Up to 75% reduction in memory footprint, since KDA maintains a fixed-size state matrix of shape (dₖ × dᵥ) instead of context-proportional buffers.
These improvements make Kimi Linear not just theoretically elegant but practically deployable; which is a rare achievement in long-context modeling. By combining linear recurrence, fine-grained gating, and hardware-aware chunking, it bridges the gap between expressivity and scalability.
Complexity Analysis
1. Time Complexity
Traditional full attention requires computing the similarity matrix
making it O(n² · d) in time, hence quadratic in sequence length. Kimi Linear, by replacing this dense operation with rank-1 updates and chunkwise recurrences, reduces the cost to
which is effectively linear in sequence length. This happens because each new token only updates a compressed state St∈Rdk×dv via the KDA rule rather than recomputing all pairwise attentions, also remember dk and dv are fixed length. Hence, the complexity could be re-written as
here c^2 is constant, hence, effectively a O(N) complexity. The hardware-efficient chunking further amortizes the computation, grouping rank-1 updates and processing them in parallel. In practice, KDA kernels are ~2× faster than DPLR kernels, and up to 6.3× faster than full attention during decoding at 1M context length.
2. Space Complexity
Full attention stores the complete key-value cache, which grows as
for each layer. For long-context tasks, this is the main bottleneck. Kimi Linear’s fixed-size memory state St independent of sequence length; this brings this down to
meaning the cache never expands with input length, basically our St is a dim(KxV) matrix (which is fixed). Empirically, KV cache usage drops by up to 75%, enabling 1M+ token contexts without memory overflow.
Thoughts
Attention with memory (RNNs inside Transformers):
Kimi Linear feels like an RNN nested inside an attention head. The gated recurrence acts as an implicit memory controller, blending the selective focus of transformers with the continuous temporal modeling of RNNs.Channel-wise decay as micro-attention:
Each channel’s independent forgetting rate is like having attention within attention itself (sort of like a weighted feature selector); a localized, learnable dropout that dynamically regulates what matters and what fades.Bridging modalities through persistence:
The decay-driven design can easily generalize beyond text. Audio, time-series, and even video frames share the same need for gradual forgetting with continuity, making Kimi Linear a strong blueprint for cross-modal architectures.NoPE as architectural positional encoding:
Removing explicit positional embeddings and relying on KDA’s internal dynamics feels like an architectural realization of position, where time itself is baked into the recurrence, not appended as metadata (here the time is treated as an index set).A path toward continuous transformers:
If traditional attention is discrete sampling, Kimi Linear is closer to a continuous flow, it is something between sequence modeling and dynamical systems.
Conclusion
At its essence, Kimi Linear isn’t just a faster transformer it’s a rethinking of what attention is supposed to be. Instead of treating every token as an equal participant in a quadratic dance of pairwise similarities, it brings attention closer to how memory evolves in time which is stateful, selective, and efficient.
We started with the age-old problem; the quadratic cost of attention. The very mechanism that made Transformers so powerful also limited their scalability. From RNNs to Linformer and Longformer, we saw multiple attempts at trading context length for efficiency. Each helped in some way but brought its own compromises like sparse patterns, fixed receptive fields, or loss of fine-grained context.
Then we discussed methods like Mamba and GDN, which brought recurrence and memory back into attention. They hinted at something deeper, maybe attention doesn’t need to attend everywhere; maybe it just needs to remember better.
From there we explored the ideal situation and how Kimi-Linear attempts to reach it, wherein we first saw important shifts like KDA, usage of DPLR transformations for compute efficiency, and other architectural (NoPE) as well as objective level changes. Eventually we took a deep-dive into how DPLR is actually working here and then discussed about results and outcomes along with complexity analysis.
Kimi-Linear feels like a RNN as a sub-module inside transformer; without the parallelization bottlenecks and ambiguities. With this we come to the end of this pretty long article.
Some interesting References:
Kimi-Linear paper : https://arxiv.org/pdf/2510.26692
https://www.reddit.com/r/LocalLLaMA/comments/1oorvd0/in_light_of_kimi_linear_reposting_minimaxs/
https://www.themoonlight.io/en/review/kimi-linear-an-expressive-efficient-attention-architecture
Setup/code : https://github.com/MoonshotAI/Kimi-Linear
That’s all for today.
Follow me on LinkedIn and Substack for more insightful posts, till then happy Learning. Bye👋


























