mHC: Generalized Residual Connections in DeepSeek-V4
Published:
The residual connection is one of the most important inventions in modern deep learning. Before ResNet introduced it in 2015, training networks deeper than about 20 layers was impractical because gradients either exploded or vanished on their way back through the network. The ResNet fix was simple. Instead of asking each layer to compute the next hidden state from scratch, let it compute a correction and add that to the input.
\[x_{l+1} = x_l + F_l(x_l).\]This additive structure gives gradients a direct path from the loss back to early layers, bypassing all the intermediate nonlinearities. It works so well that every modern transformer uses it unchanged.
However, there is a fundamental bottleneck in residual connections. After $L$ layers, the hidden state is roughly $x_0 + \sum_{l=1}^{L} F_l(x_l)$. Since every layer reads from and writes to the same vector, a layer cannot selectively preserve its output for specific later layers or selectively read from specific previous layers.
This could be solved if each layer read and wrote to a matrix instead of a vector, with learnable gating rules to filter information. ByteDance proposed this idea as Hyper-Connections (HC). DeepSeek extended it with a manifold constraint in Manifold-Constrained Hyper-Connections (mHC), and DeepSeek-V4 showed mHC works at scale. In this blogpost, I will try to motivate what it is and how it works.
Notations. We denote the hidden dimension by $d$ and the number of rows in the expanded hidden state by $n$. The hidden state at layer $l$ is $X_l \in \mathbb{R}^{n \times d}$. Each layer function $F_l$ takes a single $d$-dimensional vector as input and returns a $d$-dimensional vector as output. This includes both attention and feed-forward layers. The set of $n \times n$ doubly stochastic matrices is denoted $\mathcal{M}$.
From Vector to Matrix
As noted before, residual connections accumulate all layer outputs into one vector. In Hyper-Connections, we pass $n$ stacked vectors $x_l \in \mathbb{R}^d$ between layers as a matrix $X_l \in \mathbb{R}^{n \times d}$. (DeepSeek-V4 uses $n = 4$.) At initialization all rows are identical copies of the embedding vector. As training progresses, different layers learn to write to different rows, so the rows begin to carry different information. A later layer can then read from whichever rows are relevant to its computation. Three small, learnable matrices control the information flow.
The Update Equation
At layer $l$, the update rule is
\[X_{l+1} = B_l X_l + C_l F_l(A_l X_l)\]where $A_l \in \mathbb{R}^{1 \times n}$, $B_l \in \mathbb{R}^{n \times n}$, and $C_l \in \mathbb{R}^{n \times 1}$.
This has the same structure as an RNN hidden state update. The first term $B_l X_l$ carries forward the existing state and determines what the network remembers. The second term is the new input. $A_l$ selects what to feed from the hidden state to a layer, and $C_l$ determines where to write the layer output back. When $n = 1$, all three become scalars, and setting them to 1 recovers the standard residual connection.
Constraining $B_l$ to Preserve Identity Mapping
The reason residual connections work so well is the identity mapping. In $x_{l+1} = x_l + F_l(x_l)$, the hidden state $x_l$ passes through to $x_L$ unchanged regardless of depth. HC breaks this. The residual term is now $B_l X_l$ instead of $I \cdot X_l$, so across $L$ layers the hidden state gets multiplied by $B_L B_{L-1} \cdots B_1$. If $B_l$ are unconstrained, this product can amplify or attenuate signals arbitrarily, destroying the stability that identity mapping provided.
mHC constrains $B_l$ to be a doubly stochastic matrix. A matrix $M \in \mathbb{R}^{n \times n}$ is doubly stochastic if all entries are non-negative, every row sums to 1, and every column sums to 1.
\[\mathcal{M} = \left\{ M \in \mathbb{R}^{n \times n} \;\middle|\; M \geq 0, \; M \mathbf{1}_n = \mathbf{1}_n, \; \mathbf{1}_n^\top M = \mathbf{1}_n^\top \right\}\]While this does not restore the identity exactly, the vector $\mathbf{1}_n$ is always an eigenvector with eigenvalue 1, so the product preserves the mean across rows. All other eigenvalues have magnitude at most 1, so the norm is bounded. And since $\mathcal{M}$ is closed under multiplication, the product $B_L B_{L-1} \cdots B_1$ remains doubly stochastic regardless of depth.
Constraining $B$ like this also prevents gradient explosion, the same problem that plagued RNNs for years and that LSTMs and GRUs solved with gating. The vanishing problem is handled by the additive term $C_l F_l(A_l X_l)$, which injects fresh signal at every layer.
Projecting $B_l$ During Training
One concern with $B$ is that gradient descent updates it without respecting the doubly stochastic constraint. So before each forward pass, we project $B_l$ back onto $\mathcal{M}$ using the Sinkhorn-Knopp algorithm: first exponentiate all entries to make them positive. Then alternate between normalizing rows to sum to 1 and normalizing columns to sum to 1.
\[M^{(0)} = \exp(\tilde{B}_l), \qquad M^{(t)} = T_r(T_c(M^{(t-1)}))\]After enough iterations (DeepSeek-V4 uses 20), the result is approximately doubly stochastic. The entire operation is differentiable since it only involves exponentials and divisions.
Input-Dependent Parameters
In practice, $A_l$, $B_l$, and $C_l$ are not static, but are functions of the current hidden state.
\[A_l, B_l, C_l = f(X_l; \theta_l)\]where $f$ is a small learned projection (flatten, RMSNorm, linear) with per-layer parameters $\theta_l$. Since $n$ is small, the overhead is negligible.
Conclusion
Manifold-Constrained Hyper-Connections replace the residual connection’s single shared vector with a matrix of $n$ separate rows. Constraining $B_l$ to be doubly stochastic preserves the identity mapping property that makes residual connections stable. The layers themselves are unchanged. mHC only changes how information flows between them.
