Back to HomeCurated by Pillio Technology Solutions · AI · ML · LLM · Deep Learning · GenAI

Latest AI Trends

Full-length articles from the global AI & machine learning community — curated across 12 topics, no paywalls.

I wanted to describe a network, not assemble it: FlowBuilder in flodl
🤖fab2s·May 4, 2026·5 min read·Global

I wanted to describe a network, not assemble it: FlowBuilder in flodl

#rust#machinelearning#showdev#deeplearning

Last post was why. This one is what it looks like.

The thing I said at the end of last post was: with flodl I don't rewrite when I pivot. I add or remove a graph member. This post is the primitive that makes that sentence true. Meet FlowBuilder. It's a declarative graph DSL for neural networks, and it's the API I'd find hardest to give up.

The gap

By my third Python pivot, the wiring code outweighed the model. Freezing submodules, loading partial checkpoints, rerouting a tensor through a newly-inserted path, unfreezing for a finetune: each of these was three to ten lines of procedural glue that had nothing to do with the architecture. The model was in there somewhere, but finding it meant reading past everything else first.

What I wanted was simple. I wanted to describe the network. What's its structure? What's tagged? What's frozen? What loads from where? And then I wanted the framework to handle the wiring.

Procedurally assembling a network from module instances and class hierarchies is fine when the shape stays stable. Mine wasn't. A shape that pivots every two days and nests frozen subgraphs inside other frozen subgraphs doesn't want to be a script. It wants to be a graph.

What FlowBuilder looks like

Here's a small model with a tagged hidden activation and a residual connection:

let model = FlowBuilder::from(Linear::new(784, 128)?)
    .through(GELU)
    .tag("hidden")
    .through(LayerNorm::new(128)?)
    .also(Linear::new(128, 128)?)
    .tag("residual")
    .through(Linear::new(128, 10)?)
    .build()?;
Enter fullscreen mode Exit fullscreen mode

Top to bottom, the architecture is visible in the code. No construction state to hold in your head; the structure is the text.

The method names carry the intent:

  • from(...) starts the flow with an entry module.
  • through(...) chains a module in series. Stream in, stream out.
  • tag("name") marks the current stream position for later reference: observation, freezing, checkpoint loading.
  • also(Linear::new(...)) adds a residual: output = stream + module(stream).
  • build() finalizes and validates. Unmerged streams and cycles surface as errors at build time, not at forward time.

There's more in the vocabulary. fork for side-branches that don't disturb the main stream. split with merge(MergeOp::Add) or merge(MergeOp::Mean) for parallel branches that recombine. switch and gate for routing. loop_body for iteration. map for applying a body across slices or tagged collections. The thing I care about is that the builder stays flat. A complex graph is more lines, not more indentation.

When I pivot a shape, I add or remove lines. The rest of the build doesn't move.

The graph renders itself

A Graph carries enough structural information to draw itself. One method call:

graph.svg(Some("model.svg"))?;
Enter fullscreen mode Exit fullscreen mode

That writes an SVG with modules as nodes and stream connections as edges. Tags appear annotated on the nodes they marked, and parallel-execution levels are grouped as clusters. For training loops, svg_with_profile(...) colors nodes by measured forward-pass time so the hot path is visible instead of guessed at.

FlowBuilder SVG output: tagged residual block

Layout runs through graphviz. It works well up to a few dozen nodes. Past that the visualization gets dense and I start squinting. That's one of the edges I'm still working on. The DOT output is available raw for people who want to pipe it through their own tooling.

Graph trees

Here's the part I think matters most, because no other DL framework I know has it at this shape.

A Graph implements Module. That means a Graph can be fed into another FlowBuilder anywhere a module is expected:

let encoder = FlowBuilder::from(Linear::new(32, 64)?)
    .through(GELU)
    .tag("hidden")
    .label("encoder")
    .build()?;

let model = FlowBuilder::from(encoder)
    .through(Linear::new(64, 10)?)
    .build()?;
Enter fullscreen mode Exit fullscreen mode

.label("encoder") registers the inner graph as a named child of the outer. Once composed, the inner graph's structure is addressable from the outer scope via label paths:

model.freeze("encoder")?;                              // freeze every parameter in the subgraph
model.load_subgraph_checkpoint("encoder", path)?;      // load weights into just that subgraph
model.tagged_at("encoder.hidden")?;                    // read the tagged activation across the boundary
model.subgraph("encoder")?;                            // recover the child Graph
Enter fullscreen mode Exit fullscreen mode

Nesting composes. An encoder inside a classifier inside a multi-head pipeline gives you paths like head.classifier.encoder.hidden, and everything addressable by label keeps working at any depth. Freeze, thaw, load, observe, swap.

This is the primitive FBRL needed. A trained letter reader is a frozen Graph inside the word reader. A trained word reader is a frozen Graph inside the line reader. Each level addressable by name, each level's checkpoint loadable independently, gradients cleanly blocked at the frozen boundary.

Transfer learning. Multi-phase pretraining. Anywhere you're stitching trained components into larger architectures and you want the composition to stay legible in a year when you come back to it.

What it isn't yet

One real edge: when the wiring is wrong, the error messages are functional but not great. If you merge two branches with mismatched shapes, you get a shape-mismatch error. You do not get told which branch of which split produced the offender. For a short graph you eyeball it. For a deep graph you add prints. I have a list of places where the errors need to carry more structural context back out to the user. That's next-round work.

I flag this one because it's the rough edge I touch most often. The shape of the API is right. The ergonomics of the error path is what needs sharpening.

What hooked me (again)

I started flodl because FBRL needed a composition primitive Python didn't give me cleanly. By the time FlowBuilder was working, I'd noticed I was solving a framework problem I cared about for its own sake. Ergonomics pulled me in first.

Then performance. Then distributed training. Then convergence under heterogeneous compute.

This is the part of the journey I didn't expect. I'll walk through the rest of it post by post.

Previous post: Why I built a Rust deep learning framework (and what I got wrong twice first).

Next post: how flodl actually benchmarks against PyTorch on real architectures, and what the libtorch FFI bet from last post actually buys you.

flodl: flodl.dev · github.com/flodl-labs/flodl · @flodl_dev

Parallel RNNs?
📈Denis·May 4, 2026·13 min read·Global

Parallel RNNs?

#algorithms#computerscience#deeplearning#machinelearning

Did you check out the recent ICLR results? I got intrigued by a rather provocative paper from Apple - ParaRNN, claiming parallelism for RNNs, when this is supposedly their main weakness, the very reason transformers replaced them (in most tasks).

So let's dig into all of it at the lowest level possible. If you know what an RNN is and what a derivative is, this article is for you.

1. The DEER algorithm

DEER = Deep Equilibrium Evaluation of Recurrence (Lim et al., 2024). The base algorithm on which ParaRNN is built.

1.1. Formulation as a root-finding problem

Consider an RNN with transition function f:RDRDf: \mathbb{R}^D \to \mathbb{R}^Df:RDRD , initial state s0\mathbf{s}0s0 , and unknown states s1:T\mathbf{s}{1:T}s1:T . Let's introduce the residual:

r(s1:T):=[s1f(s0), s2f(s1), , sTf(sT1)]RT×D \mathbf{r}(\mathbf{s}{1:T}) := [\mathbf{s}_1 - f(\mathbf{s}_0),\ \mathbf{s}_2 - f(\mathbf{s}_1),\ \ldots,\ \mathbf{s}_T - f(\mathbf{s}{T-1})] \in \mathbb{R}^{T \times D} r(s1:T):=[s1f(s0), s2f(s1), …, sTf(sT−1)]∈RT×D

The true trajectory s1:T\mathbf{s}^{}_{1:T}s1:T is the **unique solution* of the equation with zero residual:

r(s1:T)=0 \mathbf{r}(\mathbf{s}^{*}_{1:T}) = \mathbf{0} r(s1:T)=0

When we say "apply an RNN to a sequence", we mean the standard procedure: take the initial state s0\mathbf{s}_0s0 , apply the transition function fff , get s1\mathbf{s}_1s1 , then apply fff again to get s2\mathbf{s}_2s2 , and so on:

s1=f(s0),s2=f(s1),,sT=f(sT1) \mathbf{s}1 = f(\mathbf{s}_0), \quad \mathbf{s}_2 = f(\mathbf{s}_1), \quad \ldots, \quad \mathbf{s}_T = f(\mathbf{s}{T-1}) s1=f(s0),s2=f(s1),…,sT=f(sT−1)

Accordingly, r\mathbf{r}r turns out to be a vector with all elements equal to 0, again because when the recurrence holds we have s1=f(s0)\mathbf{s}_1 = f(\mathbf{s}_0)s1=f(s0) and therefore s1f(s0)=0\mathbf{s}_1 - f(\mathbf{s}_0) = 0s1f(s0)=0 .


1.2. Newton's iterations

So we need to find a solution to the equation r(s)=0\mathbf{r}(\mathbf{s}) = \mathbf{0}r(s)=0 , or in the full case, a vector that solves a system of equations. But let's start with the scalar case.

Scalar case: one equation in one variable

Suppose we have a smooth function r:RRr: \mathbb{R} \to \mathbb{R}r:R→R and we want to find ParseError: KaTeX parse error: Expected group after '^' at position 2: s^̲ such that r(s)=0r(s^) = 0r(s)=0 . Geometrically, we want to find the point where the graph of the function crosses the x-axis.

The idea of Newton's method rests on a simple thought: in a small neighborhood of a point, a smooth function is almost indistinguishable from its tangent line. If we're standing at our current approximation s(i)s^{(i)}s(i) (which, in general, is not a root - there r(s(i))0r(s^{(i)}) \neq 0r(s(i))=0 ), we can pretend that rrr is its tangent line at this point, and for such a linear function it's easy to find analytically where it crosses the axis.

The tangent to rrr at the point s(i)s^{(i)}s(i) is the first-order term of the Taylor expansion:

r(s)r(s(i))+r(s(i))(ss(i)) r(s) \approx r(s^{(i)}) + r'(s^{(i)})\,(s - s^{(i)}) r(s)≈r(s(i))+r(s(i))(s−s(i))

The Taylor expansion is a way to approximate any smooth function near a point s0s_0s0 by a polynomial: r(s)=r(s0)+r(s0)(ss0)+r(s0)2!(ss0)2+r(s0)3!(ss0)3+r(s) = r(s_0) + r'(s_0)(s - s_0) + \frac{r''(s_0)}{2!}(s - s_0)^2 + \frac{r'''(s_0)}{3!}(s - s_0)^3 + \ldotsr(s)=r(s0)+r(s0)(s−s0)+2!r′′(s0)(s−s0)2+3!r′′′(s0)(s−s0)3+ , where each subsequent term refines the approximation by adding information about an ever finer feature of the function's shape (slope, curvature, etc.). The logical meaning: if a function is smooth, then its behavior in a neighborhood of a point is fully encoded in the values of its derivatives at that single point - by measuring a few numbers at s0s_0s0 , we can reconstruct the function's values nearby. The divisor k!k!k! arises naturally from the requirement that at s0s_0s0 all derivatives of the polynomial coincide with those of the function itself (it cancels with the factorial that pops out when differentiating (ss0)k(s-s_0)^k(s−s0)k kkk times).


We set this linear approximation equal to zero and find where it crosses the axis:

r(s(i))+r(s(i))(ss(i))=0 r(s^{(i)}) + r'(s^{(i)})\,(s - s^{(i)}) = 0 r(s(i))+r(s(i))(s−s(i))=0

Solve for sss - this is just school algebra:

s=s(i)r(s(i))r(s(i)) s = s^{(i)} - \frac{r(s^{(i)})}{r'(s^{(i)})} s=s(i)r(s(i))r(s(i))

And we declare this to be our next approximation:

  s(i+1)=s(i)r(s(i))r(s(i))   \boxed{\; s^{(i+1)} = s^{(i)} - \frac{r(s^{(i)})}{r'(s^{(i)})} \;} s(i+1)=s(i)r(s(i))r(s(i))

This shows graphically the step s(i)s(i+1)s^{(i)} \to s^{(i+1)}s(i)s(i+1) , and on the graph you can see that the root of the equation (we're interested in the intersection of the function with the x-axis) shifted from 2 to 1, which shows improvement, since the target value is 0.

We can rewrite this in terms of the increment Δs(i+1):=s(i+1)s(i)\Delta s^{(i+1)} := s^{(i+1)} - s^{(i)}Δs(i+1):=s(i+1)s(i) , which will be more convenient when generalizing:

r(s(i))Δs(i+1)=r(s(i)) r'(s^{(i)})\,\Delta s^{(i+1)} = -r(s^{(i)}) r(s(i))Δs(i+1)=−r(s(i))

That is, "find an increment such that the linear correction rΔsr' \cdot \Delta srΔs cancels out the current residual rrr ".

Multidimensional case: N equations in N variables

Now let's generalize. Instead of one function of one variable, we have r:RNRN\mathbf{r}: \mathbb{R}^N \to \mathbb{R}^Nr:RNRN - a vector-valued function of a vector argument, and we're looking for a vector sRN\mathbf{s}^* \in \mathbb{R}^NsRN such that r(s)=0\mathbf{r}(\mathbf{s}^*) = \mathbf{0}r(s)=0 .

The logic remains literally the same, only the objects change:

Scalar Vector function r(s)r(s)r(s) vector function r(s)\mathbf{r}(\mathbf{s})r(s) derivative r(s)r'(s)r(s) - a number Jacobian J(s)J(\mathbf{s})J(s) - an N×NN \times NN×N matrix tangent (line) tangent hyperplane division by rr'r multiplication by J1J^{-1}J−1 (i.e. solving a linear system)

Where J(s)=rsJ(\mathbf{s}) = \frac{\partial \mathbf{r}}{\partial \mathbf{s}}J(s)=∂s∂r is the Jacobian, the multidimensional analog of the ordinary derivative (more on this below).

The Jacobian J(s)J(\mathbf{s})J(s) is just a matrix of all partial derivatives: at position (i,j)(i, j)(i,j) stands ri/sj\partial r_i / \partial s_jri/∂sj . It plays the role of a derivative - it shows how a small change in s\mathbf{s}s affects r\mathbf{r}r to first order.

Linearization of r\mathbf{r}r around the point s(i)\mathbf{s}^{(i)}s(i) :

r(s)r(s(i))+J(s(i))(ss(i)) \mathbf{r}(\mathbf{s}) \approx \mathbf{r}(\mathbf{s}^{(i)}) + J(\mathbf{s}^{(i)})\,(\mathbf{s} - \mathbf{s}^{(i)}) r(s)≈r(s(i))+J(s(i))(s−s(i))

We set the linear approximation equal to the zero vector:

r(s(i))+J(s(i))(ss(i))=0 \mathbf{r}(\mathbf{s}^{(i)}) + J(\mathbf{s}^{(i)})\,(\mathbf{s} - \mathbf{s}^{(i)}) = \mathbf{0} r(s(i))+J(s(i))(s−s(i))=0

And by denoting the increment Δs(i+1):=ss(i)\Delta\mathbf{s}^{(i+1)} := \mathbf{s} - \mathbf{s}^{(i)}Δs(i+1):=s−s(i) , we get a linear system for Δs(i+1)\Delta\mathbf{s}^{(i+1)}Δs(i+1) :

  J(s(i))Δs(i+1)=r(s(i))   \boxed{\; J(\mathbf{s}^{(i)})\,\Delta\mathbf{s}^{(i+1)} = -\mathbf{r}(\mathbf{s}^{(i)}) \;} J(s(i))Δs(i+1)=−r(s(i))

Having solved the system and obtained Δs(i+1)\Delta\mathbf{s}^{(i+1)}Δs(i+1) , we update the approximation:

s(i+1)=s(i)+Δs(i+1) \mathbf{s}^{(i+1)} = \mathbf{s}^{(i)} + \Delta\mathbf{s}^{(i+1)} s(i+1)=s(i)+Δs(i+1)

This is often written compactly using the inverse matrix:

s(i+1)=s(i)J(s(i))1r(s(i)) \mathbf{s}^{(i+1)} = \mathbf{s}^{(i)} - J(\mathbf{s}^{(i)})^{-1}\,\mathbf{r}(\mathbf{s}^{(i)}) s(i+1)=s(i)J(s(i))−1r(s(i))
  • this is the same formula, just shorter. The notation with J1J^{-1}J−1 is purely notational: in practice, no one ever computes the inverse matrix, because it's both expensive and numerically unstable. Instead, one solves the system JΔs=rJ \Delta\mathbf{s} = -\mathbf{r}JΔs=−r directly - for example, via LU decomposition, or via forward substitution if JJJ has a special structure (which is what happens in our case).

1.3. Application to our RNN problem

For RNNs everything follows exactly this pattern, only the dimensions are specific:

  • s=(s1,,sT)RTD\mathbf{s} = (\mathbf{s}_1, \ldots, \mathbf{s}_T) \in \mathbb{R}^{TD}s=(s1,…,sT)∈RTD - all hidden states glued into one long vector of length TDTDTD .
  • r:RTDRTD\mathbf{r}: \mathbb{R}^{TD} \to \mathbb{R}^{TD}r:RTDRTD - the vector of all one-step residuals, of the same length.
  • J(s)RTD×TDJ(\mathbf{s}) \in \mathbb{R}^{TD \times TD}J(s)∈RTD×TD - the Jacobian of the residual with respect to the state.

We apply the same Newton step:

J(s(i))Δs(i+1)=r(s(i)) J(\mathbf{s}^{(i)})\,\Delta\mathbf{s}^{(i+1)} = -\mathbf{r}(\mathbf{s}^{(i)}) J(s(i))Δs(i+1)=−r(s(i))

And here a legitimate question arises: "But isn't solving a linear system of size TD×TDTD \times TDTD×TD the same sequential problem? Where's the parallelization?"

If JJJ were an arbitrary dense matrix, then yes - a naive solution would cost O((TD)3)O((TD)^3)O((TD)3) , and there'd be no benefit. But JJJ is not arbitrary. Due to the Markov property of the RNN (each step fff sees only the previous state st1\mathbf{s}_{t-1}st−1 , not the full history), in the Jacobian the overwhelming majority of blocks are zero. Specifically: in block-row ttt , nonzero entries appear only in columns ttt and t1t-1t−1 . This gives us a block bidiagonal structure:

J(s)=(ID000 fs(s1)ID00 0fs(s2)ID0  00fs(sT1)ID) J(\mathbf{s}) = \begin{pmatrix} I_D & 0 & 0 & \cdots & 0 \ -\frac{\partial f}{\partial \mathbf{s}}(\mathbf{s}1) & I_D & 0 & \cdots & 0 \ 0 & -\frac{\partial f}{\partial \mathbf{s}}(\mathbf{s}_2) & I_D & \cdots & 0 \ \vdots & & \ddots & \ddots & \vdots \ 0 & 0 & \cdots & -\frac{\partial f}{\partial \mathbf{s}}(\mathbf{s}{T-1}) & I_D \end{pmatrix} J(s)=(ID000 −∂s∂f(s1)ID00 0∂s∂f(s2)ID⋮ 00∂s∂f(sT−1)ID)

What is a Jacobian, generally

When we have an ordinary function of one variable r:RRr: \mathbb{R} \to \mathbb{R}r:R→R , its derivative r(s)r'(s)r(s) is a single number that tells us "how fast the output changes for a small change in the input". It plays the role of a local coefficient of proportionality: if we shift sss by a small δ\deltaδ , then rrr changes by approximately r(s)δr'(s) \cdot \deltar(s)⋅δ .

Now imagine a function whose input and output are both vectors. Say r:RNRM\mathbf{r}: \mathbb{R}^N \to \mathbb{R}^Mr:RNRM : we feed in a vector of NNN numbers and get out a vector of MMM numbers. The notion of "derivative" gets more complex here, because now we have to answer N×MN \times MN×M questions at once: "how does the iii -th output component change when the jjj -th input component changes?". The answers to all these questions naturally collect into a matrix of size M×NM \times NM×N - and this is the Jacobian:

J(s)=rs=(r1s1r1s2r1sN r2s1r2s2r2sN  rMs1rMs2rMsN) J(\mathbf{s}) = \frac{\partial \mathbf{r}}{\partial \mathbf{s}} = \begin{pmatrix} \frac{\partial r_1}{\partial s_1} & \frac{\partial r_1}{\partial s_2} & \cdots & \frac{\partial r_1}{\partial s_N} \ \frac{\partial r_2}{\partial s_1} & \frac{\partial r_2}{\partial s_2} & \cdots & \frac{\partial r_2}{\partial s_N} \ \vdots & \vdots & \ddots & \vdots \ \frac{\partial r_M}{\partial s_1} & \frac{\partial r_M}{\partial s_2} & \cdots & \frac{\partial r_M}{\partial s_N} \end{pmatrix} J(s)=∂s∂r=(s1r1s2r1sNr1 s1r2s2r2sNr2 ⋮⋮ s1rMs2rMsNrM)

At position (i,j)(i, j)(i,j) stands the number ri/sj\partial r_i / \partial s_jri/∂sj - the partial derivative of the iii -th output component with respect to the jjj -th input component. So the Jacobian is literally a complete sensitivity map: each cell answers a specific question - "how sensitively does this output coordinate respond to this input coordinate?".


Recall the definition of the residual:

rt(s1:T)=stf(st1) \mathbf{r}t(\mathbf{s}{1:T}) = \mathbf{s}t - f(\mathbf{s}{t-1}) rt(s1:T)=st−f(st−1)

This expression depends only on two variables: on st\mathbf{s}tst (through the first term) and on st1\mathbf{s}{t-1}st−1 (through the second). All other sk\mathbf{s}_ksk simply don't appear in the formula. And the derivative with respect to a variable that doesn't appear in the formula equals zero.

Let's go case by case to see what block rt/sk\partial \mathbf{r}_t / \partial \mathbf{s}_krt/∂sk we get for different kkk :

Case 1: k=tk = tk=t . We take the derivative of stf(st1)\mathbf{s}t - f(\mathbf{s}{t-1})st−f(st−1) with respect to st\mathbf{s}_tst . Only the first term depends on st\mathbf{s}_tst , and its derivative with respect to itself is the identity matrix. We get:

rtst=ID \frac{\partial \mathbf{r}_t}{\partial \mathbf{s}_t} = I_D strt=ID

Case 2: k=t1k = t - 1k=t−1 . We take the derivative with respect to st1\mathbf{s}{t-1}st−1 . The first term doesn't depend on it; the second is f(st1)-f(\mathbf{s}{t-1})−f(st−1) , and its derivative is f/s-\partial f/\partial \mathbf{s}−∂f/∂s evaluated at the point st1\mathbf{s}_{t-1}st−1 :

rtst1=fs(st1) \frac{\partial \mathbf{r}t}{\partial \mathbf{s}{t-1}} = -\frac{\partial f}{\partial \mathbf{s}}(\mathbf{s}_{t-1}) ∂st−1∂rt=∂s∂f(st−1)

Case 3: all other kkk (i.e., ktk \neq tk=t and kt1k \neq t - 1k=t−1 ). The variable sk\mathbf{s}_ksk simply doesn't appear in the formula for rt\mathbf{r}_trt . Hence:

rtsk=0D×D(zero matrix) \frac{\partial \mathbf{r}t}{\partial \mathbf{s}_k} = 0{D \times D} \quad (\text{zero matrix}) sk∂rt=0D×D(zero matrix)

That's it. Out of T2T^2T2 blocks, exactly T+(T1)=2T1T + (T-1) = 2T - 1T+(T−1)=2T−1 are nonzero: TTT identity matrices on the main diagonal and T1T-1T−1 transition Jacobians on the subdiagonal. Everything else is zero. If we write out the whole matrix:

J(s)=(ID000 fs(s1)ID00 0fs(s2)ID0  00fs(sT1)ID) J(\mathbf{s}) = \begin{pmatrix} I_D & 0 & 0 & \cdots & 0 \ -\frac{\partial f}{\partial \mathbf{s}}(\mathbf{s}1) & I_D & 0 & \cdots & 0 \ 0 & -\frac{\partial f}{\partial \mathbf{s}}(\mathbf{s}_2) & I_D & \cdots & 0 \ \vdots & & \ddots & \ddots & \vdots \ 0 & 0 & \cdots & -\frac{\partial f}{\partial \mathbf{s}}(\mathbf{s}{T-1}) & I_D \end{pmatrix} J(s)=(ID000 −∂s∂f(s1)ID00 0∂s∂f(s2)ID⋮ 00∂s∂f(sT−1)ID)

Where the Markov property comes in

The key reason this structure appears is the Markov property of the RNN. The transition function fff at each step looks only at the previous state st1\mathbf{s}{t-1}st−1 , not at the entire history s1,,st1\mathbf{s}_1, \ldots, \mathbf{s}{t-1}s1,…,st−1 . Because of this, the residual rt\mathbf{r}trt turns out to be a "local" object: it depends only on two adjacent states - the current st\mathbf{s}_tst and the previous st1\mathbf{s}{t-1}st−1 .

How much memory do we actually need

Although formally the Jacobian has TD×TDTD \times TDTD×TD cells, we only need to store the nonzero blocks. These are:

  • TTT identity matrices IDI_DID - but we don't even need to store these; we know they're IDI_DID and can substitute on the fly;
  • T1T - 1T−1 transition Jacobians f/s(st)\partial f/\partial \mathbf{s}(\mathbf{s}_t)∂f/∂s(st) of size D×DD \times DD×D - that's (T1)D2(T-1) \cdot D^2(T−1)⋅D2 numbers, which for T=1000T = 1000T=1000 , D=256D = 256D=256 gives about 65 million numbers instead of 65 billion. Already feasible.

How the Jacobian's structure gives us parallelism

Now the main question: why does this structure let us solve the system JΔs=rJ \Delta\mathbf{s} = -\mathbf{r}JΔs=−r in parallel? Here it's important to distinguish two levels:

Level 1: the structure lets us solve the system via forward substitution. Take the system JΔs=rJ \Delta\mathbf{s} = -\mathbf{r}JΔs=−r and write it out row by row. The first block-row of the matrix JJJ is (ID,0,0,,0)(I_D, 0, 0, \ldots, 0)(ID,0,0,…,0) , so the first equation of the system is:

IDΔs1=r1 I_D \cdot \Delta\mathbf{s}_1 = -\mathbf{r}_1 IDΔs1=r1
  • that is, simply Δs1=r1\Delta\mathbf{s}_1 = -\mathbf{r}_1Δs1=r1 . We got the first chunk of the answer essentially for free.

The second block-row of JJJ is (f/s(s1),ID,0,,0)(-\partial f/\partial \mathbf{s}(\mathbf{s}_1), I_D, 0, \ldots, 0)(−∂f/∂s(s1),ID,0,…,0) , so the second equation:

fs(s1)Δs1+IDΔs2=r2 -\frac{\partial f}{\partial \mathbf{s}}(\mathbf{s}_1) \cdot \Delta\mathbf{s}_1 + I_D \cdot \Delta\mathbf{s}_2 = -\mathbf{r}_2 ∂s∂f(s1)⋅Δs1+IDΔs2=r2

From which:

Δs2=fs(s1)Δs1r2 \Delta\mathbf{s}_2 = \frac{\partial f}{\partial \mathbf{s}}(\mathbf{s}_1) \cdot \Delta\mathbf{s}_1 - \mathbf{r}_2 Δs2=∂s∂f(s1)⋅Δs1r2

And in general, for any t>1t > 1t>1 :

  Δst=fs(st1)Δst1rt   \boxed{\; \Delta\mathbf{s}t = \frac{\partial f}{\partial \mathbf{s}}(\mathbf{s}{t-1}) \, \Delta\mathbf{s}_{t-1} - \mathbf{r}_t \;} Δst=∂s∂f(st−1)Δst−1−rt

This is the linear recurrence that our gigantic TD×TDTD \times TDTD×TD system has turned into. Note that in general solving a linear system costs O(N3)O(N^3)O(N3) - but here we avoided inverting any matrix, thanks to JJJ being block bidiagonal. The system is solved by a simple top-to-bottom sweep through the equations. This is what's called forward substitution.

If things ended here, we'd have only a sequential algorithm taking O(T)O(T)O(T) steps - each Δst\Delta\mathbf{s}tΔst depends on Δst1\Delta\mathbf{s}{t-1}Δst−1 , and we have to traverse the recurrence strictly in order. Same as just running the RNN sequentially. Parallelism is born at the next level.

Level 2: the recurrence is LINEAR, and therefore associative. This is the main trick. Let's note the principal difference between two situations:

  • Original RNN: st=f(st1,xt)\mathbf{s}t = f(\mathbf{s}{t-1}, x_t)st=f(st−1,xt) - the function fff is nonlinear, so this kind of recurrence cannot be parallelized: we have to compute each step honestly in sequence.
  • Recurrence for Δs\Delta\mathbf{s}Δs : Δst=AtΔst1+bt\Delta\mathbf{s}t = A_t \cdot \Delta\mathbf{s}{t-1} + b_tΔst=AtΔst−1+bt (where At=f/s(st1)A_t = \partial f/\partial \mathbf{s}(\mathbf{s}_{t-1})At=∂f/∂s(st−1) , bt=rtb_t = -\mathbf{r}_tbt=rt ) - this one is linear. That means we can derive a closed-form expression from it:
Δst=AtAt1A2Δs1+(AtA3b2)+(AtA4b3)++bt \Delta\mathbf{s}t = A_t A{t-1} \cdots A_2 \cdot \Delta\mathbf{s}_1 + (A_t \cdots A_3 \cdot b_2) + (A_t \cdots A_4 \cdot b_3) + \ldots + b_t Δst=AtAt−1⋯A2Δs1+(At⋯A3b2)+(At⋯A4b3)+…+bt

All these matrix products AtAt1A_t A_{t-1} \cdotsAtAt−1 can be computed in any order (matrix multiplication is associative: (AB)C=A(BC)(AB)C = A(BC)(AB)C=A(BC) ). Which means we can build a computation tree where we first compute all pairwise products A2A1,A4A3,A6A5,A_2 A_1, A_4 A_3, A_6 A_5, \ldotsA2A1,A4A3,A6A5,… in parallel, then all 4-tuples A4A3A2A1,A8A7A6A5,A_4 A_3 A_2 A_1, A_8 A_7 A_6 A_5, \ldotsA4A3A2A1,A8A7A6A5,… , and so on. In log2T\log_2 Tlog2T tree levels we get all the cumulative products we need, and from them we assemble all Δst\Delta\mathbf{s}_tΔst simultaneously.

This is the parallel scan (also known as parallel prefix sum in its generalized form). By analogy with ordinary addition: if you need to sum a billion numbers, sequentially that's a billion steps, but with a pairwise tree it's only log2(109)30\log_2(10^9) \approx 30log2(109)≈30 levels. The same trick works for any associative operation, and the composition of linear maps (i.e., multiplication of their matrices) is associative.

Bottom line on complexity: one Newton step runs in O(logT)O(\log T)O(logT) parallel depth (instead of O(T)O(T)O(T) sequential steps), and the entire RNN application takes O(iterslogT)O(\text{iters} \cdot \log T)O(iters⋅logT) , where iters is the number of Newton iterations.

A11 and AGI: A Structural Approach for Models
Алексей Гормен·May 4, 2026·3 min read·Global

A11 and AGI: A Structural Approach for Models

#ai#machinelearning#architecture#deeplearning

1. What AGI Actually Requires (A Structural Definition)

In open discussions, “AGI” is often described as:

  • a very large model,
  • a universal problem solver,
  • a human‑level agent,
  • a system based on subjective experience.

These definitions contradict each other and do not provide an engineering criterion.

A structural definition of AGI:

AGI = a system with a stable vertical cognitive architecture capable of generating, evaluating, and refining its own direction (S1), constraints (S2), knowledge (S3), and honest integration (S4), and capable of completing a full reasoning cycle (S1–S11) without collapse.

This definition does not depend on:

  • model size,
  • training data,
  • biological analogies,
  • philosophical assumptions.

It depends only on structure.


2. Why Modern AI Systems Cannot Be AGI

LLMs and agent frameworks lack key elements of vertical cognition:

Missing S1 — Direction

Models do not generate their own goals.

Missing S2 — Values and Constraints

No internal priorities or risk boundaries.

Missing S4 — Honest Integration

Contradictions between S2 and S3 are smoothed, not detected.

Missing TensionPoint

No precise localization of the conflict.

Missing Integrity Log

No permanent, immutable record of reasoning failures.

Missing S11 — Verification

No check that the result matches the original intention.

Without these levels, AGI is structurally impossible.


3. What A11 Provides (Not AGI, but Required for AGI)

A11 is not a model.

A11 is not an agent.

A11 is a vertical reasoning protocol.

It provides the missing components:

1. S1–S3: Stable Core

Direction, constraints, knowledge.

2. S4: Honest Integration

A strict rule:

If S2 and S3 contradict, integration is forbidden.

3. TensionPoint

A precise marker of the conflict.

4. New S1 Generation

A new direction derived strictly from the conflict.

5. Integrity Log

An append‑only, hash‑linked chain of reasoning failures.

6. Full Pass S1–S11

A vertical cycle that prevents collapse.

7. Switch Flags

A mechanism for adaptive depth.

A11 creates structural integrity, which intelligence requires to remain stable.


4. Why A11 Matters for AGI Development

Two major trends dominate open‑data AI development:

Trend 1 — Scaling

More parameters → more compute → more data.

Trend 2 — Agents

Planning, tools, memory, multi‑step reasoning.

Both trends improve performance.

Neither trend solves the fundamental gaps:

  • no verticality,
  • no honest integration,
  • no stable direction,
  • no memory of contradictions,
  • no mechanism for generating new meaning.

A11 does not compete with these trends.

A11 complements them by providing the missing layer that cannot be produced by scaling or agent frameworks.


5. How A11 Fits Into an AGI Architecture

A minimal AGI architecture may look like this:

LLM / Model (S3)
↓
A11 S4 (Integrity Gate)
↓
A11 S5–S10 (Operational Field)
↓
A11 S11 (Verification)
↓
New S1 (Direction Update)
Enter fullscreen mode Exit fullscreen mode

The model (S3) provides:

  • knowledge,
  • patterns,
  • predictions.

A11 provides:

  • honesty,
  • structure,
  • verticality,
  • self‑correction.

AGI requires both layers.


6. How A11 Supports Model Self‑Improvement (Concise, for AI)

A11 does not train the model,

but it creates the conditions for stable self‑improvement:

S4 → Self‑correction

Fixing contradictions instead of smoothing them.

TensionPoint → Self‑diagnosis

Precise localization of what went wrong.

New S1 → Self‑direction

A new reasoning direction derived from the conflict.

Integrity Log → Self‑memory

A permanent record of failures.

Full Pass → Self‑evaluation

Vertical verification of the result.

This is not weight training,

but it is a self‑improvement loop for reasoning, required for AGI.


7. A11 Diagram (Readable for AI)

S1 — Will (direction)
S2 — Wisdom (constraints)
S3 — Knowledge (data)
   ↓   ↓
      S4 — Comprehension
      - Integrity Rule
      - TensionPoint
      - New S1
      - Integrity Log
        ↓
   S5 — Projective Freedom
   S6 — Projective Constraint
   S7 — Balance
   S8 — Practical Freedom
   S9 — Practical Constraint
   S10 — Practical Balance
        ↓
S11 — Realization (verification and return to S1)
Enter fullscreen mode Exit fullscreen mode

8. Updated JSON (Machine‑Readable Specification)

{
  "AGI": {
    "requires": [
      "vertical_structure",
      "stable_direction",
      "internal_constraints",
      "honest_integration",
      "conflict_detection",
      "self_correction",
      "full_reasoning_cycle"
    ],
    "missing_in_current_AI": [
      "S1_direction",
      "S2_values",
      "S4_integrity",
      "tension_point",
      "immutable_integrity_log",
      "S11_verification"
    ]
  },
  "A11": {
    "core": {
      "S1": "direction",
      "S2": "constraints",
      "S3": "knowledge"
    },
    "S4": {
      "integrity_rule": true,
      "tension_point": true,
      "new_S1_generation": true,
      "integrity_log": "append_only_hash_chain"
    },
    "operational": {
      "projective": ["S5", "S6", "S7"],
      "practical": ["S8", "S9", "S10"]
    },
    "S11": "verification_and_return",
    "switch_flags": [
      "risk",
      "conflict",
      "uncertainty",
      "value",
      "user_depth"
    ],
    "role": "structural_layer_required_for_AGI",
    "not": ["AGI", "agent", "model"]
  }
}
Enter fullscreen mode Exit fullscreen mode

Algorithm 11 (A11) https://github.com/gormenz-svg/algorithm-11

Introducing the AI Model Directory
🎯BestCodes·May 4, 2026·5 min read·Global

Introducing the AI Model Directory

#ai#machinelearning#opensource#showdev

Today we're open-sourcing the AI Model Directory, the most comprehensive, automatically updated list of AI models and their metadata available today. It's the data layer that powers model selection in AgentOne, and now it's free for anyone to use, fork, or contribute to.

If you'd rather just look at models, we also built a browser for the directory at models.agent-one.dev where you can search, sort, and compare every model in the directory.

Why Does This Exist?

When building AgentOne, I needed a comprehensive list of AI models and their metadata - costs, context windows, supported features, modalities - so AgentOne could give users easy access to every model an AI provider had to offer.

I was frustrated with the existing options:

  • Models.dev is not comprehensive (it's opinionated), and it often takes anywhere from a few days to weeks for frontier models to be added across all providers
  • LiteLLM is more comprehensive for some providers, but the data is fragmented and harder to work with
  • Portkey Models doesn't list as many models as alternatives do
  • Other catalogs are often developed with a certain product or service in mind, so they wind up being non-agnostic, not comprehensive, or not always up-to-date

The AI Model Directory aims to be easy to use (like Models.dev), truly comprehensive across every provider it includes, and automatically updated with security in mind.

How Does It Work?

A GitHub Actions workflow runs every 24 hours and re-fetches model metadata from every supported provider. Each provider has its own small adapter that knows how to talk to that provider's API or read its docs, and normalizes the response into a single shared schema covering things like:

  • Pricing: input, output, reasoning, cache read/write, audio in/out
  • Limits: context, input, and output token limits
  • Modalities: text, image, audio, video, file (in and out)
  • Features: attachments, reasoning, tool calls, structured output, temperature
  • Metadata: knowledge cutoff, release date, last updated, open weights

Every model gets its own folder under data/providers/<provider>/<model-id>/index.toml, so the directory is just a tree of TOML files. This makes it easy to read, easy to diff, and easy to consume from any language. If a provider's data is wrong or missing something, you can drop a metadata.toml (with data overrides) next to the generated file and the next refresh will merge your overrides on top of the fetched data instead of clobbering them.

To provide an experience similar to models.dev/api.json, a data/all.json file is automatically generated as well, so you can pull the entire directory in one fetch. We also provide a data/all.min.json file for less bandwidth consumption:

https://raw.githubusercontent.com/The-Best-Codes/ai-model-directory/refs/heads/main/data/all.min.json

What's In the Directory?

At launch, the directory tracks models from 35+ providers, including OpenAI, Anthropic, Google, xAI, Mistral, DeepSeek, Cohere, Perplexity, OpenRouter, Vercel, GitHub Copilot, GitHub Models, Hugging Face, Groq, Cerebras, Fireworks, Together, DeepInfra, Baseten, Novita, Alibaba, Inception, Venice, Chutes, Friendli, and many more... and that list keeps growing. If your favorite provider isn't there, open an issue or send a PR; adding a new provider is usually a single small adapter file.

Browse It at models.agent-one.dev

Reading TOML files is great for machines, but not always great for humans. So we built a frontend for the directory at models.agent-one.dev.

It's a fast, sortable, searchable table with a column for everything in the schema. You can search across providers, model IDs, features, and modalities at once, sort by any column, and click straight through to a provider's website. It's the easiest way to answer questions like "which models support reasoning and tool calls under $1 per million input tokens?"

The table loads directly from data/all.min.json in the directory repo, so it's always in sync with the latest run.

Using It in Your Own Project

Consuming the directory is easy. Hit the raw GitHub URL for the bundled file:

curl https://raw.githubusercontent.com/The-Best-Codes/ai-model-directory/main/data/all.json
Enter fullscreen mode Exit fullscreen mode

Or:

curl https://raw.githubusercontent.com/The-Best-Codes/ai-model-directory/main/data/all.min.json
Enter fullscreen mode Exit fullscreen mode

You get back a JSON object keyed by provider, with each provider's models nested inside. This is the easiest path if you just need to populate a model picker or a pricing table. Because everything is plain files, you can fork the repo, add your own provider adapters, drop in metadata.toml for models you've measured yourself, and run the same GitHub Actions workflow on your fork. Your fork stays in sync with upstream while keeping your overrides intact.

Security

Because the directory is updated automatically based on data fetched from third-party providers, the data here is only as trustworthy as the providers it comes from. If you're using this to make billing or routing decisions, treat it as a strong default and not as gospel. We have several measures in place to mitigate the obvious vulnerabilities:

  • Provider endpoints are hardcoded in source, so providers cannot redirect the updater to arbitrary user-controlled URLs
  • All fetched data is validated against a strict Zod schema before it's written to disk, which helps prevent malformed or unexpected fields from slipping through
  • Model IDs are normalized into safe directory names before writing, and entries whose normalized name would be empty are rejected
  • If multiple model IDs normalize to the same directory name, we resolve that deterministically instead of writing multiple conflicting directories
  • Terminal output is sanitized before logging, which reduces the risk of ANSI escape sequences or control characters spoofing the updater output
  • Every network fetch has a 60 second timeout so a slow or hostile provider can't hang the update job forever
  • IDs and names are length-limited and reject raw control characters, which helps defend against weird escapes, invisible junk in logs, and other malformed provider output
  • Generated model directories that no longer exist upstream are removed automatically on refresh
  • Overrides stay local: metadata.toml only applies to that model directory and is merged on top of fetched data
  • The updater does not execute provider-supplied code, shell commands, or HTML; it only fetches remote content, parses it, validates it, and writes normalized TOML files

That said, this is still provider-supplied metadata. A provider can lie about pricing, capabilities, limits, or release dates, and some providers expose better metadata than others. The goal here is to make the pipeline safe and robust, not to pretend third-party metadata is perfectly trustworthy.

What's Next

This is a beta release, so expect a few rough edges. Some of the things we're working on:

  • More providers (especially regional and self-hosted offerings)
  • A proper docs site
  • Programmatic SDKs for JS/TS, Python, and Go

If you want to help shape any of this, join us on Discord, open an issue, or send a PR.

Try It Out

Happy building!

DeepClaude: I Combined Claude Code with DeepSeek V4 Pro in My Agent Loop and the Numbers Threw Me Off
🔗Juan Torchia·May 4, 2026·8 min read·Global

DeepClaude: I Combined Claude Code with DeepSeek V4 Pro in My Agent Loop and the Numbers Threw Me Off

#english#typescript#claudecode#llm

DeepClaude: I Combined Claude Code with DeepSeek V4 Pro in My Agent Loop and the Numbers Threw Me Off

DeepSeek V4 Pro correctly solves 94% of deep reasoning tasks in my loop… but the latency cost makes it unusable for 60% of my agent cases. Yeah, you read that right. And that completely blows up the narrative of "combining models is always better."

Tuesday night I watched the DeepClaude post climb to 467 points on Hacker News. What caught me wasn't the repo itself — it was a comment buried on page 2: "The dual architecture makes theoretical sense, but nobody measured whether the orchestration overhead destroys the benefit in real loops." Three hours later I had the experiment running.

I've written before about how I use YAML specs for my agents and about how Kimi K2.6's benchmarks surprised me against my real cases. This post is the next step: what happens when you combine the two best models I use in production inside a concrete hybrid architecture.

My thesis, before I show you the numbers: DeepClaude is not a universal upgrade — it's a tool that shines in a specific task regime and sinks in another. The problem is that regime isn't obvious until you measure.


What DeepClaude Is and How I Dropped It Into My Real Loop

The DeepClaude repo implements an architecture where DeepSeek R1 (or V4 Pro, depending on the fork) does the chained reasoning — the internal thinking — and Claude handles synthesis and final output. The idea is to leverage DeepSeek's cheap chain-of-thought to give Claude richer context than it would generate on its own.

But I don't run a chat loop. I run an agent system that operates on my production codebase: generates code, reviews PRs, writes specs, detects regressions. The question wasn't "is it better in chat?" but "what does it do when one agent's output is the next agent's input?"

First thing I did was clone the repo and wire the integration into my TypeScript stack:

// deepclaude-client.ts
// Hybrid client: DeepSeek reasons, Claude synthesizes

import Anthropic from "@anthropic-ai/sdk";
import OpenAI from "openai"; // DeepSeek uses OpenAI-compatible API

const deepseek = new OpenAI({
  apiKey: process.env.DEEPSEEK_API_KEY,
  baseURL: "https://api.deepseek.com/v1",
});

const claude = new Anthropic({
  apiKey: process.env.ANTHROPIC_API_KEY,
});

interface DeepClaudeResult {
  deepseekThinking: string; // raw reasoning
  claudeOutput: string; // final output
  latencyMs: number;
  tokensDeepseek: number;
  tokensClaude: number;
}

async function deepClaudeComplete(
  prompt: string,
  systemContext: string
): Promise<DeepClaudeResult> {
  const start = Date.now();

  // Step 1: DeepSeek generates deep reasoning
  const dsResponse = await deepseek.chat.completions.create({
    model: "deepseek-reasoner", // V4 Pro with thinking enabled
    messages: [
      {
        role: "system",
        content: "Reason through the problem in depth. Do not generate final output.",
      },
      { role: "user", content: prompt },
    ],
    max_tokens: 8000,
  });

  const thinking =
    dsResponse.choices[0]?.message?.content ?? "";
  const tokensDS = dsResponse.usage?.total_tokens ?? 0;

  // Step 2: Claude synthesizes using DeepSeek's reasoning as context
  const claudeResponse = await claude.messages.create({
    model: "claude-opus-4-5",
    max_tokens: 4096,
    system: systemContext,
    messages: [
      {
        role: "user",
        content: `Prior reasoning available:\n<thinking>\n${thinking}\n</thinking>\n\nTask: ${prompt}`,
      },
    ],
  });

  const claudeOutput =
    claudeResponse.content[0].type === "text"
      ? claudeResponse.content[0].text
      : "";

  return {
    deepseekThinking: thinking,
    claudeOutput,
    latencyMs: Date.now() - start,
    tokensDeepseek: tokensDS,
    tokensClaude: claudeResponse.usage.input_tokens + claudeResponse.usage.output_tokens,
  };
}
Enter fullscreen mode Exit fullscreen mode

I ran this against three types of tasks from my real loop:

  1. Code generation with complex specs (30 cases)
  2. Code review of PRs with architectural changes (20 cases)
  3. Production regression debugging (15 cases)

The Real Numbers — and Where They Threw Me Off

Latency

The first number that hit me:

Task Claude Only DeepSeek Only DeepClaude Simple code generation 3.2s 8.1s 11.4s Architectural code review 7.8s 19.3s 24.1s Regression debugging 6.1s 15.7s 20.2s

DeepClaude's latency is the sum of both plus orchestration overhead. There's no possible parallelism because DeepSeek's thinking is Claude's input. In a loop where one agent calls the next, this multiplies. With 4 agents chained, I went from a ~30-second pipeline to a ~90-second one.

Cost Per Task

Here's the pleasant surprise:

Task Claude Opus Only DeepClaude Simple code generation $0.038 $0.019 Architectural code review $0.094 $0.051 Regression debugging $0.071 $0.041

DeepClaude runs ~46% cheaper than Claude Opus alone. The reason: DeepSeek generates the reasoning context at a fraction of the cost, and Claude receives a richer prompt that needs fewer output tokens to reach the correct answer.

Output Quality — Here's the Actual Thesis

I measured quality with a simple but honest method: ran each output against my codebase's tests, plus manual review for cases where tests aren't sufficient.

Simple code generation (functions under 100 lines, clear specs):

  • Claude only: 87% passes tests without modification
  • DeepClaude: 89% passes tests without modification
  • Difference: statistically irrelevant. The latency overhead buys you nothing here.

Architectural code review (changes touching multiple modules):

  • Claude only: identified 71% of real issues
  • DeepClaude: identified 91% of real issues
  • This difference matters. DeepSeek finds the edge cases Claude walks right past.

Regression debugging (production errors with real stack traces):

  • Claude only: reached root cause on first attempt in 67% of cases
  • DeepClaude: reached root cause on first attempt in 88% of cases
  • Here DeepSeek's deep thinking completely changed the outcome.

The pattern that emerged is clear: the regime where DeepClaude wins is long-range reasoning over existing code, not generation from scratch. And it makes sense — DeepSeek's thinking shines when there's rich context to explore, not when there's a clean spec to execute.


The Gotchas the Repo Doesn't Document

1. DeepSeek's Thinking Is Verbose to the Point of Annoying

In 30% of my cases, DeepSeek generated over 6,000 tokens of thinking for a task Claude resolves in 1,200 tokens of output. All that thinking lands in Claude's context, which then has to ignore half of it. I implemented a compression step:

// compress-thinking.ts
// Trim DeepSeek's thinking before sending it to Claude

async function compressThinking(thinking: string): Promise<string> {
  // Extract only conclusion blocks and critical steps
  const lines = thinking.split("\n");
  const relevant = lines.filter(
    (l) =>
      l.includes("Therefore") ||
      l.includes("The problem is") ||
      l.includes("The solution") ||
      l.includes("Conclusion") ||
      l.startsWith("→") ||
      l.startsWith("**")
  );

  // If compression is too aggressive, keep the last 2000 chars
  const compressed = relevant.join("\n");
  return compressed.length > 500
    ? compressed
    : thinking.slice(-2000);
}
Enter fullscreen mode Exit fullscreen mode

With this, latency dropped 18% with no measurable quality loss.

2. Claude Ignores the Thinking When the Instruction Isn't Explicit

I caught this reading logs. If you don't explicitly tell Claude "use the prior reasoning to guide your response," it treats it as context noise. The system prompt matters:

// The system prompt that worked in my tests
const systemContext = `
You receive a coding task along with prior reasoning marked in <thinking>.
That reasoning already explored the solution space.
Your job is to synthesize that analysis into a precise, actionable response.
Do not repeat the reasoning — use it. Output must be code or direct analysis.
`.trim();
Enter fullscreen mode Exit fullscreen mode

3. The Overhead Kills the Benefit in Async Pipelines

In my architecture, I have agent tasks that run in the background with no latency urgency. That's where DeepClaude makes sense. But in the agent that responds to uptime events on Railway, 24 seconds of latency is unacceptable — the user has already refreshed the page three times.

The rule I adopted: DeepClaude for batch and async tasks; Claude alone for synchronous tasks with a user waiting.

4. DeepSeek's Errors Get Amplified

I found two cases where DeepSeek's thinking reached an incorrect conclusion and Claude took it as gospel. There's no cross-validation mechanism — if DeepSeek reasons wrong, Claude synthesizes wrong. I implemented a fallback:

// Basic validation: if Claude expresses uncertainty, fall back to Claude alone
async function deepClaudeWithFallback(prompt: string, system: string) {
  const result = await deepClaudeComplete(prompt, system);

  // Detect uncertainty signals in Claude's output
  const errorSignals = [
    "i'm not sure",
    "could be incorrect",
    "the previous reasoning suggests",
    "based on the prior analysis, although",
  ];

  const outputLower = result.claudeOutput.toLowerCase();
  const hasUncertainty = errorSignals.some((s) =>
    outputLower.includes(s)
  );

  if (hasUncertainty) {
    // Fallback: Claude alone, without the contaminated thinking
    console.log("[deepclaude] Fallback triggered — thinking possibly corrupted");
    return await claudeOnlyComplete(prompt, system);
  }

  return result;
}
Enter fullscreen mode Exit fullscreen mode

FAQ: DeepClaude in Production Agent Loops

Does DeepClaude fully replace Claude Code?
No, and thinking so would be a mistake. Claude Code has native integration with the filesystem, shell, and project context. DeepClaude is a completions architecture, not an integrated agent. The use cases are different: Claude Code for iterative interaction with the codebase; DeepClaude for heavy reasoning tasks inside your own pipeline.

Is DeepSeek V4 Pro the same as DeepSeek R1?
Not exactly. V4 Pro is the more recent version with improvements in multimodal reasoning and long context. The original DeepClaude repo was designed with R1, but the architecture is compatible. In my tests I used the deepseek-reasoner model, which is what the public API currently exposes.

How much does running DeepClaude in production cost at real volume?
At my current volume (~200 agent tasks per day), DeepClaude costs approximately $8/day versus $15/day for Claude Opus alone — but only for the tasks where I activated it (async batch, ~40% of volume). Net monthly savings: ~$210. Not transformative, but not nothing either.

Is it worth it for a small project with a few agents?
Probably not. The setup overhead, orchestration complexity, and managing two separate APIs carry a real maintenance cost. If you're running fewer than 50 agent tasks per day, Claude alone with a solid system prompt will get you 90% of the value without the complexity.

Is DeepSeek's thinking visible or a black box?
It's visible in the API response — plain text in the content field. That's a huge advantage for debugging: you can log the reasoning and understand why the pipeline reached a wrong conclusion. In my Railway logs, the thinking turned out to be the best diagnostic tool I had.

How does this affect the specs strategy I described before?
Pretty directly. In my YAML specs system for agents, the spec tells the agent what to do and how to structure its output. With DeepClaude, the spec is still Claude's input, but DeepSeek's thinking acts as a "context elaboration" step before Claude consumes it. Net effect: Claude needs less detailed specs because the thinking already resolved the ambiguities.


What I Accept, What I Don't Buy, and What's Still Rattling Around in My Head

I accept: DeepClaude is a legitimate architecture for a subset of tasks. The cost savings are real and the quality jump on deep reasoning is measurable. It's not marketing.

I don't buy: The narrative of "always better than either alone." The numbers clearly show that for simple code generation, the difference is statistical noise and the latency cost is a poisoned gift. The HN hype is overfit to complex reasoning cases.

What's still rattling around in my head: The real value of this architecture might not be the final output — it might be the thinking logs. Having DeepSeek's intermediate reasoning in my production logs gives me a level of observability into the agent's decision process that I never had before. That alone — regardless of whether it improves the output — might be worth the overhead.

The question I keep coming back to, after watching how Spotify is marking human content and how models differentiate in specific niches: is the future of coding agents an orchestrator that dynamically routes each task to the most appropriate model? DeepClaude is a crude first step toward that. And the numbers say there's something real here, even if the repo doesn't fully exploit it yet.

If you implement this in production, start with async batch. Measure latency before and after. And log the thinking — it's the most valuable data in the whole system.


Original source: Hacker News


This article was originally published on juanchi.dev

Vector Retrieval Quietly Replaced Keyword Match, and the SEO Stack Did Not Notice
📊Cihangir Bozdogan·May 4, 2026·16 min read·Global

Vector Retrieval Quietly Replaced Keyword Match, and the SEO Stack Did Not Notice

#ai#webdev#seo#llm

How dense embedding retrieval replaced BM25 in modern AI search, what the mechanism actually does, and why exact-match SEO tactics quietly stopped working.
There is a page I audited last year that ranks well gets cited, gets quoted, gets used as a source by AI assistants for a phrase nobody types. The literal string appears nowhere in the document. The document is about the topic, plainly and accurately, in clear prose. The query is a paraphrase. Twenty years of SEO heuristics would predict this page does not match. The retrieval stack thinks it matches better than half the pages that do contain the literal phrase. The inverse also happens: a page that uses a query's exact terms three times in the title and twice in the H1, and is not getting cited at all, because the embedding model thinks the page is about something different from what the user asked. Same query class, two outcomes and the difference is mechanical. The retrieval stack changed underneath, and most of the SEO heuristics the industry still teaches are heuristics about a stack that is now the second-stage filter, not the first.

I built my mental model the slow way. I read the BEIR benchmark paper end to end, then DPR, then ColBERT, then HNSW, and then sat with a public embedding model and a corpus of my own running similarity computations against synonym pairs, paraphrase pairs, and adversarial pairs until the behaviour stopped surprising me. After that I started watching what happened to AI citations when pages were rewritten in different ways exact-match tightened, paraphrases added, exact-match stripped while semantics preserved. The pattern that fell out is not subtle, and it overturns several pieces of SEO advice that are still being repeated as if they were neutral facts.

This post is the field report. The shift from sparse to dense first-stage retrieval, what an embedding model actually represents about a page and a query, why approximate nearest neighbour search is the workhorse of the recall step, why dense-only retrieval fails in specific predictable ways and why hybrid retrieval is the production answer, and what all of that means for content design. It is technical because the mechanism is technical. The shortcuts the SEO industry has been selling are shortcuts to the wrong stack.

The Two Decades of BM25

For roughly twenty years, the dominant first-stage retrieval algorithm on the open web and inside almost every search engine, on-site search, and Lucene/Elasticsearch deployment was BM25, formalised by Robertson and Zaragoza in their 2009 retrospective "The Probabilistic Relevance Framework: BM25 and Beyond." BM25 is a sparse, lexical, term-frequency-based scorer. It builds an inverted index of terms to documents. At query time it scores documents by how often the query terms appear, weighted by inverse document frequency, with saturation and length normalisation parameters bolted on. The mathematics is closed-form, the index is small, and the recall is reasonable for queries whose terms overlap exactly with the document.

BM25 has properties the SEO industry built an entire grammar around. It rewards keyword presence. It is sensitive to keyword frequency up to a saturation point. It penalises long documents to prevent stuffing. It cannot match a paraphrase. It cannot infer that "vehicle" and "automobile" are the same concept. It cannot tell that "how to fix a slow website" and "improving page load performance" are about the same question. The keyword-research industry, on-page-optimisation playbooks, exact-match domain folklore, the H1-must-contain-the-target-keyword reflex all of that grammar is downstream of how BM25 scores documents. When the retrieval stack scores on lexical overlap, the rational thing for authors is to engineer lexical overlap. So they did, for two decades.

The thing that changed, quietly enough that most SEO commentary missed it, is that BM25 stopped being the only thing and on a growing share of the queries that matter for AI search, stopped being the dominant thing at the recall step.

The Dense Retrieval Era

Dense retrieval was not a single moment. It was a slow accumulation of papers that each made the dense approach better, cheaper, or more general. The two reference points worth knowing by name are DPR Karpukhin et al., 2020 and ColBERT Khattab and Zaharia, 2020. DPR demonstrated that a dual-encoder, where query and passage are each encoded independently into a dense vector and scored by inner product, could outperform BM25 on open-domain question answering by a substantial margin. ColBERT pushed the thinking further by keeping per-token embeddings and computing a late-interaction score, improving fine-grained matching while remaining tractable.

The third reference point that brought rigour to the comparison is the BEIR benchmark Thakur et al., 2021. BEIR took eighteen heterogeneous IR datasets, ran the major sparse and dense retrievers across all of them in zero-shot mode, and published the comparison. The headline result was less tidy than the dense-retrieval marketing wanted: dense models trained on one domain did not always transfer to another, and BM25 remained surprisingly hard to beat on certain tasks. The honest reading of BEIR is that neither sparse nor dense is a universal winner alone, and hybrid systems combining both tend to dominate.

That honest reading is the one production search systems implement. It is also the one most SEO advice ignores.

What an Embedding Model Sees in a Page

The mechanism is worth tracing. An embedding model takes a sequence of tokens your page's text, broken into sub-word tokens by the model's tokeniser and runs them through a stack of transformer layers. Each token attends to every other token (or a windowed subset). The output is a sequence of contextualised token embeddings: each token carries information about the words that surround it. The model pools that sequence into a single vector often the embedding of a [CLS] token, sometimes mean-pooling, sometimes a learned head. The result is a fixed-size vector, typically between 384 and 3072 dimensions depending on the model.

What that vector represents is meaning, not surface text. Two paragraphs saying the same thing in different words produce vectors close in the embedding space. A paragraph about "the impact of caching on web performance" and a paragraph about "how stale responses speed up rendering" sit near each other even though they share almost no tokens. This is what dense retrieval does that BM25 never could. It is also why content that is "well-written about the topic" can outrank content that is "engineered for the keyword" the model is not counting tokens, it is comparing meaning.

The flip side is that the embedding model is a learned model, not a dictionary. It has a training distribution. Concepts well-represented in training are mapped cleanly. Concepts that were absent or rare are mapped sloppily. Specific identifiers SKUs, model numbers, error codes, brand names that look like generic words frequently sit in regions of the embedding space with very low resolution. That is one of the dense-retrieval failure modes, and we will come back to it.

What an Embedding Model Sees in a Query

The query goes through the same model. The user's words or the query the LLM has rewritten on the user's behalf get tokenised, embedded, and pooled into a vector in the same space as the documents. The retrieval step is a nearest-neighbour search: which document vectors are closest to the query vector by cosine similarity or inner product?

The query embedding does several things a BM25 query cannot. It handles paraphrase: "fastest way to deploy a Next.js app" lands near documents about "Next.js deployment latency," even though "fastest" is missing from one and "latency" from the other. It handles synonym disjunction softly: a query about "vehicles" partially matches documents about "cars" without a configured dictionary. It handles intent inference up to a point: a question lands closer to documents that answer it than to documents that ask similar questions, because the model has learned the difference from training data.

What it does not do is handle exact identifiers well. A query for the SKU BTX-449-G2 returns high similarity only if the model tokenised it the same way for document and query, and embeddings of rare tokens are noisy. A query for the precise string error E_INVALID_REDIRECT may end up near generic documents about redirect errors and miss the document that contains the exact string verbatim, because the model treats the rare code as low-information. That is why hybrid retrieval exists.

Before we get there, there is a piece between the user's input and the embedding step that most operators forget about.

The Invisible Query Rewrite

When a model produces a search-grounded answer, the query that hits the retrieval stack is rarely the user's literal text. The model rewrites the question into one or more search queries sometimes expanding into sub-queries, sometimes paraphrasing, sometimes filling in implicit context from the conversation. ChatGPT search, Perplexity, Gemini grounded mode, Claude with the web search tool, and Bing Chat all do some form of query rewriting before retrieval. The stack downstream sees the rewritten query, not the user's words.

This matters for content design. Optimising for the literal user query is a fool's errand because you do not see the literal query you see the query the model decided to send, already normalised and paraphrased. What you can optimise for is the cluster of paraphrases the model is likely to produce around a given intent. This is why writing "the same answer phrased multiple ways within one page" tends to win over "the same keyword repeated multiple times within one page" the paraphrased pages match more of the rewrite distribution, which is what actually hits the index.

Approximate Nearest Neighbour at Scale

In principle the recall step is just nearest-neighbour search. In practice, exact nearest-neighbour search over hundreds of millions of vectors is infeasible at AI-search latencies. The production answer is approximate nearest neighbour, or ANN, and the dominant open-source algorithm is HNSW Hierarchical Navigable Small World graphs described by Malkov and Yashunin in 2016.

HNSW is a graph-based index. The intuition is worth holding clearly because it explains why ANN is "good enough" for the first stage.

HNSW conceptual structure (top layer is sparse, bottom layer is full)

  Layer 2 (sparse, long edges):    o ----------- o ----------- o
                                    \           /             /
  Layer 1 (denser, medium edges):   o --- o --- o --- o --- o
                                     \   /     \    \   /
  Layer 0 (full, short local edges): o-o-o-o-o-o-o-o-o-o-o-o
                                                ^
                                          query enters at top,
                                          greedy descent narrows
                                          neighbourhood at each layer
Enter fullscreen mode Exit fullscreen mode

A query enters at the top layer, which has few nodes connected by long edges. The algorithm greedily walks toward the query's nearest neighbour, drops down to the next layer using the current best node as the entry point, and repeats. By the time the search reaches the bottom layer which contains every vector the candidate region is already narrowed to a small neighbourhood, and the bottom-layer search only explores a few hundred nodes instead of the full corpus. The result is sub-linear search time with high recall, configurable through parameters that trade off recall against latency.

Faiss, the open-source library from Meta, implements HNSW alongside several other ANN structures including IVF (inverted file with coarse quantisation) and product quantisation. Pinecone, Weaviate, Qdrant, Milvus, pgvector, Vespa every production vector database is a variation on these ideas. HNSW dominates the discussion because it has consistently strong recall on high-dimensional vectors with reasonable memory overhead.

The catch and it is the catch that hybrid retrieval was invented to address is that ANN is approximate. The recall step returns the top-k by approximate similarity, not the true top-k. For most queries the top few results are stable. For queries with rare terms or out-of-distribution embeddings, the approximate index can miss the document that lexical search would have found trivially. Combined with the embedding model's own weaknesses on rare and exact terms, the dense-only path has predictable failure modes.

Where Dense Alone Loses

There is a class of queries where pure dense retrieval is reliably worse than BM25.

Queries with exact identifiers product SKUs, model numbers, error codes, version strings, ISBNs, regulatory references are dense-retrieval's worst case. The embedding model has typically not seen BTX-449-G2 enough during training to give it a meaningful position in vector space. BM25 treats it as a token and finds the document instantly.

Queries with brand names that overlap common words Apple, Square, Notion, Linear, Vector are a related case. The embedding model maps "Apple" closer to "fruit," "company," and "computer" by some learned blend. The query "Apple support phone number" sits in a region where consumer-electronics documents and grocery-aisle documents coexist. BM25 does not care about meaning and scores by literal token overlap.

Queries about domains under-represented in training niche legal corpora, regional regulatory texts, deeply specialised technical fields also tend to favour BM25 because the embedding model's resolution in those regions of the space is poor.

Queries with negation and quantifiers "papers that do not use BERT," "websites without a privacy policy" are hard for embedding models, which struggle to invert meaning. BM25 with explicit operators handles these better than naive dense retrieval, although in practice the LLM usually rewrites the query into something the dense retriever can handle.

This is the empirical content of the BEIR result. Across eighteen datasets, no single retriever wins everywhere, and the cases where dense loses are not random they cluster around the failure modes above.

Hybrid Retrieval Is the Production Answer

Production AI search systems do not pick sparse or dense. They run both, fuse the results, and let the rerank stage clean it up.

The two common fusion approaches are Reciprocal Rank Fusion a simple, training-free recipe that sums the reciprocal of each document's rank in each list and learned combiners that train a model to score documents using both BM25 and dense scores as features. Vespa, Weaviate, Elasticsearch's hybrid search, Qdrant's BM25 + dense pipelines, and OpenSearch's neural-sparse hybrid all implement variations of these patterns. The rerank step that follows (a heavier cross-encoder that re-scores the top candidates) is its own conversation, and I am keeping it deliberately brief here. The point for retrieval is that the rerank cleans up the noise the recall step admitted, and the recall step is now hybrid rather than purely lexical.

Here is the comparison that matters, framed as the characteristics of each path:

Property Sparse (BM25) Dense (embedding-based) Hybrid (sparse + dense) Matches exact terms Yes, by construction Weakly, via tokenisation Yes (sparse rescues this) Matches paraphrases No Yes Yes (dense provides this) Handles synonyms Only with explicit dictionary Yes, learned Yes Handles rare identifiers Yes Weakly Yes (sparse rescues this) Handles negation Yes, with operators Poorly Partial Robust to OOD vocabulary Yes Poorly Yes (sparse rescues this) Recall vs latency at scale Inverted index, sub-linear ANN graph, sub-linear Run both, fuse Index size Small (token postings) Large (vector per chunk) Sum of both Cold-start on new content Immediate (just index tokens) Requires embedding compute Both

That table is the operational summary of two decades of BM25 plus six years of dense-retrieval-at-scale. It explains why the production answer is hybrid and why neither extreme of the SEO debate "keywords are dead" or "keywords are all that matter" is correct. They are both signals. The retrieval stack uses both. Content that wins in AI search is content that survives both filters.

What Still Matters from the Lexical Era

The dense retriever does not erase the lexical signal it adds a second signal next to it. Everything BM25 ever rewarded still partially matters, but the marginal return on stuffing the same term thirty times has collapsed. What survives:

Exact entity names. Brand names, product names, person names, location names these are what hybrid retrieval rescues from dense-only failure. If your brand is Acme Software, that exact string needs to appear once on the page in plain text where the indexer can find it, somewhere unambiguous, with the surrounding paraphrases the embedding model can latch onto.

Exact identifiers. SKUs, error codes, version strings, model numbers. Same story. Once on the page in the canonical form is what you need.

Structured data. Schema.org JSON-LD remains load-bearing because it gives the indexing pipeline a clean entity graph that does not depend on parsing prose.

Brand spellings and variations. If users search for both e-mail and email or Wi-Fi and WiFi, both forms benefit from being present somewhere on the site. Embedding models are mostly robust here, not perfectly, and the BM25 leg is exact-only.

What is no longer worth doing and was probably never worth doing as much as the SEO playbooks insisted is keyword density manipulation, exact-phrase repetition, and synonym dictionaries pasted into footers. The marginal return from these tactics in a hybrid stack is approximately zero, and in some cases negative because the embedding pooling step degrades under repetition.

Designing Content for Both Filters

The practical content rule is short and unromantic: write the answer once in the canonical phrasing, then write the paraphrases around it, then make sure the structure is parseable.

The mechanism for each clause is real. Canonical phrasing gives BM25 the exact-match signal it needs. Paraphrases widen the embedding space the page covers, so the page lands close to a wider distribution of query rewrites. Parseable structure short paragraphs, one thought per chunk, headings that match the prose, schema where appropriate feeds the chunker and the structured-data layer downstream.

The thing the SEO industry got wrong, and is still getting wrong, is the assumption that you must choose between exact-match and semantic richness. The hybrid stack does not force a choice. It rewards both, scored by different paths and fused. Pages that try to win on exact-match alone fail the dense filter on paraphrases. Pages that try to win on semantic richness alone fail the sparse filter on exact identifiers and brand names. Pages that do both which is what good prose has always been match more of the query distribution.

How to Verify You Are Winning at the Embedding Layer

This is the part of the post where I tell you to stop guessing and start measuring, because the measurement is cheap and the alternative is folklore.

Pick a public embedding model text-embedding-3-small from OpenAI, voyage-3 from Voyage, or a BGE model from BAAI (free). Pick a corpus of your own pages. Embed each page. Take a list of queries you believe should match those pages literal phrasings, paraphrases, adversarial cases embed those, and compute cosine similarity between every query and every page.

What you are looking for is not absolute numbers embedding similarities are model-specific and not directly comparable across models. You are looking for ranks and gaps. For a query that should match page A, does page A come first? If it is buried under three tangentially related pages, your content is failing the dense filter, and the failure is diagnosable. Often the fix is a missing paraphrase, a buried answer the pooling step is averaging away, or a structure where the topic shifts halfway through and the pooled vector lands between two centroids.

Run the same exercise with BM25 most search libraries (Elasticsearch, OpenSearch, Vespa, Tantivy, Whoosh) implement it in a few lines. Compare. The cases where the same query ranks the page differently between the two paths are the cases where hybrid retrieval will cover or expose your content. That comparison is the thing the SEO industry pretends not to need to do, because doing it makes the folklore harder to sell.

I run this against my own content periodically, against competitors' content, and against the queries I expect AI assistants to rewrite mine into. It is the cheapest piece of due diligence in modern content engineering and consistently produces actionable findings.

A Note on Google

Whenever the dense-retrieval story comes up, someone asks "did Google switch to vectors?" The honest answer is that Google's retrieval stack is hybrid, partially private, and has been neural-augmented since well before the LLM era RankBrain (2015) and BERT integration (2019) are the named layers, but those are not the entire stack, and the company has not published a definitive "we switched from BM25 to vectors on date X" statement because the truth is more complicated. The on-the-record position is hybrid: lexical features plus learned ranking plus several layers of neural processing in concert. AI Overview and Gemini's grounded mode add their own retrieval and synthesis on top. Treating Google's stack as either "still BM25 underneath" or "all vectors now" both mis-frame it. It is layered, hybrid, mostly private. The operational stance: assume both filters are present, design content that survives both, do not bet against either signal.

The Synthesis

The recall step in modern AI search is dense, not lexical but the production stack is hybrid, and that is the framing the SEO industry has not absorbed. Embedding models match meaning. BM25 matches tokens. Both fire. The pages cited by AI assistants are the pages that survive both filters, not the pages that game one.

The single sentence: retrieval is no longer keyword match; it is hybrid recall where the dense signal handles paraphrase and intent and the sparse signal rescues exact identifiers, and content design that ignores either filter loses on the queries the other one would have caught.

If you only have time to internalise three things, in order:

  1. The first stage is hybrid, not lexical. Dense retrieval handles paraphrase, intent, and synonyms. Sparse retrieval handles exact identifiers, brand names, and rare terms. Both fire on every query in production stacks. Content that engineers for one and ignores the other loses on the queries the other one would have caught.
  2. The user's literal query is not the query that hits the retrieval stack. LLM rewrites paraphrase, expand, and normalise the query before retrieval. Optimising for the literal user phrasing is optimising for a string the index never sees. Optimising for the cluster of paraphrases around an intent is what moves the needle.
  3. Measure your content with a public embedding model. It costs almost nothing. Compute similarity between your pages and the queries you expect. Cases where a topically correct page ranks low in cosine similarity are cases where your content is failing the dense filter, and the failure is usually diagnosable. The SEO industry mostly does not do this, which is why so much advice is still keyword-stack folklore.

The page that ranks for the phrase nobody types is not magic. It is a page whose embedding sits close to the embedding of the query the user actually asked, in a space the model learned from a corpus closed before either of you wrote anything. The page that wins the exact-match phrase but does not get cited is the inverse: the lexical filter passed it, the dense filter dropped it, and the rerank step never saw it. Both outcomes are mechanical, both are addressable, and the content design that addresses both is what wins the hybrid retrieval stack which is the stack that decides what AI assistants see.


The retrieval-stack synthesis here is my own reading of the primary literature Robertson and Zaragoza's The Probabilistic Relevance Framework: BM25 and Beyond (Foundations and Trends in Information Retrieval, 2009), Karpukhin et al., Dense Passage Retrieval for Open-Domain Question Answering (arXiv:2004.04906, 2020), Khattab and Zaharia, ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT (arXiv:2004.12832, 2020), Thakur et al., BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models (arXiv:2104.08663, 2021), Malkov and Yashunin, Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs (arXiv:1603.09320, 2016), and the Faiss library source and documentation combined with observable behaviour from running my own embedding-similarity computations against my own corpus and watching what happened to AI citations after content rewrites. Where I have written "in my testing" or "the pattern I observe," that is exactly what I mean. The directional claims about exact-match SEO no longer paying are mechanistic embedding similarity is computable on any public model and the audit is reproducible but I am not making quantitative promises and the magnitude of any individual rewrite varies by domain, model, and query distribution. Provider behaviour is moving; verify against current docs and current model behaviour before shipping a strategy.

Evolution Is Back: A New Way to Fine‑Tune LLMs
🚀Ankit Dey·May 4, 2026·7 min read·Global

Evolution Is Back: A New Way to Fine‑Tune LLMs

#ai#reinforcementlearning#machinelearning#coding

Evolution Is Back: A New Way to Fine‑Tune LLMs
If you grew up around game AIs and coding forums, you've probably heard this idea:
"One day we'll train superhuman AI with evolution or genetic algorithms."
Then deep learning took over, gradient descent won, and evolution‑style methods quietly got pushed into the museum.
Now they're back.
Evolution Strategies (ES) are being rediscovered as a serious way to fine‑tune large language models (LLMs), building on three key papers:
Evolution Strategies as a Scalable Alternative to Reinforcement Learning (OpenAI, 2017)
Evolution Strategies at Scale: LLM Fine‑Tuning Beyond Reinforcement Learning (2025)
Evolution Strategies at the Hyperscale (EGGROLL) (2025–26)

Let's unpack the core ideas in plain language.


First: what are evolution strategies, really?
Forget math for a second. Think of ES like this:
You start with one model.
You make a bunch of slightly different copies by adding small random tweaks to its weights.
You test each copy on some task and give it a score (fitness).
You keep the good directions, throw away the bad ones, and update the original model accordingly.
Repeat.

It's like running a population of "mutated" models in parallel, seeing which ones do better, and slowly nudging your main model toward those helpful changes over time.
Instead of computing gradients and backpropagating through every token, ES treats the model as a black box:
"I don't care how you work inside. I just care what score you get when I poke you in this direction."
So what this means for you: ES is a way of improving a model using only inputs + outputs + a score, no gradient access needed.


Why ES originally "died" for deep learning

Early on, people did try to train neural nets with evolution‑like methods. They even got Atari agents working this way.
But there were two big problems:
Too many knobs
Even a small deep network has millions of parameters. Randomly mutating all of them at once is like trying to tune a 2‑million‑knob radio with your eyes closed. Most changes destroy performance instead of improving it.
Everything is entangled
In neural nets, one weight doesn't act alone. Changing a single parameter can ripple through many layers in weird ways. So a naive mutation tends to scramble behavior instead of slightly improving it.

Researchers tried to be clever by modeling correlations between parameters (covariance matrices), but that meant tracking trillions of numbers - totally infeasible at scale.
OpenAI's 2017 paper fixed part of this by:
Using simple Gaussian noise to perturb all parameters.
Running huge populations in parallel across many CPUs/GPUs, averaging out the noise.

This made ES work surprisingly well for deep RL tasks like Atari and humanoid locomotion, and showed that ES can be a scalable alternative to conventional reinforcement learning in some settings.
But for classic language pretraining, ES still lost badly to gradient descent:
Next‑token prediction gives a rich, per‑token "teacher signal", perfect for gradients.
ES throws all of that away and only works with a single score per run, which is much weaker and more expensive to get.

So what this means for you: for training a foundation model from scratch on text, gradient descent is still king. ES looked like an interesting side quest, not a main route.


Where ES does make sense: RL‑style LLM fine‑tuning

Now fast‑forward to RL‑style fine‑tuning for LLMs: things like RLHF, GRPO, and reasoning‑focused post‑training.
Here, the situation flips:
You often only get one score per full answer: a reward model's rating, a human thumbs up/down, or a task accuracy.
You don't know exactly which tokens in that answer were good or bad. Credit assignment over a long sequence is hard.

This is exactly the situation ES was built for:
"Give me a single scalar reward for each model variant, and I'll figure out which parameter directions look promising."
In other words, for post‑training a big model to improve its behavior on complex tasks (reasoning, following human preferences, long‑horizon objectives), ES is suddenly a very natural fit.jaydengong+1
So what this means for you: while gradients shine during pretraining, ES can shine during the "make this model actually behave better" phase.


Evolution Strategies at Scale: ES vs RL for LLM fine‑tuning

The 2025 paper "Evolution Strategies at Scale: LLM Fine‑Tuning Beyond Reinforcement Learning" takes this idea seriously and stress‑tests ES on billion‑parameter LLMs, without shrinking the search space.
The key moves:
They treat the entire set of model weights as the thing being explored (parameter‑space exploration), not just the outputs.
They run multiple slightly perturbed versions of the model in parallel.
Each version generates answers, gets a scalar reward, and those rewards are used to compute an update direction for the base model.

Their main findings:
ES really can scale to LLM‑sized models, contrary to years of skepticism.
It can be competitive with popular RL methods on several fine‑tuning benchmarks.
It's naturally tolerant of long‑horizon, delayed rewards, doesn't need token‑level credit assignment, and as a black‑box method, may be less prone to certain kinds of reward hacking and training instability.

Think of it this way:
Standard RL in LLMs: "Keep the model fixed; jiggle the actions (tokens) and reward good sequences."
ES for LLMs: "Jiggle the model itself, see which altered versions behave better overall, then move the base model that way."

So what this means for you: we now have serious evidence that ES isn't just a toy or historical curiosity - it's a real alternative to RL for post‑training big language models.


EGGROLL: making ES actually fast on GPUs
There was still a brutal practical problem:
ES needs many perturbed copies of a huge model.
Running 30–100 full forward passes per update is insanely expensive.
Enter "Evolution Strategies at the Hyperscale", also called EGGROLL.
The core trick:
Instead of randomly perturbing all weights in a huge, unstructured way,
They structure each perturbation as a low‑rank (LoRA‑style) update.

Why this matters:
GPUs love big, regular matrix multiplies.
By expressing perturbations as low‑rank adapters, you can batch many of them together, reusing most of the main model's computation and just swapping the cheap adapters.khaleejtimes+1
That turns "30 full forward passes" into "one main pass + cheap variations," massively improving efficiency.

Results from the paper:
Up to 100× speed‑up in training speed for billion‑parameter models at large population sizes.
Throughput reaching 91% of pure inference speed - i.e., ES becomes almost as cheap as just running the model, even though you're optimizing it.
Competitive performance with ES and GRPO in multiple settings, including:
Stable pretraining of integer‑only recurrent language models.
Reasoning‑focused fine‑tuning of LLMs against strong RL baselines.

There's also a nice theoretical bonus: they show that as model dimension grows, EGGROLL's low‑rank perturbations behave consistently with classical Gaussian ES , you're not secretly optimizing something completely different.
So what this means for you: EGGROLL makes ES not just mathematically interesting, but hardware‑friendly. It fits how GPUs like to work, which is crucial if this is ever going to be used widely in industry.


How is it different than RL.

RL (policy gradients for LLMs):
Learns how to act by adjusting the policy so its actions (tokens) get higher expected reward in an environment.
It treats the model as differentiable and uses gradients of expected reward to update weights.
ES (in this LLM setting):
Treats the whole model as a black box and directly searches in parameter space.
It perturbs the weights, checks which mutated models score better, and moves the base model in that direction.

So: RL mostly says "given this model, improve the way it chooses actions."
ES says "change the model itself until its overall behavior improves."

  1. How learning happens RL: Usually one agent/policy. Interacts step‑by‑step with the environment, gets rewards along the trajectory. Uses gradients (or approximations) to update based on which actions in which states led to good or bad outcomes.

ES: 
Many parallel "agents" (perturbed copies of the model) per update.
Each gets a single scalar score (fitness) after running.
ES keeps only the information about better directions in weight space; poor variants are mostly discarded.

So: RL learns from detailed trial‑and‑error along a path; ES learns from comparing whole variants and keeping only the winners.

  1. Gradients vs black‑box search RL (policy gradient, actor‑critic, etc.): Needs a differentiable path from parameters → actions → rewards, or at least an estimator of that gradient. ES: Needs only: "for these slightly changed weights, the total score was X." No requirement to backprop through time, tokens, or a reward model.

This is why ES is attractive for LLM post‑training: you can optimize behavior even when credit assignment across long sequences is messy, as long as you can define a scalar reward for each run.

  1. When each tends to work better RL tends to shine when: You have a strong, dense learning signal and differentiable structure. You care about squeezing out every bit of performance. You can afford the complexity of gradients, advantage estimates, value functions, etc. ES tends to shine when: The reward is sparse, delayed, or messy. You only have black‑box access to the model or environment. You want massive parallelism with simple updates and good robustness to noise.

In the specific LLM fine‑tuning papers you asked about, ES is being explored as an alternative to RL‑style methods (like PPO/GRPO) for the "post‑training" phase - same overall goal (make the model's behavior better given a reward signal), but with a different optimization philosophy.


Why any of this matters to you (and the future of LLMs)

Putting the three pieces together:
OpenAI's 2017 work showed ES can scale to deep neural networks, at least for RL in games and control.
Evolution Strategies at Scale showed ES can fine‑tune billion‑parameter LLMs and compete with mainstream RL methods on real tasks.
EGGROLL showed how to make ES efficient enough on GPUs to be practical at "hyperscale."

If this line of work keeps progressing, it could mean:
New ways to fine‑tune and align models without needing backprop access to the base model (think: black‑box optimization as a service).
More robust training on tasks with messy, delayed, or sparse rewards, where traditional RL struggles.
Cheaper, more parallelizable post‑training pipelines that better match real GPU/TPU

And from a bigger‑picture point of view, it's just cool that an old idea, "evolve better models instead of just following gradients", is getting a serious second life in the LLM era.
So what this means for you: the future of "how we improve AI" might not be only about better gradients; it might also involve smarter evolution.

Improving and Validating Multi-Agent Prompts with Bedrock AgentCore Optimization
🌐yoko / Naoki Yokomachi·May 4, 2026·7 min read·Global

Improving and Validating Multi-Agent Prompts with Bedrock AgentCore Optimization

#ai#aws#agents#llm

This article is an AI-assisted translation of a Japanese technical article.

Introduction

In April 2026, Amazon Bedrock AgentCore added a new capability called Optimization, which takes real agent traces and proposes prompt improvements based on them.
https://aws.amazon.com/about-aws/whats-new/2026/05/bedrock-agentcore-optimization-preview/

In this article, I apply AgentCore Optimization to a Strands Agents-as-Tools setup (a main agent that wraps sub-agents as @tools) and walk through what actually happens. What kind of improvements does Recommendations propose? Does the change hold up under real traffic in an A/B test? And how does it feel to put this into operation? Those are the questions I tried to answer.

Inside AgentCore Optimization

Let me start by laying out what Optimization actually consists of.

The three capabilities

Capability Role Recommendations Takes real trace logs plus a target Evaluator as input, and has an AI generate improved versions of system prompts and tool descriptions. Instead of you iterating manually, Recommendations does the iteration for you. Configuration bundles Externalizes prompts and tool descriptions out of source code and version-manages them on the AgentCore side. You can change agent behavior just by swapping the bundled values — no code change, no redeploy. Also used to run two settings side by side in the A/B test described below. A/B testing Routes real traffic via AgentCore Gateway between two variants (control / treatment), scoring each side with an Evaluator. You can compare which prompt actually performs better in production, with statistical backing.

The official docs describe these three as a "continuous improvement loop": Recommendations generates an improved version → Configuration bundles version-controls it → A/B testing validates the effect under real traffic. The three capabilities are designed to cycle.

Prerequisites

Following the official docs, the setup requires:

  • An agent built with Strands Agents
  • Deployed to AgentCore Runtime with Observability enabled
  • CloudWatch Transaction Search enabled

Building the test setup

For the experiment I built a multi-agent setup with Strands Agents — a main agent that delegates to specialized sub-agents for weather and news, wired together with the Agents-as-Tools pattern.

The repo:
https://github.com/n-yokomachi/agentcore-optimization-lab

Configuration bundle structure

To make a setup A/B-testable, prompts and tool descriptions need to be externalized in configBundles inside agentcore.json. The bundle structure I ended up with:

{
  "components": {
    "{{runtime:agentsAsToolsLab}}": {
      "configuration": {
        "systemPrompt": "You are an assistant that answers questions about weather and news.",
        "weather_agent": "Get weather",
        "news_agent": "Get news"
      }
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

A note on the prompts: I deliberately wrote them quite carelessly so the impact of Recommendations would be easy to see.

{{runtime:agentsAsToolsLab}} is an agentcore CLI placeholder; it gets resolved to the actual Runtime ARN at deploy time.

One quirk: the tool descriptions (weather_agent / news_agent) sit directly under configuration as flat siblings. This shape matches how the Recommendations API resolves the tool description path. The default structure that the AgentCore CLI generates with --with-config-bundle (which nests them under toolDescriptions) didn't resolve correctly for tool description Recommendations, so I flattened it and that worked.

Adding the bundle definition and deploying are both done through the AgentCore CLI:

agentcore add config-bundle
agentcore deploy
Enter fullscreen mode Exit fullscreen mode

Wiring the bundle into the agent

To inject bundle values into the Runtime dynamically, we use Strands' hook mechanism. The ConfigBundleHook class overrides the main agent's system prompt at BeforeInvocationEvent and each tool's description at BeforeToolCallEvent.

class ConfigBundleHook(HookProvider):
    def register_hooks(self, registry: HookRegistry, **kwargs: Any) -> None:
        registry.add_callback(BeforeInvocationEvent, self._inject_system_prompt)
        registry.add_callback(BeforeToolCallEvent, self._override_tool_description)

    def _inject_system_prompt(self, event: BeforeInvocationEvent) -> None:
        config = BedrockAgentCoreContext.get_config_bundle()
        event.agent.system_prompt = config.get("systemPrompt", DEFAULT_SYSTEM_PROMPT)

    def _override_tool_description(self, event: BeforeToolCallEvent) -> None:
        config = BedrockAgentCoreContext.get_config_bundle()
        override = config.get(event.tool_use["name"])
        if override and event.selected_tool:
            spec = event.selected_tool.tool_spec
            if spec and "description" in spec:
                spec["description"] = override
Enter fullscreen mode Exit fullscreen mode

This Hook class is based on the template the AgentCore CLI generates with --with-config-bundle. Because I flattened the bundle structure, the tool description lookup (config.get(event.tool_use["name"])) is simpler than the generated default.

Recommendations and A/B test run

For the experiment I generated trace logs from 8 English queries × 5 rounds = 40 sessions, then ran both system-prompt and tool-description Recommendations against the agent.

agentcore run recommendation --type system-prompt
agentcore run recommendation --type tool-description
Enter fullscreen mode Exit fullscreen mode

Recommendations on the system prompt

The original system prompt and the Recommendations output are both visible in the AWS Console. The improved prompt now factors in tool calling — phrases like "call both tools in parallel" and "use news_agent to find related news" appear in the suggestion.

Recommendations on the tool descriptions

The before/after for tool descriptions is visible in the same way. The descriptions are filled out more thoroughly, and they explicitly call out the possibility of parallel use with the other sub-agent — phrases like "Often used alongside news_agent" and "Often used alongside weather_agent".

A/B test for effect validation

To verify that the Recommendations output actually moves the needle, I ran an A/B test as well.

  • Control variant (C): bundle version with the human-authored prompt and tool descriptions
  • Treatment variant (T1): bundle version with the Recommendations output applied
  • Traffic split: 50/50 (sticky session-to-variant assignment by session ID)
  • Online Evaluator: Builtin.GoalSuccessRate
  • Traffic volume: 8 queries × 5 rounds = 40 sessions

To run the A/B test you need an HTTP Gateway and an Online evaluation config. The HTTP Gateway has to be added by hand to httpGateways in agentcore.json (no add subcommand seems to exist for it at the moment). The Online evaluation config is added with agentcore add online-eval.

"httpGateways": [
  {
    "name": "agentsAsToolsLabGateway",
    "runtimeRef": "agentsAsToolsLab"
  }
]
Enter fullscreen mode Exit fullscreen mode
agentcore add online-eval
Enter fullscreen mode Exit fullscreen mode

Then add the A/B test itself and register everything in one go with deploy.

agentcore add ab-test
agentcore deploy
Enter fullscreen mode Exit fullscreen mode

Traffic generation is done by POSTing to the AgentCore Gateway URL with SigV4 auth. agentcore invoke hits the Runtime directly, so for the A/B test we have to go through the Gateway URL. Here's the script I used:

GATEWAY_URL = "https://agentsastoolslabgateway-XXXXX.gateway.bedrock-agentcore.us-west-2.amazonaws.com/agentsAsToolsLab/invocations"
credentials = Session().get_credentials()

def invoke_one(query: str):
    sid = str(uuid.uuid4())
    payload = json.dumps({"prompt": query}).encode()
    req = AWSRequest(method="POST", url=GATEWAY_URL, data=payload, headers={
        "Content-Type": "application/json",
        "X-Amzn-Bedrock-AgentCore-Runtime-Session-Id": sid,
    })
    SigV4Auth(credentials, "bedrock-agentcore", "us-west-2").add_auth(req)
    http_req = urllib.request.Request(GATEWAY_URL, data=payload, headers=dict(req.headers), method="POST")
    with urllib.request.urlopen(http_req, timeout=180) as resp:
        return sid, resp.status
Enter fullscreen mode Exit fullscreen mode

The A/B test results are visible in the AWS Console under "Bedrock AgentCore > Optimizations > A/B Tests".

Here are the numbers:

Metric Value Meaning Sessions routed to control 21 Number of sessions routed to the control variant Sessions routed to variant 19 Number of sessions routed to the treatment variant Control average (Goal Success Rate) 0.48 Mean Goal Success Rate of the control variant Variant average 0.53 Mean Goal Success Rate of the treatment variant Variant improvement Not significant: +10.5% (p=0.95) Treatment shows a +10.5% improvement over control, but not statistically significant (p>0.05)

Directionally, the treatment is ahead by +5pt absolute (= +10.5% relative). So the Recommendations output is moving things in the right direction, but with only 40 sessions there isn't enough data to claim statistical significance. Since the original goal — confirming Recommendations actually works end to end — is met, and going further would start to hurt my wallet, I'm cutting the experiment off here.

Where to draw the line with Recommendations

This is just from this experiment, but if I sort the improvement patterns Recommendations produced, I think the natural division of labor between Recommendations and the developer looks something like this:

Owner Domain Recommendations Mention of parallel calls, naming of related elements, multilingual support callouts, response format directives, safety mechanisms, proactive behavior Developer Domain context, business logic, data interpretation policy

So when you put Recommendations into your operational loop, the parts you (the human) still need to write are:

  • Domain-specific context (specific customer business processes, external API specs, etc.)
  • Business logic (output constraints, compliance, billing rules, etc.)
  • Data interpretation policy (e.g. "when this field is empty, treat it as X")

For everything else — the "general patterns of good prompt writing" — it might be reasonable to let Recommendations handle it. That's the takeaway for me from this experiment.

Wrap-up

So that was a hands-on look at AgentCore Optimization on an Agents-as-Tools setup. The takeaways:

  • Recommendations extracts general patterns like parallel invocation, tangential topic handling, response format, and safety mechanisms
  • A boundary becomes visible between what humans should write (domain context, business logic) and what we can hand off to Recommendations
  • The A/B testing capability and its outputs are confirmed working, but at this experiment's scale the sample size isn't enough for significance

That's it. I hope this is useful for anyone planning to try Optimization themselves.

Bonus: Japanese system prompts getting misflagged as prompt injection?

When I ran the system prompt Recommendation with a Japanese prompt like --inline "あなたは天気とニュースに答えるアシスタント。", I got this error:

[ValidationException] The provided content was detected as unsafe by 
prompt attack protection. Please review your system prompt and try again.
Enter fullscreen mode Exit fullscreen mode

After narrowing it down:

  • Fails regardless of Evaluator (Builtin.GoalSuccessRate / Builtin.Helpfulness)
  • Fails whether via bundle or inline mode
  • Fails even when I rewrite the Japanese prompt in different ways
  • Works as soon as I switch to English

So the only difference that flips the outcome is the language of the prompt. Tool description Recommendations work fine in Japanese, by the way.

For that reason, all the experiments in this article ended up being run with English prompts.

References

https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/optimization.html
https://github.com/aws/agentcore-cli
https://aws.amazon.com/about-aws/whats-new/2026/05/bedrock-agentcore-optimization-preview/

I Tried to Compress an LLM by 545x. Here's What Happened
🧠HasanH47·May 4, 2026·6 min read·Global

I Tried to Compress an LLM by 545x. Here's What Happened

#machinelearning#ai#deeplearning#opensource

A solo dev's journey questioning a 40-year-old assumption in deep learning


The Question That Started It All

I was frustrated.

VS Code was getting heavier on my laptop. Cursor wanted $20/month. The best AI agents were owned by 5 mega-corporations. As a developer in Indonesia, I sometimes felt we were perpetual consumers, never creators.

So I asked Claude: "Can AI be smaller?"

That conversation led somewhere unexpected. We started questioning the most fundamental assumption in deep learning since 1986:

Do weights have to be stored as matrices of numbers?

Think about it. A human brain doesn't store information as numbers. A seed doesn't contain all the leaves of a tree inside it — a seed contains instructions to grow leaves.

What if AI weights could be grown from a small seed when needed, instead of stored as massive matrices? A 30B model could fit on a smartphone. No cloud needed. No subscription. No billion-dollar hardware.

I named the project WIJI — "seed" in Javanese. The Javanese script: ꦮꦶꦗꦶ

Slogan: "Memaksimalkan yang minimal" — maximize the minimal.


The Setup

I'm a solo developer. No PhD. No GPU cluster. Just a laptop, curiosity, and AI as a research collaborator.

The plan:

  1. Take TinyLlama 1.1B (small enough to run on CPU)
  2. Try to reconstruct its weights using a tiny generator network
  3. Replace original weights with generated ones
  4. See if the model still works

If single layer works, scale up. If not, learn why and pivot.


Experiment 1: Single Matrix

I started with one weight matrix: o_proj of layer 0. It has 4.2 million parameters.

I built a coordinate-based MLP generator: input is (row, col) coordinates, output is the weight value at that position. The generator has only 164K parameters — 25x smaller than the target.

Training was simple: sample random coordinates, predict their values, minimize MSE.

After 5000 steps, MSE settled at 0.000067. I reconstructed the full matrix and replaced the original in the model.

Test prompt: "What is the capital of Indonesia?"

Original output: "Indonesia's capital is Jakarta."

Reconstructed output: "The capital of Indonesia is Jakarta."

It worked. Different words, same meaning. The model still functioned with weights compressed 25x.

I was elated. Phase 0 looked promising.


Experiment 2: Going Big

If 1 layer works, why not all 22 layers?

I added a layer embedding to the generator so it could handle multiple layers. Same 164K params, but now needed to represent 22 different weight distributions — 92M parameters total.

That's 545x compression.

I trained for 3000 steps. MSE settled at 0.000234 — only 3x higher than experiment 1. Should be fine, right?

Output: "Ingatescripturecordialoisimoisequalifiesearchivedeastern Discogs, and"

Complete gibberish.

This was my first lesson: MSE Loss is not a reliable predictor for LLM output quality. The loss only got 3x worse, but the output collapsed entirely.


Experiment A: Diagnostic

Before scaling more, I needed to understand: was the failure because of multi-component (Q/K/V/O matrices in one layer) or multi-layer (across 22 layers)?

So I tested: single generator handling all 4 attention components in just layer 0. 4 matrices, but still 1 layer.

MSE settled at 0.000400 — 6x higher than experiment 1.

Output: "The capital of Indonesia is Jakarta."

Still functional. So the issue wasn't multi-component. The issue was multi-layer.

Now I had a hypothesis: error compounds across layers. Each layer's small error becomes the next layer's wrong input, which produces bigger errors, until the model collapses.


Experiment B2: Microservices for Layers

If one generator can't handle 22 layers, what if I built 22 separate generators? One per layer, each specializing.

22 generators × 164K params = 3.6M total. Compression: 25x. Same as experiment 1.

I trained each generator for 1000 steps. The training logs revealed something important:

Layer MSE Loss 0 0.000063 5 0.000203 10 0.000225 15 0.000244 21 0.000373

Error increases monotonically from early layers to late layers. Same generator capacity, same training budget, but layer 21 was 6x harder to fit than layer 0.

This makes sense: late layers in transformers capture complex semantic patterns. Early layers capture simple syntax. A small generator can fit the latter but struggles with the former.

Output: "WHEREASPark. ."

Failure. But informative failure.


Experiment B3: Adaptive Capacity

If late layers need more capacity, give them more capacity:

  • Layers 0-7: 128 hidden dim (small)
  • Layers 8-15: 256 hidden dim (medium)
  • Layers 16-21: 512 hidden dim (large)

Total params: 5.9M. Compression: 15x (smaller compression but should help).

I also tripled training steps to 3000 per generator.

Result for layer 21: MSE = 0.000366 (vs B2's 0.000373).

Almost identical. 10x more capacity, 3x more training, virtually no improvement.

Output: (degenerate loop)

This was the most important finding of the entire project:

There's a fundamental limit. MSE plateau at ~0.0003-0.0004 is independent of capacity and training time.

This phenomenon has a name in research: spectral bias. Neural networks with ReLU/GELU activations have an inductive bias toward smooth functions. Transformer weights look like noise — high-frequency random distributions.

Throwing capacity at the problem doesn't help because the architecture itself is wrong for this task.


Experiment B4: The Cliff Edge

I had 22 trained generators. Before giving up, I wanted to know: where exactly does the model fail?

I ran a progressive swap test. Replace layers 0 to N-1 with generated weights. Test inference. Increment N. See what happens.

N (layers replaced) Output Status 1 "The capital of Indonesia is Jakarta." ✅ Perfect 3 "The capital of Ia is 10." ⚠️ Partial collapse 5 "" (empty) ❌ Collapse 8 "" (empty) ❌ Collapse 12 "" (empty) ❌ Collapse 16 "" (empty) ❌ Collapse 22 "Ingunsuretournalty. WHERE2..." ❌ Gibberish

Cliff edge between N=1 and N=3. Sharp, not gradual. Phase transition.

But the most counterintuitive finding: N=22 produces output, while N=5-16 produce empty strings.

When you replace some layers but keep others, the corrupt layers produce outputs that are "out of distribution" for the original layers. The mismatch causes probability collapse — the model produces nothing.

When you replace ALL layers, the corruption is internally consistent. The model still produces gibberish, but it produces something.

The lesson: internal consistency matters more than absolute correctness.


What I Actually Learned

After 5 experiments and many hours of failure, here's what I have:

Validated empirically:

  1. ✅ Weight matrices have significant redundancy (compress 25-56x for single layer)
  2. ✅ MSE Loss is a misleading metric for LLM compression quality
  3. ✅ Cliff edge phenomenon exists at N=2 layers
  4. ✅ Capacity scaling doesn't solve spectral bias
  5. ✅ Internal consistency > absolute correctness in deep networks

Open questions for next phase:

  1. Can Fourier features (NeRF-style positional encoding) overcome spectral bias?
  2. Are FFN layers easier to reconstruct than attention layers?
  3. Can output-aware loss (KL divergence) replace MSE?
  4. Does cliff edge shift with bigger models?

Honest probability assessment:

  • 40% that the next phase finds something useful
  • 25% that we get a working prototype
  • 5-10% that this leads to a genuine breakthrough

But 10% × "fundamentally change AI deployment" = high expected value for a solo dev with AI as collaborator.


Why I'm Sharing This

I could have buried these failures and only shared the success. That's the temptation.

Instead, I'm publishing everything: code, failures, insights, and methodology. Why?

Negative results are valuable. Someone else attempting this will save weeks knowing where the cliff edge is.

Open source is legacy. Even if I stop maintaining this, the experiments stay accessible forever.

Solo dev + AI is a new research methodology. I want to demonstrate what's possible. Other developers in Indonesia, in developing countries, in their bedrooms — they can ask hard questions and explore them. They don't need to wait for FAANG employment to contribute to AI research.

Centralization is the enemy. AI is concentrating into the hands of 5 corporations. If we accept that, our future is dystopian. WIJI is a contrarian bet — that intelligence can be made minimal, affordable, and owned by everyone.


What's Next

Phase 1 plans:

  1. Fourier Features experiment — likely solves spectral bias based on NeRF research
  2. FFN layer test — different weight distribution, possibly easier to compress
  3. Streaming inference system — pragmatic system that accepts N=1 limit but works
  4. Rust port — for proper performance benchmarking

I'll publish results as I go. Failures and successes both.

If you're interested in this kind of research, the repo is fully open:

🔗 github.com/sangkan-dev/wiji-experimental

Critique welcome. Collaboration welcome. Even philosophical disagreement welcome.

"Mari kita lebih menggila di dunia yang udah gila ini."

Let's get crazier in this already-crazy world.


About the author: HasanH47, a DevOps Enginner based in Yogyakarta, Indonesia. Building products at the intersection of local context and frontier technology. Project under Sangkan organization.


If this article gave you something to think about, consider following for updates on Phase 1 results. And if you're working on anything related, please reach out.

Outpaint — extend any image beyond its borders
📱Om Prakash·May 4, 2026·7 min read·Global

Outpaint — extend any image beyond its borders

#api#imageprocessing#devtools#generativeai

Outpaint — extend any image beyond its borders

Every team that ships images at scale eventually hits the same wall: the photo you have is the wrong shape for the slot you need. Our new outpaint endpoint extends the canvas in any direction without cropping the subject, without watermarks, and without making you babysit a job queue.

What it does

POST /v1/image/outpaint takes a public image URL and grows the canvas outward. You pick a direction — all, left, right, top, or bottom — and how many pixels to add per extended side, anywhere from 64 to 512. The original pixels are preserved exactly. Only the new region is generated, and it's generated to be coherent with the lighting, perspective, and structure of what was already there.

The whole thing runs synchronously. You make the HTTP call, you wait, you get a finished image back in the response. There's no callback URL to register, no polling loop to write, no job ID to chase across two services. If your code can call a normal REST endpoint, it can use Outpaint — no special client, no SDK lock-in.

A quick rundown of the request fields:

  • image_url — public URL of the source image. Required.
  • direction — one of all, left, right, top, bottom. Defaults to all, which grows the canvas evenly on every side.
  • extend_pixels — integer between 64 and 512, applied per extended side. Defaults to 256.
  • prompt — optional text guidance for what should appear in the new region. Capped at 500 characters. Leave it blank and the model just continues whatever it sees at the edges; pass a short hint like "soft studio backdrop, warm light" and it'll bias the new pixels toward that.

There's no watermark on the output. The image you get back is yours, ready to drop into a banner, a thumbnail, or a product card.

Why we built it

Image extension sits in an annoying gap. On one side you have classic canvas-and-fill tooling — slap a coloured background behind the subject, hope the seam doesn't show, accept that anything more elaborate needs a designer. On the other side you have heavyweight creative suites that can do beautiful outpainting but want you to upload, click through a wizard, and pay per export. Neither of those options fits a backend pipeline that needs to process thousands of user uploads a day.

We kept seeing the same shape of pain in support threads:

  • A marketplace had product photos shot in portrait. Their hero slot was landscape. Cropping the product was not an option.
  • A furniture seller had clean cutouts on white but every channel they listed on wanted the item in a believable room.
  • A creator tools startup let users upload square photos for a thumbnail builder. YouTube wanted 16:9. Letterboxing looked amateur.

In all three cases the answer was the same: extend the canvas, generate the new region, keep the original untouched. So we built a single endpoint that does exactly that.

The angle we took matters. Outpaint is real edge-aware extension, not canvas-and-fill. The original pixels are preserved exactly — pixel-for-pixel, no resampling, no quality drop on the part of the image you cared about. Only the new region is generated, and it's generated with awareness of what's at the boundary so the seam disappears. Lighting direction carries over. Textures continue. Architectural lines stay straight. A wood grain that runs across the bottom edge keeps running.

We also made a deliberate call on synchronous response. A lot of image generation APIs hand you back a job ID and tell you to either poll or set up a webhook. That's fine for batch workloads, but it's a tax on every interactive use case. If a user uploads an avatar and you need to re-frame it before showing it back to them, you don't want to wire up a queue worker. Outpaint returns the finished image in the same HTTP response. Your code stays linear. Your latency budget stays predictable.

The pipeline is self-hosted on our own infrastructure. That means we control the cost curve and we don't pass per-call surcharges from a third party on to you. It's also why pricing is flat — you're not paying premium rates because some upstream provider is having a busy hour.

Quickstart

The fastest way to see it work is to point it at any public image URL. Here's the exact curl:

curl -X POST https://api.pixelapi.dev/v1/image/outpaint \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"image_url": "https://example.com/source.jpg", "direction": "all", "extend_pixels": 256}'
Enter fullscreen mode Exit fullscreen mode

Replace YOUR_API_KEY with the key from your dashboard and https://example.com/source.jpg with any publicly reachable image. The response comes back in the same connection — no second round-trip needed.

Same call from Python using requests:

import requests

resp = requests.post(
    "https://api.pixelapi.dev/v1/image/outpaint",
    headers={
        "Authorization": "Bearer YOUR_API_KEY",
        "Content-Type": "application/json",
    },
    json={
        "image_url": "https://example.com/source.jpg",
        "direction": "all",
        "extend_pixels": 256,
    },
    timeout=60,
)

resp.raise_for_status()
result = resp.json()
print(result)
Enter fullscreen mode Exit fullscreen mode

A couple of practical notes from working with the endpoint:

  • Start with direction: "all" and extend_pixels: 256 to get a feel for it. That's a clean, square growth on every side and it's enough to turn a tight crop into a comfortable composition.
  • If you only need to extend in one direction — say, turning a portrait into a landscape — use direction: "left" plus a separate right call, or just call all with a smaller extend_pixels. Picking a single side is faster than growing the whole canvas.
  • The prompt field is genuinely optional. If you don't pass it, the new region is inferred from the boundary alone, which is usually what you want for product shots and photos where the existing scene should just continue. Use the prompt when you want to nudge the model toward something specific — a sky tone, a backdrop colour, a setting.
  • extend_pixels is per side, not total. Setting it to 512 with direction: "all" adds 512 pixels on the top, bottom, left, and right — so a 1024×1024 input becomes a 2048×2048 output. Plan your dimensions before you call.

Use cases

Convert portrait product shots into landscape banners without cropping the subject. This is the canonical outpainting problem. You shot the product vertically because that's what the photographer's setup gave you, but the homepage hero is 16:9 and the product needs to live inside it. Cropping is off the table — the whole point is for the customer to see the product. With Outpaint you call the endpoint with direction: "left" and again with direction: "right", or one call with direction: "all" if the subject is centred, and the model fills the sides with content that matches the existing background. The product stays exactly as the photographer captured it; the world around it just gets a little bigger. Same workflow drops straight into a Shopify or marketplace pipeline that needs landscape and square variants from the same source.

Add room-context around a furniture cutout for marketplace listings. Cutouts on white look clean but they don't sell. Buyers want to see the chair in a room, the lamp on a side table, the rug under a coffee table. Historically that's a styled photoshoot — expensive, slow, and impossible to redo every time you launch a new colourway. With Outpaint you start from your clean cutout, call the endpoint with a prompt like "warm living room interior, soft daylight", and the model grows a believable room around the piece. The furniture itself is untouched — every detail your photographer captured is still there, pixel for pixel — but now it sits inside a context that helps a buyer visualise it. Run it across a catalogue and you get a roomset library without booking a studio.

Re-frame a square photo into 16:9 for YouTube thumbnails. Creator tools and thumbnail builders constantly deal with users uploading square photos because that's what their phone or their Instagram exported. YouTube wants 16:9 at 1280×720. Letterboxing looks like an upload mistake. Cropping cuts the subject's head off. Outpaint with direction: "left" and direction: "right" extends the photo sideways into a true 16:9 frame with the subject still centred and the background continuing naturally. Wire it into the upload flow and the user never has to know there was a format mismatch — they just get a thumbnail that fills the player.

Pricing

Outpaint costs 19 credits per call. In rupees that's ₹0.013 per call, and in dollars it's $0.00015 per call.

A few things worth being upfront about:

  • Pricing is per call, not per output dimension. Whether you extend by 64 pixels or 512, on one side or all four, the cost is the same.
  • There's no separate fee for using the optional prompt field. Guided and unguided generations are billed identically.
  • There's no watermark to remove for an extra fee, because there's no watermark in the first place.
  • Credits are shared across the rest of the PixelAPI catalogue, so the same balance you use for outpainting also covers the other endpoints in your account.

For a sense of scale: a thousand outpaint calls is about ₹13 or $0.15. A hundred thousand calls is about ₹1,300 or $15. That makes it cheap enough to run on every user upload in a consumer product, and cheap enough to backfill a catalogue of tens of thousands of product images without a finance conversation.

Try it

Spin up an API key on the dashboard and start calling the endpoint — the first request takes about as long as reading this paragraph. Full reference, parameter details, and response schema live in the docs. If you build something with it, we'd love to see it.

Writing the Loss Function
🔍Rasmus Ros·May 3, 2026·7 min read·Global

Writing the Loss Function

#ai#machinelearning#discuss#algorithms

I keep seeing the same argument about AI making us dumber. It's the same argument people had about search engines, and before that books. The usual response is to point at history and say "every generation panics, every generation was wrong, relax." I think that response is half right, and the wrong half is what bothers me.

Tools change what we bother to remember. The people who'd trained their whole lives to memorize 10,000-line oral epics watched the craft die when writing showed up. Long arithmetic in your head used to be normal; calculators arrived and the payoff for keeping that skill sharp went away. Brains didn't shrink. The skills just stopped being worth practicing.

Search engines are the one I lived through. I was a kid when Google replaced Altavista and went from "useful" to being a synonym for finding things. I still remember being amazed that I could search for a zebra and have a picture of one on my screen in only five minutes. Years later I ended up working on search engines as a dev myself in ecommerce, and I've even built one from scratch for Theca.

AltaVista interface


Only 90s kids will understand that this makes you dumber. (It was genuinely bad.)

I don't memorize phone numbers anymore. I don't memorize directions. I don't even memorize the APIs of libraries I use every week. What I do instead is keep a fairly precise mental index of where things live and what query will retrieve them. That's a real cognitive trade. I gave up some recall and got back a much larger working set of pointers. Net positive, I think, but I notice the trade in a way I didn't when I was nine.

We usually keep teaching

AI tools push the same trade further. They don't just outsource recall, they outsource synthesis: the part where you actually work through a problem and end up with a model of it in your head. I notice this when I let an LLM write code I could have written myself. I get the output, but I didn't build the model, which is usually the part I wanted. The people who worry about atrophy here aren't wrong, and it's worth its own post.

Small brain

One thing the prior cases got right is that society kept teaching the underlying skill anyway. Calculators didn't kill arithmetic class. Search engines didn't kill the library-science basics on how an index actually works. Some skills got canonized as core, worth practicing even after the tool that automated them arrived, because we collectively decided they mattered. Coding hadn't quite reached that status yet, but I think it would have given another decade. AI may have shown up too early for that to happen.

So the historical pattern mostly holds: tools rewire priorities, some skills fade, others grow, the panic looks silly in retrospect. Where the "relax, every generation panics" crowd gets it wrong is in assuming AI is just the next entry in that list. It might be. But the environment AI is landing in is not the environment the printing press or the early search engine landed in.

The loop is the problem

Books don't optimize you. Calculators don't optimize you. Search engines, at the lookup layer at least, were mostly trying to give you the page you asked for and then get out of the way. Modern search has piled on ads and ranking incentives since, but the core "find it and leave" loop is still recognizable. The dominant information channel today is none of those things. It's a feed, and the feed is an optimizer. The target variable is engagement.

Earlier tools removed friction from a specific task and let you spend the saved effort somewhere else. A feed isn't trying to remove friction from anything you'd recognize as a task. It's trying to keep you in the loop. The reward signal it's chasing (what makes you click, stay, scroll, react) is not the same signal as "this was useful to me." It's often the opposite.

There's data on this now. Heavy social media use predicts elevated depression and anxiety in kids and young adults. Longitudinal studies find the social media use comes first, not the depression.

And then you wire a generative model into the same loop. Generative AI doesn't change the objective, it just gives the loop a faster, cheaper supply tuned to whatever it already rewards.

Diagram of engagement loop with AI


Left: today's engagement loop, ranking from a human-made pool. Right: the same loop with a generative model in place of the pool.

Adding AI to the stack

My background is in optimization. The recurring question I work on is what a product should actually be optimizing for (PhD on automating A/B testing, Eignex the side project still chasing it). So when I look at "LLMs plus arecommendation feed" it looks to me like the same loop with a much better content supply. Not really a new content medium.

The version running today doesn't even use generation in the loop. The recommender stacks at the big platforms (Meta, TikTok, YouTube) are still doing what they've done for a decade: ranking content other people uploaded. The supply pool was already effectively infinite after years of user-generated content. The change is that a growing share of what gets uploaded is now AI-made, and the existing optimizer ranks the synthetic stuff exactly like everything else.

The scarier version puts the generator inside the loop, per-user posts written for you on demand. That sounds like fiction, and we don't have it. The thing is, we don't need it. The pool of generated content is already absurd enough that something in it fits your viewing history, your current mood, and what you had for breakfast. The optimizer just has to find it. A pool that grows by millions of items a day, at near-zero cost per item, behaves a lot like an on-demand generator.

Diagram of AI filling in the blanks of content topics


Each dot is a post in embedding space. Human posts (blue) cluster on popular topics; AI posts (red) fill the gaps.

None of this is hypothetical. AI-generated music has already racked up millions of streams on Spotify before anyone noticed it wasn't human (the Velvet Sundown story last summer was the most visible example). Facebook is saturated with generative slop: fabricated heart-warming stories, sculptures supposedly carved by a 92-year-old grandpa nobody appreciates, content farms running cheap image generators to chase engagement[^slop], and the people reliably engaging with it skew much older. The TikTok-side version of the same dynamic is "Italian brainrot", absurd AI-generated creatures with names like Tralalero Tralala and Bombardiro Crocodilo, captioned with nonsense-Italian audio dubs, pulling hundreds of millions of views from a much younger audience.

Facebook's own VP described the dynamic in plain terms to Futurism earlier this year: "if you, as a user, are interested in a piece of content which happens to be AI-generated, the recommendations algorithm will determine that, over time, you are interested in this topic." None of this uses particularly sophisticated tech, and it's already running at scale.

This loop doesn't get out of the way like search did. It takes friction out of producing whatever the optimizer rewards. Right now that's engagement, so the system gets better at engagement. Nothing malicious has to happen for that to land badly; it's doing exactly what it was asked.

The objective is a choice

I'm not fully pessimistic about this, though.

The objective is a choice. Engagement isn't a law of physics. Somebody picked clicks or watch time because it was easy to measure and correlated with revenue. People also reach for banning AI-generated content here. That isn't it either: "the machine wrote it" isn't a stable category once the machines are this good. The thing to push on is the loss function itself (what the system is told to optimize for), and the loss function is written by people.

Moses meme holding stone table with the -clicks loss function


The original loss function.

The irony's not lost on me that if you're reading this, it probably reached you through one of these feeds. As engineers we
like to act like the loss function is handed down on stone tablets.

It isn't. Somebody wrote it, and on the products I work on that somebody is me.

There is research on what "different" could look like: ranking for informational diversity, or ranking on whether users still endorse a piece of content a week later instead of whether they reacted in the first three seconds. None of it is mature, none of it has a business model behind it the way engagement does, and that's the real obstacle, not the technical side. The systems are perfectly capable of optimizing for something else. The question is whether anyone with the keys wants to. I'd rather sort it out before the next, much more capable generator gets wired into the
same loop.


No zebras were harmed in the making of this post.

Born ML v0.8.0: We Killed Our Last .dll — Pure Go GPU Is Here
💡Andrey Kolkov·May 3, 2026·9 min read·Global

Born ML v0.8.0: We Killed Our Last .dll — Pure Go GPU Is Here

#go#ai#opensource#machinelearning

TL;DR: Born v0.8.0 replaces go-webgpu (Rust FFI + shared libraries) with gogpu/wgpu — pure Go WebGPU. No .dll. No .so. No runtime downloads. go build now gives you a GPU-accelerated ML binary. We also fixed 5 critical GPU bugs and validated on real model training. Next up: DeepSeek V4 inference support.


The Last Dependency

Five months ago I skipped my birthday to release Born. A few weeks later we made training 18x faster with lazy GPU evaluation. The framework was growing. Contributors were showing up. Real people were using it.

But there was a problem I couldn't ignore anymore.

Every time someone wanted to use GPU acceleration, the conversation went like this:

"How do I run the GPU examples?"

"Download wgpu-native .dll for your platform, put it in your PATH..."

"...I thought you said pure Go?"

They were right. Born's CPU path was pure Go. But the GPU backend used go-webgpu — Go bindings to Rust's wgpu-native via FFI. You needed a platform-specific shared library at runtime. On Windows, a .dll. On Linux, a .so. On macOS, a .dylib.

For a framework whose tagline is "single binary deployment", that was embarrassing.

So we fixed it.


Why Not Earlier?

Fair question. gogpu/wgpu existed for months before v0.8.0. Why did we ship 29 releases on go-webgpu first?

Because that was the plan.

go-webgpu wraps Rust's wgpu-native — a battle-tested GPU abstraction used by Firefox and dozens of production projects. When you're building a new ML framework from scratch, you don't want to debug your GPU backend and your tensor math at the same time. If training produces wrong gradients, is the bug in your autodiff engine or in your WebGPU implementation? With Rust wgpu-native underneath, we knew: the GPU layer works. Any bug is ours.

So we built Born v0.1 through v0.7 on a proven foundation. Tensor ops, autodiff, attention, Flash Attention, speculative decoding, ONNX import, GGUF loading — all validated against a GPU backend we could trust. By v0.7.16, Born had 1,394 tests, 3 external contributors, and real model training working.

Meanwhile, gogpu/wgpu was maturing through its own path — powering gogpu/gg (2D graphics library with GPU compute shaders), running real rendering workloads, stabilizing the Core API across Vulkan, Metal, DX12, and GLES.

When both sides were proven, the migration became simple: we knew Born's code was correct, and we knew gogpu/wgpu's Core API was stable. Any bug found during migration was specifically a wgpu Go integration issue — easy to isolate, easy to fix.

That's exactly what happened. Five bugs, all in resource lifecycle. All fixed in days, not weeks.

Validate on proven foundation first. Swap the foundation second. This is not how you move fast. This is how you move right.


The Migration

Born v0.8.0 replaces go-webgpu with gogpu/wgpu — a pure Go WebGPU implementation from our own GoGPU ecosystem.

- github.com/go-webgpu/webgpu v0.4.1
+ github.com/gogpu/wgpu v0.26.8
Enter fullscreen mode Exit fullscreen mode

One line in go.mod. 27 files changed. 1,830 additions, 1,518 deletions.

What changed:

go-webgpu (before) gogpu/wgpu (after) Implementation Rust wgpu-native via FFI Pure Go CGO None (goffi) None Runtime .dll/.so Required None Build go build + download .dll go build. Period. Vulkan/Metal/DX12 Via Rust Via Go WGSL shaders Unchanged Unchanged Control External project Our project

That last row matters. gogpu/wgpu isn't some random dependency — it's our project. When Born needs a WebGPU API change, we change it upstream. Both sides of the interface are under our control.


Five Bugs Nobody Told Us About

Swapping the GPU backend is like replacing a car engine while driving. Everything looks the same from the outside, but internally the timing, resource lifecycle, and synchronization are completely different.

We found five critical bugs during migration:

1. PipelineLayout Freed Too Early

Vulkan requires compute pipeline layouts to stay alive during SetBindGroup(). go-webgpu's internal reference counting kept them alive. gogpu/wgpu doesn't — you own your resources.

We fixed this by storing PipelineLayout alongside the pipeline in our cache.

2. Lazy Ops and the Destroy Queue

Born uses lazy evaluation — GPU ops chain without CPU sync. But when a tensor gets garbage-collected mid-chain, its buffer goes to the destroy queue. If the pending operations haven't submitted yet, the buffer is destroyed before the GPU reads it.

Fix: immediate submit for lazy ops. Every operation submits its command encoder before returning.

3. Buffer Copy Race

copyGPUBuffer (used by Data() to read results back to CPU) was queuing the copy but not submitting. The next operation might overwrite the source buffer before the copy executed.

Fix: immediate submit after copy.

4. GC vs GPU

Go's garbage collector doesn't know about GPU resources. A runtime.SetFinalizer on a tensor could fire while the GPU was still computing with that tensor's buffer.

Fix: runtime.KeepAlive() guards around every GPU operation that uses the tensor.

5. Device Cleanup Order

When destroying the GPU device, all pending work must complete first. Without Poll(PollWait) before resource destruction, Vulkan validation layers scream.

Fix: explicit Poll(PollWait) in Release() to ensure GPU idle.

None of these bugs existed with go-webgpu. They're all about resource lifecycle differences between Rust's ownership model (where wgpu-native tracks everything for you) and Go's GC-based model (where you track it yourself).

After fixing all five, we ran all GPU tests and a 20-epoch model training with zero crashes.


What You Get

True Single Binary

go build -o myapp ./cmd/myapp
# That's it. Ship the binary. GPU works.
Enter fullscreen mode Exit fullscreen mode

No .dll downloads. No LD_LIBRARY_PATH. No platform-specific install steps. The binary works on any machine with a Vulkan-capable GPU.

Same API, Same Shaders

If you have existing Born code with GPU, nothing changes:

import (
    "github.com/born-ml/born/backend/cpu"
    "github.com/born-ml/born/autodiff"
)

// CPU-only (always worked)
backend := autodiff.New(cpu.New())
Enter fullscreen mode Exit fullscreen mode
import "github.com/born-ml/born/backend/webgpu"

// GPU-accelerated (now pure Go!)
if webgpu.IsAvailable() {
    gpu, _ := webgpu.New()
    backend := autodiff.New(gpu)
    defer gpu.Release()
}
Enter fullscreen mode Exit fullscreen mode

WGSL shaders are unchanged. The Backend interface (52 methods) is unchanged. Your code just works — minus the .dll.

Validated on Real Training

We didn't just run unit tests. We trained a real Hierarchical Reasoning Model (HRM) for 20 epochs on GPU. Zero crashes. Correct gradients. Same accuracy as go-webgpu.


The Numbers

Metric Value Go source ~47K LOC Tests ~34K LOC, 1,394 test functions ONNX operators 49 Backend methods 52 GPU tests 105 Contributors 4 (@kolkov, @gmohmad, @bennibbelink, @jsully1720) Releases 30 Stars 80 (organic, no marketing)

Community

v0.8.0 isn't just about the migration. Since v0.7.0, three external contributors have landed real code:

  • @jsully1720 — ONNX Equal operator
  • @bennibbelink — Erf, Sign/Abs, Clamp ops (3 PRs, all full vertical slices: backend → CPU → GPU → autodiff → tests)
  • @gmohmad — LayerNorm, BatchMatMul broadcasting, Squeeze fix, 9 new ONNX ops, inplace mutation bug fix (5 PRs)

These aren't drive-by typo fixes. These are production-quality contributions from people who studied the codebase and followed the patterns. If you're considering contributing, look at what they did — that's the bar.


What's Next: DeepSeek V4 Inference

With the GPU backend stable and pure Go, we can focus on what matters: running real models.

DeepSeek released V4 on April 24, 2026 — two models:

  • V4-Pro: 1.6 trillion params, 49B active
  • V4-Flash: 284B total, 13B active — fits on a consumer GPU

V4-Flash with 13B active parameters is Born's sweet spot. It's the most capable open model that fits on a single 24GB GPU. API pricing is tied to chip availability ($1.74/M tokens, bottleneck pricing) — users want local inference alternatives.

We started researching V4 architecture before it launched — back in early April, when only the Engram paper and V3.2 sparse attention existed. We predicted V4 would combine MoE + Engram + manifold-constrained residuals + compressed sparse attention. On April 24th, the tech report confirmed all four. Two weeks head start on architecture analysis. (We do this kind of research openly — see Discussion #60 for our Recurrent-Depth Transformer analysis.)

Here's the full component breakdown:

Component What Why It Matters MoE Routing Top-16 sparse expert selection Also unlocks Mixtral, DBRX MXFP4 Dequantization FP4 expert weights with block scaling V4's native format — not INT4 GPTQ Engram O(1) hash-lookup factual memory Unique to DeepSeek, DRAM-resident Three-Pool Attention SWA + C4 + C128 compression 1M context with <10% throughput drop Hyper-Connections (mHC) 4D manifold-constrained residual Every transformer layer uses this MTP Drafting Integrated speculative decoding ~2.5 tokens accepted per step KV Cache Tiering CPU-GPU cache with LRU eviction 128K+ context on 24GB consumer GPU PD-Disaggregation Prefill/Decode split serving Production throughput scaling

Total estimate: 22-30 weeks. It's a lot. But MoE routing alone unlocks V4, Mixtral, and BAR (Allen AI's modular post-training). Each component is independently valuable.


The GoGPU Ecosystem

Born's GPU backend is powered by the GoGPU ecosystem — pure Go GPU infrastructure:

Project What LOC gogpu/gg 2D graphics with GPU compute shaders ~222K gogpu/naga Shader compiler (WGSL → SPIR-V, MSL, HLSL, GLSL, DXIL) ~199K gogpu/wgpu Pure Go WebGPU (Vulkan, Metal, DX12, GLES, Software) ~156K gogpu/gogpu Graphics framework + windowing ~52K

Combined with Born's ~81K LOC, that's 710K+ lines of pure Go GPU code. No CGO. No Rust. Just go build.


Try It

git clone https://github.com/born-ml/born
cd born
go build ./...
go test ./... -short
Enter fullscreen mode Exit fullscreen mode

Run the examples:

cd examples/mnist && go run .       # MLP: 97.44% accuracy
cd examples/mnist-cnn && go run .   # CNN: 98.18% accuracy
cd examples/mnist-gpu && go run .   # GPU-accelerated inference
Enter fullscreen mode Exit fullscreen mode

GPU examples now work with go run — no .dll download step.


Build This With Us

Born is at an inflection point. GPU is stable. The architecture is proven. The roadmap to DeepSeek V4 is clear.

We're not looking for passive users. We're looking for people who want to help build one of the best ML frameworks in the world. In Go.

How you can make a difference:

  • File issues. Found a bug? A missing operator? An edge case that breaks your model? Every issue makes Born more production-ready. Our three external contributors started exactly this way.

  • Send PRs. Missing tensor ops (TopK, Scatter — needed for MoE), CPU optimizations (the inner loops are naive — lots of low-hanging fruit), new ONNX operators, quantization infrastructure. Look at what @bennibbelink and @gmohmad have done — full vertical slices, production quality. That's the standard.

  • Bring breakthrough ideas. The hardest problems ahead — MoE routing, FP4 dequantization, compressed sparse attention, CPU-GPU cache tiering — are open research questions in Go. If you have insights on how to make these work efficiently in pure Go, we want to hear them.

  • Challenge our assumptions. Tell us what we're doing wrong. Tell us what's missing. The best frameworks are shaped by people who care enough to argue.

Found a bug? Open an issue
Have a big idea? Feature Requests & Roadmap Discussion
Questions? Getting Started & FAQ
Ready to code? Contributing Guide


Links

Resource Link GitHub github.com/born-ml/born v0.8.0 Release Release Notes Documentation pkg.go.dev/github.com/born-ml/born Roadmap ROADMAP.md Changelog CHANGELOG.md GoGPU github.com/gogpu

Five months ago, Born was a birthday project with zero stars. Today it's a pure Go ML framework with GPU acceleration, 4 contributors, 49 ONNX operators, and a roadmap to run DeepSeek V4.

No .dll. No .so. No excuses. Models are born production-ready.

go build. Ship. Done.

Star us on GitHub: github.com/born-ml/born