Skip to content
Go back

Google Opens the Hood of LLMs

Published:  at  11:00 AM
Available Languages:

Google Opens the Hood of LLMs

In a recent paper (“Learning without training: The implicit dynamics of in-context learning”), Google’s teams propose a mathematical dive into the inner workings of Transformers.


Quick technical reminder

An LLM consists of stacked Transformer layers. In each layer, we find two main blocks:

But what’s fascinating is that the interaction between these layers isn’t limited to simple sequential processing.


The discovery: dynamic plasticity

The paper demonstrates that passing through the FFN, combined with the context brought by attention, mathematically amounts to temporarily creating a weight matrix specific to the input. In other words, the architecture allows dynamic adaptation to each situation, a form of temporary plasticity, only triggered by the prompt.

Even with fixed weights, the model seems to adapt in real-time to each new input. A form of contextual intelligence!

The demonstration (not yet peer-reviewed) aims to better understand this black box, to eventually develop diagnostic techniques, “explicability,” and even targeted optimization.


The opened perspectives

This formal framework opens several concrete paths:

However, be careful not to confuse mathematical description with deep understanding: these observations should be the starting point, not the conclusion, of our understanding of LLMs.


Conclusion

In short, the models are powerful, but it’s still human intelligence that will need to extract their full potential.


The experts’ corner 🤓

Mixtures-of-Experts (MoE)

Technical definition: An architecture in which multiple sub-models (“experts”) are available, but only a few are activated for each input, according to a selection mechanism (gating), allowing specialization of calculations while controlling costs.

Advantages:

Concrete example: A MoE model could have experts specialized in different domains (medicine, code, literature) and only activate the most relevant ones according to the task.

Hyper-Networks

Technical definition: A network that doesn’t make predictions directly, but dynamically generates the weights of another main network, allowing the latter to adapt to each task or context without additional training.

Potential applications:

Analogy: A chef who adapts their recipe in real-time according to available ingredients, without needing to relearn how to cook.


References



Previous Post
LLMs and web data: how AI collects, filters and uses information.
Next Post
Publications of the Week #30 - 2025