[ad_1]
Technologies
Announcing a comprehensive, open suite of sparse autoencoders for language model interpretability.
To create an artificial intelligence (AI) language model, researchers are building a system that learns from massive amounts of data without human guidance. As a result, the inner workings of language models are often a mystery, even to the researchers who train them. Mechanistic interpretability is a field of research that focuses on decoding these inner workings. Researchers in this field use sparse autoencoders as a kind of “microscope” that allows them to look inside a language model and get a better feel for how it works.
Today we are announcing Gemma Scope, a new set of tools designed to help researchers understand the inner workings of Gemma 2, our lightweight family of open models. Gemma Scope is a collection of hundreds of freely available, open sparse autoencoders (SAEs) for Gemma 2 9B and Gemma 2 2B. We are also open sourcing Mishax, a tool we developed that enabled much of the interpretability work behind Gemma Scope.
We hope that today's publication enables more ambitious interpretive research. Further research has the potential to help the field build more robust systems, develop better protections against model hallucinations, and protect against risks posed by autonomous AI agents such as deception or manipulation.
Try our interactive Gemma Scope demo, courtesy of Neuronpedia.
Interpreting what happens within a language model
When you ask a language model a question, your text input is converted into a series of “activations.” These activations map the relationships between the words you enter and help the model make connections between different words that it uses to write a response.
As the model processes text input, activations at different levels in the model's neural network represent several increasingly advanced concepts called “features.”
For example, the first layers of a model might learn to remember facts, such as that Michael Jordan plays basketball, while later layers might recognize more complex concepts, such as the facticity of the text.
However, interpretability researchers face a central problem: the model's activations are a mixture of many different features. In the early days of mechanistic interpretability, researchers hoped that features in the activations of a neural network would correspond to individual neurons. ie, Information node. But unfortunately, in practice, neurons are active for many independent functions. This means that there is no clear way to identify which features are part of the activation.
This is where sparse autoencoders come into play.
A given activation will only be a mix of a small number of features, even though the language model is likely able to recognize millions or even billions of them – iethe model uses features thin. For example, a language model will take relativity into account when responding to a query about Einstein, and will take eggs into account when writing about omelettes, but will probably not take relativity into account when writing about omelettes.
Sparse autoencoders use this fact to discover a set of possible functions and split each activation into a small number of them. The researchers hope that the sparse autoencoder can best accomplish this task by finding the actual underlying functions that the language model uses.
Importantly, at no point in this process do we – the researchers – tell the sparse autoencoder what features to look for. This enables us to discover rich structures that we did not foresee. However, we don't know this immediately Meaning For the discovered features, we look for meaningful patterns in text samples where the sparse autoencoder indicates that the feature is “triggered.”
Here's an example where the tokens that trigger the feature are highlighted in blue gradients according to their strength:
What makes Gemma Scope unique
Previous research with sparse autoencoders has focused primarily on examining the inner workings of tiny models or a single layer in larger models. However, more ambitious interpretability research involves decoding multi-layered, complex algorithms in larger models.
We trained sparse autoencoders at everyone Layer and sublayer output from Gemma 2 2B and 9B to build Gemma Scope, producing more than 400 sparse autoencoders with a total of more than 30 million learned features (although many features are likely to overlap). This tool allows researchers to examine how features evolve throughout the model and interact and assemble to create more complex features.
Gemma Scope is also trained using our new, state-of-the-art JumpReLU SAE architecture. The original sparse autoencoder architecture struggled to balance the two goals of detecting existing features and estimating their strength. The JumpReLU architecture makes it easier to find this balance appropriately and significantly reduce errors.
Training so many sparse autoencoders was a major technical challenge and required a lot of computing power. We used about 15% of Gemma 2 9B's training computing power (excluding distillation label generation computing power), stored about 20 pebibytes (PiB) of activations on disk (about the same as a million copies of English Wikipedia), and hundreds of which created a total of billions of sparse autoencoder parameters.
Advancing the field
With the release of Gemma Scope, we hope to make Gemma 2 the best model family for open mechanistic interpretability research and accelerate the community's work in this area.
To date, the interpretability community has made great progress in understanding small models with sparse autoencoders and developing relevant techniques such as: B. causal interventions, automatic circuit analysis, feature interpretation and the evaluation of sparse autoencoders. We hope that the community will use Gemma Scope to scale these techniques to modern models, analyze more complex features like thought chains, and find real-world applications of interpretability, such as addressing problems like hallucinations and jailbreaks that only occur on larger models.
Acknowledgments
Gemma Scope was a joint effort by Tom Lieberum, Sen Rajamanoharan, Arthur Conmy, Lewis Smith, Nic Sonnerat, Vikrant Varma, Janos Kramar and Neel Nanda, advised by Rohin Shah and Anca Dragan. We would especially like to thank Johnny Lin, Joseph Bloom, and Curt Tigges from Neuronpedia for their support with the interactive demo. We are grateful for the help and contributions of Phoebe Kirk, Andrew Forbes, Arielle Bier, Aliya Ahmad, Yotam Doron, Tris Warkentin, Ludovic Peran, Kat Black, Anand Rao, Meg Risdal, Samuel Albanie, Dave Orr, Matt Miller, and Alex Turner , Tobi Ijitoye, Shruti Sheth, Jeremy Sie, Tobi Ijitoye, Alex Tomala, Javier Ferrando, Oscar Obeso, Kathleen Kenealy, Joe Fernandez, Omar Sanseviero and Glenn Cameron.
[ad_2]
Source link