Gemma Scope: We help the security community shed light on the inner workings of language models

[ad_1]

Technologies

Published
Authors

Language Model Interpretability Team

Announcing a comprehensive, open suite of sparse autoencoders for language model interpretability.

To create an artificial intelligence (AI) language model, researchers are building a system that learns from massive amounts of data without human guidance. As a result, the inner workings of language models are often a mystery, even to the researchers who train them. Mechanistic interpretability is a field of research that focuses on decoding these inner workings. Researchers in this field use sparse autoencoders as a kind of “microscope” that allows them to look inside a language model and get a better feel for how it works.

Today we are announcing Gemma Scope, a new set of tools designed to help researchers understand the inner workings of Gemma 2, our lightweight family of open models. Gemma Scope is a collection of hundreds of freely available, open sparse autoencoders (SAEs) for Gemma 2 9B and Gemma 2 2B. We are also open sourcing Mishax, a tool we developed that enabled much of the interpretability work behind Gemma Scope.

We hope that today's publication enables more ambitious interpretive research. Further research has the potential to help the field build more robust systems, develop better protections against model hallucinations, and protect against risks posed by autonomous AI agents such as deception or manipulation.

Try our interactive Gemma Scope demo, courtesy of Neuronpedia.

Interpreting what happens within a language model

When you ask a language model a question, your text input is converted into a series of “activations.” These activations map the relationships between the words you enter and help the model make connections between different words that it uses to write a response.

Xem thêm  AlphaProteo generates novel proteins for biology and health research

As the model processes text input, activations at different levels in the model's neural network represent several increasingly advanced concepts called “features.”

For example, the first layers of a model might learn to remember facts, such as that Michael Jordan plays basketball, while later layers might recognize more complex concepts, such as the facticity of the text.

A stylized representation of using a sparse autoencoder to interpret a model's activations, as it evokes the fact that the City of Light is Paris. We see that French-related concepts are present, but unrelated concepts are not.

However, interpretability researchers face a central problem: the model's activations are a mixture of many different features. In the early days of mechanistic interpretability, researchers hoped that features in the activations of a neural network would correspond to individual neurons. ie, Information node. But unfortunately, in practice, neurons are active for many independent functions. This means that there is no clear way to identify which features are part of the activation.

This is where sparse autoencoders come into play.

A given activation will only be a mix of a small number of features, even though the language model is likely able to recognize millions or even billions of them – iethe model uses features thin. For example, a language model will take relativity into account when responding to a query about Einstein, and will take eggs into account when writing about omelettes, but will probably not take relativity into account when writing about omelettes.

Sparse autoencoders use this fact to discover a set of possible functions and split each activation into a small number of them. The researchers hope that the sparse autoencoder can best accomplish this task by finding the actual underlying functions that the language model uses.

Xem thêm  Why micro-LED TVs shouldn't be going to vary mini-LED or projectors any time shortly, and why they could all the time have OLED's on-going draw again

Importantly, at no point in this process do we – the researchers – tell the sparse autoencoder what features to look for. This enables us to discover rich structures that we did not foresee. However, we don't know this immediately Meaning For the discovered features, we look for meaningful patterns in text samples where the sparse autoencoder indicates that the feature is “triggered.”

Here's an example where the tokens that trigger the feature are highlighted in blue gradients according to their strength:

Example activations for a function found by our sparse autoencoders. Each bubble is a token (word or word fragment) and the variable blue color shows how strong the feature is present. In this case the function is apparently related to idioms.

What makes Gemma Scope unique

Previous research with sparse autoencoders has focused primarily on examining the inner workings of tiny models or a single layer in larger models. However, more ambitious interpretability research involves decoding multi-layered, complex algorithms in larger models.

We trained sparse autoencoders at everyone Layer and sublayer output from Gemma 2 2B and 9B to build Gemma Scope, producing more than 400 sparse autoencoders with a total of more than 30 million learned features (although many features are likely to overlap). This tool allows researchers to examine how features evolve throughout the model and interact and assemble to create more complex features.

Gemma Scope is also trained using our new, state-of-the-art JumpReLU SAE architecture. The original sparse autoencoder architecture struggled to balance the two goals of detecting existing features and estimating their strength. The JumpReLU architecture makes it easier to find this balance appropriately and significantly reduce errors.

Xem thêm  LG G5 OLED TV: What we want to see

Training so many sparse autoencoders was a major technical challenge and required a lot of computing power. We used about 15% of Gemma 2 9B's training computing power (excluding distillation label generation computing power), stored about 20 pebibytes (PiB) of activations on disk (about the same as a million copies of English Wikipedia), and hundreds of which created a total of billions of sparse autoencoder parameters.

Advancing the field

With the release of Gemma Scope, we hope to make Gemma 2 the best model family for open mechanistic interpretability research and accelerate the community's work in this area.

To date, the interpretability community has made great progress in understanding small models with sparse autoencoders and developing relevant techniques such as: B. causal interventions, automatic circuit analysis, feature interpretation and the evaluation of sparse autoencoders. We hope that the community will use Gemma Scope to scale these techniques to modern models, analyze more complex features like thought chains, and find real-world applications of interpretability, such as addressing problems like hallucinations and jailbreaks that only occur on larger models.

Acknowledgments

Gemma Scope was a joint effort by Tom Lieberum, Sen Rajamanoharan, Arthur Conmy, Lewis Smith, Nic Sonnerat, Vikrant Varma, Janos Kramar and Neel Nanda, advised by Rohin Shah and Anca Dragan. We would especially like to thank Johnny Lin, Joseph Bloom, and Curt Tigges from Neuronpedia for their support with the interactive demo. We are grateful for the help and contributions of Phoebe Kirk, Andrew Forbes, Arielle Bier, Aliya Ahmad, Yotam Doron, Tris Warkentin, Ludovic Peran, Kat Black, Anand Rao, Meg Risdal, Samuel Albanie, Dave Orr, Matt Miller, and Alex Turner , Tobi Ijitoye, Shruti Sheth, Jeremy Sie, Tobi Ijitoye, Alex Tomala, Javier Ferrando, Oscar Obeso, Kathleen Kenealy, Joe Fernandez, Omar Sanseviero and Glenn Cameron.

[ad_2]

Source link

By

Trả lời

Email của bạn sẽ không được hiển thị công khai. Các trường bắt buộc được đánh dấu *