A downloadable project

TraCR-Supported Mechanistic Interpretability

We compare a Transformer compiled from a RASP algorithm using TRACR to a Transformer trained on the same task with AdamW using the TransformerLens library. We find that (1) compiled Transformers are significantly more interpretable due to their on/off activation patterns, (2) a compiled and a trained toy model Transformer learn the same type of circuits and (3) using RASP and TRACR might provide a path towards automated circuits identification by compiling to causal graphs and utilising causal scrubbing for algorithmic matching.

Keywords: Mechanistic interpretability, ML safety

Introduction

In this study, we aim to investigate the potential of utilizing algorithms converted to Transformer networks as a tool for interpretability. To this end, we employ the RASP (Weiss et al., 2021) programming language and the TRACR compiler (Lindner et al., 2023), which compiles RASP code into Transformer neural networks. By comparing the mechanistic circuits generated by the TRACR compiler to those learned during training by a learning mechanism, we aim to observe the differences between the two approaches and assess their potential for interpretability. 

Our hypotheses are (I) that the compiled algorithm will exhibit a higher degree of symbolic circuits in its attention heads and MLP layers, while the neural learned algorithm will be more ambiguous in its activation patterns. (II) Additionally, we posit that the algorithms we write are explicitly learnable by a Transformer network (III) and that the same algorithms will be learned, albeit with different representations in the network.

Methods

We use TRACR to compile a reversal algorithm written in RASP to a 4-layer Transformer with four attention blocks and visualize the residual stream activation patterns, attention patterns, and MLP activiations.

We convert the compiled model to the HookedTransformer class of the TransformerLens library, reinitialize the weights, and retrain it on a generated dataset consisting of a dictionary of four tokens (“BOS”, 1, 2, and 3) and a training example length of “BOS” and up to five tokens (e.g. “BOS, 2, 1, 3, 3, 3”). The model is trained on 25000 randomly generated samples to reverse the token numbers in the same way as the RASP-compiled Transformer. After training, we compare the results with the compiled Transformer.

Results

Figure 1 – Residual stream of the compiled token reversal algorithm. The compiled 54%

algorithm reverses [“BOS”, 1, 2, 3, 3, 3] to [“BOS”, 3, 3, 3, 2, 1]

Figure 2 - Residual stream of the learned token reversal algorithm. The learned algorithm reverses [“BOS”, 1, 2, 3, 3, 3] to [“BOS”, 3, 3, 3, 2, 1]

Figure 3: Attention patterns in the four layers of the compiled algorithm. The compiled algorithm reverses [“BOS”, 1, 2, 3, 3, 3] to [“BOS”, 3, 3, 3, 2, 1]

Figure 4: Attention patterns in the four layers of the learned algorithm. The learned algorithm reverses [“BOS”, 1, 2, 3, 3, 3] to [“BOS”, 3, 3, 3, 2, 1]

Figure 5: Activation patterns of the MLPs (post-non-linearity) of the compiled algorithm (left) and the learned algorithm (right). Both algorithms reverse [“BOS”, 1, 2, 3, 3, 3] to [“BOS”, 3, 3, 3, 2, 1]

Discussion and Conclusion

Figure 1 presents the residual stream of the compiled token reversal algorithm, and it illustrates that the neurons exhibit significantly more binary activation compared to the learned algorithm, supporting our first hypothesis.

Figures 3 and 4 depict the last attention patterns of the last layer, which are consistent between the two algorithms. It is observed that there is no attention for index 0 as it represents the “BOS” token, while the rest of the prompt is attended to in reverse order (output position 5 attends to input position 1 and so on). This is exactly as one would predict for a reversal algorithm. The fact that both final layers are so similar, indicates the potential for creating a library of compiled Transformers. Learned Transformers could be compared and matched, which could partly automatize mechanistic interpretability.

These results partially confirm our second hypothesis, however, it seems that the first three layers do not have a significant impact on this specific algorithm. The TRACR compiler compiles all RASP programs into 4-layer Transformers. The authors detail a compression stage in their paper, which is not implemented in the public code, though it seems to degrade performance as well. We find that the only attention heads that are relevant seem to be in layer four, while the first three layers’ attention appear redundant.

In Figure 5, we observe that the MLP (multi-layer perceptron) layers 2 and 3 seem relevant for the compiled reversal algorithm, while the learned Transformer’s layers are significantly more challenging to interpret. This supports our claim that the compiled algorithm exhibits a more interpretable, symbolic representation though hypothesis 2 is put into doubt until we use activation patching to backtrace the circuit.

The Road to Automation

In mechanistic interpretability, mechanistic anomaly detection (Christiano, 2022) and unsupervised latent knowledge supervision (Burns et al., 2022) are two promising approaches to summarising and understanding where models go wrong and what knowledge they have that they do not reveal, respectively.

By using Transformer-compiled RASP programs, we can get a better understanding of what a specific algorithm in a model might look like. Our results indicate that especially looking at the attention patterns and comparing them to compiled Transformers might provide information about the learned algorithms, whereas the compiled MLP activations provide less information.

Extending this work might include:

  1. Exploring if we can compress learned Transformers into a TRACR-like model, i.e. a significantly easier-to-interpret model, might enable more understandable neural behaviours (see the difference between Figure 5 and 6).
  2. Use activation patching to interpret the circuits algorithm learned by the TransformerLens model and compare the attention heads and MLP layers’ functionality to the compiled model for various RASP programs.
  3. Explore how we can generate causal graphs of the RASP algorithms for causal scrubbing to automatically search for the algorithm in a larger network, e.g. the string reversal algorithm is present in a few-shot GPT-3 model (Peter Welinder [@npew], 2022).

If we are able to achieve the above three goals, RASP will turn into a powerful tool towards quickly writing and interpreting algorithmic circuits (Nanda, 2022) beyond toy models.

References

See the code in Colab at https://ais.pub/tracr.

https://ai-alignment.com/mechanistic-anomaly-detection-and-elk-fb84f4c6d0dc

Burns, C., Ye, H., Klein, D., & Steinhardt, J. (2022). Discovering Latent Knowledge in Language Models Without Supervision (arXiv:2212.03827). arXiv. https://doi.org/10.48550/arXiv.2212.03827

Christiano, P. (2022, November 25). Mechanistic anomaly detection and ELK. Medium. https://ai-alignment.com/mechanistic-anomaly-detection-and-elk-fb84f4c6d0dc

Lindner, D., Kramár, J., Rahtz, M., McGrath, T., & Mikulik, V. (2023). Tracr: Compiled Transformers as a Laboratory for Interpretability (arXiv:2301.05062). arXiv. http://arxiv.org/abs/2301.05062

Nanda, N. (2022). 200 COP in MI: Interpreting Algorithmic Problems. https://www.alignmentforum.org/posts/ejtFsvyhRkMofKAFy/200-cop-in-mi-interpretin...

Peter Welinder [@npew]. (2022, May 15). GPT-3 is amazing at complex tasks like creative writing and summarizing. But it’s surprisingly bad at reversing words. 🤔 The reason is that GPT-3 doesn’t see the world the way we humans do. 👀 If you teach it to reason, it can get around its limitations to get really good. 💡 https://t.co/Cnd9iN87oq [Tweet]. Twitter. https://twitter.com/npew/status/1525900849888866307 Weiss, G., Goldberg, Y., & Yahav, E. (2021). Thinking Like Transformers (arXiv:2106.06981). arXiv. http://arxiv.org/abs/2106.06981

StatusReleased
CategoryOther
Rating
Rated 5.0 out of 5 stars
(1 total ratings)
AuthorsEsben Kran, ElliotJDavies, h6
Tagsartificial-intelligence, Sci-fi

Download

Download
Tracr-supported mechanistic interpretability.pdf 191 kB

Leave a comment

Log in with itch.io to leave a comment.