Masked Hard-Attention Transformers

A schematic of a Masked Hard-Attention Transformer alongside the program trace of a B-RASP program recognizing Dyck-1 of depth 2

The Paper

Masked Hard-Attention Transformers

The Slides

Talk slides - I don't know when the youtube video will be up

The Simulator

Check out the B-RASP simulator

What's the Deal?

Hard-attention transformers are a model of transformers used for theoretical analysis, where attention values get concentrated on a single position (unique hard-attention) or spread uniformly over a small subset of positions (average hard-attention). In prior research, Elena Voita (et al) have observed that in trained transformers, attention actually does get concentrated on just a few positions, which suggests this theoretical model may provide insight into real-life transformer models.

More specifically, this work analyzes unique hard-attention transformers with future masking. The heart of the analysis is a little programming language Dana came up with called Boolean-RASP, inspired by Gail Weiss's RASP. This language facilitates our proof of the equivalence between the expressivity of masked hard-attention transformers and that of first-order logic, linear temporal logic, as well as counter-free automata (these link to some nice definitions).

*LIMITATIONS MAY APPLY*
In reality, transformers exhibit the capacity to attend uniformly across all positions, a behavior not expressible by hard-attention models. Thus these results give a strict lower bound with a sizeable gap from the actual expressivity of transformers.

Despite these limitations, I think this work still provides valuable insight for future research as a "base-case" for studying the formal expressivity of transformers. As one of the only known exact correspondences between a transformer variant and a logic, it places this line of studying transformer expressivity on firm foundations. Furthermore, the proof techniques used can provide inspiration for future work - in particular, Boolean-RASP and its possible extensions hold a lot of potential for simplifying future analysis of more complicated transformer models. I hope to see what comes next in this line of research!