no code implementations • 19 May 2021 • M. Lautaro Hickmann, Fabian Wurzberger, Megi Hoxhalli, Arne Lochner, Jessica Töllich, Ansgar Scherp
We observe a high correlation between the attention weights and this reference metric, especially on the the later decoding layers of the transformer architecture.