no code implementations • 20 Mar 2024 • Yuseung Lee, Minhyuk Sung
Our analysis of a pretrained image diffusion model that integrates gated self-attention into the U-Net reveals that spatial grounding often outweighs textual grounding due to the sequential flow from gated self-attention to cross-attention.