Spatially Aware Multimodal Transformers for TextVQA
Yash Kant1
Dhruv Batra1,2
Peter Anderson1,3
Alex Schwing5
Devi Parikh1,2
Jiasen Lu1,4
Harsh Agrawal1
1Georgia Tech
3Google AI
4Allen AI
Published at ECCV, 2020
[Talk (long)]
[Talk (short)]
We construct a spatial-graph that encodes different spatial relationships between a pair of visual entities and use it to guide the self-attention layers present in multi-modal transformer architectures.


Textual cues are essential for everyday tasks like buying groceries and using public transport. To develop this assistive technology, we study the TextVQA task, i.e., reasoning about text in images to answer a question. Existing approaches are limited in their use of spatial relations and rely on fully-connected transformer-like architectures to implicitly learn the spatial structure of a scene. In contrast, we propose a novel spatially aware self-attention layer such that each visual entity only looks at neighboring entities defined by a spatial graph. Further, each head in our multi-head self-attention layer focuses on a different subset of relations. Our approach has two advantages: (1) each head considers local context instead of dispersing the attention amongst all visual entities; (2) we avoid learning redundant features. We show that our model improves the absolute accuracy of current state-of-the-art methods on TextVQA by 2.2% overall over an improved baseline, and 4.62% on questions that involve spatial reasoning and can be answered correctly using OCR tokens. Similarly on ST-VQA, we improve the absolute accuracy by 4.2%. We further show that spatially aware self-attention improves visual grounding.


We extend the vanilla self-attention layer to utilize a graph over the input tokens. Instead of looking at the entire global context, an entity attends to just the neighboring entities as defined by a relationship graph. Moreover, heads consider different types of relations which encodes different context and avoids learning redundant features. We define the heterogeneous graph over tokens from multiple modalities which are connected by different edge types. While our framework is general and easily extensible to other tasks, we present our approach for the TextVQA task.
(a) Spatially aware attention layer uses a spatial graph to guide the attention in each head of the self-attention layer. (b) The spatial graph is represented as a stack of adjacency matrices. (c) Each head indexed by $h$ looks at a subset of relationships ${T}^h$ defined by the size of the context ($c=2$ here), e.g. $\texttt{head}_1$ looks at a two types of relation (${T}^1 = \{t_1, t_2\}$).

Pre-recorded Talk delivered at ECCV, 2020

Qualitative Samples

We flip the spatial-words in questions of the TextVQA dataset and find that SA-M4C adapts to this change!

Paper and Bibtex

Kant, Y., Batra, D., Anderson, P., Schwing, A., Parikh, D., Lu, J., & Agrawal, H. 2020. Spatially Aware Multimodal Transformers for TextVQA. In ECCV.

  title={Spatially Aware Multimodal Transformers for TextVQA},
  author={Kant, Yash and Batra, Dhruv and Anderson, Peter 
          and Schwing, Alexander and Parikh, Devi and Lu, Jiasen
          and Agrawal, Harsh},



We thank Abhishek Das and Abhinav Moudgil for their feedback. The Georgia Tech effort was supported in part by NSF, AFRL, DARPA, ONR YIPs, ARO PECASE, Amazon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the U.S. Government, or any sponsor.

This template was originally made by Phillip Isola and Richard Zhang for a colorful ECCV project; the code can be found here.