The EST-VQA dataset stands out among other text VQA datasets with the consideration that existing datasets pay more attention to the question answering part, and the OCR part is almost ignored in both training and evaluation of the model.


Training Images


Training Q&A pairs


Testing Images


Testing Q&A pairs

Image Collection

Images in the EST-VQA dataset are collected from publicly available scene text detection and recognition datasets, which are comprised of daily scenes that include both indoor and outdoor settings.


All images in EST-VQA are annotated with at least one QA pair and a bounding box, where the answer must be texts in the images. The QA pairs could be formed in cross-language e.g., an English question queries the name of a Chinese restaurant so that the answer could be a Chinese text and vice versa for Chinese question.

Evidence-based Evaluation (EvE) Metric

Evaluation (EvE) metric will require a VQA model to provide evidence to support the predicted answers. It will first check the answer and then check the evidence. In the former, we use the normalized Levenshtein similarity score and adopt the widely used IoU metric to determine whether the evidence is sufficient or insufficient.

Cross Language Challenge

This challenge aims to explore a model’s ability in extracting common knowledge between different languages. The EST-VQA dataset contains questions presented in both English and Chinese. The QA pairs can be presented in cross language, e.g., English question with Chinese answers.

Localization Challenge

This challenge requires the VQA model to provide the spatial location where an answer will be most likely to appear in an image based on a question. IoU between the predicted and ground-truth bounding box is employed as the performance metric.

Traditional Challenge

This challenge does not consider the evidence for the predicted answers. The
normalized Levenshtein similarity score between the prediction and ground-truth is employed as the metric for this challenge.