EST-VQA is Bilingual

EST-VQA is the first scene text VQA dataset with evidence based on English and Chinese question and answer pairs. (CVPR2020)

Abstract

Visual Question Answering (VQA) methods have made incredible progress, but suffer from a failure to generalize. This is visible in the fact that they are vulnerable to learning coincidental correlations in the data rather than deeper relations between image content and ideas expressed in language. We present a dataset that takes a step towards addressing this problem in that it contains questions expressed in two languages, and an evaluation process that co-opts a well understood image-based metric to reflect the method’s ability to reason. Measuring reasoning directly encourages generalization by penalizing answers that are coincidentally correct. The dataset reflects the scene-text version of the VQA problem, and the reasoning evaluation can be seen as a text-based version of a referring expression challenge. Experiments and analyses are provided that show the value of the dataset.

By requiring VQA models to provide evidence for their decisions encourages the development of generalized approaches that depend on reasoning.

EST-VQA


The EST-VQA dataset stands out among other text VQA datasets with the consideration that existing datasets pay more attention to the question answering part, and the OCR part is almost ignored in both training and evaluation of the model.

17047

Training Images

19362

Training Q&A pairs

4000

Testing Images

4525

Testing Q&A pairs

Image Collection

Images in the EST-VQA dataset are collected from publicly available scene text detection and recognition datasets, which are comprised of daily scenes that include both indoor and outdoor settings.

Annotations

All images in EST-VQA are annotated with at least one QA pair and a bounding box, where the answer must be texts in the images. The QA pairs could be formed in cross-language e.g., an English question queries the name of a Chinese restaurant so that the answer could be a Chinese text and vice versa for Chinese question.

Evidence-based Evaluation (EvE) Metric

Evaluation (EvE) metric will require a VQA model to provide evidence to support the predicted answers. It will first check the answer and then check the evidence. In the former, we use the normalized Levenshtein similarity score and adopt the widely used IoU metric to determine whether the evidence is sufficient or insufficient.

Cross Language Challenge

This challenge aims to explore a model’s ability in extracting common knowledge between different languages. The EST-VQA dataset contains questions presented in both English and Chinese. The QA pairs can be presented in cross language, e.g., English question with Chinese answers.

Localization Challenge

This challenge requires the VQA model to provide the spatial location where an answer will be most likely to appear in an image based on a question. IoU between the predicted and ground-truth bounding box is employed as the performance metric.

Traditional Challenge

This challenge does not consider the evidence for the predicted answers. The
normalized Levenshtein similarity score between the prediction and ground-truth is employed as the metric for this challenge.

News

  • January 2021 – After carefully checking and reannotating, the EST-VQA dataset is now available for download and evaluation.
  • April 2020 – Website created.
  • March 2020 – Paper accepted to CVPR 2020.

Citation

@inproceedings{wang2020general,
    title={On the General Value of Evidence, and Bilingual Scene-Text Visual Question Answering},
    author={Wang, Xinyu and Liu, Yuliang and Shen, Chunhua and Ng, Chun Chet and Luo, Canjie and Jin, Lianwen 
    and Chan, Chee Seng and Hengel, Anton van den and Wang, Liangwei},
    booktitle={The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
    year={2020}
}

Our Team


This is a collaboration project between researchers from University of Adelaide, South China University of Technology and University of Malaya.

Xinyu Wang

Xinyu Wang

University of Adelaide

Yuliang Liu

Yuliang Liu

South China University of Technology

Chunhua Shen

Chunhua Shen

University of Adelaide

Chun Chet Ng

Chun Chet Ng

University of Malaya

Canjie Luo

Canjie Luo

South China University of Technology

Lianwen Jin

Lianwen Jin

South China University of Technology

Chee Seng Chan

Chee Seng Chan

University of Malaya

Anton van den Hengel

Anton van den Hengel

University of Adelaide

For any questions or suggestions about EST-VQA dataset, please contact Xinyu Wang via xinyu.wang02@adelaide.edu.au.