EST-VQA is Bilingual

EST-VQA is the first scene text VQA dataset with evidence based on English and Chinese question and answer pairs. (CVPR2020)

Abstract

Visual Question Answering (VQA) methods have made incredible progress, but suffer from a failure to generalize. This is visible in the fact that they are vulnerable to learning coincidental correlations in the data rather than deeper relations between image content and ideas expressed in language. We present a dataset that takes a step towards addressing this problem in that it contains questions expressed in two languages, and an evaluation process that co-opts a well understood image-based metric to reflect the method’s ability to reason. Measuring reasoning directly encourages generalization by penalizing answers that are coincidentally correct. The dataset reflects the scene-text version of the VQA problem, and the reasoning evaluation can be seen as a text-based version of a referring expression challenge. Experiments and analyses are provided that show the value of the dataset.

By requiring VQA models to provide evidence for their decisions encourages the development of generalized approaches that depend on reasoning.

EST-VQA


The EST-VQA dataset stands out among other text VQA datasets with the consideration that existing datasets pay more attention to the question answering part, and the OCR part is almost ignored in both training and evaluation of the model.

20757

Training Images

23144

Training Q&A pairs

4482

Testing Images

5014

Testing Q&A pairs

Image Collection

Images in the EST-VQA dataset are collected from publicly available scene text detection and recognition datasets, which are comprised of daily scenes that include both indoor and outdoor settings.

Annotations

All images in EST-VQA are annotated with at least one QA pair and a bounding box, where the answer must be texts in the images. The QA pairs could be formed in cross-language e.g., an English question queries the name of a Chinese restaurant so that the answer could be a Chinese text and vice versa for Chinese question.

Evidence-based Evaluation (EvE) Metric

Evaluation (EvE) metric will require a VQA model to provide evidence to support the predicted answers. It will first check the answer and then check the evidence. In the former, we use the normalized Levenshtein similarity score and adopt the widely used IoU metric to determine whether the evidence is sufficient or insufficient.

Cross Language Challenge

This challenge aims to explore a model’s ability in extracting common knowledge between different languages. Under this challenge, the candidates are requested to submit results predicted by both the monolingual and bilingual models with an identical framework for evaluations. EvE will be used as the primary metric.

Localization Challenge

This challenge requires the VQA model to provide the spatial location where an answer will be most likely to appear in an image based on a question. IoU between the predicted and ground-truth bounding box is employed as the performance metric.

Traditional Challenge

This challenge does not consider the evidence for the predicted answers. The
normalized Levenshtein similarity score between the prediction and ground-truth is employed as the metric for this challenge.

News


The EST-VQA dataset will be released here soon. Check out the timeline of our work.

Citation

@inproceedings{wang2020general,
    title={On the General Value of Evidence, and Bilingual Scene-Text Visual Question Answering},
    author={Wang, Xinyu and Liu, Yuliang and Shen, Chunhua and Ng, Chun Chet and Luo, Canjie and Jin, Lianwen 
    and Chan, Chee Seng and Hengel, Anton van den and Wang, Liangwei},
    booktitle={The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
    year={2020}
}

Our Team


This is a collaboration project between researchers from University of Adelaide, South China University of Technology, University of Malaya and Huawei Noah’s Ark Lab.

Xinyu Wang

Xinyu Wang

University of Adelaide

Yuliang Liu

Yuliang Liu

South China University of Technology

Chunhua Shen

Chunhua Shen

University of Adelaide

Chun Chet Ng

Chun Chet Ng

University of Malaya

Canjie Luo

Canjie Luo

South China University of Technology

Lianwen Jin

Lianwen Jin

South China University of Technology

Chee Seng Chan

Chee Seng Chan

University of Malaya

Anton van den Hengel

Anton van den Hengel

University of Adelaide

Xinyu Wang and Yuliang Liu contributed equally. Yuliang Liu’s contribution was made when visiting the University of Adelaide.

Professor Chunhua Shen is the corresponding author, reach him via e-mail at chunhua.shen@adelaide.edu.au.

Get In Touch


Feel free to drop us a message if you have any questions, suggestions and feedback about our work.

You can reach us through:

Email Address

enquiry@est-vqa.org