The 1st Video to Language Challenge




Challenge


To further motivate and challenge the academic and industrial research community, Microsoft will release Microsoft Research - Video to Text (MSR-VTT), a large-scale video benchmark to public for bridging video and language. The dataset contains 41.2 hours and 200K clip-sentence pairs in total, covering the most comprehensive categories and diverse visual content, and representing the largest dataset in terms of sentence and vocabulary. The dataset can be used to train and evaluate video to language tasks, and other tasks (e.g., video retrieval, event detection, video categorization, etc.) as well in the near future. This challenge allows to use external data to train and tune parameters of algorithms. Therefore, each submission must explicitly cite the kind of external data used in the submission file.



Task Description


This year we will focus on video to language task. Given an input video clip, the goal is to automatically generate a complete and natural sentence to describe video content, ideally encapsulating its most informative dynamics.

The contestants are asked to develop video to language system based on the MSR-VTT dataset provided by the Challenge (as training data) and any other public/private data to recognize a wide range of object, scene, event, etc., in the images/videos. For the evaluation purpose, a contesting system is asked to produce at least one sentence of the test videos. The accuracy will be evaluated against human pre-generated sentence(s) during evaluation stage.



Important Dates


· April 15, 2016: Dataset available for download (training and validation set)

· May 31, 2016: Test set available for download

· June 2, 2016: Test set available for download

· June 12, 2016: Results and one-page notebook paper submission

· June 13 ~ June 25, 2016: Objective evaluation and human evaluation

· June 30, 2016: Evaluation results announce

· July 6, 2016: Paper submission deadline (please follow the instructions on the main conference website)



Submission File


To enter the competition, you need to create an account on Evaluation Server. This account allows you to upload your results to the server. Each team is allowed to submit the results of at most three runs and selects one run as the primary run of the submission, which will be measured for performance comparison across teams. Each run must be formatted in a JSON File as

{
  "version": "VERSION 1.2",
  "result":[
  {
    "video_id": "video2218",
    "caption": "a panel of people talk about things"
  },
  ...
  {
    "video_id": "video1835",
    "caption": "a person is playing the piano"
  }
  ],
  "external_data":{
    "used": "true", # Boolean flag. True indicates used of external data.
    "details": "First fully-connected layer from VGG-16 pre-trained on ILSVRC-2012 training set" # String with details of your external data.
  }
}

              

Note: comments in brown are illustrative and help us to provide inline detailed explanations. Please avoid them in your sumisions.

To help with better understanding the format of the submission text file, a sample submission can be seen here. Participants please strictly follow the submission format.


All the results should be zipped into a single file named by result.zip. Within the zipped folder, results from different runs should be placed in separate files and one of them should be noticed as primary run in the name (e.g., result1(primary).json, result2.json, result3.json).



Evaluation Metric


The evaluation provided here can be used to obtain results on the testing set of MSR-VTT. It computes multiple common metrics, including BLEU@4, METEOR, ROUGE-L, and CIDEr-D.

In addition, we will carry out the human evaluation of the systems submitted to this challenge on a subset of the testing set. Human were asked to rank the generated sentences of the primary run from each team and a reference sentence from 1 to 5 (lower - better) with respect to the following criteria.

    · Coherence: judge the logic and readability of the sentence.

    · Relevance: whether the sentence contains the more relevant and important objects/actions/events in the video clip?

    · Helpful for blind (additional criteria): how helpful would the sentence be for a blind person to understand what is happening in this video clip?



Ranking


The ranking for the competition this year is based on objective evaluation and human evaluation, respectively. Specifically, a rank list of teams is produced by sorting their scores on each objective evaluation metric, respectively. The final rank of a team is measured by combining its ranking positions in the four ranking list and defined as:

R(team) = R(team)@BLEU@4 + R(team)@METEOR + R(team)@ROUGE-L + R(team)@CIDEr-D.

where R(team) is the rank position of the team, e.g., if the team achieves the best performance in terms of BLEU@4, then R(team)@BLEU@4 is "1". The smaller the final ranking, the better the performance.

Similar in spirit, we will linearly fuse the scores of human evaluation on Coherence, Relevance and Helpful for Blind (in a scale of 1-5) for each team. The final score of each team is given by:

S(team) = S(team)@Coherence + S(team)@Relevance + S(team)@Helpful for Blind.

The larger the score, the better the performance.

We will finally rank all the participants in two separate lists, one in terms of R(team) and the other S(team).



Participation


The Challenge is a team-based contest. Each team can have one or more members, and an individual cannot be a member of multiple teams.

At the end of the Challenge, all teams will be ranked based on both objective evaluation and human evaluation described above. The top three teams will receive award certificates and/or cash prizes (prize amounts TBD). At the same time, all accepted submissions are qualified for the conference’s grand challenge award competition.



Citations


If you are trying to reproduce or compare the baselines conducted on our MSR-VTT dataset, please refer to this supplementary material and the updated performance reported in this material. However, please cite our CVPR paper if you want to use the MSR-VTT as your dataset.

The references are as follows:

@inproceedings{Xu:CVPR16,
title={MSR-VTT: A Large Video Description Dataset for Bridging Video and Language},
author={Jun Xu and Tao Mei and Ting Yao and Yong Rui},
booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2016}}

@inproceedings{Pan:CVPR16,
title={Jointly Modeling Embedding and Translation to Bridge Video and Language},
author={Yingwei Pan and Tao Mei and Ting Yao and Houqiang Li and Yong Rui},
booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2016}}