The 2nd Video to Language Challenge


To further motivate and challenge the academic and industrial research community, Microsoft will release Microsoft Research - Video to Text (MSR-VTT), a large-scale video benchmark to public for bridging video and language. The dataset contains about 50 hours and 260K clip-sentence pairs in total, covering the most comprehensive categories and diverse visual content, and representing the largest dataset in terms of sentence and vocabulary. The dataset can be used to train and evaluate video to language tasks, and other tasks (e.g., video retrieval, event detection, video categorization, etc.) as well in the near future. This challenge allows to use external data to train and tune parameters of algorithms. Therefore, each submission must explicitly cite the kind of external data used in the submission file.

Task Description

This year we will focus on video to language task. Given an input video clip, the goal is to automatically generate a complete and natural sentence to describe video content, ideally encapsulating its most informative dynamics.

The contestants are asked to develop video to language system based on the MSR-VTT dataset provided by the Challenge (as training data) and any other public/private data to recognize a wide range of object, scene, event, etc., in the images/videos. For the evaluation purpose, a contesting system is asked to produce at least one sentence of the test videos. The accuracy will be evaluated against human pre-generated sentence(s) during evaluation stage.

Important Dates

· April 18, 2017: Dataset available for download (training set)

· June 1, 2017: Test set available for download

· June 15, 2017: Results and one-page notebook paper submission

· June 16 ~ June 28, 2017: Objective evaluation and human evaluation

· July 3, 2017: Evaluation results announce

· July 14, 2017: Paper submission deadline (please follow the instructions on the main conference website)

Submission File

To enter the competition, you need to create an account on Evaluation Server. This account allows you to upload your results to the server. Each team is allowed to submit the results of at most three runs and selects one run as the primary run of the submission, which will be measured for performance comparison across teams. Each run must be formatted in a JSON File as

  "version": "VERSION 1.2",
    "video_id": "video2218",
    "caption": "a panel of people talk about things"
    "video_id": "video1835",
    "caption": "a person is playing the piano"
    "used": "true", # Boolean flag. True indicates used of external data.
    "details": "First fully-connected layer from VGG-16 pre-trained on ILSVRC-2012 training set" # String with details of your external data.


Note: comments in brown are illustrative and help us to provide inline detailed explanations. Please avoid them in your sumisions.

To help with better understanding the format of the submission text file, a sample submission can be seen here. Participants please strictly follow the submission format.

All the results should be zipped into a single file named by Within the zipped folder, results from different runs should be placed in separate files and one of them should be noticed as primary run in the name (e.g., result1(primary).json, result2.json, result3.json).

Every team is also required to upload a one-page notebook paper that briefly describes your system. The paper format follows ACM proceeding style.

Evaluation Metric

The evaluation provided here can be used to obtain results on the testing set of MSR-VTT. It computes multiple common metrics, including BLEU@4, METEOR, ROUGE-L, and CIDEr.

In addition, we will carry out the human evaluation of the systems submitted to this challenge on a subset of the testing set. Human were asked to rank the generated sentences of the primary run from each team and a reference sentence from 1 to 5 (lower - better) with respect to the following criteria.

    · Coherence: judge the logic and readability of the sentence.

    · Relevance: whether the sentence contains the more relevant and important objects/actions/events in the video clip?

    · Helpful for blind (additional criteria): how helpful would the sentence be for a blind person to understand what is happening in this video clip?


The ranking for the competition this year is based on objective evaluation and human evaluation, respectively. Specifically, a rank list of teams is produced by sorting their scores on each objective evaluation metric, respectively. The final rank of a team is measured by combining its ranking positions in the four ranking list and defined as:

R(team) = R(team)@BLEU@4 + R(team)@METEOR + R(team)@ROUGE-L + R(team)@CIDEr.

where R(team) is the rank position of the team, e.g., if the team achieves the best performance in terms of BLEU@4, then R(team)@BLEU@4 is "1". The smaller the final ranking, the better the performance.

Similar in spirit, we will linearly fuse the scores of human evaluation on Coherence, Relevance and Helpful for Blind (in a scale of 1-5) for each team. The final score of each team is given by:

S(team) = S(team)@Coherence + S(team)@Relevance + S(team)@Helpful for Blind.

The larger the score, the better the performance.

We will finally rank all the participants in two separate lists, one in terms of R(team) and the other S(team).


The Challenge is a team-based contest. Each team can have one or more members, and an individual cannot be a member of multiple teams.

At the end of the Challenge, all teams will be ranked based on both objective evaluation and human evaluation described above. The top three teams will receive award certificates. At the same time, all accepted submissions are qualified for ACM MM 2017 Challenge award competition.


If you intend to publish results that use the data and resources provided by this challenge, please include the following references:

title={MSR-VTT: A Large Video Description Dataset for Bridging Video and Language},
author={Jun Xu and Tao Mei and Ting Yao and Yong Rui},
booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},

title={Jointly Modeling Embedding and Translation to Bridge Video and Language},
author={Yingwei Pan and Tao Mei and Ting Yao and Houqiang Li and Yong Rui},
booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},