In the 2nd MSR Video to Language Challenge, we have combined the training set, validation set, and testing data in the 1st MSR Video to Language Challenge as the new training data. An additional test set of around 3K video clips will be released on June 1st as the final evaluation set. As such, we have 10K video clips for training and 3K video clips for testing this year. Each video is annotated with 20 natural sentences.

* In MSR-VTT dataset, we provide the category information for each video clip and the video clip contains audio information as well.

All video info and caption sentences are formatted in a JSON file as

  "info" : {
    "year" : str,
    "version" : str,
    "description": str,
    "contributor": str,
    "data_created": str
  "videos": {
    "id": int,
    "video_id": str,
    "category": int,
    "url": str,
    "start time": float,
    "end time": float,
    "split": str
  "sentences": {
    "sen_id": int,
    "video_id": str,
    "caption": str


You can download training video URLs and their associated sentences here.

[New] The test video URLs are avaliable here. For more details, please refer to Prof. Xinmei Tian's Lab Page

Note that the testing data will ONLY be released to participants who have registered the challenge during the competition. Until the challenge completes, we will make the testing data publically available to the whole research community.