The dataset is based on MSR-VTT and we split the data according to 65%:30%:5% in the training, testing and validation set, respectively.

Below table shows the statistics of MSR-VTT dataset.

Dataset Context Sentence source #Video #Clip #Sentence #Word Vocabulary Duration(hr)
MSR-VTT 20 categories AMT workers 7,180 10,000 200,000 1,856,523 29,316 41.2

* In MSR-VTT dataset, we provide the category information for each video clip and the video clip contains audio information as well.

All video info and caption sentences are formatted in a JSON file as

  "info" : {
    "year" : str,
    "version" : str,
    "description": str,
    "contributor": str,
    "data_created": str
  "videos": {
    "id": int,
    "video_id": str,
    "category": int,
    "url": str,
    "start time": float,
    "end time": float,
    "split": str
  "sentences": {
    "sen_id": int,
    "video_id": str,
    "caption": str


You can download video URLs and their associated sentences here. The test data(including test annotation) is available here. For more details, please refer to Prof. Xinmei Tian's Lab Page

Note that the testing data will ONLY be released to participants who have registered the challenge during the competition. Until the challenge completes, we will make the testing data publically available to the whole research community.