In the first track of pre-training for video captioning, we have released ACTION for vision-language pre-training. The table below shows the statistics of ACTION dataset.:
|ACTION||multi-category||Automatic crawling from web||213,078||224,989||2,291,565||50,039|
To formalize the task of pre-training for video captioning, we provide three datasets to the participants:
A pre-training dataset of ~220K GIF videos in ACTION. Each GIF video is equipped with one caption.
A training dataset of ~9.5K videos in MSR-VTT. Each video is annotated with 20 captions.
A validation dataset of ~0.5K videos in MSR-VTT. Each video is annotated with 20 captions.
In addition to the datasets above, we will include a testing set.
In the second track of pre-training for video categorization, we are finalizing the pre-training video dataset (the Weakly-Supervised dataset). Here we show the statistics of the Weakly-Supervised dataset as following:
|the Weakly-Supervised dataset||multi-faceted||Automatic crawling from web||√||2,015||2,958,092|
To formalize the task of pre-training for video categorization, we provide four datasets to the participants:
• A pre-training dataset of ~3M videos in the Weakly-Supervised dataset
• A training dataset of ~50K videos in Downstream.
• A validation dataset of ~25K videos in Downstream.
• A testing dataset of ~30K videos in Downstream.