The Leaderboard is available. Congratulations to all top performers!

Rank Team Organization BLEU@4 Meteor CIDEr-D ROUGE-L
1RUC+CMU_V2TRUC & CMU0.3900.2550.3150.542
3NIINational Institute of Informatics0.3590.2340.2310.514
4MIC_TJUTongji University0.3510.2260.2360.509
5IllusionIIT Delhi0.3040.2130.2060.494
6LVIC_ASCEA LIST0.2890.2030.1750.487
7TJU-NUSTJU & NUS0.2650.1910.1510.456
Rank Team Organization C1 C2 C3
1RUC+CMU_V2TRUC & CMU4.4373.4373.567
2NIINational Institute of Informatics4.0783.3593.570
4MIC_TJUTongji University3.8442.7892.978
4IllusionIIT Delhi4.0422.5832.921
6TJU-NUSTJU & NUS3.7622.3642.376
8LVIC_ASCEA LIST3.4772.3222.321


We computed multiple common metrics, including BLEU@4, METEOR, ROUGE-L, and CIDEr-D. The performances of the primary run from each team are measured for comparison across teams. The results of all runs can be downloaded here.

In addition, we will carry out the human evaluation of the systems submitted to this challenge on a subset of the testing set. Human were asked to rank the generated sentences of the primary run from each team and a reference sentence from 1 to 5 (lower - better) with respect to the following criteria.

      ·    Coherence:   judge the logic and readability of the sentence.

      ·    Relevance:   whether the sentence contains the more relevant and important objects/actions/events in the video clip?

      ·    Helpful for blind (additional criteria):   how helpful would the sentence be for a blind person to understand what is happening in this video clip?

M2 Human evaluation of the captions in terms of Coherence, Relevance, and helpful for blind on a scale of 1-5 (lower - better)


The ranking for the competition is based on the results from M1 and M2, respectively. Specifically, a rank list of teams is produced by sorting their scores on each M1 evaluation metric, respectively. The final rank of a team is measured by combining its ranking positions in the four ranking list and defined as:

R(team) = R(team)@BLEU@4 + R(team)@METEOR + R(team)@ROUGE-L + R(team)@CIDEr-D.

where R(team) is the rank position of the team, e.g., if the team achieves the best performance in terms of BLEU@4, thenR(team)@BLEU@4 is "1". The smaller the final ranking, the better the performance.

Similar in spirit, we will linearly fuse the scores of human evaluation on Coherence, Relevance and Helpful for Blind (in a scale of 1-5) for each team. The final score of each team is given by:

S(team) = S(team)@Coherence + S(team)@Relevance + S(team)@Helpful for Blind.

The larger the score, the better the performance.

We finally rank all the participants in two separate lists, one in terms of R(team) and the other S(team).

M1 R(team) = R(team)@BLEU@4 + R(team)@METEOR + R(team)@ROUGE-L + R(team)@CIDEr-D
M2 S(team) = S(team)@Coherence + S(team)@Relevance + S(team)@Helpful for Blind