|1||RUC+CMU_V2T||RUC & CMU||0.390||0.255||0.315||0.542|
|3||NII||National Institute of Informatics||0.359||0.234||0.231||0.514|
|7||TJU-NUS||TJU & NUS||0.265||0.191||0.151||0.456|
|1||RUC+CMU_V2T||RUC & CMU||4.437||3.437||3.567|
|2||NII||National Institute of Informatics||4.078||3.359||3.570|
|6||TJU-NUS||TJU & NUS||3.762||2.364||2.376|
We computed multiple common metrics, including BLEU@4, METEOR, ROUGE-L, and CIDEr-D. The performances of the primary run from each team are measured for comparison across teams. The results of all runs can be downloaded here.
In addition, we will carry out the human evaluation of the systems submitted to this challenge on a subset of the testing set. Human were asked to rank the generated sentences of the primary run from each team and a reference sentence from 1 to 5 (lower - better) with respect to the following criteria.
· Coherence: judge the logic and readability of the sentence.
· Relevance: whether the sentence contains the more relevant and important objects/actions/events in the video clip?
· Helpful for blind (additional criteria): how helpful would the sentence be for a blind person to understand what is happening in this video clip?
|M1||BLEU@4, METEOR, ROUGE-L, and CIDEr-D|
|M2||Human evaluation of the captions in terms of Coherence, Relevance, and helpful for blind on a scale of 1-5 (lower - better)|
The ranking for the competition is based on the results from M1 and M2, respectively. Specifically, a rank list of teams is produced by sorting their scores on each M1 evaluation metric, respectively. The final rank of a team is measured by combining its ranking positions in the four ranking list and defined as:
R(team) = R(team)@BLEU@4 + R(team)@METEOR + R(team)@ROUGE-L + R(team)@CIDEr-D.
where R(team) is the rank position of the team, e.g., if the team achieves the best performance in terms of BLEU@4, thenR(team)@BLEU@4 is "1". The smaller the final ranking, the better the performance.
Similar in spirit, we will linearly fuse the scores of human evaluation on Coherence, Relevance and Helpful for Blind (in a scale of 1-5) for each team. The final score of each team is given by:
S(team) = S(team)@Coherence + S(team)@Relevance + S(team)@Helpful for Blind.
The larger the score, the better the performance.
We finally rank all the participants in two separate lists, one in terms of R(team) and the other S(team).
|M1||R(team) = R(team)@BLEU@4 + R(team)@METEOR + R(team)@ROUGE-L + R(team)@CIDEr-D|
|M2||S(team) = S(team)@Coherence + S(team)@Relevance + S(team)@Helpful for Blind|