Early Embedding and Late Reranking for Video Captioning

We are going to present our video2text work in the Multimedia Grand Challenge session of the forthcoming ACM Multimedia 2016 Conference at Amsterdam.

This paper describes our solution for the MSR Video to Language Challenge.  We start from the popular ConvNet + LSTM model, which we extend with two novel modules. One is early embedding, which enriches the current low-level input to LSTM by tag embeddings. The other is late reranking, for re-scoring generated sentences in terms of their relevance to a specific video. The modules are inspired by recent works on image captioning, repurposed and redesigned for video. As experiments on the MSR-VTT validation set show, the joint use of these two modules add a clear improvement over a non-trivial ConvNet + LSTM baseline under four performance metrics. The viability of the proposed solution is further confirmed by the blind test by the organizers. Our system is ranked at the 4th place in terms of overall performance, while scoring the best CIDEr-D, which measures the human-likeness of generated captions.

Early embedding and late reranking for video captioning

Jianfeng Dong, Xirong Li, Weiyu Lan, Yujia Huo, Cees G. M. Snoek (2016): Early Embedding and Late Reranking for Video Captioning. ACM Multimedia, 2016, (Grand Challenge Award).

TagBook for video event detection with few or zero example

Our work on video event detection, TagBook: A Semantic Video Representation without Supervision for Event Detection, has been published as a regular paper in the July issue of IEEE Transactions on Multimedia.

tagbook-frameworkWe consider the problem of event detection in video for scenarios where only few, or even zero examples are available for training. For this challenging setting, the prevailing solutions in the literature rely on a semantic video representation obtained from thousands of pre-trained concept detectors. Different from existing work, we propose a new semantic video representation that is based on freely available social tagged videos only, without the need for training any intermediate concept detectors. We introduce a simple algorithm that propagates tags from a video’s nearest neighbors, similar in spirit to the ones used for image retrieval, but redesign it for video event detection by including video source set refinement and varying the video tag assignment. We call our approach TagBook and study its construction, descriptiveness and detection performance on the TRECVID 2013 and 2014 multimedia event detection datasets and the Columbia Consumer Video dataset. Despite its simple nature, the proposed TagBook video representation is remarkably effective for few-example and zero-example event detection, even outperforming very recent state-of-the-art alternatives building on supervised representations.

Masoud Mazloom, Xirong Li, Cees G. M. Snoek (2016): TagBook: A Semantic Video Representation Without Supervision for Event Detection. In: IEEE Transactions on Multimedia (TMM), 18 (7), pp. 1378-1388, 2016.