Detecting Violence in Video using Subclasses

Our work on video violence detection is to appear as a short paper in the forthcoming ACM Multimedia 2016 conference.

This paper attacks the challenging problem of violence detection in videos. Different from existing works focusing on combining multi-modal features, we go one step further by adding and exploiting subclasses visually related to violence. We enrich the MediaEval 2015 violence dataset by manually labeling violence videos with respect to the subclasses. Such fine-grained annotations not only help understand what have impeded previous efforts on learning to fuse the multi-modal features, but also enhance the generalization ability of the learned fusion to novel test data. The new subclass based solution, with AP of 0.303 and P100 of 0.55 on the MediaEval 2015 test set, outperforms the state-of-the-art. Notice that our solution does not require fine-grained annotations on the test set, so it can be directly applied on novel and fully unlabeled videos. Interestingly, our study shows that motion related features (MBH, HOG and HOF), though being essential part in previous systems, are seemingly dispensable. Data is available at

Xirong Li, Yujia Huo, Qin Jin, Jieping Xu (2016): Detecting Violence in Video using Subclasses. In: ACM Multimedia, 2016.

Improving Image Captioning by Concept-based Sentence Reranking

We presented our image2text work (Best Paper Runner-Up) at the Pacific-Rim Conference on Multimedia (PCM) 2016 today.

This paper describes our winning entry in the ImageCLEF 2015 image sentence generation task. We improve Google’s CNN-LSTM model by introducing concept-based sentence reranking, a data-driven approach which exploits the large amounts of concept-level annotations on Flickr. Different from previous usage of concept detection that is tailored to specific image captioning models, the propose approach reranks predicted sentences in terms of their matches with detected concepts, essentially treating the underlying model as a black box. This property makes the approach applicable to a number of existing solutions. We also experiment with fine tuning on the deep language model, which improves the performance further. Scoring METEOR of 0.1875 on the ImageCLEF 2015 test set, our system outperforms the runner-up (METEOR of 0.1687) with a clear margin.

 Examples illustrating concept-based sentence re-ranking for improving image captioning.

Xirong Li, Qin Jin (2016): Improving Image Captioning by Concept-based Sentence Reranking. In: The 17th Pacific-Rim Conference on Multimedia (PCM), pp. 231-240, 2016, (Best Paper Runner-up).

Early Embedding and Late Reranking for Video Captioning

We are going to present our video2text work in the Multimedia Grand Challenge session of the forthcoming ACM Multimedia 2016 Conference at Amsterdam.

This paper describes our solution for the MSR Video to Language
Challenge. We start from the popular ConvNet + LSTM model, which we extend with two novel modules. One is early embedding, which enriches the current low-level input to LSTM by tag embeddings. The other is late reranking, for re-scoring generated sentences in terms of their relevance to a specific video. The modules are inspired by recent works on image captioning, repurposed and redesigned for video. As experiments on the MSR-VTT validation set show, the joint use of these two modules add a clear improvement over a non-trivial ConvNet + LSTM baseline under four performance metrics. The viability of the proposed solution is further confirmed by the blind test by the organizers. Our system is ranked at the 4th place in terms of overall performance, while scoring the best CIDEr-D, which measures the human-likeness of generated captions.

Early embedding and late reranking for video captioning

Jianfeng Dong, Xirong Li, Weiyu Lan, Yujia Huo, Cees G. M. Snoek (2016): Early Embedding and Late Reranking for Video Captioning. ACM Multimedia, 2016, (Grand Challenge Award).

TagBook for video event detection with few or zero example

Our work on video event detection, TagBook: A Semantic Video Representation without Supervision for Event Detection, has been published as a regular paper in the July issue of IEEE Transactions on Multimedia.

tagbook-frameworkWe consider the problem of event detection in video for scenarios where only few, or even zero examples are available for training. For this challenging setting, the prevailing solutions in the literature rely on a semantic video representation obtained from thousands of pre-trained concept detectors. Different from existing work, we propose a new semantic video representation that is based on freely available social tagged videos only, without the need for training any intermediate concept detectors. We introduce a simple algorithm that propagates tags from a video’s nearest neighbors, similar in spirit to the ones used for image retrieval, but redesign it for video event detection by including video source set refinement and varying the video tag assignment. We call our approach TagBook and study its construction, descriptiveness and detection performance on the TRECVID 2013 and 2014 multimedia event detection datasets and the Columbia Consumer Video dataset. Despite its simple nature, the proposed TagBook video representation is remarkably effective for few-example and zero-example event detection, even outperforming very recent state-of-the-art alternatives building on supervised representations.

Masoud Mazloom, Xirong Li, Cees G. M. Snoek (2016): TagBook: A Semantic Video Representation Without Supervision for Event Detection. In: IEEE Transactions on Multimedia (TMM), 18 (7), pp. 1378-1388, 2016.