CVPR2019: Dual Encoding for Zero-Example Video Retrieval

Our CVPR paper on zero-example video retrieval is online.  Data and source code is publicly available at github.

This paper attacks the challenging problem of zero-example video retrieval. In such a retrieval paradigm, an end user searches for unlabeled videos by ad-hoc queries described in natural language text with no visual example provided. Given videos as sequences of frames and queries as sequences of words, an effective sequence-to-sequence cross-modal matching is required. The majority of existing methods are concept based, extracting relevant concepts from queries and videos and accordingly establishing associations between the two modalities. In contrast, this paper takes a concept-free approach, proposing a dual deep encoding network that encodes videos and queries into powerful dense representations of their own. Dual encoding is conceptually simple, practically effective and end-to-end. As experiments on three benchmarks, i.e., MSR-VTT, TRECVID 2016 and 2017 Ad-hoc Video Search show, the proposed solution establishes a new state-of-the-art for zero-example video retrieval.

Jianfeng Dong, Xirong Li, Chaoxi Xu, Shouling Ji, Yuan He, Gang Yang, Xun Wang (2019): Dual Encoding for Zero-Example Video Retrieval. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.

MMM2019: Four Models for Automatic Recognition of Left and Right Eye in Fundus Images

Our MMM2019 paper on recognizing Left / Right Eye in Fundus Images is online.

Fundus image analysis is crucial for eye condition screening and diagnosis and consequently personalized health management in a long term. This paper targets at left and right eye recognition, a basic module for fundus image analysis. We study how to automatically assign left-eye/right-eye labels to fundus images of posterior pole. For this under-explored task, four models are developed. Two of them are based on optic disc localization, using extremely simple max intensity and more advanced Faster R-CNN, respectively. The other two models require no localization, but perform holistic image classification using classical Local Binary Patterns (LBP) features and fine-tuned ResNet18, respectively. The four models are tested on a real-world set of 1,633 fundus images from 834 subjects. Fine-tuned ResNet-18 has the highest accuracy of 0.9847. Interestingly, the LBP based model, with the trick of left-right contrastive classification, performs closely to the deep model, with an accuracy of 0.9718.

Xin Lai, Xirong Li, Rui Qian, Dayong Ding, Jun Wu, Jieping Xu (2019): Four Models for Automatic Recognition of Left and Right Eye in Fundus Images. the 25th International Conference on MultiMedia Modeling (MMM), 2019.


ACCV2018: Laser Scar Detection in Fundus Images using Convolutional Neural Networks

We are going to present our work on detecting laser scars in color fundus images at the 14th Asian Conference on Computer Vision (ACCV 2018) at Perth, Australia. This is a joint work with Vistel Inc. and Peking Union Medical College Hospital.

In diabetic eye screening programme, a special pathway is designed for those who have received laser photocoagulation treatment. The treatment leaves behind circular or irregular scars in the retina. Laser scar detection in fundus images is thus important for automated DR screening. Despite its importance, the problem is understudied in terms of both datasets and methods. This paper makes the first attempt to detect laser-scar images by deep learning. To that end, we contribute to the community Fundus10K, a large-scale expert-labeled dataset for training and evaluating laser scar detectors. We study in this new context major design choices of state-of-the-art Convolutional Neural Networks including Inception-v3, ResNet and DenseNet. For more effective training we exploit transfer learning that passes on trained weights of ImageNet models to their laser-scar countcerparts. Experiments on the new dataset shows that our best model detects laser-scar images with sensitivity of 0.962, specificity of 0.999, precision of 0.974 and AP of 0.988 and AUC of 0.999. The same model is tested on the public LMD-BAPT test set, obtaining sensitivity of 0.765, specificity of 1, precision of 1, AP of 0.975 and AUC of 0.991, outperforming the state-of-the-art with a large margin. Data is available at

Qijie Wei, Xirong Li, Hao Wang, Dayong Ding, Weihong Yu, Youxin Chen (2018): Laser Scar Detection in Fundus Images using Convolutional Neural Networks. Asian Conference on Computer Vision (ACCV), 2018.

Feature Re-Learning with Data Augmentation for Content-based Video Recommendation

We are going to present our work on content-based video recommendation in the Multimedia Grand Challenge session of the forthcoming ACM Multimedia 2018 Conference at Seoul. Source code will be available shortly at

This paper describes our solution for the Hulu Content-based Video Relevance Prediction Challenge. Noting the deficiency of the original features, we propose feature re-learning to improve video relevance prediction. To generate more training instances for supervised learning, we develop two data augmentation strategies, one for frame-level features and the other for video-level features. In addition, late fusion of multiple models is employed to further boost the performance. Evaluation conducted by the organizers shows that our best run outperforms the Hulu baseline, obtaining relative improvements of 26.2% and 30.2% on the TV-shows track and the Movies track, respectively, in terms of recall@100. The results clearly justify the effectiveness of the proposed solution.

Jianfeng Dong, Xirong Li, Chaoxi Xu, Gang Yang, Xun Wang (2018): Feature Re-Learning with Data Augmentation for Content-based Video Recommendation. ACM Multimedia, 2018, (Grand challenge paper).

Predicting Visual Features from Text for Image and Video Caption Retrieval

Our Word2VisualVec work has been accepted for publication as a REGULAR paper in the IEEE Transactions on Multimedia. Source code is available at

This paper strives to find amidst a set of sentences the one best describing the content of a given image or video. Different from existing works, which rely on a joint subspace for their image and video caption retrieval, we propose to do so in a visual space exclusively. Apart from this conceptual novelty, we contribute Word2VisualVec, a deep neural network architecture that learns to predict a visual feature representation from textual input. Example captions are encoded into a textual embedding based on multi-scale sentence vectorization and further transferred into a deep visual feature of choice via a simple multi-layer perceptron. We further generalize Word2VisualVec for video caption retrieval, by predicting from text both 3-D convolutional neural network features as well as a visual-audio representation. Experiments on Flickr8k, Flickr30k, the Microsoft Video Description dataset and the very recent NIST TrecVid challenge for video caption retrieval detail Word2VisualVec’s properties, its benefit over textual embeddings, the potential for multimodal query composition and its state-of-the-art results.

Jianfeng Dong, Xirong Li, Cees G. M. Snoek (2018): Predicting Visual Features from Text for Image and Video Caption Retrieval. In: IEEE Transactions on Multimedia (TMM), 20 (12), pp. 3377-3388, 2018.


Cross-Media Similarity Evaluation for Web Image Retrieval in the Wild

Our work on cross-media similarity computation for web image retrieval has been accepted for publication as a REGULAR paper in the IEEE Transactions on Multimedia.

In order to retrieve unlabeled images by textual queries, cross-media similarity computation is a key ingredient. Although novel methods are continuously introduced, little has been done to evaluate these methods together with large-scale query log analysis. Consequently, how far have these methods brought us in answering real-user queries is unclear. Given baseline methods that use relatively simple text/image matching, how much progress have advanced models made is also unclear. This paper takes a pragmatic approach to answering the two questions. Queries are automatically categorized according to the proposed query visualness measure, and later connected to the evaluation of multiple cross-media similarity models on three test sets. Such a connection reveals that the success of the state-of-the-art is mainly attributed to their good performance on visual-oriented queries, which account for only a small part of real-user queries. To quantify the current progress, we propose a simple text2image method, representing a novel query by a set of images selected from large-scale query log. Consequently, computing cross-media similarity between the query and a given image boils down to comparing the visual similarity between the given image and the selected images. Image retrieval experiments on the challenging Clickture dataset show that the proposed text2image is a strong baseline, comparing favorably to recent deep learning alternatives.


Jianfeng Dong, Xirong Li, Duanqing Xu (2018): Cross-Media Similarity Evaluation for Web Image Retrieval in the Wild. In: IEEE Transactions on Multimedia (TMM), 20 (9), pp. 2371-2384, 2018.

Fluency-Guided Cross-Lingual Image Captioning

Our MM2017 paper on cross-lingual image captioning is online.  We have also released code, data and pre-trained models at


Image captioning has so far been explored mostly in English, as most available datasets are in this language. However, the application of image captioning should not be restricted by language. Only few studies have been conducted for image captioning in a cross-lingual setting. Different from these works that manually build a dataset for a target language, we aim to learn a cross-lingual captioning model fully from machine-translated sentences. To conquer the lack of fluency in the translated sentences, we propose in this paper a fluency-guided learning framework. The framework comprises a module to automatically estimate the fluency of the sentences and another module to utilize the estimated fluency scores to effectively train an image captioning model for the target language. As experiments on two bilingual (English-Chinese) datasets show, our approach improves both fluency and relevance of the generated captions in Chinese, but without using any manually written sentences from the target language.

Weiyu Lan, Xirong Li, Jianfeng Dong (2017): Fluency-Guided Cross-Lingual Image Captioning. In: ACM Multimedia, 2017.

Harvesting Deep Models for Cross-Lingual Image Annotation

Our CBMI2017 paper on cross-lingual image annotation is online.

This paper considers cross-lingual image annotation, harvesting deep visual models from one language to annotate images with labels from another language. This task cannot be accomplished by machine translation, as labels can be ambiguous and a translated vocabulary leaves us limited freedom to annotate images with appropriate labels. Given non-overlapping vocabularies between two languages, we formulate cross-lingual image annotation as a zero-shot learning problem. For cross-lingual label matching, we adapt zero-shot by replacing the current monolingual semantic embedding space by a bilingual alternative. In order to reduce both label ambiguity and redundancy we propose a simple yet effective approach called label-enhanced zero-shot learning. Using three state-of-the-art deep visual models, i.e., ResNet-152, GoogleNet-Shuffle and OpenImages, experiments on the test set of Flickr8k-CN demonstrate the viability of the proposed approach for cross-lingual image annotation.


Qijie Wei, Xiaoxu Wang, Xirong Li (2017): Harvesting Deep Models for Cross-Lingual Image Annotation. The 15th International Workshop on Content-Based Multimedia Indexing (CBMI), 2017.

Tag Relevance Fusion for Social Image Retrieval

My work on image tag relevance learning, Tag Relevance Fusion for Social Image Retrieval, has been published as a special issue paper in the Multimedia Systems journal.

Due to the subjective nature of social tagging, measuring the relevance of social tags with respect to the visual content is crucial for retrieving the increasing amounts of social-networked images. Witnessing the limit of a single measurement of tag relevance, we introduce in this paper tag relevance fusion as an extension to methods for tag relevance estimation. We present a systematic study, covering tag relevance fusion in early and late stages, and in supervised and unsupervised settings. Experiments on a large present-day benchmark set show that tag relevance fusion leads to better image retrieval. Moreover, unsupervised tag relevance fusion is found to be practically as effective as supervised tag relevance fusion, but without the need of any training efforts. This finding suggests the potential of tag relevance fusion for real-world deployment.

Xirong Li (2017): Tag Relevance Fusion for Social Image Retrieval. In: Multimedia Systems, 23 (1), pp. 29–40, 2017, ISSN: 1432-1882.

Detecting Violence in Video using Subclasses

Our work on video violence detection is to appear as a short paper in the forthcoming ACM Multimedia 2016 conference.

This paper attacks the challenging problem of violence detection in videos. Different from existing works focusing on combining multi-modal features, we go one step further by adding and exploiting subclasses visually related to violence. We enrich the MediaEval 2015 violence dataset by manually labeling violence videos with respect to the subclasses. Such fine-grained annotations not only help understand what have impeded previous efforts on learning to fuse the multi-modal features, but also enhance the generalization ability of the learned fusion to novel test data. The new subclass based solution, with AP of 0.303 and P100 of 0.55 on the MediaEval 2015 test set, outperforms the state-of-the-art. Notice that our solution does not require fine-grained annotations on the test set, so it can be directly applied on novel and fully unlabeled videos. Interestingly, our study shows that motion related features (MBH, HOG and HOF), though being essential part in previous systems, are seemingly dispensable. Data is available at

Xirong Li, Yujia Huo, Qin Jin, Jieping Xu (2016): Detecting Violence in Video using Subclasses. In: ACM Multimedia, 2016.