ICMR2020: iCap: Interactive Image Captioning with Predictive Text

Our ICMR’20 paper on interactive image captioning is online.

In this paper we study a brand new topic of interactive image captioning with human in the loop. Different from automated image captioning where a given test image is the sole input in the inference stage, we have access to both the test image and a sequence of (incomplete) user-input sentences in the interactive scenario. We formulate the problem as Visually Conditioned Sentence Completion (VCSC). For VCSC, we propose ABD-Cap, asynchronous bidirectional decoding for image caption completion. With ABD-Cap as the core module, we build iCap, a web-based interactive image captioning system capable of predicting new text with respect to live input from a user. A number of experiments covering both automated evaluations and real user studies show the viability of our proposals.

Zhengxiong Jia, Xirong Li: iCap: Interactive Image Captioning with Predictive Text. In: ACM International Conference on Multimedia Retrieval (ICMR), 2020.

MM2019: W2VV++: Fully Deep Learning for Ad-hoc Video Search

Our ACMMM’19 paper on ad-hoc video search is online. Source code and data are accessible via https://github.com/li-xirong/w2vvpp.

Ad-hoc video search (AVS) is an important yet challenging problem in multimedia retrieval. Different from previous concept-based methods, we propose an end-to-end deep learning method for query representation learning. The proposed method requires no concept modeling, matching and selection. The backbone of our method is the proposed W2VV++ model, a super version of Word2VisualVec (W2VV) previously developed for visual-to-text matching. W2VV++ is obtained by tweaking W2VV with a better sentence encoding strategy and an improved triplet ranking loss. With these simple changes, W2VV++ brings in a substantial improvement in performance. As our participation in the TRECVID 2018 AVS task and retrospective experiments on the TRECVID 2016 and 2017 data show, our best single model, with an overall inferred average precision (infAP) of 0.157, outperforms the state-of-the-art. The performance can be further boosted by model ensemble using late average fusion, reaching a higher infAP of 0.163. With W2VV++, we establish a new baseline for ad-hoc video search.

Xirong Li, Chaoxi Xu, Gang Yang, Zhineng Chen, Jianfeng Dong: W2VV++: Fully Deep Learning for Ad-hoc Video Search. In: ACM Multimedia, 2019.

T-MM 2019: COCO-CN for Cross-Lingual Image Tagging, Captioning, and Retrieval

Our work on cross-lingual image tagging, captioning and retrieval has been published as a regular paper in the September issue of the IEEE Transactions on Multimedia (Impact factor: 5.452). Data and code are available at https://github.com/li-xirong/coco-cn.

This paper contributes to cross-lingual image annotation and retrieval in terms of data and baseline methods. We propose COCO-CN, a novel dataset enriching MS-COCO with manually written Chinese sentences and tags. For effective annotation acquisition, we develop a recommendation-assisted collective annotation system, automatically providing an annotator with several tags and sentences deemed to be relevant with respect to the pictorial content. Having 20 342 images annotated with 27 218 Chinese sentences and 70 993 tags, COCO-CN is currently the largest Chinese–English dataset that provides a unified and challenging platform for cross-lingual image tagging, captioning, and retrieval. We develop conceptually simple yet effective methods per task for learning from cross-lingual resources. Extensive experiments on the three tasks justify the viability of the proposed dataset and methods.

Xirong Li, Chaoxi Xu, Xiaoxu Wang, Weiyu Lan, Zhengxiong Jia, Gang Yang, Jieping Xu: COCO-CN for Cross-Lingual Image Tagging, Captioning and Retrieval. In: IEEE Transactions on Multimedia, vol. 21, no. 9, pp. 2347-2360, 2019.

MICCAI2019: Fully Deep Learning for Slit-lamp Photo based Nuclear Cataract Grading

Our MICCAI2019 paper on automated nuclear cataract grading is online.

Age-related cataract is a priority eye disease, with nuclear cataract as its most common type. This paper aims for automated nuclear cataract grading based on slit-lamp photos. Different from previous efforts which rely on traditional feature extraction and grade modeling techniques, we propose in this paper a fully deep learning based solution. Given a slit-lamp photo, we localize its nuclear region by Faster R-CNN, followed by a ResNet-101 based grading model. In order to alleviate the issue of imbalanced data, a simple batch balancing strategy is introduced for improving the training of the grading network. Tested on a clinical dataset of 157 slit-lamp photos from 39 female and 31 male patients, the proposed solution outperforms the state-of-the-art, reducing the mean absolute error from 0.357 to 0.313. In addition, our solution processes a slit-lamp photo in approximately 0.1 second, which is two order faster than the state-of-the-art. With its effectiveness and efficiency, the new solution is promising for automated nuclear cataract grading.

miccai2019-nuclear-cataract-grading

Chaoxi Xu, Xiangjia Zhu, Wenwen He, Yi Lu, Xixi He, Zongjiang Shang, Jun Wu, Keke Zhang, Yinglei Zhang, Xianfang Rong, Zhennan Zhao, Lei Cai, Dayong Ding, Xirong Li: Fully Deep Learning for Slit-lamp Photo based Nuclear Cataract Grading. In: International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI), 2019, (early accept).

MICCAI 2019: Two-Stream CNN with Loose Pair Training for Multi-modal AMD Categorization

Our MICCAI’19 paper on multi-modal age-related macular degeneration (AMD) categorization is online.

This paper studies automated categorization of age-related macular degeneration (AMD) given a multi-modal input, which consists of a color fundus image and an optical coherence tomography (OCT) image from a specific eye. Previous work uses a traditional method, comprised of feature extraction and classifier training that cannot be optimized jointly. By contrast, we propose a two-stream convolutional neural network (CNN) that is end-to-end. The CNN’s fusion layer is tailored to the need of fusing information from the fundus and OCT streams. For generating more multi-modal training instances, we introduce Loose Pair training, where a fundus image and an OCT image are paired based on class labels rather than eyes. Moreover, for a visual interpretation of how the individual modalities make contributions, we extend the class activation mapping technique to the multi-modal scenario. Experiments on a real-world dataset collected from an outpatient clinic justify the viability of our proposal for multi-modal AMD categorization.

Weisen Wang, Zhiyan Xu, Weihong Yu, Jianchun Zhao, Jingyuan Yang, Feng He, Zhikun Yang, Di Chen, Dayong Ding, Youxin Chen, Xirong Li: Two-Stream CNN with Loose Pair Training for Multi-modal AMD Categorization. In: International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI), 2019, (early accept).

CVPR2019: Dual Encoding for Zero-Example Video Retrieval

Our CVPR paper on zero-example video retrieval is online.  Data and source code is publicly available at github.

This paper attacks the challenging problem of zero-example video retrieval. In such a retrieval paradigm, an end user searches for unlabeled videos by ad-hoc queries described in natural language text with no visual example provided. Given videos as sequences of frames and queries as sequences of words, an effective sequence-to-sequence cross-modal matching is required. The majority of existing methods are concept based, extracting relevant concepts from queries and videos and accordingly establishing associations between the two modalities. In contrast, this paper takes a concept-free approach, proposing a dual deep encoding network that encodes videos and queries into powerful dense representations of their own. Dual encoding is conceptually simple, practically effective and end-to-end. As experiments on three benchmarks, i.e., MSR-VTT, TRECVID 2016 and 2017 Ad-hoc Video Search show, the proposed solution establishes a new state-of-the-art for zero-example video retrieval.

Jianfeng Dong, Xirong Li, Chaoxi Xu, Shouling Ji, Yuan He, Gang Yang, Xun Wang: Dual Encoding for Zero-Example Video Retrieval. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.

MMM2019: Four Models for Automatic Recognition of Left and Right Eye in Fundus Images

Our MMM2019 paper on recognizing Left / Right Eye in Fundus Images is online.

Fundus image analysis is crucial for eye condition screening and diagnosis and consequently personalized health management in a long term. This paper targets at left and right eye recognition, a basic module for fundus image analysis. We study how to automatically assign left-eye/right-eye labels to fundus images of posterior pole. For this under-explored task, four models are developed. Two of them are based on optic disc localization, using extremely simple max intensity and more advanced Faster R-CNN, respectively. The other two models require no localization, but perform holistic image classification using classical Local Binary Patterns (LBP) features and fine-tuned ResNet18, respectively. The four models are tested on a real-world set of 1,633 fundus images from 834 subjects. Fine-tuned ResNet-18 has the highest accuracy of 0.9847. Interestingly, the LBP based model, with the trick of left-right contrastive classification, performs closely to the deep model, with an accuracy of 0.9718.

Xin Lai, Xirong Li, Rui Qian, Dayong Ding, Jun Wu, Jieping Xu: Four Models for Automatic Recognition of Left and Right Eye in Fundus Images. the 25th International Conference on MultiMedia Modeling (MMM), 2019.

 

ACCV2018: Laser Scar Detection in Fundus Images using Convolutional Neural Networks

We are going to present our work on detecting laser scars in color fundus images at the 14th Asian Conference on Computer Vision (ACCV 2018) at Perth, Australia. This is a joint work with Vistel Inc. and Peking Union Medical College Hospital.

In diabetic eye screening programme, a special pathway is designed for those who have received laser photocoagulation treatment. The treatment leaves behind circular or irregular scars in the retina. Laser scar detection in fundus images is thus important for automated DR screening. Despite its importance, the problem is understudied in terms of both datasets and methods. This paper makes the first attempt to detect laser-scar images by deep learning. To that end, we contribute to the community Fundus10K, a large-scale expert-labeled dataset for training and evaluating laser scar detectors. We study in this new context major design choices of state-of-the-art Convolutional Neural Networks including Inception-v3, ResNet and DenseNet. For more effective training we exploit transfer learning that passes on trained weights of ImageNet models to their laser-scar countcerparts. Experiments on the new dataset shows that our best model detects laser-scar images with sensitivity of 0.962, specificity of 0.999, precision of 0.974 and AP of 0.988 and AUC of 0.999. The same model is tested on the public LMD-BAPT test set, obtaining sensitivity of 0.765, specificity of 1, precision of 1, AP of 0.975 and AUC of 0.991, outperforming the state-of-the-art with a large margin. Data is available at https://github.com/li-xirong/fundus10k/

Qijie Wei, Xirong Li, Hao Wang, Dayong Ding, Weihong Yu, Youxin Chen: Laser Scar Detection in Fundus Images using Convolutional Neural Networks. Asian Conference on Computer Vision (ACCV), 2018.

Feature Re-Learning with Data Augmentation for Content-based Video Recommendation

We are going to present our work on content-based video recommendation in the Multimedia Grand Challenge session of the forthcoming ACM Multimedia 2018 Conference at Seoul. Source code will be available shortly at https://github.com/danieljf24/cbvr.

This paper describes our solution for the Hulu Content-based Video Relevance Prediction Challenge. Noting the deficiency of the original features, we propose feature re-learning to improve video relevance prediction. To generate more training instances for supervised learning, we develop two data augmentation strategies, one for frame-level features and the other for video-level features. In addition, late fusion of multiple models is employed to further boost the performance. Evaluation conducted by the organizers shows that our best run outperforms the Hulu baseline, obtaining relative improvements of 26.2% and 30.2% on the TV-shows track and the Movies track, respectively, in terms of recall@100. The results clearly justify the effectiveness of the proposed solution.

Jianfeng Dong, Xirong Li, Chaoxi Xu, Gang Yang, Xun Wang: Feature Re-Learning with Data Augmentation for Content-based Video Recommendation. ACM Multimedia, 2018, (Grand challenge paper).