MMM2019: Four Models for Automatic Recognition of Left and Right Eye in Fundus Images

Our MMM2019 paper on recognizing Left / Right Eye in Fundus Images is online.

Fundus image analysis is crucial for eye condition screening and diagnosis and consequently personalized health management in a long term. This paper targets at left and right eye recognition, a basic module for fundus image analysis. We study how to automatically assign left-eye/right-eye labels to fundus images of posterior pole. For this under-explored task, four models are developed. Two of them are based on optic disc localization, using extremely simple max intensity and more advanced Faster R-CNN, respectively. The other two models require no localization, but perform holistic image classification using classical Local Binary Patterns (LBP) features and fine-tuned ResNet18, respectively. The four models are tested on a real-world set of 1,633 fundus images from 834 subjects. Fine-tuned ResNet-18 has the highest accuracy of 0.9847. Interestingly, the LBP based model, with the trick of left-right contrastive classification, performs closely to the deep model, with an accuracy of 0.9718.

Xin Lai, Xirong Li, Rui Qian, Dayong Ding, Jun Wu, Jieping Xu (2019): Four Models for Automatic Recognition of Left and Right Eye in Fundus Images. the 25th International Conference on MultiMedia Modeling (MMM), 2019.


ACCV2018: Laser Scar Detection in Fundus Images using Convolutional Neural Networks

We are going to present our work on detecting laser scars in color fundus images at the 14th Asian Conference on Computer Vision (ACCV 2018) at Perth, Australia. This is a joint work with Vistel Inc. and Peking Union Medical College Hospital.

In diabetic eye screening programme, a special pathway is designed for those who have received laser photocoagulation treatment. The treatment leaves behind circular or irregular scars in the retina. Laser scar detection in fundus images is thus important for automated DR screening. Despite its importance, the problem is understudied in terms of both datasets and methods. This paper makes the first attempt to detect laser-scar images by deep learning. To that end, we contribute to the community Fundus10K, a large-scale expert-labeled dataset for training and evaluating laser scar detectors. We study in this new context major design choices of state-of-the-art Convolutional Neural Networks including Inception-v3, ResNet and DenseNet. For more effective training we exploit transfer learning that passes on trained weights of ImageNet models to their laser-scar countcerparts. Experiments on the new dataset shows that our best model detects laser-scar images with sensitivity of 0.962, specificity of 0.999, precision of 0.974 and AP of 0.988 and AUC of 0.999. The same model is tested on the public LMD-BAPT test set, obtaining sensitivity of 0.765, specificity of 1, precision of 1, AP of 0.975 and AUC of 0.991, outperforming the state-of-the-art with a large margin. Data is available at

Qijie Wei, Xirong Li, Hao Wang, Dayong Ding, Weihong Yu, Youxin Chen (2018): Laser Scar Detection in Fundus Images using Convolutional Neural Networks. Asian Conference on Computer Vision (ACCV), 2018.

Predicting Visual Features from Text for Image and Video Caption Retrieval

Our Word2VisualVec work has been accepted for publication as a REGULAR paper in the IEEE Transactions on Multimedia. Source code is available at

This paper strives to find amidst a set of sentences the one best describing the content of a given image or video. Different from existing works, which rely on a joint subspace for their image and video caption retrieval, we propose to do so in a visual space exclusively. Apart from this conceptual novelty, we contribute Word2VisualVec, a deep neural network architecture that learns to predict a visual feature representation from textual input. Example captions are encoded into a textual embedding based on multi-scale sentence vectorization and further transferred into a deep visual feature of choice via a simple multi-layer perceptron. We further generalize Word2VisualVec for video caption retrieval, by predicting from text both 3-D convolutional neural network features as well as a visual-audio representation. Experiments on Flickr8k, Flickr30k, the Microsoft Video Description dataset and the very recent NIST TrecVid challenge for video caption retrieval detail Word2VisualVec’s properties, its benefit over textual embeddings, the potential for multimodal query composition and its state-of-the-art results.

Jianfeng Dong, Xirong Li, Cees G. M. Snoek (2018): Predicting Visual Features from Text for Image and Video Caption Retrieval. In: IEEE Transactions on Multimedia (TMM), 20 (12), pp. 3377-3388, 2018.