TPAMI 2023: MVSS-Net: Multi-View Multi-Scale Supervised Networks for Image Manipulation Detection

Our paper on deep learning based image manipulation detection has been published online as a regular paper in the March 2023 Issue of the IEEE Transactions on Pattern Analysis and Machine Intelligence journal (impact factor: 24.314). Source code is available at https://github.com/dong03/MVSS-Net
As manipulating images by copy-move, splicing and/or inpainting may lead to misinterpretation of the visual content, detecting these sorts of manipulations is crucial for media forensics. Given the variety of possible attacks on the content, devising a generic method is nontrivial. Current deep learning based methods are promising when training and test data are well aligned, but perform poorly on independent tests. Moreover, due to the absence of authentic test images, their image-level detection specificity is in doubt. The key question is how to design and train a deep neural network capable of learning generalizable features sensitive to manipulations in novel data, whilst specific to prevent false alarms on the authentic. We propose multi-view feature learning to jointly exploit tampering boundary artifacts and the noise view of the input image. As both clues are meant to be semantic-agnostic, the learned features are thus generalizable. For effectively learning from authentic images, we train with multi-scale (pixel / edge / image) supervision. We term the new network MVSS-Net and its enhanced version MVSS-Net++. Experiments are conducted in both within-dataset and cross-dataset scenarios, showing that MVSS-Net++ performs the best, and exhibits better robustness against JPEG compression, Gaussian blur and screenshot based image re-capturing.

Chengbo Dong, Xinru Chen, Ruohan Hu, Juan Cao, Xirong Li: MVSS-Net: Multi-View Multi-Scale Supervised Networks for Image Manipulation Detection. In: IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2022.

PR2022: 3D Object Detection for Autonomous Driving: A Survey

Our survey paper on 3D object detection for autonomous driving has been published online as a regular paper in the Pattern Recognition journal (impact factor: 7.740). Source code is available at https://github.com/rui-qian/SoTA-3D-Object-Detection.
Autonomous driving is regarded as one of the most promising remedies to shield human beings from severe crashes. To this end, 3D object detection serves as the core basis of perception stack especially for the sake of path planning, motion prediction, and collision avoidance etc.. Taking a quick glance at the progress we have made, we attribute challenges to visual appearance recovery in the absence of depth information from images, representation learning from partially occluded unstructured point clouds, and semantic alignments over heterogeneous features from cross modalities. Despite existing efforts, 3D object detection for autonomous driving is still in its infancy. Recently, a large body of literature have been investigated to address this 3D vision task. Nevertheless, few investigations have looked into collecting and structuring this growing knowledge. We therefore aim to fill this gap in a comprehensive survey, encompassing all the main concerns including sensors, datasets, performance metrics and the recent state-of-the-art detection methods, together with their pros and cons. Furthermore, we provide quantitative comparisons with the state of the art. A case study on fifteen selected representative methods is presented, involved with runtime analysis, error analysis, and robustness analysis. Finally, we provide concluding remarks after an in-depth analysis of the surveyed works and identify promising directions for future work.

Rui Qian, Xin Lai, Xirong Li: 3D Object Detection for Autonomous Driving: A Survey. In: Pattern Recognition, 2022.

J-BHI 2022: Learning Two-Stream CNN for Multi-Modal Age-related Macular Degeneration Categorization

Our work on multi-modal AMD categorization has been published online as a regular paper in IEEE Journal of Biomedical and Health Informatics (Impact factor: 5.772). Source code is available at https://github.com/li-xirong/mmc-amd.
Proposed end-to-end solution for multi-modal AMD categorization
Weisen Wang, Xirong Li, Zhiyan Xu, Weihong Yu, Jianchun Zhao, Dayong Ding, Youxin Chen: Learning Two-Stream CNN for Multi-Modal Age-related Macular Degeneration Categorization. In: IEEE Journal of Biomedical and Health Informatics (J-BHI), 2022.

MM2021: Multi-Modal Multi-Instance Learning for Retinal Disease Recognition

Our ACMMM’2021 paper on multi-modal retinal disease recognition is online, with pre-recorded video presentation available at YouTube.
Proposed multi-modal retinal disease classification network in its inference mode.
This paper attacks an emerging challenge of multi-modal retinal disease recognition. Given a multi-modal case consisting of a color fundus photo (CFP) and an array of OCT B-scan images acquired during an eye examination, we aim to build a deep neural network that recognizes multiple vision-threatening diseases for the given case. As the diagnostic efficacy of CFP and OCT is disease-dependent, the network’s ability of being both selective and interpretable is important. Moreover, as both data acquisition and manual labeling are extremely expensive in the medical domain, the network has to be relatively lightweight for learning from a limited set of labeled multi-modal samples. Prior art on retinal disease recognition focuses either on a single disease or on a single modality, leaving multi-modal fusion largely underexplored. We propose in this paper Multi-Modal Multi-Instance Learning (MM-MIL) for selectively fusing CFP and OCT modalities. Its lightweight architecture (as compared to current multi-head attention modules) makes it suited for learning from relatively small-sized datasets. For an effective use of MM-MIL, we propose to generate a pseudo sequence of CFPs by over sampling a given CFP. The benefits of this tactic include well balancing instances across modalities, increasing the resolution of the CFP input, and finding out regions of the CFP most relevant with respect to the final diagnosis. Extensive experiments on a real-world dataset consisting of 1,206 multi-modal cases from 1,193 eyes of 836 subjects demonstrate the viability of the proposed model.

Xirong Li, Yang Zhou, Jie Wang, Hailan Lin, Jianchun Zhao, Dayong Ding, Weihong Yu, Youxin Chen: Multi-Modal Multi-Instance Learning for Retinal Disease Recognition. In: ACM Multimedia, 2021.

ICCV2021: Image Manipulation Detection by Multi-View Multi-Scale Supervision

Our ICCV’21 paper on image manipulation detection is online, with code and models released at https://github.com/dong03/MVSS-Net.
Pixel-level manipulation detection results of MVSS-Net in varied setups.
The key challenge of image manipulation detection is how to learn generalizable features that are sensitive to manipulations in novel data, whilst specific to prevent false alarms on authentic images. Current research emphasizes the sensitivity, with the specificity overlooked. In this paper we address both aspects by multi-view feature learning and multi-scale supervision. By exploiting noise distribution and boundary artifact surrounding tampered regions, the former aims to learn semantic-agnostic and thus more generalizable features. The latter allows us to learn from authentic images which are nontrivial to taken into account by current semantic segmentation network based methods. Our thoughts are realized by a new network which we term MVSS-Net. Extensive experiments on five benchmark sets justify the viability of MVSS-Net for both pixel-level and image-level manipulation detection.

Xinru Chen, Chengbo Dong, Jiaqi Ji, Juan Cao, Xirong Li: Image Manipulation Detection by Multi-View Multi-Scale Supervision. In: International Conference on Computer Vision (ICCV) , 2021.

ICPR2020: Multiple Instance Learning with Spatial Attention for ROP Case Classification, Instance Selection and Abnormality Localization

Our ICPR2020 oral paper on AI-based Retinopathy of Prematurity (ROP) diagnosis is online, and the slides as well.
This paper tackles automated screening of Retinopathy of Prematurity (ROP), one of the most common causes of visual loss in childhood. Clinically, ROP screening per case requires multiple color fundus images capturing different zones of the premature retina. A desirable model shall not only make a decision at the case level, but also pinpoint which instances and what part of the instances are responsible for the decision. This paper makes the first attempt to accomplish three tasks, i.e, ROP case classification, instance selection and abnormality localization in a unified framework. To that end, we propose a new model that effectively combines instance-attention based deep multiple instance learning (MIL) and spatial attention (SA). The propose model, which we term MIL-SA, identifies positive instances in light of their contributions to case-level decision. Meanwhile, abnormal regions in the identified instances are automatically localized by the SA mechanism. Moreover, MIL-SA is learned from case-level binary labels exclusively, and in an end-to-end manner. Experiments on a large clinical dataset of 2,186 cases with 11,053 fundus images show the viability of the proposed model for all the three tasks.

Xirong Li, Wencui Wan, Yang Zhou, Jianchun Zhao, Qijie Wei, Junbo Rong, Pengyi Zhou, Limin Xu, Lijuan Lang, Yuying Liu, Chengzhi Niu, Dayong Ding, Xuemin Jin: Deep Multiple Instance Learning with Spatial Attention for ROP Case Classification, Instance Selection and Abnormality Localization . In: 25th International Conference on Pattern Recognition (ICPR2020), 2020, (oral).

Two papers accepted at ICPR2020

We have two papers accepted at ICPR2020. At its 25th edition, the International Conference on Pattern Recognition (ICPR) is the premier world conference in Pattern Recognition.
  1. Wei et al., Learn to Segment Retinal Lesions and Beyond, ICPR 2020
  2. Li et al., Deep Multiple Instance Learning with Spatial Attention for ROP Case Classification, Instance Selection and Abnormality Localization, ICPR 2020

ICMR2020: iCap: Interactive Image Captioning with Predictive Text

Our ICMR’20 paper on interactive image captioning is online.

In this paper we study a brand new topic of interactive image captioning with human in the loop. Different from automated image captioning where a given test image is the sole input in the inference stage, we have access to both the test image and a sequence of (incomplete) user-input sentences in the interactive scenario. We formulate the problem as Visually Conditioned Sentence Completion (VCSC). For VCSC, we propose ABD-Cap, asynchronous bidirectional decoding for image caption completion. With ABD-Cap as the core module, we build iCap, a web-based interactive image captioning system capable of predicting new text with respect to live input from a user. A number of experiments covering both automated evaluations and real user studies show the viability of our proposals.

Zhengxiong Jia, Xirong Li: iCap: Interactive Image Captioning with Predictive Text. In: ACM International Conference on Multimedia Retrieval (ICMR), 2020.

MM2019: W2VV++: Fully Deep Learning for Ad-hoc Video Search

Our ACMMM’19 paper on ad-hoc video search is online. Source code and data are accessible via https://github.com/li-xirong/w2vvpp.

Ad-hoc video search (AVS) is an important yet challenging problem in multimedia retrieval. Different from previous concept-based methods, we propose an end-to-end deep learning method for query representation learning. The proposed method requires no concept modeling, matching and selection. The backbone of our method is the proposed W2VV++ model, a super version of Word2VisualVec (W2VV) previously developed for visual-to-text matching. W2VV++ is obtained by tweaking W2VV with a better sentence encoding strategy and an improved triplet ranking loss. With these simple changes, W2VV++ brings in a substantial improvement in performance. As our participation in the TRECVID 2018 AVS task and retrospective experiments on the TRECVID 2016 and 2017 data show, our best single model, with an overall inferred average precision (infAP) of 0.157, outperforms the state-of-the-art. The performance can be further boosted by model ensemble using late average fusion, reaching a higher infAP of 0.163. With W2VV++, we establish a new baseline for ad-hoc video search.

Xirong Li, Chaoxi Xu, Gang Yang, Zhineng Chen, Jianfeng Dong: W2VV++: Fully Deep Learning for Ad-hoc Video Search. In: ACM Multimedia, 2019.

T-MM 2019: COCO-CN for Cross-Lingual Image Tagging, Captioning, and Retrieval

Our work on cross-lingual image tagging, captioning and retrieval has been published as a regular paper in the September issue of the IEEE Transactions on Multimedia (Impact factor: 5.452). Data and code are available at https://github.com/li-xirong/coco-cn.

This paper contributes to cross-lingual image annotation and retrieval in terms of data and baseline methods. We propose COCO-CN, a novel dataset enriching MS-COCO with manually written Chinese sentences and tags. For effective annotation acquisition, we develop a recommendation-assisted collective annotation system, automatically providing an annotator with several tags and sentences deemed to be relevant with respect to the pictorial content. Having 20 342 images annotated with 27 218 Chinese sentences and 70 993 tags, COCO-CN is currently the largest Chinese–English dataset that provides a unified and challenging platform for cross-lingual image tagging, captioning, and retrieval. We develop conceptually simple yet effective methods per task for learning from cross-lingual resources. Extensive experiments on the three tasks justify the viability of the proposed dataset and methods.

Xirong Li, Chaoxi Xu, Xiaoxu Wang, Weiyu Lan, Zhengxiong Jia, Gang Yang, Jieping Xu: COCO-CN for Cross-Lingual Image Tagging, Captioning and Retrieval. In: IEEE Transactions on Multimedia, vol. 21, no. 9, pp. 2347-2360, 2019.