Xirong Li

CVPR2025: PhD for Evaluating Large Models’ Visual Hallucination

Our CVPR2025 paper on MLLM visual hallucination evaluation is online, with PhD a new VHE benchmark available at hugging face.

Multimodal Large Language Models (MLLMs) hallucinate, resulting in an emerging topic of visual hallucination evaluation (VHE). This paper contributes a ChatGPT-Prompted visual hallucination evaluation Dataset (PhD) for objective VHE at a large scale. The essence of VHE is to ask an MLLM questions about specific images to assess its susceptibility to hallucination. Depending on what to ask (objects, attributes, sentiment, etc.) and how the questions are asked, we structure PhD along two dimensions, i.e. task and mode. Five visual recognition tasks, ranging from low-level (object / attribute recognition) to middle-level (sentiment / position recognition and counting), are considered. Besides a normal visual QA mode, which we term PhD-base, PhD also asks questions with specious context (PhD-sec) or with incorrect context (PhD-icc), or with AI-generated counter common sense images (PhD-ccs). We construct PhD by a ChatGPT-assisted semi-automated pipeline, encompassing four pivotal modules: task-specific hallucinatory item (hitem) selection, hitem-embedded question generation, specious / incorrect context generation, and counter-common-sense (CCS) image generation. With over 14k daily images, 750 CCS images and 102k VQA triplets in total, PhD reveals considerable variability in MLLMs’ performance across various modes and tasks, offering valuable insights into the nature of hallucination. As such, PhD stands as a potent tool not only for VHE but may also play a significant role in the refinement of MLLMs.

Jiazhen Liu, Yuhan Fu, Ruobing Xie, Runquan Xie, Xingwu Sun, Fengzong Lian, Zhanhui Kang, Xirong Li: PhD: A ChatGPT-Prompted Visual hallucination Evaluation Dataset. In: CVPR, 2025.

CVPR2024: Holistic Features are almost Sufficient for Text-to-Video Retrieval

Our CVPR2024 paper on text-to-video retrieval is online. Source code will be available shortly at github.

For text-to-video retrieval (T2VR), which aims to retrieve unlabeled videos by ad-hoc textual queries, CLIP-based methods currently lead the way. Compared to CLIP4Clip which is efficient and compact, state-of-the-art models tend to compute video-text similarity through fine-grained cross-modal feature interaction and matching, putting their scalability for large-scale T2VR applications into doubt. We propose TeachCLIP, enabling a CLIP4Clip based student network to learn from more advanced yet computationally intensive models. In order to create a learning channel to convey fine-grained cross-modal knowledge from a heavy model to the student, we add to CLIP4Clip a simple Attentional frame-Feature Aggregation (AFA) block, which by design adds no extra storage / computation overhead at the retrieval stage. Frame-text relevance scores calculated by the teacher network are used as soft labels to supervise the attentive weights produced by AFA. Extensive experiments on multiple public datasets justify the viability of the proposed method. TeachCLIP has the same efficiency and compactness as CLIP4Clip, yet has near-SOTA effectiveness.

Kaibin Tian, Ruixiang Zhao, Zijie Xin, Bangxiang Lan, Xirong Li: Holistic Features are almost Sufficient for Text-to-Video Retrieval. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024.

TPAMI 2023: MVSS-Net: Multi-View Multi-Scale Supervised Networks for Image Manipulation Detection

Our paper on deep learning based image manipulation detection has been published online as a regular paper in the March 2023 Issue of the IEEE Transactions on Pattern Analysis and Machine Intelligence journal (impact factor: 24.314). Source code is available at https://github.com/dong03/MVSS-Net

As manipulating images by copy-move, splicing and/or inpainting may lead to misinterpretation of the visual content, detecting these sorts of manipulations is crucial for media forensics. Given the variety of possible attacks on the content, devising a generic method is nontrivial. Current deep learning based methods are promising when training and test data are well aligned, but perform poorly on independent tests. Moreover, due to the absence of authentic test images, their image-level detection specificity is in doubt. The key question is how to design and train a deep neural network capable of learning generalizable features sensitive to manipulations in novel data, whilst specific to prevent false alarms on the authentic. We propose multi-view feature learning to jointly exploit tampering boundary artifacts and the noise view of the input image. As both clues are meant to be semantic-agnostic, the learned features are thus generalizable. For effectively learning from authentic images, we train with multi-scale (pixel / edge / image) supervision. We term the new network MVSS-Net and its enhanced version MVSS-Net++. Experiments are conducted in both within-dataset and cross-dataset scenarios, showing that MVSS-Net++ performs the best, and exhibits better robustness against JPEG compression, Gaussian blur and screenshot based image re-capturing.

Chengbo Dong, Xinru Chen, Ruohan Hu, Juan Cao, Xirong Li: MVSS-Net: Multi-View Multi-Scale Supervised Networks for Image Manipulation Detection. In: IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2022.

AE for IET Computer Vision

I am delighted to serve as an Associate Editor of IET Computer Vision. The journal publishes original research papers in a wide range of areas of computer vision.

PR2022: 3D Object Detection for Autonomous Driving: A Survey

Our survey paper on 3D object detection for autonomous driving has been published online as a regular paper in the Pattern Recognition journal (impact factor: 7.740). Source code is available at https://github.com/rui-qian/SoTA-3D-Object-Detection.

Autonomous driving is regarded as one of the most promising remedies to shield human beings from severe crashes. To this end, 3D object detection serves as the core basis of perception stack especially for the sake of path planning, motion prediction, and collision avoidance etc.. Taking a quick glance at the progress we have made, we attribute challenges to visual appearance recovery in the absence of depth information from images, representation learning from partially occluded unstructured point clouds, and semantic alignments over heterogeneous features from cross modalities. Despite existing efforts, 3D object detection for autonomous driving is still in its infancy. Recently, a large body of literature have been investigated to address this 3D vision task. Nevertheless, few investigations have looked into collecting and structuring this growing knowledge. We therefore aim to fill this gap in a comprehensive survey, encompassing all the main concerns including sensors, datasets, performance metrics and the recent state-of-the-art detection methods, together with their pros and cons. Furthermore, we provide quantitative comparisons with the state of the art. A case study on fifteen selected representative methods is presented, involved with runtime analysis, error analysis, and robustness analysis. Finally, we provide concluding remarks after an in-depth analysis of the surveyed works and identify promising directions for future work.

Rui Qian, Xin Lai, Xirong Li: 3D Object Detection for Autonomous Driving: A Survey. In: Pattern Recognition, 2022.

J-BHI 2022: Learning Two-Stream CNN for Multi-Modal Age-related Macular Degeneration Categorization

Our work on multi-modal AMD categorization has been published online as a regular paper in IEEE Journal of Biomedical and Health Informatics (Impact factor: 5.772). Source code is available at https://github.com/li-xirong/mmc-amd.

Proposed end-to-end solution for multi-modal AMD categorization

Weisen Wang, Xirong Li, Zhiyan Xu, Weihong Yu, Jianchun Zhao, Dayong Ding, Youxin Chen: Learning Two-Stream CNN for Multi-Modal Age-related Macular Degeneration Categorization. In: IEEE Journal of Biomedical and Health Informatics (J-BHI), 2022.

MM2021: Multi-Modal Multi-Instance Learning for Retinal Disease Recognition

Our ACMMM’2021 paper on multi-modal retinal disease recognition is online, with pre-recorded video presentation available at YouTube.

Proposed multi-modal retinal disease classification network in its inference mode.

This paper attacks an emerging challenge of multi-modal retinal disease recognition. Given a multi-modal case consisting of a color fundus photo (CFP) and an array of OCT B-scan images acquired during an eye examination, we aim to build a deep neural network that recognizes multiple vision-threatening diseases for the given case. As the diagnostic efficacy of CFP and OCT is disease-dependent, the network’s ability of being both selective and interpretable is important. Moreover, as both data acquisition and manual labeling are extremely expensive in the medical domain, the network has to be relatively lightweight for learning from a limited set of labeled multi-modal samples. Prior art on retinal disease recognition focuses either on a single disease or on a single modality, leaving multi-modal fusion largely underexplored. We propose in this paper Multi-Modal Multi-Instance Learning (MM-MIL) for selectively fusing CFP and OCT modalities. Its lightweight architecture (as compared to current multi-head attention modules) makes it suited for learning from relatively small-sized datasets. For an effective use of MM-MIL, we propose to generate a pseudo sequence of CFPs by over sampling a given CFP. The benefits of this tactic include well balancing instances across modalities, increasing the resolution of the CFP input, and finding out regions of the CFP most relevant with respect to the final diagnosis. Extensive experiments on a real-world dataset consisting of 1,206 multi-modal cases from 1,193 eyes of 836 subjects demonstrate the viability of the proposed model.

Xirong Li, Yang Zhou, Jie Wang, Hailan Lin, Jianchun Zhao, Dayong Ding, Weihong Yu, Youxin Chen: Multi-Modal Multi-Instance Learning for Retinal Disease Recognition. In: ACM Multimedia, 2021.

ICCV2021: Image Manipulation Detection by Multi-View Multi-Scale Supervision

Our ICCV’21 paper on image manipulation detection is online, with code and models released at https://github.com/dong03/MVSS-Net.

Pixel-level manipulation detection results of *MVSS-Net* in varied setups.

The key challenge of image manipulation detection is how to learn generalizable features that are sensitive to manipulations in novel data, whilst specific to prevent false alarms on authentic images. Current research emphasizes the sensitivity, with the specificity overlooked. In this paper we address both aspects by multi-view feature learning and multi-scale supervision. By exploiting noise distribution and boundary artifact surrounding tampered regions, the former aims to learn semantic-agnostic and thus more generalizable features. The latter allows us to learn from authentic images which are nontrivial to taken into account by current semantic segmentation network based methods. Our thoughts are realized by a new network which we term MVSS-Net. Extensive experiments on five benchmark sets justify the viability of MVSS-Net for both pixel-level and image-level manipulation detection.

Xinru Chen, Chengbo Dong, Jiaqi Ji, Juan Cao, Xirong Li: Image Manipulation Detection by Multi-View Multi-Scale Supervision. In: International Conference on Computer Vision (ICCV) , 2021.

ICPR2020: Multiple Instance Learning with Spatial Attention for ROP Case Classification, Instance Selection and Abnormality Localization

Our ICPR2020 oral paper on AI-based Retinopathy of Prematurity (ROP) diagnosis is online, and the slides as well.

This paper tackles automated screening of Retinopathy of Prematurity (ROP), one of the most common causes of visual loss in childhood. Clinically, ROP screening per case requires multiple color fundus images capturing different zones of the premature retina. A desirable model shall not only make a decision at the case level, but also pinpoint which instances and what part of the instances are responsible for the decision. This paper makes the first attempt to accomplish three tasks, i.e, ROP case classification, instance selection and abnormality localization in a unified framework. To that end, we propose a new model that effectively combines instance-attention based deep multiple instance learning (MIL) and spatial attention (SA). The propose model, which we term MIL-SA, identifies positive instances in light of their contributions to case-level decision. Meanwhile, abnormal regions in the identified instances are automatically localized by the SA mechanism. Moreover, MIL-SA is learned from case-level binary labels exclusively, and in an end-to-end manner. Experiments on a large clinical dataset of 2,186 cases with 11,053 fundus images show the viability of the proposed model for all the three tasks.

Xirong Li, Wencui Wan, Yang Zhou, Jianchun Zhao, Qijie Wei, Junbo Rong, Pengyi Zhou, Limin Xu, Lijuan Lang, Yuying Liu, Chengzhi Niu, Dayong Ding, Xuemin Jin: Deep Multiple Instance Learning with Spatial Attention for ROP Case Classification, Instance Selection and Abnormality Localization . In: 25th International Conference on Pattern Recognition (ICPR2020), 2020, (oral).