AAAI2025: Enriching E-commerce Videos with Sound Effects

Our AAAI2025 paper on enriching E-commerce videos with sound effects is online, with code and data released at https://rucmm.github.io/VDSFX/.

Videos showcasing specific products are increasingly important for E-commerce. Key moments naturally exist as the first appearance of a specific product, presentation of its distinctive features, the presence of a buying link, etc. Adding proper sound effects (SFX) to such moments, or video decoration with SFX (VDSFX), is crucial for enhancing user engaging experience. Previous work adds SFX to videos by video-toSFX matching at a holistic level, lacking the ability of adding SFX to a specific moment. Meanwhile, previous studies on video highlight detection or video moment retrieval consider only moment localization, leaving moment to SFX matching untouched. By contrast, we propose in this paper D&M, a unified method that accomplishes key moment detection and moment-to-SFX matching simultaneously. Moreover, for the new VDSFX task we build a large-scale dataset SFX-Moment from an E-commerce video creation platform. For a fair comparison, we build competitive baselines by extending a number of current video moment detection methods to the new task. Extensive experiments on SFX-Moment show the superior performance of the proposed method over the baselines.

Jingyu Liu, Minquan Wang, Ye Ma, Bo Wang, Aozhu Chen, Quan Chen, Peng Jiang, Xirong Li: D&M: Enriching E-commerce Videos with Sound Effects by Key Moment Detection and SFX Matching. In: The 39th Annual AAAI Conference on Artificial Intelligence (AAAI), 2025.

J-BHI 2022: Learning Two-Stream CNN for Multi-Modal Age-related Macular Degeneration Categorization

Our work on multi-modal AMD categorization has been published online as a regular paper in IEEE Journal of Biomedical and Health Informatics (Impact factor: 5.772). Source code is available at https://github.com/li-xirong/mmc-amd.
Proposed end-to-end solution for multi-modal AMD categorization
Weisen Wang, Xirong Li, Zhiyan Xu, Weihong Yu, Jianchun Zhao, Dayong Ding, Youxin Chen: Learning Two-Stream CNN for Multi-Modal Age-related Macular Degeneration Categorization. In: IEEE Journal of Biomedical and Health Informatics (J-BHI), 2022.

MM2021: Multi-Modal Multi-Instance Learning for Retinal Disease Recognition

Our ACMMM’2021 paper on multi-modal retinal disease recognition is online, with pre-recorded video presentation available at YouTube.
Proposed multi-modal retinal disease classification network in its inference mode.
This paper attacks an emerging challenge of multi-modal retinal disease recognition. Given a multi-modal case consisting of a color fundus photo (CFP) and an array of OCT B-scan images acquired during an eye examination, we aim to build a deep neural network that recognizes multiple vision-threatening diseases for the given case. As the diagnostic efficacy of CFP and OCT is disease-dependent, the network’s ability of being both selective and interpretable is important. Moreover, as both data acquisition and manual labeling are extremely expensive in the medical domain, the network has to be relatively lightweight for learning from a limited set of labeled multi-modal samples. Prior art on retinal disease recognition focuses either on a single disease or on a single modality, leaving multi-modal fusion largely underexplored. We propose in this paper Multi-Modal Multi-Instance Learning (MM-MIL) for selectively fusing CFP and OCT modalities. Its lightweight architecture (as compared to current multi-head attention modules) makes it suited for learning from relatively small-sized datasets. For an effective use of MM-MIL, we propose to generate a pseudo sequence of CFPs by over sampling a given CFP. The benefits of this tactic include well balancing instances across modalities, increasing the resolution of the CFP input, and finding out regions of the CFP most relevant with respect to the final diagnosis. Extensive experiments on a real-world dataset consisting of 1,206 multi-modal cases from 1,193 eyes of 836 subjects demonstrate the viability of the proposed model.

Xirong Li, Yang Zhou, Jie Wang, Hailan Lin, Jianchun Zhao, Dayong Ding, Weihong Yu, Youxin Chen: Multi-Modal Multi-Instance Learning for Retinal Disease Recognition. In: ACM Multimedia, 2021.