Tag: cross-modal representation learning

MM2019: W2VV++: Fully Deep Learning for Ad-hoc Video Search

Our ACMMM’19 paper on ad-hoc video search is online. Source code and data are accessible via https://github.com/li-xirong/w2vvpp.

Ad-hoc video search (AVS) is an important yet challenging problem in multimedia retrieval. Different from previous concept-based methods, we propose an end-to-end deep learning method for query representation learning. The proposed method requires no concept modeling, matching and selection. The backbone of our method is the proposed W2VV++ model, a super version of Word2VisualVec (W2VV) previously developed for visual-to-text matching. W2VV++ is obtained by tweaking W2VV with a better sentence encoding strategy and an improved triplet ranking loss. With these simple changes, W2VV++ brings in a substantial improvement in performance. As our participation in the TRECVID 2018 AVS task and retrospective experiments on the TRECVID 2016 and 2017 data show, our best single model, with an overall inferred average precision (infAP) of 0.157, outperforms the state-of-the-art. The performance can be further boosted by model ensemble using late average fusion, reaching a higher infAP of 0.163. With W2VV++, we establish a new baseline for ad-hoc video search.

Xirong Li, Chaoxi Xu, Gang Yang, Zhineng Chen, Jianfeng Dong: W2VV++: Fully Deep Learning for Ad-hoc Video Search. In: ACM Multimedia, 2019.

Author xirongPosted on September 17, 2019Categories researchTags cross-modal representation learning, deep learning, video retrieval

T-MM 2019: COCO-CN for Cross-Lingual Image Tagging, Captioning, and Retrieval

Our work on cross-lingual image tagging, captioning and retrieval has been published as a regular paper in the September issue of the IEEE Transactions on Multimedia (Impact factor: 5.452). Data and code are available at https://github.com/li-xirong/coco-cn.

This paper contributes to cross-lingual image annotation and retrieval in terms of data and baseline methods. We propose COCO-CN, a novel dataset enriching MS-COCO with manually written Chinese sentences and tags. For effective annotation acquisition, we develop a recommendation-assisted collective annotation system, automatically providing an annotator with several tags and sentences deemed to be relevant with respect to the pictorial content. Having 20 342 images annotated with 27 218 Chinese sentences and 70 993 tags, COCO-CN is currently the largest Chinese–English dataset that provides a unified and challenging platform for cross-lingual image tagging, captioning, and retrieval. We develop conceptually simple yet effective methods per task for learning from cross-lingual resources. Extensive experiments on the three tasks justify the viability of the proposed dataset and methods.

Xirong Li, Chaoxi Xu, Xiaoxu Wang, Weiyu Lan, Zhengxiong Jia, Gang Yang, Jieping Xu: COCO-CN for Cross-Lingual Image Tagging, Captioning and Retrieval. In: IEEE Transactions on Multimedia, vol. 21, no. 9, pp. 2347-2360, 2019.

Author xirongPosted on September 15, 2019Categories researchTags cross-lingual, cross-modal representation learning, image annotation, image captioning, image retrieval

CVPR2019: Dual Encoding for Zero-Example Video Retrieval

Our CVPR paper on zero-example video retrieval is online. Data and source code is publicly available at github.

This paper attacks the challenging problem of zero-example video retrieval. In such a retrieval paradigm, an end user searches for unlabeled videos by ad-hoc queries described in natural language text with no visual example provided. Given videos as sequences of frames and queries as sequences of words, an effective sequence-to-sequence cross-modal matching is required. The majority of existing methods are concept based, extracting relevant concepts from queries and videos and accordingly establishing associations between the two modalities. In contrast, this paper takes a concept-free approach, proposing a dual deep encoding network that encodes videos and queries into powerful dense representations of their own. Dual encoding is conceptually simple, practically effective and end-to-end. As experiments on three benchmarks, i.e., MSR-VTT, TRECVID 2016 and 2017 Ad-hoc Video Search show, the proposed solution establishes a new state-of-the-art for zero-example video retrieval.

Jianfeng Dong, Xirong Li, Chaoxi Xu, Shouling Ji, Yuan He, Gang Yang, Xun Wang: Dual Encoding for Zero-Example Video Retrieval. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.

Author xirongPosted on March 27, 2019March 27, 2019Categories researchTags cross-modal representation learning, deep learning, video retrieval

Predicting Visual Features from Text for Image and Video Caption Retrieval

Our Word2VisualVec work has been accepted for publication as a REGULAR paper in the IEEE Transactions on Multimedia. Source code is available at https://github.com/danieljf24/w2vv.

This paper strives to find amidst a set of sentences the one best describing the content of a given image or video. Different from existing works, which rely on a joint subspace for their image and video caption retrieval, we propose to do so in a visual space exclusively. Apart from this conceptual novelty, we contribute Word2VisualVec, a deep neural network architecture that learns to predict a visual feature representation from textual input. Example captions are encoded into a textual embedding based on multi-scale sentence vectorization and further transferred into a deep visual feature of choice via a simple multi-layer perceptron. We further generalize Word2VisualVec for video caption retrieval, by predicting from text both 3-D convolutional neural network features as well as a visual-audio representation. Experiments on Flickr8k, Flickr30k, the Microsoft Video Description dataset and the very recent NIST TrecVid challenge for video caption retrieval detail Word2VisualVec’s properties, its benefit over textual embeddings, the potential for multimodal query composition and its state-of-the-art results.

Jianfeng Dong, Xirong Li, Cees G. M. Snoek: Predicting Visual Features from Text for Image and Video Caption Retrieval. In: IEEE Transactions on Multimedia (TMM), vol. 20, no. 12, pp. 3377-3388, 2018.

Author xirongPosted on May 8, 2018Categories researchTags cross-modal representation learning, deep learning, image caption retrieval, neural networks, sentence vectorization, video caption retrievalLeave a comment on Predicting Visual Features from Text for Image and Video Caption Retrieval