Bridging the Ultimate Semantic Gap: A Semantic Search Engine for Internet Videos
Lu Jiang, Shoou-I Yu, Deyu Meng, Teruko Mitamura and Alex Hauptmann
(Carnegie Mellon University and Xi'an Jiaotong University)

Paper, Supplementary materials

Semantic Search in Internet Videos?

    Semantic Query for Birthday Party

    Semantic Search

      Search the multimodal semantic content in video:
    • No user-generated metadata
    • Content understanding
    • Multimodal semantic query
    • No example videos (also called Zero-Example search)

    Relevant Videos

What's in this web page?

This page contains a list of features on two benchmarks MED13 and MED14 used in our paper [15], as well as the ranked list returned by our system. The shared data are expected to help:
MED16 Train and Test features are avialable. See details here.
1) reproduce our state-of-the-art results;
2) benefit related tasks such as video recommendation, hyperlinking and recounting.


MED16 features
Semantic Concatened [3,0]featuresdictionaryfeaturesdictionary
Improved Dense Trajectory [15]featuresfeatures

[0] Junwei Liang, Lu Jiang, Deyu Meng, Alexander Hauptmann. Learning to Detect Concepts from Webly-Labeled Video Data. In IJCAI, 2016.

*Please cite the corresponding papers for using our features (32,000 Internet videos).
Semantic Features MED13Test MED14Test
ASR [1] raw features, dictionary, sparse matrix raw features, dictionary, sparse matrix
OCR [15, 2] raw features, dictionary, sparse matrix raw features, dictionary, sparse matrix
YFCC100M (609 concepts) [3,4]  features, dictionary for all semantic concepts
Google Sports (478 concepts) [3,5]  featuresdictionary for all semantic concepts
IACC (346 concepts) [3,6]  featuresdictionary for all semantic concepts
DIY (1601 concepts) [3,7]  featuresdictionary for all semantic concepts

Low-level features MED13Test MED14Test
Improved Dense Trajectory [2,8] features features
MFCC [2] features features

[1] Y. Miao, F. Metze, and S. Rawat. Deep maxout networks for low-resource speech recognition. In ASRU, 2013.
[2] S.-I. Yu, L. Jiang, Z. Xu, et al. CMU-informedia@TRECVID 2014. In TRECVID, 2014.
[3] L. Jiang, D. Meng, S.-I. Yu, Z. Lan, S. Shan, and A. G. Hauptmann. Self-paced learning with diversity. In NIPS, 2014.
[4] B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde, K. Ni, D. Poland, D. Borth, and L.-J. Li. The new data and new challenges in multimedia research. arXiv preprint arXiv:1503.01817, 2015.
[5] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks. In CVPR, 2014.
[6] P. Over, G. Awad, M. Michel, J. Fiscus, G. Sanders, W. Kraaij, A. F. Smeaton, and G. QuŽeenot. TRECVID 2014 – an overview of the goals, tasks, data, evaluation mechanisms and metrics. In TRECVID, 2014.
[7] S.-I. Yu, L. Jiang, and A. Hauptmann. Instructional videos for unsupervised harvesting and learning of action examples. In MM, 2014.
[8] H. Wang and C. Schmid. Action recognition with improved trajectories. In ICCV, 2013.

Retrieved Ranked List:

*The ranked list are specified in NIST's standard csv format (
Runs MED13Test MED14Test
ASR System E006-E015, E021-E030 E021-E040
OCR System E006-E015, E021-E030 E021-E040
Visual System E006-E015, E021-E030 E021-E040
AutoSQG System E006-E015, E021-E030 E021-E040
Full System E006-E015, E021-E030 E021-E040
PRF System E006-E015, E021-E030 E021-E040

Published Results on the MED13Test dataset:

Method MAP (x100)
Composite Concepts [9] 6.4
Tag Propagation [10] 9.6
MMPRF [11] 10.1
Clauses [12] 11.2
Multimodal Fusion [13] 12.6
SPaR [14] 12.9
E-Lamp AutoSQG System [15] 12.0
E-Lamp Visual System [15] 18.3
E-Lamp Full System [15] 20.7

[9] A. Habibian, T. Mensink, and C. G. Snoek. Composite concept discovery for zero-shot video event detection. In ICMR, 2014.
[10] M. Mazloom, X. Li, and C. G. Snoek. Few-example video event retrieval using tag propagation. In ICMR, 2014.
[11] L. Jiang, T. Mitamura, S.-I. Yu, and A. G. Hauptmann. Zero-example event search using multimodal pseudo relevance feedback. In ICMR, 2014.
[12] H. Lee. Analyzing complex events and human actions in” in-the-wild” videos. In UMD Ph.D Theses and Dissertations, 2014.
[13] S. Wu, S. Bondugula, F. Luisier, X. Zhuang, and P. Natarajan. Zero-shot event detection using multi-modal fusion of weakly supervised concepts. In CVPR, 2014.
[14] L. Jiang, D. Meng, T. Mitamura, and A. G. Hauptmann. Easy samples first: Self-paced reranking for zero-example multimedia search. In MM, 2014.
[15] L. Jiang, S.-I Yu, D. Meng, T. Mitamura, A. G. Hauptmann. Bridging the Ultimate Semantic Gap: A Semantic Search Engine for Internet Videos. In ICMR 2015.

Recommandations for building a state-of-the-art system[15]:

  1. Training concept detectors on big data sets is ideal. However, given limited resources, building more detectors of reasonable accuracy seems to be a sensible strategy. Merely increasing the number of low quality concepts may not improve performance.
  2. PRF (or reranking) is an effective approach to improve the search result.
  3. Retrieval models may have substantial impacts to the search result. A reasonable strategy is to incorporate multiple models and apply them to their appropriate features/modalities.
  4. Automatic query generation for queries in the form of event-kit descriptions is still very challenging. Combining mapping results from various mapping algorithms and applying manual examination afterward is the best strategy known so far.

Screenshot of our Prototype System [16]:

*Please contact us if you would like to access our prototype system.
[16] S. Xu, H. Li, X. Chang, S.-I. Yu, X. Du, X. Li, L. Jiang, Z. Mao, Z. Lan, S. Burger, and A. Hauptmann. Incremental multimodal query construction for video search. In ICMR, 2015.


Lu Jiang, Shoou-I Yu, Deyu Meng, Teruko Mitamura, Alexander Hauptmann. Bridging the Ultimate Semantic Gap: A Semantic Search Engine for Internet Videos.
In ACM International Conference on Multimedia Retrieval (ICMR). 2015. [BibTex | supplementary materials]

(C) COPYRIGHT 2015, Carnegie Mellon University All Rights Reserved.