E-Lamp Semantic Search Engine

Bridging the Ultimate Semantic Gap: A Semantic Search Engine for Internet Videos
Lu Jiang, Shoou-I Yu, Deyu Meng, Teruko Mitamura and Alex Hauptmann
(Carnegie Mellon University and Xi'an Jiaotong University)
Paper, Supplementary materials

Semantic Search in Internet Videos?

Semantic Query for Birthday Party

Semantic Search

No user-generated metadata
Content understanding
Multimodal semantic query
No example videos (also called Zero-Example search)

Relevant Videos

What's in this web page?

This page contains a list of features on two benchmarks MED13 and MED14 used in our paper [15], as well as the ranked list returned by our system. The shared data are expected to help:

MED16 Train and Test features are avialable. See details here.
1) reproduce our state-of-the-art results;
2) benefit related tasks such as video recommendation, hyperlinking and recounting.

Features:

MED16 features

Features	MED16Train	MED16Test
Semantic Concatened [3,0]	features, dictionary	features, dictionary
Improved Dense Trajectory [15]	features	features
VGG-19	features	features

[0] Junwei Liang, Lu Jiang, Deyu Meng, Alexander Hauptmann. Learning to Detect Concepts from Webly-Labeled Video Data. In IJCAI, 2016.

*Please cite the corresponding papers for using our features (32,000 Internet videos).

Semantic Features	MED13Test	MED14Test
ASR [1]	raw features, dictionary, sparse matrix	raw features, dictionary, sparse matrix
OCR [15, 2]	raw features, dictionary, sparse matrix	raw features, dictionary, sparse matrix
YFCC100M (609 concepts) [3,4]	features, dictionary for all semantic concepts
Google Sports (478 concepts) [3,5]	features, dictionary for all semantic concepts
IACC (346 concepts) [3,6]	features, dictionary for all semantic concepts
DIY (1601 concepts) [3,7]	features, dictionary for all semantic concepts

Low-level features	MED13Test	MED14Test
Improved Dense Trajectory [2,8]	features	features
MFCC [2]	features	features

[1] Y. Miao, F. Metze, and S. Rawat. Deep maxout networks for low-resource speech recognition. In ASRU, 2013.
[2] S.-I. Yu, L. Jiang, Z. Xu, et al. CMU-informedia@TRECVID 2014. In TRECVID, 2014.
[3] L. Jiang, D. Meng, S.-I. Yu, Z. Lan, S. Shan, and A. G. Hauptmann. Self-paced learning with diversity. In NIPS, 2014.
[4] B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde, K. Ni, D. Poland, D. Borth, and L.-J. Li. The new data and new challenges in multimedia research. arXiv preprint arXiv:1503.01817, 2015.
[5] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks. In CVPR, 2014.
[6] P. Over, G. Awad, M. Michel, J. Fiscus, G. Sanders, W. Kraaij, A. F. Smeaton, and G. Qu´eenot. TRECVID 2014 – an overview of the goals, tasks, data, evaluation mechanisms and metrics. In TRECVID, 2014.
[7] S.-I. Yu, L. Jiang, and A. Hauptmann. Instructional videos for unsupervised harvesting and learning of action examples. In MM, 2014.
[8] H. Wang and C. Schmid. Action recognition with improved trajectories. In ICCV, 2013.

Retrieved Ranked List:

*The ranked list are specified in NIST's standard csv format (http://www.nist.gov/itl/iad/mig/med14.cfm).

Runs	MED13Test	MED14Test
ASR System	E006-E015, E021-E030	E021-E040
OCR System	E006-E015, E021-E030	E021-E040
Visual System	E006-E015, E021-E030	E021-E040
AutoSQG System	E006-E015, E021-E030	E021-E040
Full System	E006-E015, E021-E030	E021-E040
PRF System	E006-E015, E021-E030	E021-E040

Published Results on the MED13Test dataset:

Method	MAP (x100)
Composite Concepts [9]	6.4
Tag Propagation [10]	9.6
MMPRF [11]	10.1
Clauses [12]	11.2
Multimodal Fusion [13]	12.6
SPaR [14]	12.9
E-Lamp AutoSQG System [15]	12.0
E-Lamp Visual System [15]	18.3
E-Lamp Full System [15]	20.7

[9] A. Habibian, T. Mensink, and C. G. Snoek. Composite concept discovery for zero-shot video event detection. In ICMR, 2014.
[10] M. Mazloom, X. Li, and C. G. Snoek. Few-example video event retrieval using tag propagation. In ICMR, 2014.
[11] L. Jiang, T. Mitamura, S.-I. Yu, and A. G. Hauptmann. Zero-example event search using multimodal pseudo relevance feedback. In ICMR, 2014.
[12] H. Lee. Analyzing complex events and human actions in” in-the-wild” videos. In UMD Ph.D Theses and Dissertations, 2014.
[13] S. Wu, S. Bondugula, F. Luisier, X. Zhuang, and P. Natarajan. Zero-shot event detection using multi-modal fusion of weakly supervised concepts. In CVPR, 2014.
[14] L. Jiang, D. Meng, T. Mitamura, and A. G. Hauptmann. Easy samples first: Self-paced reranking for zero-example multimedia search. In MM, 2014.
[15] L. Jiang, S.-I Yu, D. Meng, T. Mitamura, A. G. Hauptmann. Bridging the Ultimate Semantic Gap: A Semantic Search Engine for Internet Videos. In ICMR 2015.

Recommandations for building a state-of-the-art system[15]:

Training concept detectors on big data sets is ideal. However, given limited resources, building more detectors of reasonable accuracy seems to be a sensible strategy. Merely increasing the number of low quality concepts may not improve performance.
PRF (or reranking) is an effective approach to improve the search result.
Retrieval models may have substantial impacts to the search result. A reasonable strategy is to incorporate multiple models and apply them to their appropriate features/modalities.
Automatic query generation for queries in the form of event-kit descriptions is still very challenging. Combining mapping results from various mapping algorithms and applying manual examination afterward is the best strategy known so far.

Screenshot of our Prototype System [16]:

*Please contact us if you would like to access our prototype system.
[16] S. Xu, H. Li, X. Chang, S.-I. Yu, X. Du, X. Li, L. Jiang, Z. Mao, Z. Lan, S. Burger, and A. Hauptmann. Incremental multimodal query construction for video search. In ICMR, 2015.

Citation:

Lu Jiang, Shoou-I Yu, Deyu Meng, Teruko Mitamura, Alexander Hauptmann. Bridging the Ultimate Semantic Gap: A Semantic Search Engine for Internet Videos.
In ACM International Conference on Multimedia Retrieval (ICMR). 2015. [BibTex | supplementary materials]