Document - The VISIONE video search system: exploiting off-the-shelf text search engines for large-scale video retrieval

2021

Journal article Open Access

The VISIONE video search system: exploiting off-the-shelf text search engines for large-scale video retrieval

Amato G., Bolettieri P., Carrara F., Debole F., Falchi F., Gennaro C., Vadicamo L., Vairo C.

Multimedia and multimodal retrieval Video search multimedia information systems TR1-1050 Electronic computers. Computer science Article Computer Vision and Pattern Recognition Computer applications to medicine. Medical informatics Image search R858-859.7 Nuclear Medicine and imaging Computer Vision and Pattern Recognition (cs.CV) Electrical and Electronic Engineering Multimedia (cs.MM) FOS: Computer and information sciences known item search Content-based video retrieval retrieval models and ranking Retrieval models and ranking surrogate text representation Users and interactive retrieval Information systems applications Computer Graphics and Computer-Aided Design video search Photography QA75.5-76.95 information systems applications Ad-hoc video search multimedia and multimodal retrieval Radiology Surrogate text representation content-based video retrieval Computer Science - Multimedia Known item search image search users and interactive retrieval Multimedia information systems Computer Science - Computer Vision and Pattern Recognition

This paper describes in detail VISIONE, a video search system that allows users to search for videos using textual keywords, the occurrence of objects and their spatial relationships, the occurrence of colors and their spatial relationships, and image similarity. These modalities can be combined together to express complex queries and meet users' needs. The peculiarity of our approach is that we encode all information extracted from the keyframes, such as visual deep features, tags, color and object locations, using a convenient textual encoding that is indexed in a single text retrieval engine. This offers great flexibility when results corresponding to various parts of the query (visual, text and locations) need to be merged. In addition, we report an extensive analysis of the retrieval performance of the system, using the query logs generated during the Video Browser Showdown (VBS) 2019 competition. This allowed us to fine-tune the system by choosing the optimal parameters and strategies from those we tested.

Source: JOURNAL OF IMAGING 7 (2021). doi:10.3390/jimaging7050076

Citations

[1] Giuseppe Amato, Paolo Bolettieri, Fabio Carrara, Franca Debole, Fabrizio Falchi, Claudio Gennaro, Lucia Vadicamo, and Claudio Vairo. VISIONE at VBS2019. In MultiMedia Modeling, pages 591-596, Cham, 2019. Springer International Publishing.
[2] L. Rossetto, R. Gasser, J. Lokoc, W. Bailer, K. Schoeffmann, B. Muenzer, T. Soucek, P. A. Nguyen, P. Bolettieri, A. Leibetseder, and S. Vrochidis. Interactive video retrieval in the age of deep learning - detailed evaluation of vbs 2019. IEEE Transactions on Multimedia, pages 1-1, 2020.
[3] Claudiu Cobaˆrzan, Klaus Schoeffmann, Werner Bailer, Wolfgang Hu¨rst, Adam Blazˇek, Jakub Lokocˇ, Stefanos Vrochidis, Kai Uwe Barthel, and Luca Rossetto. Interactive video search tools: a detailed analysis of the video browser showdown 2015. Multimedia Tools and Applications, 76(4):5539-5571, Feb 2017.
[4] Jakub Lokocˇ, Werner Bailer, Klaus Schoeffmann, Bernd Muenzer, and George Awad. On influential trends in interactive video retrieval: Video browser showdown 2015-2017. IEEE Transactions on Multimedia, 20(12):3361- 3376, Dec 2018.
[5] Fabian Berns, Luca Rossetto, Klaus Schoeffmann, Christian Beecks, and George Awad. V3c1 dataset: An evaluation of content characteristics. In Proceedings of the 2019 on International Conference on Multimedia Retrieval, ICMR '19, pages 334--338, New York, NY, USA, 2019. Association for Computing Machinery.
[6] K. Schoeffmann. Video browser showdown 2012-2019: A review. In 2019 International Conference on ContentBased Multimedia Indexing (CBMI), pages 1-4, 2019.
[7] Jakub Lokocˇ, Gregor Kovalcˇ´ık, Bernd Mu¨nzer, Klaus Scho¨ffmann, Werner Bailer, Ralph Gasser, Stefanos Vrochidis, Phuong Anh Nguyen, Sitapa Rujikietgumjorn, and Kai Uwe Barthel. Interactive search or sequential browsing? a detailed analysis of the Video Browser Showdown 2018. ACM Trans. Multimedia Comput. Commun. Appl., 15(1), February 2019.
[8] Jakub Lokocˇ, Gregor Kovalcˇ´ık, and Toma´sˇ Soucˇek. Revisiting SIRET Video Retrieval Tool. In Multimedia Modeling. MMM 2018, Lecture Notes in Computer Science, pages 419-424. Springer, 2018.
[9] Luca Rossetto, Mahnaz Amiri Parian, Ralph Gasser, Ivan Giangreco, Silvan Heller, and Heiko Schuldt. Deep learning-based concept detection in vitrivr. In MultiMedia Modeling, pages 616-621, Cham, 2019. Springer International Publishing.
[10] Miroslav Kratochv´ıl, Patrik Vesely´, Frantisˇek Mejzl´ık, and Jakub Lokocˇ. SOM-Hunter: Video browsing with relevance-to-SOM feedback loop. In MultiMedia Modeling, pages 790-795, Cham, 2020. Springer International Publishing.
[11] Jakub Lokocˇ, Gregor Kovalcˇ´ık, and Toma´sˇ Soucˇek. VIRET at Video Browser Showdown 2020. In MultiMedia Modeling, pages 784-789, Cham, 2020. Springer International Publishing.
[12] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8697-8710, 2018.
[13] Loris Sauter, Mahnaz Amiri Parian, Ralph Gasser, Silvan Heller, Luca Rossetto, and Heiko Schuldt. Combining boolean and multimedia retrieval in vitrivr for large-scale video search. In MultiMedia Modeling, pages 760-765, Cham, 2020. Springer International Publishing.
[14] Luca Rossetto, Ralph Gasser, and Heiko Schuldt. Query by semantic sketch, 2019.
[15] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91-99, 2015.
[16] Claudio Gennaro, Giuseppe Amato, Paolo Bolettieri, and Pasquale Savino. An approach to content-based image retrieval based on the Lucene search engine library. In Proceedings of the International Conference on Theory and Practice of Digital Libraries, TPDL 2010, pages 55-66. Springer Berlin Heidelberg, 2010.
[17] Giuseppe Amato, Fabio Carrara, Fabrizio Falchi, and Claudio Gennaro. Efficient indexing of regional maximum activations of convolutions using full-text search engines. In Proceedings of the ACM International Conference on Multimedia Retrieval, ICMR 2017, pages 420-423. ACM, 2017.
[18] Giuseppe Amato, Paolo Bolettieri, Fabio Carrara, Fabrizio Falchi, and Claudio Gennaro. Large-scale image retrieval with Elasticsearch. In Proceeding of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, SIGIR 2018, pages 925-928. ACM, 2018.
[19] Giuseppe Amato, Fabio Carrara, Fabrizio Falchi, Claudio Gennaro, and Lucia Vadicamo. Large-scale instancelevel image retrieval. Information Processing & Management, page 102100, 2019.
[20] Edward Y. Chang, Kingshy Goh, Gerard Sychay, and Gang Wu. Cbsa: content-based soft annotation for multimodal image retrieval using bayes point machines. IEEE Trans. Circuits Syst. Video Techn., 13:26-38, 2003.
[21] Gustavo Carneiro, Antoni Chan, Pedro Moreno, and Nuno Vasconcelos. Supervised learning of semantic classes for image annotation and retrieval. IEEE transactions on pattern analysis and machine intelligence, 29:394-410, 04 2007.
[22] Kobus Barnard and David Forsyth. Learning the semantics of words and pictures. In Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001, volume II, pages 408-415, 02 2001.
[23] Xirong Li, Tiberio Uricchio, Lamberto Ballan, Marco Bertini, Cees G. M. Snoek, and Alberto Del Bimbo. Socializing the semantic gap: A comparative survey on image tag assignment, refinement, and retrieval. ACM Comput. Surv., 49(1), 2016.
[24] Luis Pellegrin, Hugo Jair Escalante, Manuel Montes, and Fabio Gonza´lez. Local and global approaches for unsupervised image annotation. Multimedia Tools and Applications, 09 2016.
[25] Giuseppe Amato, Fabrizio Falchi, Claudio Gennaro, and Fausto Rabitti. Searching and annotating 100M images with yfcc100m-hnfc6 and mi-file. In Proceedings of the 15th International Workshop on Content-Based Multimedia Indexing, CBMI '17, pages 26:1-26:4. ACM, 2017.
[26] Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor Darrell. DeCAF: A deep convolutional activation feature for generic visual recognition. CoRR, abs/1310.1531, 2013.
[27] Artem Babenko, Anton Slesarev, Alexandr Chigorin, and Victor Lempitsky. Neural codes for image retrieval. In Proceedings of 13th European Conference on Computer Vision, ECCV 2014, pages 584-599. Springer, 2014.
[28] Ali Sharif Razavian, Josephine Sullivan, Stefan Carlsson, and Atsuto Maki. Visual instance retrieval with deep convolutional networks. arXiv preprint arXiv:1412.6574, 2014.
[29] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2014, pages 580-587. IEEE, June 2014.
[30] Ali S Razavian, Hossein Azizpour, Josephine Sullivan, and Stefan Carlsson. CNN features off-the-shelf: an astounding baseline for recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2014, pages 512-519. IEEE Computer Society, 2014.
[31] Giorgos Tolias, Ronan Sicre, and Herve´ Je´gou. Particular object retrieval with integral max-pooling of CNN activations. CoRR, abs/1511.05879, 2015.
[32] Albert Gordo, Jon Almaza´n, Jerome Revaud, and Diane Larlus. End-to-end learning of deep visual representations for image retrieval. International Journal of Computer Vision, 124(2):237-254, Sep 2017.
[33] Albert Gordo, Jon Almazan, Jerome Revaud, and Diane Larlus. End-to-end learning of deep visual representations for image retrieval. arXiv preprint arXiv:1610.07940, 2016.
[34] N Najva and K Edet Bijoy. Sift and tensor based object detection and classification in videos using deep neural networks. Procedia Computer Science, 93:351-358, 2016.
[35] Ashiq Anjum, Tariq Abdullah, Muhammad Tariq, Yusuf Baltaci, and Nick Antonopoulos. Video stream analysis in clouds: An object detection and classification framework for high performance video analytics. IEEE Transactions on Cloud Computing, 2016.
[36] Muhammad Usman Yaseen, Ashiq Anjum, Omer Rana, and Richard Hill. Cloud-based scalable object detection and classification in video streams. Future Generation Computer Systems, 80:286-298, 2018.
[37] Muhammad Rashid, Muhammad Attique Khan, Muhammad Sharif, Mudassar Raza, Muhammad Masood Sarfraz, and Farhat Afza. Object detection and classification: a joint selection and fusion strategy of deep convolutional neural network and sift point features. Multimedia Tools and Applications, 78(12):15751-15777, 2019.
[38] Joseph Redmon and Ali Farhadi. YOLOv3: An incremental improvement. CoRR, abs/1804.02767, 2018.
[39] Joseph Redmon and Ali Farhadi. YOLO9000: better, faster, stronger. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7263-7271, 2017.
[40] Joseph Redmon and Ali Farhadi. YOLOv3 on the Open Images dataset. https://pjreddie.com/darknet/ yolo/, 2018. [Online; accessed 28-February-2019].
[41] Bart Thomee, David A. Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. YFCC100M: The new data in multimedia research. CACM, 59(2):64-73, January 2016.
[42] George Miller. WordNet: an electronic lexical database. Language, speech, and communication. MIT Press, 1998.
[43] Giuseppe Amato, Claudio Gennaro, and Pasquale Savino. Mi-file: Using inverted files for scalable approximate similarity search. Multimedia Tools and Applications, pages 1-30, 11 2012.
[44] Thanh-Dat Truong, Vinh-Tiep Nguyen, Minh-Triet Tran, Trang-Vinh Trieu, Tien Do, Thanh Duc Ngo, and Dinh-Duy Le. Video search based on semantic extraction and locally regional object proposal. In MultiMedia Modeling. MMM 2018, Lecture Notes in Computer Science, pages 451-456. Springer, 2018.
[45] Yossi Rubner, Leonidas Guibas, and Carlo Tomasi. The Earth Mover”s Distance, multidimensional scaling, and color-based image retrieval. In Proceedings of the ARPA image understanding workshop., volume 661, page 668, 01 1997.
[46] Wenling Shang, Kihyuk Sohn, Diogo Almeida, and Honglak Lee. Understanding and improving convolutional neural networks via concatenated rectified linear units. In Proceedings of the 33rd International Conference on Machine Learning, volume 48 of ICML 2016, pages 2217-2225. JMLR.org, 2016.
[47] Stephen E. Robertson, Steve Walker, Susan Jones, Micheline Hancock-Beaulieu, and Mike Gatford. Okapi at TREC-3. In Proceedings of The Third Text REtrieval Conference, TREC 1994, Gaithersburg, Maryland, USA, November 2-4, 1994, volume 500-225 of NIST Special Publication, pages 109-126. National Institute of Standards and Technology (NIST), 1994.
[48] Karen Sparck Jones. A statistical interpretation of term specificity and its application in retrieval. Journal of documentation, 28(1):11-21, 1972.

Metrics

Back to previous page

Cite as

BibTeX entry

@article{oai:it.cnr:prodotti:456298,
	title = {The VISIONE video search system: exploiting off-the-shelf text search engines for large-scale video retrieval},
	author = {Amato G. and Bolettieri P. and Carrara F. and Debole F. and Falchi F. and Gennaro C. and Vadicamo L. and Vairo C.},
	doi = {10.3390/jimaging7050076 and 10.48550/arxiv.2008.02749},
	journal = {JOURNAL OF IMAGING},
	volume = {7},
	year = {2021}
}