Messina N., Sedmidubsk'y J., Falchi F., Rebok T.
Human motion data Skeleton sequences CLIP BERT Deep language models ViViT Motion retrieval Cross-modal retrieval
Due to recent advances in pose-estimation methods, human motion can be extracted from a common video in the form of 3D skeleton sequences. Despite wonderful application opportunities, effective and efficient content-based access to large volumes of such spatio-temporal skeleton data still remains a challenging problem. In this paper, we propose a novel content-based text-to-motion retrieval task, which aims at retrieving relevant motions based on a specified natural-language textual description. To define baselines for this uncharted task, we employ the BERT and CLIP language representations to encode the text modality and successful spatio-temporal models to encode the motion modality. We additionally introduce our transformer-based approach, called Motion Transformer (MoT), which employs divided space-time attention to effectively aggregate the different skeleton joints in space and time. Inspired by the recent progress in text-to-image/video matching, we experiment with two widely-adopted metric-learning loss functions. Finally, we set up a common evaluation protocol by defining qualitative metrics for assessing the quality of the retrieved motions, targeting the two recently-introduced KIT Motion-Language and HumanML3D datasets. The code for reproducing our results is available here: https://github.com/mesnico/text-to-motion-retrieval.
Source: SIGIR '23: The 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 2420–2425, Taipei, Taiwan, 23-27/07/2023
Publisher: ACM - Association for Computing Machinery, New York, USA
@inproceedings{oai:it.cnr:prodotti:486043, title = {Text-to-motion retrieval: towards joint understanding of human motion data and natural language}, author = {Messina N. and Sedmidubsk'y J. and Falchi F. and Rebok T.}, publisher = {ACM - Association for Computing Machinery, New York, USA}, doi = {10.1145/3539618.3592069}, booktitle = {SIGIR '23: The 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 2420–2425, Taipei, Taiwan, 23-27/07/2023}, year = {2023} }