2023
Conference article  Open Access

Text-to-motion retrieval: towards joint understanding of human motion data and natural language

Messina N., Sedmidubsk'y J., Falchi F., Rebok T.

Human motion data  Skeleton sequences  CLIP  BERT  Deep language models  ViViT  Motion retrieval  Cross-modal retrieval 

Due to recent advances in pose-estimation methods, human motion can be extracted from a common video in the form of 3D skeleton sequences. Despite wonderful application opportunities, effective and efficient content-based access to large volumes of such spatio-temporal skeleton data still remains a challenging problem. In this paper, we propose a novel content-based text-to-motion retrieval task, which aims at retrieving relevant motions based on a specified natural-language textual description. To define baselines for this uncharted task, we employ the BERT and CLIP language representations to encode the text modality and successful spatio-temporal models to encode the motion modality. We additionally introduce our transformer-based approach, called Motion Transformer (MoT), which employs divided space-time attention to effectively aggregate the different skeleton joints in space and time. Inspired by the recent progress in text-to-image/video matching, we experiment with two widely-adopted metric-learning loss functions. Finally, we set up a common evaluation protocol by defining qualitative metrics for assessing the quality of the retrieved motions, targeting the two recently-introduced KIT Motion-Language and HumanML3D datasets. The code for reproducing our results is available here: https://github.com/mesnico/text-to-motion-retrieval.

Source: SIGIR '23: The 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 2420–2425, Taipei, Taiwan, 23-27/07/2023

Publisher: ACM - Association for Computing Machinery, New York, USA


Metrics



Back to previous page
BibTeX entry
@inproceedings{oai:it.cnr:prodotti:486043,
	title = {Text-to-motion retrieval: towards joint understanding of human motion data and natural language},
	author = {Messina N. and Sedmidubsk'y J. and Falchi F. and Rebok T.},
	publisher = {ACM - Association for Computing Machinery, New York, USA},
	doi = {10.1145/3539618.3592069},
	booktitle = {SIGIR '23: The 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 2420–2425, Taipei, Taiwan, 23-27/07/2023},
	year = {2023}
}

AI4Media
A European Excellence Centre for Media, Society and Democracy


OpenAIRE