2025
Journal article  Open Access

Joint-dataset learning and cross-consistent regularization for text-to-motion retrieval

Messina N., Sedmidubsky J., Falchi F., Rebok T.

Information Retrieval (cs.IR)  Cross-modal retrieval  Computer Vision and Pattern Recognition (cs.CV)  3D Human motion  FOS: Computer and information sciences  Computer Science - Information Retrieval  Multimedia (cs.MM)  Multi-modal understanding  Computer Science - Multimedia  Text-motion retrieval  Computer Science - Computer Vision and Pattern Recognition 

Pose-estimation methods enable extracting human motion from common videos in the structured form of 3D skeleton sequences. Despite great application opportunities, effective content-based access to such spatio-temporal motion data is a challenging problem. In this paper, we focus on the recently introduced text-motion retrieval tasks, which aim to search for database motions that are the most relevant to a specified natural-language textual description (text-to-motion) and vice-versa (motion-to-text). Despite recent efforts to explore these promising avenues, a primary challenge remains the insufficient data available to train robust text-motion models effectively. To address this issue, we propose to investigate joint-dataset learning – where we train on multiple text-motion datasets simultaneously – together with the introduction of a Cross-Consistent Contrastive Loss function (CCCL), which regularizes the learned text-motion common space by imposing uni-modal constraints that augment the representation ability of the trained network. To learn a proper motion representation, we also introduce a transformer-based motion encoder, called MoT++, which employs spatio-temporal attention to process skeleton data sequences. We demonstrate the benefits of the proposed approaches on the widely-used KIT Motion-Language and HumanML3D datasets, including also some results on the recent Motion-X dataset. We perform detailed experimentation on joint-dataset learning and cross-dataset scenarios, showing the effectiveness of each introduced module in a carefully conducted ablation study and, in turn, pointing out the limitations of state-of-the-art methods. The code for reproducing our results is available here: .

Source: ACM TRANSACTIONS ON MULTIMEDIA COMPUTING, COMMUNICATIONS AND APPLICATIONS


Metrics



Back to previous page
BibTeX entry
@article{oai:iris.cnr.it:20.500.14243/554744,
	title = {Joint-dataset learning and cross-consistent regularization for text-to-motion retrieval},
	author = {Messina N. and Sedmidubsky J. and Falchi F. and Rebok T.},
	doi = {10.1145/3744565 and https://doi.org/10.1145/3744565 and 10.48550/arxiv.2407.02104},
	year = {2025}
}