Self-Supervised Video Pose Representation Learning for Occlusion-Robust Action Recognition

Yang, Di; Wang, Yaohui; Dantcheva, Antitza; Garattoni, Lorenzo; Francesca, Gianpiero; Bremond, Francois

Published December 15, 2021 | Version v1

Conference paper Metadata-only

Self-Supervised Video Pose Representation Learning for Occlusion-Robust Action Recognition

Contributors

Others:

Spatio-Temporal Activity Recognition Systems (STARS) ; Inria Sophia Antipolis - Méditerranée (CRISAM) ; Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)
Université Côte d'Azur (UCA)
Toyota Motor Europe
ANR-19-P3IA-0002,3IA@cote d'azur,3IA Côte d'Azur(2019)

Action recognition based on human pose has witnessed increasing attention due to its robustness to changes in appearances, environments, and viewpoints. Despite associated progress, one remaining challenge has to do with occlusion in real-world videos that hinders the visibility of all joints. Such occlusion impedes representation of such scenes by models that have been trained on full-body pose data, obtained in laboratory conditions with specific sensors. To address this, as a first contribution, we introduce OR-VPE, a novel video pose embedding network that is streamlined to learn an occlusionrobust representation for pose sequences in videos. In order to enable our embedding network to handle partially visible joints, we propose to incorporate a sub-graph data augmentation mechanism during training, which simulates occlusions, into a video pose encoder based on Graph Convolutional Networks (GCNs). As a second contribution, we apply a contrastive learning module to train the video pose representation in a selfsupervised manner without the necessity of action annotations. This is achieved by minimizing the mutual information of the same pose sequence pruned into different spatio-temporal subgraphs. Experimental analyses show that compared to training the same encoder from scratch, our proposed OR-VPE, with pre-training on a large-scale dataset, NTU-RGB+D 120, improves the performance of the downstream action classification on Toyota Smarthome, N-UCLA and Penn Action datasets.

Abstract

International audience

Additional details

URL: https://hal.archives-ouvertes.fr/hal-03476564
URN: urn:oai:HAL:hal-03476564v1

Origin repository: UNICA

	All versions	This version
Views	0	0
Downloads	0	0
Data volume	0 Bytes	0 Bytes

Self-Supervised Video Pose Representation Learning for Occlusion-Robust Action Recognition

Creators

Contributors

Others:

Description

Abstract

Additional details

Identifiers

Origin repository