MS-TCT: Multi-Scale Temporal ConvTransformer for Action Detection
- Others:
- Spatio-Temporal Activity Recognition Systems (STARS) ; Inria Sophia Antipolis - Méditerranée (CRISAM) ; Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)
- Université Côte d'Azur (UCA)
- Stony Brook University [SUNY] (SBU) ; State University of New York (SUNY)
- This work was also supported in part by the National Science Foundation (IIS-2104404 and CNS-2104416).
- ANR-19-P3IA-0002,3IA@cote d'azur,3IA Côte d'Azur(2019)
Description
Action detection is a significant and challenging task, especially in densely-labelled datasets of untrimmed videos. Such data consist of complex temporal relations including composite or co-occurring actions. To detect actions in these complex settings, it is critical to capture both shortterm and long-term temporal information efficiently. To this end, we propose a novel 'ConvTransformer' network for action detection: MS-TCT 1. This network comprises of three main components: (1) a Temporal Encoder module which explores global and local temporal relations at multiple temporal resolutions, (2) a Temporal Scale Mixer module which effectively fuses multi-scale features, creating a unified feature representation, and (3) a Classification module which learns a center-relative position of each action instance in time, and predicts frame-level classification scores. Our experimental results on multiple challenging datasets such as Charades, TSU and MultiTHUMOS, validate the effectiveness of the proposed method, which outperforms the state-of-the-art methods on all three datasets.
Abstract
International audience
Additional details
- URL
- https://hal.inria.fr/hal-03682969
- URN
- urn:oai:HAL:hal-03682969v1
- Origin repository
- UNICA