Spatio-Temporal Action Detection with a Motion Sense and Semantic Correction Framework

Published in 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024

Accurately distinguishing between action-related features and nonaction-related features is crucial in spatio-temporal action detection tasks. Additionally, the calibration and fusion of information across different modalities remain challenging. This paper proposes a novel Motion Sense and Semantic Correction framework (MS-SC) to address these issues. The MS-SC framework achieves accurate detection by fusing features from images (spatial dimension) and videos (spatio-temporal dimension). A Motion Sense Module (MSM) is proposed to significantly increase the feature distance between action and non-action features in the semantic space, enhancing feature discriminability. Considering the complementary nature of information across different modalities, an efficient Semantic Correction Fusion Module (SFM) is introduced to facilitate interaction between features of distinct modalities and maximize their complementary information integration. To evaluate the performance of the MSSC framework, extensive experiments were conducted on two challenging datasets, UCF101-24 and AVA. The results demonstrate the effectiveness of the MS-SC framework in handling spatio-temporal action detection tasks.

Recommended citation: Zhang Y, Yu C, Fu C, et al. Spatio-Temporal Action Detection with a Motion Sense and Semantic Correction Framework[C]//ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024: 3645-3649.
Download Paper