SEMA: Semantic Attention for Capturing Long-Range Dependencies in Egocentric Lifelogs

Pravin Nagar, K.N. Ajay Shastry, Jayesh Chaudhari, Chetan Arora

Abstract

Transformer architecture is a de-facto standard for modeling global dependency in long sequences. However, quadratic space and time complexity for self-attention prohibits transformers from scaling to extremely long sequences (> 10k). Low-rank decomposition as a non-negative matrix factorization (NMF) of self-attention demonstrates remarkable performance in linear space and time complexity with strong theoretical guarantees. However, our analysis reveals that NMF-based works struggle to capture the rich spatio-temporal visual cues scattered across the long sequences resulting from egocentric lifelogs.

To capture such cues, we propose a novel attention mechanism named SEMantic Attention (SEMA), which factorizes the self-attention matrix into a semantically meaningful subspace. We demonstrate SEMA in a representation learning setting, aiming to recover activity patterns in extremely long (weeks-long) egocentric lifelogs using a novel self-supervised training pipeline. Compared to the current state-of-the-art, we report significant improvement in terms of (NMI, AMI, and F-Score) for EgoRoutine, UTE, and Epic Kitchens datasets. Furthermore, to underscore the efficacy of SEMA, we extend its application to conventional video tasks such as online action detection, video recognition, and action localization.

Architecture

Figure 1: Proposed approach for SEMA

Figure 2: Proposed SEMAFormer

Datasets

This work uses several benchmark datasets including EgoRoutine, UTE, and Epic Kitchens for evaluating SEMA’s performance.

Code

Find the code implementation on GitHub. The implementation is based on the PyTorch library.

Supplementary Material

The supplementary material can be found here

Methods	c = 12			c = 13			c = 15
Methods	F1↑	AMI↑	NMI↑	F1↑	AMI↑	NMI↑	F1↑	AMI↑	NMI↑
SR-clustering [13]	0.3044	0.0913	0.0924	0.2697	0.1294	0.1312	0.2614	0.1537	0.1557
TW-FINCH [51]	0.3132	0.1548	0.1603	0.3259	0.1649	0.1655	0.3072	0.1530	0.1545
SeLa [2]	0.6642	0.6291	0.6299	0.6662	0.6150	0.6158	0.5855	0.5954	0.5963
DAPC + bi-GRU [3]	0.7135	0.6129	0.6135	0.6152	0.6040	0.6048	0.6343	0.6080	0.6089
GALA [48]	0.6357	0.6079	0.6085	0.6458	0.6084	0.6093	0.5381	0.5932	0.5941
CARL [7]	0.5551	0.5219	0.5253	0.5847	0.5258	0.5262	0.5721	0.5139	0.5144
SEMA	0.7482	0.6510	0.6515	0.7976	0.6837	0.6842	0.7960	0.6806	0.6814