SEMA: Semantic Attention for Capturing Long-Range Dependencies in Egocentric Lifelogs


Abstract

Transformer architecture is a de-facto standard for modeling global dependency in long sequences. However, quadratic space and time complexity for self-attention prohibits transformers from scaling to extremely long sequences (> 10k). Low-rank decomposition as a non-negative matrix factorization (NMF) of self-attention demonstrates remarkable performance in linear space and time complexity with strong theoretical guarantees. However, our analysis reveals that NMF-based works struggle to capture the rich spatio-temporal visual cues scattered across the long sequences resulting from egocentric lifelogs.

To capture such cues, we propose a novel attention mechanism named SEMantic Attention (SEMA), which factorizes the self-attention matrix into a semantically meaningful subspace. We demonstrate SEMA in a representation learning setting, aiming to recover activity patterns in extremely long (weeks-long) egocentric lifelogs using a novel self-supervised training pipeline. Compared to the current state-of-the-art, we report significant improvement in terms of (NMI, AMI, and F-Score) for EgoRoutine, UTE, and Epic Kitchens datasets. Furthermore, to underscore the efficacy of SEMA, we extend its application to conventional video tasks such as online action detection, video recognition, and action localization.


Architecture


Architecture Diagram 1 Figure 1: Proposed approach for SEMA
Architecture Diagram 2 Figure 2: Proposed SEMAFormer


Datasets

This work uses several benchmark datasets including EgoRoutine, UTE, and Epic Kitchens for evaluating SEMA’s performance.


Code

Find the code implementation on GitHub. The implementation is based on the PyTorch library.


Supplementary Material

The supplementary material can be found here


Quantitative Comparison

Performance comparison of SEMA with existing methods on multiple datasets:

Methods c = 12 c = 13 c = 15
F1↑ AMI↑ NMI↑ F1↑ AMI↑ NMI↑ F1↑ AMI↑ NMI↑
SR-clustering [13] 0.3044 0.0913 0.0924 0.2697 0.1294 0.1312 0.2614 0.1537 0.1557
TW-FINCH [51] 0.3132 0.1548 0.1603 0.3259 0.1649 0.1655 0.3072 0.1530 0.1545
SeLa [2] 0.6642 0.6291 0.6299 0.6662 0.6150 0.6158 0.5855 0.5954 0.5963
DAPC + bi-GRU [3] 0.7135 0.6129 0.6135 0.6152 0.6040 0.6048 0.6343 0.6080 0.6089
GALA [48] 0.6357 0.6079 0.6085 0.6458 0.6084 0.6093 0.5381 0.5932 0.5941
CARL [7] 0.5551 0.5219 0.5253 0.5847 0.5258 0.5262 0.5721 0.5139 0.5144
SEMA 0.7482 0.6510 0.6515 0.7976 0.6837 0.6842 0.7960 0.6806 0.6814

Table 1. Comparison between various SOTA approaches for subject S1 in EgoRoutine dataset. Here c denotes number of clusters.


Visualization


A visualization comparing the predicted classes with ground truth across multiple days for different activities in the EgoRoutine dataset.
Visualization Diagram Figure 3: Visualization of a comparison between the predicted class and ground truth for different days

If you use this work, please cite:

@InProceedings{Nagar_2024_WACV,
    author = {Nagar, Pravin and Shastry, K.N. Ajay and Chaudhari, Jayesh and Arora, Chetan},
    title = {SEMA: Semantic Attention for Capturing Long-Range Dependencies in Egocentric Lifelogs},
    booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
    month = {January},
    year = {2024},
    pages = {7025-7035}
}