Diffusion-based framework for weakly-supervised temporal action localization

Yuanbing Zou; Qingjie Zhao; Prodip Kumar Sarker; Shanshan Li; Lei Wang; Wangwang Liu

doi:10.1016/j.patcog.2024.111207

Diffusion-based framework for weakly-supervised temporal action localization

Yuanbing Zou, Qingjie Zhao^*, Prodip Kumar Sarker, Shanshan Li, Lei Wang, Wangwang Liu

^*此作品的通讯作者

计算机学院

科研成果: 期刊稿件 › 文章 › 同行评审

摘要

Weakly supervised temporal action localization aims to localize action instances with only video-level supervision. Due to the absence of frame-level annotation supervision, how effectively separate action snippets and backgrounds from semantically ambiguous features becomes an arduous challenge for this task. To address this issue from a generative modeling perspective, we propose a novel diffusion-based network with two stages. Firstly, we design a local masking mechanism module to learn the local semantic information and generate binary masks at the early stage, which (1) are used to perform action-background separation and (2) serve as pseudo-ground truth required by the diffusion module. Then, we propose a diffusion module to generate high-quality action predictions under the pseudo-ground truth supervision in the second stage. In addition, we further optimize the new-refining operation in the local masking module to improve the operation efficiency. The experimental results demonstrate that the proposed method achieves a promising performance on the publicly available mainstream datasets THUMOS14 and ActivityNet. The code is available at http://github.com/Rlab123/action_diff.

源语言	英语
文章编号	111207
期刊	Pattern Recognition
卷	160
DOI	http://doi.org/10.1016/j.patcog.2024.111207
出版状态	已出版 - 4月 2025

访问文件

10.1016/j.patcog.2024.111207

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{7aa5bff6132d41168990268948fa0a8f,

title = "Diffusion-based framework for weakly-supervised temporal action localization",

abstract = "Weakly supervised temporal action localization aims to localize action instances with only video-level supervision. Due to the absence of frame-level annotation supervision, how effectively separate action snippets and backgrounds from semantically ambiguous features becomes an arduous challenge for this task. To address this issue from a generative modeling perspective, we propose a novel diffusion-based network with two stages. Firstly, we design a local masking mechanism module to learn the local semantic information and generate binary masks at the early stage, which (1) are used to perform action-background separation and (2) serve as pseudo-ground truth required by the diffusion module. Then, we propose a diffusion module to generate high-quality action predictions under the pseudo-ground truth supervision in the second stage. In addition, we further optimize the new-refining operation in the local masking module to improve the operation efficiency. The experimental results demonstrate that the proposed method achieves a promising performance on the publicly available mainstream datasets THUMOS14 and ActivityNet. The code is available at http://github.com/Rlab123/action\_diff.",

keywords = "Diffusion, Mask learning, Temporal action localization, Weakly-supervised learning",

author = "Yuanbing Zou and Qingjie Zhao and Sarker, \{Prodip Kumar\} and Shanshan Li and Lei Wang and Wangwang Liu",

note = "Publisher Copyright: {\textcopyright} 2024",

year = "2025",

month = apr,

doi = "10.1016/j.patcog.2024.111207",

language = "English",

volume = "160",

journal = "Pattern Recognition",

issn = "0031-3203",

publisher = "Elsevier Ltd.",

}

TY - JOUR

T1 - Diffusion-based framework for weakly-supervised temporal action localization

AU - Zou, Yuanbing

AU - Zhao, Qingjie

AU - Sarker, Prodip Kumar

AU - Li, Shanshan

AU - Wang, Lei

AU - Liu, Wangwang

PY - 2025/4

Y1 - 2025/4

N2 - Weakly supervised temporal action localization aims to localize action instances with only video-level supervision. Due to the absence of frame-level annotation supervision, how effectively separate action snippets and backgrounds from semantically ambiguous features becomes an arduous challenge for this task. To address this issue from a generative modeling perspective, we propose a novel diffusion-based network with two stages. Firstly, we design a local masking mechanism module to learn the local semantic information and generate binary masks at the early stage, which (1) are used to perform action-background separation and (2) serve as pseudo-ground truth required by the diffusion module. Then, we propose a diffusion module to generate high-quality action predictions under the pseudo-ground truth supervision in the second stage. In addition, we further optimize the new-refining operation in the local masking module to improve the operation efficiency. The experimental results demonstrate that the proposed method achieves a promising performance on the publicly available mainstream datasets THUMOS14 and ActivityNet. The code is available at http://github.com/Rlab123/action_diff.

AB - Weakly supervised temporal action localization aims to localize action instances with only video-level supervision. Due to the absence of frame-level annotation supervision, how effectively separate action snippets and backgrounds from semantically ambiguous features becomes an arduous challenge for this task. To address this issue from a generative modeling perspective, we propose a novel diffusion-based network with two stages. Firstly, we design a local masking mechanism module to learn the local semantic information and generate binary masks at the early stage, which (1) are used to perform action-background separation and (2) serve as pseudo-ground truth required by the diffusion module. Then, we propose a diffusion module to generate high-quality action predictions under the pseudo-ground truth supervision in the second stage. In addition, we further optimize the new-refining operation in the local masking module to improve the operation efficiency. The experimental results demonstrate that the proposed method achieves a promising performance on the publicly available mainstream datasets THUMOS14 and ActivityNet. The code is available at http://github.com/Rlab123/action_diff.

KW - Diffusion

KW - Mask learning

KW - Temporal action localization

KW - Weakly-supervised learning

UR - http://www.scopus.com/pages/publications/85210403810

U2 - 10.1016/j.patcog.2024.111207

DO - 10.1016/j.patcog.2024.111207

M3 - Article

AN - SCOPUS:85210403810

SN - 0031-3203

VL - 160

JO - Pattern Recognition

JF - Pattern Recognition

M1 - 111207

ER -

Diffusion-based framework for weakly-supervised temporal action localization

摘要

访问文件

其它文件与链接

指纹

引用此