Tencent AI Lab 官网
Bidirectional Attentive Fusion with Context Gating for Dense Video Captioning
Abstract
Dense video captioning is a newly emerging topic that aims at both localizing and describing all events in a video. We identify and tackle two challenges in this task, namely, (1) how to utilize both past and future contexts for accurate event proposal predictions, and (2) how to construct informative input to the decoder for generating natural event descriptions. First, previous works predominantly generate temporal event proposals in forward direction, which neglects future video context. We introduce a bidirectional proposal method that effectively utilizes both past and future contexts to make proposal predictions. Second, different events ending at (nearly) the same time are indistinguishable in the previous works, resulting in the same captions. We solve this problem by representing each event with an attentive fusion of hidden states from proposal module and video contents (e.g., C3D features). We further propose a novel context gating mechanism to balance the contributions from current event and its surrounding contexts dynamically. We empirically show that our attentively fused event representation performs superiorly than the proposal hidden states or video contents alone. By coupling proposal and captioning modules into one unified framework, our model outperforms the state-of-the-art method on the ActivityNet Captions dataset with a relative gain of over 100% (Meteor score from 4.82 to 9.65).
Venue
2018 CVPR
Publication Time
2018
Authors
Jingwen Wang, Wenhao Jiang, Lin Ma, Wei Liu,Yong Xu