腾讯 AI Lab 官网

腾讯 AI Lab 官网
论文

Learning Domain-Sensitive and Sentiment-Aware Word Embeddings

Word embeddings have been widely used in sentiment classification because of their efficacy for semantic representations of words. Given reviews from different domains, some existing methods for word embeddings exploit sentiment information, but they cannot produce domain-sensitive embeddings. On the other hand, some existing methods can generate domain-sensitive word embeddings, but they cannot distinguish words with similar contexts but opposite sentiment polarity. We propose a new method for learning domain-sensitive and sentiment-aware embeddings that simultaneously capture the information of sentiment semantics and domain sensitivity of individual words. Our method can automatically determine and produce domain-common embeddings and domain-specific embeddings. The differentiation of domain-common and domain-specific words is able to take the advantage of data augmentation of common semantics from multiple domains and capture the varied semantics of specific words from different domains at the same time. Experimental results show that our model provides an effective way to learn domain-sensitive and sentiment-aware word embeddings which benefits sentiment classification at both sentence level and lexicon term level.

2018 ACL·2018

Transformation Networks for Target-Oriented Sentiment Classification

Target-oriented sentiment classification aims at classifying sentiment polarities over individual opinion targets in a sentence. RNN with attention seems a good fit for the characteristics of this task, and indeed it achieves the state-of-the-art performance. After re-examining the drawbacks of attention mechanism and the obstacles that block CNN to perform well in this classification task, we propose a new model to overcome these issues. Instead of attention, our model employs a CNN layer to extract salient features from the transformed word representations originated from a bi-directional RNN layer. Between the two layers, we propose a component to generate target-specific representations of words in the sentence, meanwhile incorporate a mechanism for preserving the original contextual information from the RNN layer. Experiments show that our model achieves a new state-of-the-art performance on a few benchmarks.

2018 ACL·2018

Automatic Article Commenting: the Task and Dataset

Comments of online articles provide extended views and improve user engagement. Automatically making comments thus become a valuable functionality for online forums, intelligent chatbots, etc. This paper proposes the new task of automatic article commenting, and introduces a large-scale Chinese dataset with millions of real comments and a human-annotated subset characterizing the comments' varying quality. Incorporating the human bias of comment quality, we further develop automatic metrics that generalize a broad set of popular reference-based metrics and exhibit greatly improved correlations with human evaluations.

2018 ACL·2018

hyperdoc2vec: Distributed Representations of Hypertext Documents

Hypertext documents, such as web pages and academic papers, are of great importance in delivering information in our daily life. Although being effective on plain documents, conventional text embedding methods suffer from information loss if directly adapted to hyper-documents. In this paper, we propose a general embedding approach for hyper-documents, namely, hyperdoc2vec, along with four criteria characterizing necessary information that hyper-document embedding models should preserve. Systematic comparisons are conducted between hyperdoc2vec and several competitors on two tasks, i.e., paper classification and citation recommendation, in the academic paper domain. Analyses and experiments both validate the superiority of hyperdoc2vec to other models w.r.t. the four criteria.

2018 ACL·2018

Towards Robust Neural Machine Translation

Small perturbations in the input can severely distort intermediate representations and thus impact translation quality of neural machine translation (NMT) models. In this paper, we propose to im- prove the robustness of NMT models with adversarial stability training. The basic idea is to make both the encoder and decoder in NMT models robust against input perturbations by enabling them to behave similarly for the original input and its perturbed counterpart. Experimental results on Chinese-English, English-German and English-French translation tasks show that our approaches can not only achieve significant improvements over the strong NMT systems but also improve the robustness of NMT models.

2018 ACL·2018

Dual Skipping Networks

Inspired by the recent neuroscience studies on the left-right asymmetry of the human brain in processing low and high spatial frequency information, this paper introduces a dual skipping network which carries out coarse-to-fine object categorization. Such a network has two branches to simultaneously deal with both coarse and fine-grained classification tasks. Specifically, we propose a layer-skipping mechanism that learns a gating network to predict which layers to skip in the testing stage. This layer-skipping mechanism endows the network with good flexibility and capability in practice. Evaluations are conducted on several widely used coarse-to-fine object categorization benchmarks, and promising results are achieved by our proposed network model.

2018 CVPR·2018

Revisiting Dilated Convolution: A Simple Approach for Weakly- and Semi- Supervised Semantic Segmentation

Despite remarkable progress, weakly supervised segmentation methods are still inferior to their fully supervised counterparts. We obverse that the performance gap mainly comes from the inability of producing dense and integral pixel-level object localization for training images only with image-level labels. In this work, we revisit the dilated convolution proposed in [1] and shed light on how it enables the classification network to generate dense object localization. By substantially enlarging the receptive fields of convolutional kernels with different dilation rates, the classification network can localize the object regions even when they are not so discriminative for classification and finally produce reliable object regions for benefiting both weakly- and semi- supervised semantic segmentation. Despite the apparent simplicity of dilated convolution, we are able to obtain superior performance for semantic segmentation tasks. In particular, it achieves 60.8\% and 67.6\% mean Intersection-over-Union (mIoU) on Pascal VOC 2012 test set in weakly- (only image-level labels are available) and semi- (1,464 segmentation masks are available) settings, which are the new state-of-the-arts.

2018 CVPR·2018

Bidirectional Attentive Fusion with Context Gating for Dense Video Captioning

Dense video captioning is a newly emerging topic that aims at both localizing and describing all events in a video. We identify and tackle two challenges in this task, namely, (1) how to utilize both past and future contexts for accurate event proposal predictions, and (2) how to construct informative input to the decoder for generating natural event descriptions. First, previous works predominantly generate temporal event proposals in forward direction, which neglects future video context. We introduce a bidirectional proposal method that effectively utilizes both past and future contexts to make proposal predictions. Second, different events ending at (nearly) the same time are indistinguishable in the previous works, resulting in the same captions. We solve this problem by representing each event with an attentive fusion of hidden states from proposal module and video contents (e.g., C3D features). We further propose a novel context gating mechanism to balance the contributions from current event and its surrounding contexts dynamically. We empirically show that our attentively fused event representation performs superiorly than the proposal hidden states or video contents alone. By coupling proposal and captioning modules into one unified framework, our model outperforms the state-of-the-art method on the ActivityNet Captions dataset with a relative gain of over 100% (Meteor score from 4.82 to 9.65).

2018 CVPR·2018

Gated Fusion Network for Single Image Dehazing

In this paper, we propose an efficient algorithm to directly restore a clear image from a hazy input. The proposed algorithm hinges on an end-to-end trainable neural network that consists of an encoder and a decoder. The encoder is exploited to capture the context of the derived input images, while the decoder is employed to estimate the contribution of each input to the final dehazed result using the learned representations attributed to the encoder. The constructed network adopts a novel fusion-based strategy which derives three inputs from an original hazy image by applying White Balance (WB), Contrast Enhancing (CE) and Gamma Correction (GC). We compute pixel-wise confidence maps based on the appearance differences between these different inputs to blend the information of the derived inputs and preserve the regions with pleasant visibility. The final dehazed image is yielded by gating the important features of the derived inputs. To train the network, we introduce a multi-scale based approach so that the halo artifacts can be avoided. Extensive experimental results on both synthetic and real-world images demonstrate that the proposed algorithm performs favorably against the state-of-the-art algorithms.

2018 CVPR·2018

Reconstruction Network for Video Captioning

In this paper, the problem of describing visual contents of a video sequence with natural language is addressed. Unlike previous video captioning work mainly exploiting the cues of video contents to make a language description, we propose a reconstruction network (RecNet) with a novel encoder-decoder-reconstructor architecture, which leverages both the forward (video to sentence) and backward (sentence to video) flows for video captioning. Specifically, the encoder-decoder makes use of the forward flow to produce the sentence description based on the encoded video semantic features. Two types of reconstructors are customized to employ the backward flow and reproduce the video features based on the hidden state sequence generated by the decoder. The generation loss yielded by encoder-decoder and the reconstruction loss introduced by reconstructor are jointly drawn into training the proposed RecNet in an end-to-end fashion. Experimental results on benchmark datasets demonstrate that the proposed reconstructor could boost the encoder-decoder models and leads to significant gains on video caption accuracy.

2018 CVPR·2018

Regularizing RNNs for Caption Generation by Reconstructing The Past with The Present

Recently, caption generation with an encoder-decoder framework has been extensively studied and applied in different domains, such as image captioning, code captioning, and so on. In this paper, we propose a novel architecture, namely Auto-Reconstructor Network (ARNet), which, coupling with the conventional encoder-decoder framework, works in an end-to-end fashion to generate captions. ARNet aims at reconstructing the previous hidden state with the present one, besides the input-dependent transition operator. Therefore, ARNet encourages the current hidden state to embed more information from the previous one, which can help regularize the transition dynamics of recurrent neural networks (RNNs). Extensive experimental results show that our proposed ARNet boosts the performance over the existing encoder-decoder models on both image captioning and source code captioning tasks. Additionally, ARNet remarkably reduces the discrepancy between training and inference processes for the caption generation. Furthermore, the performance on permuted sequential MNIST demonstrates that ARNet can effectively regularize RNN, especially on modeling long term dependencies.

2018 CVPR·2018

Tagging like Humans: Diverse and Distinct Image Annotation

In this work we propose a new automatic image annotation model, dubbed diverse and distinct image annotation (D2IA). The generative model D2IA is inspired by the ensemble of human annotations, which produce semantically relevant, distinct and diverse tags. In D2IA, we generate a relevant and distinct tag subset, in which the tags are relevant to the image contents and semantically distinct to each other, using sequential sampling from a determinantal point process (DPP) model. Multiple such tag subsets that cover diverse semantic aspects or diverse semantic levels of the image contents are generated by randomly perturbing the DPP sampling process. We leverage a generative adversarial network (GAN) model to train D2IA. We perform extensive experiments including quantitative and qualitative comparisons, as well as human subject studies, on two benchmark datasets to demonstrate that the proposed model can produce more diverse and distinct tags than the state-of-the-arts.

2018 CVPR·2018

Left-right Comparative Recurrent Model for Stereo Matching

Leveraging the disparity information from both the left and right views is crucial for stereo disparity estimation. Left-right consistency check is an effective way to enhance the disparity estimation by referring to the information from the opposite view. However, the conventional left-right consistency check is an isolated post-processing step and heavily hand-crafted. This paper proposes a novel left-right comparative recurrent model to perform left-right consistency checking jointly with the disparity estimation. At each recurrent step, it produces disparity results for both views, and then performs online left-right comparison to identify the mismatched regions which may probably contain erroneously labeled pixels. A soft attention mechanism is introduced which leverages the learned error maps for better guiding the model to selectively focus on refining the unreliable regions at the next recurrent step. In this way, the generated disparity maps are progressively improved by the left-right comparative recurrent model. Extensive evaluations on the KITTI 2015, Scene Flow and Middlebury benchmarks validate the effectiveness of the proposed model, and demonstrate that state-of-the-art stereo disparity estimation results can be achieved by the new model.

2018 CVPR·2018

Learning to See in the Dark

Imaging in low light is challenging due to low photon count and low SNR. Short-exposure images suffer from noise, while long exposure can lead to blurry images and is often impractical. A variety of denoising, deblurring, and enhancement techniques have been proposed, but their effectiveness is limited in extreme conditions, such as video-rate imaging at night. To support the development of learning-based pipelines for low-light image processing, we introduce a dataset of raw short-exposure night-time images, with corresponding long-exposure reference images. Using the presented dataset, we develop a pipeline for processing low-light images, based on end-to-end training of a fully-convolutional network. The network operates directly on raw sensor data and replaces much of the traditional image processing pipeline, which tends to perform poorly on such data. We report promising results on the new dataset, analyze factors that affect performance, and highlight opportunities for future work.

2018 CVPR·2018

End-to-End Learning of Motion Representation for Video Understanding

Despite the recent success of end-to-end learned representations, hand-crafted optical flow features are still widely used in video analysis tasks.To fill this gap, we propose TVNet, a novel end-to-end trainable neural network to learn optical-flow-like features from data.TVNet subsumes a specific optical flow solver, the TV-L1 method, and is initialized by unfolding its optimization iterations as neural layers.TVNet can therefore be used directly without any extra learning. Moreover, it can be naturally concatenated with other task-specific networks to formulate an end-to-end architecture, thus making our method more efficient than current multi-stage approaches by avoiding the need to pre-compute and store features on disk.Finally, the parameters of the TVNet can be further fine-tuned by end-to-end training.This enables TVNet to learn richer and task-specific patterns beyond exact optical flow.Extensive experiments on two action recognition benchmarks verify the effectiveness of the proposed approach.  Our TVNet achieves better accuracies than all compared methods, while being competitive with the fastest counterpart in terms of features extraction time.

2018 CVPR·2018

Geometry-Guided CNN for Self-supervised Video Representation learning

It is often laborious and costly to manually annotate videos for training high-quality video recognition models, so there has been some work and interest in exploring alternative, cheap, and yet often noisy and indirect, training signals for learning the video representations. However, these signals are still coarse, supplying supervision at the whole video frame level, and subtle, sometimes enforcing the learning agent to solve problems that are even hard for humans. In this paper, we instead explore geometry, a grand new type of auxiliary supervision for the self-supervised learning of video representations. In particular, we extract pixel-wise geometry information as flow fields and disparity maps from synthetic imagery and real 3D movies. Although the geometry and high-level semantics are seemingly distant topics, surprisingly, we find that the convolutional neural networks pre-trained by the geometry cues can be effectively adapted to semantic video understanding tasks. In addition, we also find that a progressive training strategy can foster a better neural network for the video recognition task than blindly pooling the distinct sources of geometry cues together. Extensive results on video dynamic scene recognition and action recognition tasks show that our geometry guided networks significantly outperform the competing methods that are trained with other types of labeling-free supervision signals.

2018 CVPR·2018

Deep Face Detector Adaptation without Negative Transfer or Catastrophic Forgetting

Arguably, no single face detector fits all real scenarios. It is often desirable to have some built-in schemes for a face detector to automatically adapt, e.g., to a particular user's photo album (the target domain). We propose a novel face detector adaptation approach that works as long as there are representative images of the target domain no matter they are labeled or not, and without the need of accessing the training data of the source domain. Our approach explicitly accounts for the notorious negative transfer caveat in domain adaptation thanks to a residual loss by design. It does not incur catastrophic interference with the knowledge learned from the source domain and, therefore, the adapted face detectors maintain about the same performance as the old detectors in the original source domain. As such, our adaption approach to face detectors is analogous to the popular interpolation techniques for language models; it may open a new direction for progressively training the face detectors domain by domain. We report extensive experimental results to verify our approach on two massively benchmarked face detectors.

2018 CVPR·2018

End-to-end Convolutional Semantic Embeddings

Semantic embeddings for images and sentences have been widely studied recently. The ability of deep neural networks on learning rich and robust visual and textual representations offers the opportunity to develop effective semantic embedding models. Currently, the state-of-the-art approaches in semantic learning first employ deep neural networks to encode images and sentences into a common semantic space. Then, the learning objective is to ensure a larger similarity between matching image and sentence pairs than randomly sampled pairs. Usually, Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) are employed for learning image and sentence representations, respectively. On one hand, CNNs are known to produce robust visual features at different levels and RNNs are known for capturing dependencies in sequential data. Therefore, this simple framework can be sufficiently effective in learning visual and textual semantics. On the other hand, different from CNNs, RNNs cannot produce middlelevel (e.g. phrase-level in text) representations. As a result, only global representations are available for semantic learning. This could potentially limit the performance of the model due to the hierarchical structures in images and sentences. In this work, we apply Convolutional Neural Networks to process both images and sentences. Consequently, we can employ mid-level representations to assist global semantic learning by introducing a new learning objective on the convolutional layers. The experimental results show that our proposed textual CNN models with the new learning objective lead to better performance than the state-of-the-art approaches.

2018 CVPR·2018

Image Correction via Deep Reciprocating HDR Transformation

Image correction aims to adjust an input image into a visually pleasing one with the detail in the under/over exposed regions recovered. However, existing image correction methods are mainly based on image pixel operations and attempting to recover the lost detail from these under/over exposed regions is challenging. We, therefore, revisit the image formation procedure and notice that detail is contained in the high dynamic range (HDR) light intensities (which can be perceived by human eyes) but is lost during the nonlinear imaging process by of the camera in the low dynamic range (LDR) domain. Inspired by this observation, we formulate the image correction problem as the Deep Reciprocating HDR Transformation (DRHT) process and propose a novel approach to first reconstruct the lost detail in the HDR domain and then transfer them back to the LDR image as the output image with the recovered detail preserved. To this end, we propose an end-to-end DRHT model, which contains two CNNs, one for HDR detail reconstruction and the other for LDR detail correction. Experiments on the standard benchmarks demonstrate the effectiveness of the proposed method, compared with state-of-the-art image correction methods.

2018 CVPR·2018

Dynamic Scene Deblurring Using Spatially Variant Recurrent Neural Networks

Dynamic scene blur is spatially variant which is usually caused by camera shake, scene depth and object motions. Conventional methods either use heuristic image priors or deep neural networks with large models which cannot effectively handle this problem and are computationally expensive. In contrast to existing algorithms, we propose a spatially variant neural network to solve dynamic scene deblurring. The proposed algorithm contains three deep convolutional neural networks (CNNs) and a recurrent neural network (RNN), where the CNNs are used to extract features, learn RNN weights as well as reconstruct images and the RNN is used to restore clear images under the guidance of confidence features from a CNN. Our analysis shows that the proposed algorithm has a large receptive field but with smaller model size. In addition, we analyze the relationship between the proposed spatially variant RNN and the deconvolution process to show that the spatially variant RNN is able to model the deblurring process. By training in an end-to-end manner, the proposed method has much smaller model size and is much faster in comparison with existing methods. Both quantitative and qualitative evaluations on the benchmark datasets demonstrate that the proposed method performs favorably against state-of-the-art algorithms in terms of accuracy, speed, and model size.

2018 CVPR·2018

Self-Supervised Adversarial Hashing Networks for Cross-Modal Retrieval

Thanks to the success of deep learning, cross-modal retrieval has gained significant improvements recently. However, there still exists a crucial bottleneck, i.e., how to bridge the modality gap for further promoting the retrieval accuracy. In this paper, we propose a Self-Supervised Adversarial Hashing (SSAH) approach, which is among the early attempts to incorporate adversarial learning into cross-modal hashing under a self-supervised manner. The primary contribution of this work is that we employ a couple of adversarial networks to maximize the semantic correlation and representation consistency between different modalities. In addition, we harness a self-supervised semantic network to discover high-level semantic information in the form of multi-label annotations, which guides the feature learning process to preserve modality relationships in both the common semantic space and the Hamming space. Extensive experiments on three benchmark datasets show that the proposed SSAH outperforms the state-of-the-art methods.

2018 CVPR·2018

Zero-Shot Visual Recognition using Semantics-Preserving Adversarial Embedding Networks

We propose a novel framework called Semantics-Preserving Adversarial Embedding Network (SP-AEN) for zero-shot visual recognition (ZSL), where testing images and their classes are both unseen during training. SP-AEN aims to tackle the inherent problem --- semantic loss --- in the prevailing family of embedding-based ZSL, where some semantics would be discarded during training if they are non-discriminative for training classes, but informative for testing classes. Specifically, SP-AEN prevents the semantic loss by introducing an independent visual-to-semantic space embedder which disentangles the semantic space into two subspaces for the two arguably conflicting objectives: classification and reconstruction. Through adversarial learning of the two subspaces, SP-AEN can transfer the semantics from the reconstructive subspace to the discriminative one, accomplishing improved zero-shot recognition on unseen classes. Compared against prior works, SP-AEN can not only improve classification but also generate photo-realistic images, demonstrating the effectiveness of semantic preservation. SP-AEN considerably outperforms the state-of-the-art ZSL methods by absolute 12.2%, 9.3%, 4.0%, and 3.6% in harmonic mean values on CUB, AWA, SUN and aPY, respectively.

2018 CVPR·2018

VITAL: VIsual Tracking via Adversarial Learning

The tracking-by-detection framework consists of two stages, i.e., drawing samples around the target object in the first stage and classifying each sample as the target object or as background in the second stage. The performance of existing tracking-by-detection trackers using deep classification networks is limited by two aspects. First, the positive samples in each frame are highly spatially overlapped, and they fail to capture rich appearance variations. Second, there exists severe class imbalance between positive and negative samples. This paper presents the VITAL algorithm to address these two problems via adversarial learning. To augment positive samples, we use a generative network to randomly generate masks, which are applied to input features to capture a variety of appearance changes. With the use of adversarial learning, our network identifies the mask that maintains the most robust features of the target objects over a long temporal span. In addition, to handle the issue of class imbalance, we propose a high-order cost sensitive loss to decrease the effect of easy negative samples to facilitate training the classification network. Extensive experiments on benchmark datasets demonstrate that the proposed tracker performs favorably against state-of-the-art approaches.

2018 CVPR·2018

Learning to Generate Time-Lapse Videos Using Multi-Stage Dynamic Generative Adversarial Networks

Taking a photo outside, can we predict the immediate future, e.g., how would the cloud move in the sky? We address this problem by presenting a generative adversarial network (GAN) based two-stage approach to generating realistic time-lapse videos of high resolution. Given the first frame, our model learns to generate long-term future frames. The first stage generates videos of realistic contents for each frame. The second stage refines the generated video from the first stage by enforcing it to be closer to real videos with regard to motion dynamics. To further encourage vivid motion in the final generated video, Gram matrix is employed to model the motion more precisely. We build a large scale time-lapse dataset, and test our approach on this new dataset. Using our model, we are able to generate realistic videos of up to 128x128 resolution for 32 frames. Quantitative and qualitative experiment results have demonstrated the superiority of our model over the state-of-the-art models.

2018 CVPR·2018

CNN in MRF: Video Object Segmentation via Inference in A CNN-Based Higher-Order Spatio-Temporal MRF

This paper addresses the problem of video object segmentation, where the initial object mask is given in the first frame of an input video. We propose a novel spatio-temporal Markov Random Field (MRF) model defined over pixels to tackle the problem. Unlike conventional MRF models, the spatial dependencies among pixels in our model are encoded by a Convolutional Neural Network (CNN). Specifically, for a given object, the probability of a labeling to a set of spatially neighboring pixels can be predicted by a pre-trained CNN for this specific object. As a result, higher-order, richer dependencies among pixels in the set can be implicitly modeled by the CNN. With temporal dependencies established by optical flow, the resulting MRF model combines both spatial and temporal cues for tackling video object segmentation. However, performing inference in the MRF model is very difficult due to the very high-order dependencies. To this end, we propose a novel CNN-embedded approximate algorithm to efficiently perform inference in the MRF model. This algorithm proceeds by alternating between a temporal fusion step and a feed-forward CNN step. When initialized with a simple appearance-based one-shot segmentation CNN, our model outperforms the winning entries of the DAVIS 2017 Challenge, without resorting to model ensembling or any dedicated detectors.

2018 CVPR·2018

CosFace: Large Margin Cosine Loss for Deep Face Recognition

Face recognition has achieved revolutionary advancement owing to the advancement of the deep convolutional neural network (CNN). The central task of face recognition, including face verification and identification, involves face feature discrimination. However, traditional softmax loss of deep CNN usually lacks the power of discrimination. To address this problem, recently several loss functions such as central loss [1], large margin softmax loss [2], and angular softmax loss [3] have been proposed. All these improvement algorithms share the same idea: maximizing interclass variance and minimizing intra-class variance. In this paper, we design a novel loss function, namely large margin cosine loss (LMCL), to realize this idea from a different perspective. More specifically, we reformulate the softmax loss as cosine loss by L2 normalizing both features and weight vectors to remove radial variation, based on which a cosine margin term m is introduced to further maximize decision margin in angular space. As a result, minimum intra-class variance and maximum inter-class variance are achieved by normalization and cosine decision margin maximization. We refer to our model trained with LMCL as CosFace. To test our approach, extensive experimental evaluations are conducted on the most popular public-domain face recognition datasets such as MegaFace Challenge, Youtube Faces (YTF) and Labeled Face in the Wild (LFW). We achieve the state-of-the-art performance on these benchmark experiments, which confirms the effectiveness of our approach.

2018 CVPR·2018