Tencent AI Lab Embedding Corpus for Chinese Words and Phrases


A corpus on continuous distributed representations of Chinese words and phrases.

Introduction

This corpus provides 200-dimension vector representations, a.k.a. embeddings, for over 8 million Chinese words and phrases, which are pre-trained on large-scale high-quality data. These vectors, capturing semantic meanings for Chinese words and phrases, can be widely applied in many downstream Chinese processing tasks (e.g., named entity recognition and text classification) and in further research.

Data Description

Download the corpus from: Tencent_AILab_ChineseEmbedding.tar.gz.

The pre-trained embeddings are in Tencent_AILab_ChineseEmbedding.txt. The first line shows the total number of embeddings and their dimension size, separated by a space. In each line below, the first column indicates a Chinese word or phrase, followed by a space and its embedding. For each embedding, its values in different dimensions are separated by spaces.

Highlights

In comparison with existing embedding corpora for Chinese, the superiority of our corpus mainly lies in coverage, freshness, and accuracy.

  • Coverage. Our corpus contains a large amount of domain-specific words or slangs in vocabulary, such as “喀拉喀什河”, “皇帝菜”, “不念僧面念佛面”, “冰火两重天”, “煮酒论英雄", which are not covered by most of the existing embedding corpora.
  • Freshness. Our corpus contains fresh words appearing or getting popular recently, such as “恋与制作人”, “三生三世十里桃花”, “打call”, “十动然拒”, “因吹斯汀”, etc.
  • Accuracy. Our embeddings can better reflect the semantic meaning of Chinese words or phrases, attributed to the large-scale data and the well-designed algorithm for training.

Training

To ensure the coverage, freshness, and accuracy of our corpus, we carefully design our data preparation and training process in terms of the following aspects:

Simple Cases

To exemplify the learned representations, in below we show the most similar words for some sample words. Here cosine distance between embeddings is used to compute the distance of two words/phrases.

Input

喀拉喀什河 因吹斯汀 刘德华 自然语言处理

Top
similar
words

墨玉河
和田河
玉龙喀什河
白玉河
喀什河
叶尔羌河
克里雅河
玛纳斯河
一颗赛艇
因吹斯听
城会玩
厉害了word哥
emmmmm
扎心了老铁
神吐槽
可以说是非常爆笑了
刘天王
周润发
华仔
梁朝伟
张学友
古天乐
张家辉
张国荣
自然语言理解
计算机视觉
自然语言处理技术
深度学习
机器学习
图像识别
语义理解
语音识别

FAQ

Q1: Why we encountered errors when reading Tencent AI Lab embeddings with Google’s word2vec or gensim’s Word2Vec?

Our data file is encoded in UTF-8. If you are using gensim, you can follow the scripts below to read our embeddings:

from gensim.models.word2vec import KeyedVectors
wv_from_text = KeyedVectors.load_word2vec_format(file, binary=False)

Q2: How did you segment Chinese words when processing the training data? What should we do to make our segmentation results similar to yours?

You might not be able to make full use of our embeddings if only applying publicly available toolkits for Chinese word segmentation. The reason is that most of these toolkits will further segment phrases or entities into fine-grained elements. For some specific tasks, fine-grained word segmentation in preprocessing will result in worse model performance than coarse-grained segmentation, while sometimes fine-grained word segmentation may perform better.

Currently, we are working to test our word segmentation toolkit on diverse NLP tasks and further improve its performance. Once ready, the toolkit will be released for public use. At the current stage, as a quick start, you can simply use an open-source toolkit for Chinese word segmentation. Furthermore, some words can be combined into phrases based on our vocabulary. In addition, you may consider both fine-grained words (obtained in word segmentation) and coarse-grained phrases (obtained in word combination) when tackling some certain tasks.

Q3: Why there are stop words (e.g., “的” and “是” ), digits, and punctuations (e.g., “,” and “。”) in the vocabulary of Tencent AI Lab embeddings?

We did not remove these words to ensure the coverage of our vocabulary and the general applicability of our embeddings in diverse scenarios. Though not useful in many applications, stop words, digits, and punctuations might be informative for some certain tasks, such as named entity recognition and part-of-speech tagging. To better adapt our embeddings to your specific task, you may customize your own vocabulary and ignore the words or phrases absent in your vocabulary.

Final Words

Thanks for using our corpus! Please don't forget to let us know if our embeddings advance the current state of the art forward in your Chinese natural language processing task.

Citation

If you use our corpus, please cite: Yan Song, Shuming Shi, Jing Li, and Haisong Zhang. Directional Skip-Gram: Explicitly Distinguishing Left and Right Context for Word Embeddings. NAACL 2018 (Short Paper). [pdf] [bib]

Contacts

If you have any question, please contact us: nlu@tencent.com

You can also visit our lab's official website: Tencent AI Lab

Disclaimer

This corpus is for research purpose only and released under a Creative Commons Attribution 3.0 Unported License (http://creativecommons.org/licenses/by/3.0/).