A corpus on continuous distributed representations of Chinese words and phrases.
The latest version of this corpus provides 100-dimension and 200-dimension vector representations, a.k.a. embeddings, for over 12 million Chinese words and phrases, which are pre-trained on large-scale high-quality data. These vectors, capturing semantic meanings for Chinese words and phrases, can be widely applied in many downstream Chinese processing tasks (e.g., named entity recognition and text classification) and in further research.
Please go to the download page to get the embedding data. The data format of each file is as follows,
The first line shows the total number of embeddings and their dimension size, separated by a space. In each line below, the first column indicates a word or phrase, followed by a space and its embedding. For each embedding, its values in different dimensions are separated by spaces.
In comparison with existing embedding corpora for Chinese, the superiority of our corpus mainly lies in coverage, freshness, and accuracy.
To ensure the coverage, freshness, and accuracy of our corpus, we carefully design our data preparation and training process in terms of the following aspects:
To exemplify the learned representations, in below we show the most similar words for some sample words. Here cosine distance between embeddings is used to compute the distance of two words/phrases.
Input |
新冠病毒 |
煮酒论英雄 |
流金岁月 |
刘德华 |
自然语言处理 |
Top similar words |
新冠肺炎病毒
新型冠状病毒
新冠状病毒
肺炎病毒
covid-19病毒
新冠
新型病毒
冠状病毒
|
青梅煮酒论英雄
曹操煮酒论英雄
青梅煮酒
关羽温酒斩华雄
桃园三结义
温酒斩华雄
三英战吕布
桃园结义
|
半生缘
大江大河2
你迟到的许多年
风再起时
情深缘起
外滩钟声
亲爱的自己
了不起的女孩
|
华仔
张学友
张国荣
梁朝伟
谭咏麟
周润发
刘天王
古天乐
|
自然语言理解
计算机视觉
自然语言处理技术
nlp
机器学习
语义理解
深度学习
nlp技术
|
Q1: Why we encountered errors when reading Tencent AI Lab embeddings with Google’s word2vec or gensim’s Word2Vec?
Our data file is encoded in UTF-8. If you are using gensim, you can follow the scripts below to read our embeddings:
from gensim.models import KeyedVectors wv_from_text = KeyedVectors.load_word2vec_format(file, binary=False)
Q2: How did you segment Chinese words when processing the training data? What should we do to make our segmentation results similar to yours?
You might not be able to make full use of our embeddings if only applying publicly available toolkits for Chinese word segmentation. The reason is that most of these toolkits will further segment phrases or entities into fine-grained elements. For some specific tasks, fine-grained word segmentation in preprocessing will result in worse model performance than coarse-grained segmentation, while sometimes fine-grained word segmentation may perform better.
Currently, we are working to test our word segmentation toolkit on diverse NLP tasks and further improve its performance. Once ready, the toolkit will be released for public use. At the current stage, as a quick start, you can simply use an open-source toolkit for Chinese word segmentation. Furthermore, some words can be combined into phrases based on our vocabulary. In addition, you may consider both fine-grained words (obtained in word segmentation) and coarse-grained phrases (obtained in word combination) when tackling some certain tasks.
Q3: Why there are stop words (e.g., “的” and “是” ), digits, and punctuations (e.g., “,” and “。”) in the vocabulary of Tencent AI Lab embeddings?
We did not remove these words to ensure the coverage of our vocabulary and the general applicability of our embeddings in diverse scenarios. Though not useful in many applications, stop words, digits, and punctuations might be informative for some certain tasks, such as named entity recognition and part-of-speech tagging. To better adapt our embeddings to your specific task, you may customize your own vocabulary and ignore the words or phrases absent in your vocabulary.
If you use our corpus, please cite: Yan Song, Shuming Shi, Jing Li, and Haisong Zhang. Directional Skip-Gram: Explicitly Distinguishing Left and Right Context for Word Embeddings. NAACL 2018 (Short Paper). [pdf] [bib]
Please feel free to contact us for any questions or comments: nlu@tencent.com
You can also visit the NLP research page of Tencent AI Lab.
This corpus is for research purpose only and released under a Creative Commons Attribution 3.0 Unported License (http://creativecommons.org/licenses/by/3.0/).