The corpora on continuous distributed representations of Chinese and English words and phrases.
The latest version of these corpora provides 100-dimension and 200-dimension vector representations, a.k.a. embeddings, for Chinese and English. Specifically, there are over 12 million Chinese words and phrases and 6.5 million English words and phrases, which are pre-trained on large-scale high-quality data. These vectors, capturing semantic meanings for words and phrases, can be widely applied in many downstream tasks (e.g., named entity recognition and text classification) and in further research.
Please go to the download page to get the embedding data. The data format of each file is as follows,
The first line shows the total number of embeddings and their dimension size, separated by a space. In each line below, the first column indicates a word or phrase, followed by a space and its embedding. For each embedding, its values in different dimensions are separated by spaces.
In comparison with existing embedding corpora, the superiority of our corpora mainly lies in coverage, freshness, and accuracy.
To ensure the coverage, freshness, and accuracy of our corpora, we carefully design our data preparation and training process in terms of the following aspects:
To exemplify the learned representations, in below we show the most similar words for some sample words. Here cosine distance between embeddings is used to compute the distance of two words/phrases.
For Chinese:
Input |
新冠病毒 |
煮酒论英雄 |
流金岁月 |
刘德华 |
自然语言处理 |
Top similar words |
新冠肺炎病毒
新型冠状病毒
新冠状病毒
肺炎病毒
covid-19病毒
新冠
新型病毒
冠状病毒
|
青梅煮酒论英雄
曹操煮酒论英雄
青梅煮酒
关羽温酒斩华雄
桃园三结义
温酒斩华雄
三英战吕布
桃园结义
|
半生缘
大江大河2
你迟到的许多年
风再起时
情深缘起
外滩钟声
亲爱的自己
了不起的女孩
|
华仔
张学友
张国荣
梁朝伟
谭咏麟
周润发
刘天王
古天乐
|
自然语言理解
计算机视觉
自然语言处理技术
nlp
机器学习
语义理解
深度学习
nlp技术
|
For English:
Input |
covid-19 |
metaverse |
russo-ukrainian war |
iphone 14 |
natural language processing |
Top similar words |
covid
the coronavirus
corona virus
covid-19 virus
covid-19 delta variant
sars-cov2
|
the metaverse
decentraland
blockchain gaming
virtual world
nfts
play-to-earn
|
donbas region
crimean crisis
war in ukraine
conflict in ukraine
the annexation of crimea
donbas war
|
apple watch series 7
galaxy fold 2
samsung galaxy s22
iphone 13
iphone 12s
airpods 3
|
natural language understanding
language processing
natural language generation
text analytics
text understanding
nlp applications
|
Q1: Why we encountered errors when reading Tencent AI Lab embeddings with Google’s word2vec or gensim’s Word2Vec?
Our data file is encoded in UTF-8. If you are using gensim, you can follow the scripts below to read our embeddings:
from gensim.models import KeyedVectors wv_from_text = KeyedVectors.load_word2vec_format(file, binary=False)
Q2: How did you segment Chinese words when processing the training data? What should we do to make our segmentation results similar to yours?
You might not be able to make full use of our embeddings if only applying publicly available toolkits for Chinese word segmentation. The reason is that most of these toolkits will further segment phrases or entities into fine-grained elements. For some specific tasks, fine-grained word segmentation in preprocessing will result in worse model performance than coarse-grained segmentation, while sometimes fine-grained word segmentation may perform better.
Currently, we are working to test our word segmentation toolkit on diverse NLP tasks and further improve its performance. Once ready, the toolkit will be released for public use. At the current stage, as a quick start, you can simply use an open-source toolkit for Chinese word segmentation. Furthermore, some words can be combined into phrases based on our vocabulary. In addition, you may consider both fine-grained words (obtained in word segmentation) and coarse-grained phrases (obtained in word combination) when tackling some certain tasks.
Q3: Why there are stop words (e.g., “的” and “是” ), digits, and punctuations (e.g., “,” and “。”) in the vocabulary of Tencent AI Lab embeddings?
We did not remove these words to ensure the coverage of our vocabulary and the general applicability of our embeddings in diverse scenarios. Though not useful in many applications, stop words, digits, and punctuations might be informative for some certain tasks, such as named entity recognition and part-of-speech tagging. To better adapt our embeddings to your specific task, you may customize your own vocabulary and ignore the words or phrases absent in your vocabulary.
Q4: How to parse a phrase with URL encoding into its original form in English Corpora?
In English, a phrase is usually expressed with multiple words separated by spaces. However, in embedding corpora, we also use spaces to distinguish values in different dimensions. To avoid ambiguity, we encode the phrases with URL format in the English corpus . For example, the phrase "go to school" is encoded into ""go%20to%20school"". Parse the phrase of URL format can be easily achieved using following code(e.g., python):
from urllib.parse import unquote phrase = unquote(phrase)
If you use our corpora, please cite: Yan Song, Shuming Shi, Jing Li, and Haisong Zhang. Directional Skip-Gram: Explicitly Distinguishing Left and Right Context for Word Embeddings. NAACL 2018 (Short Paper). [pdf] [bib]
Please feel free to contact us for any questions or comments: nlu@tencent.com
You can also visit the NLP research page of Tencent AI Lab.
This corpus is for research purpose only and released under a Creative Commons Attribution 3.0 Unported License (http://creativecommons.org/licenses/by/3.0/).