Tencent AI Lab Embedding Corpora for Chinese and English Words and Phrases

The corpora on continuous distributed representations of Chinese and English words and phrases.

News

2022-9-15: Version 0.1.0 of our English corpus was available for download
2021-12-24: Version 0.2.0 of our Chinese Embedding corpus was available for download

Introduction

The latest version of these corpora provides 100-dimension and 200-dimension vector representations, a.k.a. embeddings, for Chinese and English. Specifically, there are over 12 million Chinese words and phrases and 6.5 million English words and phrases, which are pre-trained on large-scale high-quality data. These vectors, capturing semantic meanings for words and phrases, can be widely applied in many downstream tasks (e.g., named entity recognition and text classification) and in further research.

Data Description

Please go to the download page to get the embedding data. The data format of each file is as follows,

The first line shows the total number of embeddings and their dimension size, separated by a space. In each line below, the first column indicates a word or phrase, followed by a space and its embedding. For each embedding, its values in different dimensions are separated by spaces.

Highlights

In comparison with existing embedding corpora, the superiority of our corpora mainly lies in coverage, freshness, and accuracy.

Coverage. Our corpora contain a large amount of domain-specific words or slangs in the Chinese and English vocabularies. In Chinese vocabulary, there includes “喀拉喀什河”, “皇帝菜”, “不念僧面念佛面”, “冰火两重天”, “煮酒论英雄", which are not covered by most of the existing Chinese embedding corpora. In English vocabulary, it covers phrases including “machine learning and natural language processing”, “budget deficit”, “foreign exchange reserves”, “hit the books”, “go cold turkey".
Freshness. Our corpora contain fresh words appearing or getting popular recently, such as “新冠病毒”, “元宇宙”, “了不起的儿科医生”, “流金岁月”, “凡尔赛文学”, “yyds” in Chinese vocabulary, and “covid-19”, “metaverse”, “russo-ukrainian war”, “iphone 14” in English vocabulary.
Accuracy. Our embeddings can better reflect the semantic meaning of words or phrases, attributed to the large-scale data and the well-designed algorithm for training.

Training

To ensure the coverage, freshness, and accuracy of our corpora, we carefully design our data preparation and training process in terms of the following aspects:

Data collection. Our training data contains large-scale text collected from news, webpages, and novels. Text data from diverse domains enables the coverage of various types of words and phrases. Moreover, the recently collected webpages and news data enable us to learn the semantic representations of fresh words.
Vocabulary building. To enrich our Chinese vocabulary, we involve phrases in Wikipedia and Baidu Baike. For English vocabulary, we involve phrases in Wikipedia. We also apply the phrase discovery approach in Corpus-based Semantic Class Mining: Distributional vs. Pattern-Based Approaches, which enhances the coverage of emerging phrases.
Training algorithm. Our corpora are trained with Directional Skip-Gram: Explicitly Distinguishing Left and Right Context for Word Embeddings, which is based on word co-occurrence and the directions of word pairs, i.e., which word is on the left, in a context window.

Simple Cases

To exemplify the learned representations, in below we show the most similar words for some sample words. Here cosine distance between embeddings is used to compute the distance of two words/phrases.

For Chinese:

Input	`新冠病毒`	`煮酒论英雄`	`流金岁月`	`刘德华`	`自然语言处理`
Top similar words	`新冠肺炎病毒新型冠状病毒新冠状病毒肺炎病毒 covid-19病毒新冠新型病毒冠状病毒`	`青梅煮酒论英雄曹操煮酒论英雄青梅煮酒关羽温酒斩华雄桃园三结义温酒斩华雄三英战吕布桃园结义`	`半生缘大江大河2 你迟到的许多年风再起时情深缘起外滩钟声亲爱的自己了不起的女孩`	`华仔张学友张国荣梁朝伟谭咏麟周润发刘天王古天乐`	`自然语言理解计算机视觉自然语言处理技术 nlp 机器学习语义理解深度学习 nlp技术`

For English:

Input	`covid-19`	`metaverse`	`russo-ukrainian war`	`iphone 14`	`natural language processing`
Top similar words	`covid the coronavirus corona virus covid-19 virus covid-19 delta variant sars-cov2`	`the metaverse decentraland blockchain gaming virtual world nfts play-to-earn`	`donbas region crimean crisis war in ukraine conflict in ukraine the annexation of crimea donbas war`	`apple watch series 7 galaxy fold 2 samsung galaxy s22 iphone 13 iphone 12s airpods 3`	`natural language understanding language processing natural language generation text analytics text understanding nlp applications`

FAQ

Q1: Why we encountered errors when reading Tencent AI Lab embeddings with Google’s word2vec or gensim’s Word2Vec?

Our data file is encoded in UTF-8. If you are using gensim, you can follow the scripts below to read our embeddings:

from gensim.models import KeyedVectors wv_from_text = KeyedVectors.load_word2vec_format(file, binary=False)

Q2: How did you segment Chinese words when processing the training data? What should we do to make our segmentation results similar to yours?

You might not be able to make full use of our embeddings if only applying publicly available toolkits for Chinese word segmentation. The reason is that most of these toolkits will further segment phrases or entities into fine-grained elements. For some specific tasks, fine-grained word segmentation in preprocessing will result in worse model performance than coarse-grained segmentation, while sometimes fine-grained word segmentation may perform better.

Currently, we are working to test our word segmentation toolkit on diverse NLP tasks and further improve its performance. Once ready, the toolkit will be released for public use. At the current stage, as a quick start, you can simply use an open-source toolkit for Chinese word segmentation. Furthermore, some words can be combined into phrases based on our vocabulary. In addition, you may consider both fine-grained words (obtained in word segmentation) and coarse-grained phrases (obtained in word combination) when tackling some certain tasks.

Q3: Why there are stop words (e.g., “的” and “是” ), digits, and punctuations (e.g., “，” and “。”) in the vocabulary of Tencent AI Lab embeddings?

We did not remove these words to ensure the coverage of our vocabulary and the general applicability of our embeddings in diverse scenarios. Though not useful in many applications, stop words, digits, and punctuations might be informative for some certain tasks, such as named entity recognition and part-of-speech tagging. To better adapt our embeddings to your specific task, you may customize your own vocabulary and ignore the words or phrases absent in your vocabulary.

Q4: How to parse a phrase with URL encoding into its original form in English Corpora?

In English, a phrase is usually expressed with multiple words separated by spaces. However, in embedding corpora, we also use spaces to distinguish values in different dimensions. To avoid ambiguity, we encode the phrases with URL format in the English corpus . For example, the phrase "go to school" is encoded into ""go%20to%20school"". Parse the phrase of URL format can be easily achieved using following code（e.g., python):

from urllib.parse import unquote phrase = unquote(phrase)

Citation

If you use our corpora, please cite: Yan Song, Shuming Shi, Jing Li, and Haisong Zhang. Directional Skip-Gram: Explicitly Distinguishing Left and Right Context for Word Embeddings. NAACL 2018 (Short Paper). [pdf] [bib]

Related Systems

Effidit: A writing assistant.
TexSmart: A text understanding toolkit and service.
TranSmart: An interactive translation system.

Contacts

Please feel free to contact us for any questions or comments: nlu@tencent.com

You can also visit the NLP research page of Tencent AI Lab.

Disclaimer

This corpus is for research purpose only and released under a Creative Commons Attribution 3.0 Unported License (http://creativecommons.org/licenses/by/3.0/).