Tencent AI Lab Embedding Corpora for Chinese and English Words and Phrases


The corpora on continuous distributed representations of Chinese and English words and phrases.

News

  • 2022-9-15: Version 0.1.0 of our English corpus was available for download
  • 2021-12-24: Version 0.2.0 of our Chinese Embedding corpus was available for download

Introduction

The latest version of these corpora provides 100-dimension and 200-dimension vector representations, a.k.a. embeddings, for Chinese and English. Specifically, there are over 12 million Chinese words and phrases and 6.5 million English words and phrases, which are pre-trained on large-scale high-quality data. These vectors, capturing semantic meanings for words and phrases, can be widely applied in many downstream tasks (e.g., named entity recognition and text classification) and in further research.

Data Description

Please go to the download page to get the embedding data. The data format of each file is as follows,

The first line shows the total number of embeddings and their dimension size, separated by a space. In each line below, the first column indicates a word or phrase, followed by a space and its embedding. For each embedding, its values in different dimensions are separated by spaces.

Highlights

In comparison with existing embedding corpora, the superiority of our corpora mainly lies in coverage, freshness, and accuracy.

  • Coverage. Our corpora contain a large amount of domain-specific words or slangs in the Chinese and English vocabularies. In Chinese vocabulary, there includes “喀拉喀什河”, “皇帝菜”, “不念僧面念佛面”, “冰火两重天”, “煮酒论英雄", which are not covered by most of the existing Chinese embedding corpora. In English vocabulary, it covers phrases including “machine learning and natural language processing”, “budget deficit”, “foreign exchange reserves”, “hit the books”, “go cold turkey".
  • Freshness. Our corpora contain fresh words appearing or getting popular recently, such as “新冠病毒”, “元宇宙”, “了不起的儿科医生”, “流金岁月”, “凡尔赛文学”, “yyds” in Chinese vocabulary, and “covid-19”, “metaverse”, “russo-ukrainian war”, “iphone 14” in English vocabulary.
  • Accuracy. Our embeddings can better reflect the semantic meaning of words or phrases, attributed to the large-scale data and the well-designed algorithm for training.

Training

To ensure the coverage, freshness, and accuracy of our corpora, we carefully design our data preparation and training process in terms of the following aspects:

Simple Cases

To exemplify the learned representations, in below we show the most similar words for some sample words. Here cosine distance between embeddings is used to compute the distance of two words/phrases.

For Chinese:

Input

新冠病毒 煮酒论英雄 流金岁月 刘德华 自然语言处理

Top
similar
words

新冠肺炎病毒
新型冠状病毒
新冠状病毒
肺炎病毒
covid-19病毒
新冠
新型病毒
冠状病毒
青梅煮酒论英雄
曹操煮酒论英雄
青梅煮酒
关羽温酒斩华雄
桃园三结义
温酒斩华雄
三英战吕布
桃园结义
半生缘
大江大河2
你迟到的许多年
风再起时
情深缘起
外滩钟声
亲爱的自己
了不起的女孩
华仔
张学友
张国荣
梁朝伟
谭咏麟
周润发
刘天王
古天乐
自然语言理解
计算机视觉
自然语言处理技术
nlp
机器学习
语义理解
深度学习
nlp技术

For English:

Input

covid-19 metaverse russo-ukrainian war iphone 14 natural language processing

Top
similar
words

covid
the coronavirus
corona virus
covid-19 virus
covid-19 delta variant
sars-cov2
the metaverse
decentraland
blockchain gaming
virtual world
nfts
play-to-earn
donbas region
crimean crisis
war in ukraine
conflict in ukraine
the annexation of crimea
donbas war
apple watch series 7
galaxy fold 2
samsung galaxy s22
iphone 13
iphone 12s
airpods 3
natural language understanding
language processing
natural language generation
text analytics
text understanding
nlp applications

FAQ

Q1: Why we encountered errors when reading Tencent AI Lab embeddings with Google’s word2vec or gensim’s Word2Vec?

Our data file is encoded in UTF-8. If you are using gensim, you can follow the scripts below to read our embeddings:

from gensim.models import KeyedVectors
wv_from_text = KeyedVectors.load_word2vec_format(file, binary=False)

Q2: How did you segment Chinese words when processing the training data? What should we do to make our segmentation results similar to yours?

You might not be able to make full use of our embeddings if only applying publicly available toolkits for Chinese word segmentation. The reason is that most of these toolkits will further segment phrases or entities into fine-grained elements. For some specific tasks, fine-grained word segmentation in preprocessing will result in worse model performance than coarse-grained segmentation, while sometimes fine-grained word segmentation may perform better.

Currently, we are working to test our word segmentation toolkit on diverse NLP tasks and further improve its performance. Once ready, the toolkit will be released for public use. At the current stage, as a quick start, you can simply use an open-source toolkit for Chinese word segmentation. Furthermore, some words can be combined into phrases based on our vocabulary. In addition, you may consider both fine-grained words (obtained in word segmentation) and coarse-grained phrases (obtained in word combination) when tackling some certain tasks.

Q3: Why there are stop words (e.g., “的” and “是” ), digits, and punctuations (e.g., “,” and “。”) in the vocabulary of Tencent AI Lab embeddings?

We did not remove these words to ensure the coverage of our vocabulary and the general applicability of our embeddings in diverse scenarios. Though not useful in many applications, stop words, digits, and punctuations might be informative for some certain tasks, such as named entity recognition and part-of-speech tagging. To better adapt our embeddings to your specific task, you may customize your own vocabulary and ignore the words or phrases absent in your vocabulary.

Q4: How to parse a phrase with URL encoding into its original form in English Corpora?

In English, a phrase is usually expressed with multiple words separated by spaces. However, in embedding corpora, we also use spaces to distinguish values in different dimensions. To avoid ambiguity, we encode the phrases with URL format in the English corpus . For example, the phrase "go to school" is encoded into ""go%20to%20school"". Parse the phrase of URL format can be easily achieved using following code(e.g., python):

from urllib.parse import unquote
phrase = unquote(phrase)

Citation

If you use our corpora, please cite: Yan Song, Shuming Shi, Jing Li, and Haisong Zhang. Directional Skip-Gram: Explicitly Distinguishing Left and Right Context for Word Embeddings. NAACL 2018 (Short Paper). [pdf] [bib]

Related Systems

  • Effidit: A writing assistant.
  • TexSmart: A text understanding toolkit and service.
  • TranSmart: An interactive translation system.

Contacts

Please feel free to contact us for any questions or comments: nlu@tencent.com

You can also visit the NLP research page of Tencent AI Lab.

Disclaimer

This corpus is for research purpose only and released under a Creative Commons Attribution 3.0 Unported License (http://creativecommons.org/licenses/by/3.0/).