En |

TexSmart Usage Instructions


TexSmart is a text understanding system built by NLP Team at Tencent AI Lab and AI Platform, which is used to analyze morphology, syntax and semantics for text in both Chinese and English. It provides basic natural language understanding functions such as word segmentation, part-of-speech tagging, named entity recognition(NER), semantic expansion, and particularly supports some key functions including fine-grained named entity recognition, semantic expansion and deep semantic expression for specific entities.

System Description

The system provides two forms of toolkit (SDK) and Http online service. The comparison of the two forms is as follows:


Offline toolkit Http online service
Function Word Segmentation Unsupervised word Segmentation algorithm Unsupervised word Segmentation algorithm
POS Tagging Algorithm: log_linear Algorithms: log_linear, CRF, DNN
Coarse-grained NER Algorithm: coarse.crf Algorithm: coarse.crf, coarse.dnn, coarse.lua
Fine-grained NER Algorithm: fine.std Algorithm: fine.std, fine.high_acc
Constituency Parsing Unsupported Supported
Semantic Role Labeling (SRL) Unsupported Supported
Semantic Expansion Std Algorithm based on a small knowledge base Std Algorithm based on a large knowledge base
Deep semantic expression Context free grammar (CFG) Context free grammar (CFG)
Text classification Unsupported Default algorithm
Text matching Linkage algorithm Both esim and linkage algorithms
Text Graph Unsupported Supported
Size of knowledge base Small Large
Supported Languages Chinese and English Chinese and English

Http online service can be called in various programming language on various operating systems (API call: Text Understanding | Text Matching | Text Graph).
The scope of application of the offline development toolkit is as follows:

Natural Language Understanding (NLU) Tasks

In the current version, TexSmart supports multiple functions, including word segmentation, part-of-speech tagging, (coarse and fine grained) named entity recognition, constituency parsing, semantic role labeling, semantic expansion, deep semantic expression (for some specific entities), text matching, text classification and text graph.

Word Segmentation

Automatic Word Segmentation task is to cut the input string into word-based sequences and space them. Automatic word segmentation is the basic task of natural language processing, because thee results of word segmentation can be used as the input of many downstream tasks. In order to apply different application scenarios, TexSmart provides two kinds of granularity automatic word segmentation function support: basic granularity and compound granularity. The differences between them can be seen in the example in Figure 3. In the basic granularity, “30(sanshi)” and “号(hao)” are divided into two words; while in the compound granularity, “30号(sanshihao)” an NT, is no longer divided into a more fine-grained word, which is classified as one word. It should be noted that TexSmart also supports English word segmentation. The results of basic granularity word segmentation, that is, English input, because the English input itself has already been segmented according to the space; and the results of compound granularity word segmentation is unique to TexSmart, as shown in Figure 4.

Figure 3: Examples of Chinese word segmentation and part-of-speech tagging tasks

Figure 4: Examples of Chinese word segmentation and part-of-speech tagging tasks

Part-of-Speech Tagging

Part-of-Speech (POS) is a mark of the syntactic role of each word in a sentence, also known as word classes or syntactic categories. The most common part-of-speech tags are noun, verb, adjective, etc. Understanding the part-of-speech of a word can reveal its usage rules and matched habits in context; for example, if a word is a verb, it can be inferred that the word in front of it is most likely to be a noun or pronoun.

POS Tagging task is the process of assigning a part-of-speech tag to each word in the input sentence. Figure 3, in addition to the word segmentation results, also shows the part-of-speech tags of each word for the sentence under different word segmentation granularity. For example, in the basic granularity, “30”and “号(hao)”are tagged as cardinal number(CD) and measure word(M) respectively; while in compound granularity, “30号(30hao)”is tagged as temporal noun(NT). The Chinese and English part-of-speech tag set supported by TexSmart are CTB and PTB, respectively. See Table 2 and Table 3 for details.

TexSmart implements three part-of-speech tagging algorithms: Log-Linear Model, Conditional Random Field(CRF) and Deep Neural Network(DNN). In the demo of web version, users can switch between different algorithms through the “part-of-speech tagging algorithm” drop-down box in Figure 3; and when calling the HTTP API, they can also choose the algorithm. The names of the three algorithms are: log_linear, crf and dnn.

Named Entity Recognition (NER)

Named Entity refers to the entities with names in natural language sentences or texts. Roughly speaking, for example, a person, a location or an organization corresponds to its person name, location name or organization name. Named Entity Recognition (NER) task is the process that recognize the above-mentioned different types entities. Specifically, it is to tag the input word sequence with different categories of named entities. NER task is indispensable for text analysis, information extraction, knowledge graph construction and other downstream tasks. It has been fully deployed in the information retrieval and e-commerce, and it has also been applied in many scientific and technological frontier fields, such as information process and knowledge mining of biological and medical big data.

TexSmart provides named entity recognition function in two modes of coarse-grained and fine-grained. Coarse-grained NER uses supervised learning method and provides two different models (CRF and DNN) to recognize persons, locations, organizations and other types of named entities (see Table 4). Fined-grained NER is implemented by the combination of a supervised algorithm and an unsupervised one. It is able to recognize about one thousand entity types.

Information about all the entity types supported by TexSmart can be downloaded from:

Please note that the contents within these two compressed files are excatly the same and thus it is fine to choose either of them.

Figure 5: Coarse-grained named entity recognition

Figure 6: Fine-grained named entity recognition

Figure 5 is the result of named entity recognition in coarse-grained mode of example sentence, recognizing the location “Los Angeles”. Figure 6 is the result of named entity recognition in fine-grained mode of example sentence. It can be seen that more diverse entity types are recognized, such as “city”, “movie”, “time”; in addition, TexSmart also provides other related named entity corresponding to the named entity to entity expansion, such as “Spider-Man” and “Iron Man” corresponding to “Captain Marvel”.

Constituency Parsing

Based on the phrase structure grammar proposed by Chomsky, constituency parsing is the process that combines the input word sequence into a phrase structure tree. The so-called phrase structure, such as the noun phrase (NP) composed of “Captain Marvel”, or the verb phrase (VP) composed of “premiered in Los Angeles 14 months ago”, etc. in Figure 7.

Figure 7: Constituency parsing

The leaf node of the constituency syntax tree is each word of the input sentence; the upper layer of the leaf node is the part-of-speech tag of each word; moving to the root node of the tree is the name of the phrase structure constituted, and the phrase structure labels (CTB and PTB standards) in Chinese and English are shown in Table 6 and Table 7. The constituency parsing in TexSmart uses the model based on self-attention published by Berkeley University in ACL 2018. The model achieves excellent performance and has low search complexity. We uses pre-trained BERT model at the bottom layer to provide features to further enhance the robustness of the model.

Semantic Role Labeling

The semantic understanding of natural language often includes the analysis of the main elements that make up an event, such as behavior, subject, object and other adjuncts, such as the time, location, method, etc. of the event. In event semantics, the elements that make up an event are called semantic roles; and semantic role labeling task is the process to identify all the events and their components in a sentence, for instance, its behavior (often the predicate part of a sentence), subject, event, location, etc. In Figure 8, the two events in the example “看(kan)” and “吃(chi)” are both identified, and the corresponding “看(kan)” subject “南昌的王先生(nanchangdewangxiansheng)” and the object “流浪地球(liulangdiqiu)” as well as the adjunct time “上个月30号(shanggeyue30hao)” and the adjunct location “在自己家里(zaizijijiali)”of the event are accurately labeled. Semantic role labeling can provide support for many downstream tasks, such as deeper semantic analysis (AMR Parsing, CCG Parsing, etc.), intention recognition in task-based dialogue system, entity scoring in fact-based question and answer system, etc.

Figure 8: Semantic role labeling

TexSmart also supports semantic role labeling on Chinese and English texts, the Chinese label set are shown in Table 8, and the English are shown in Table 9. TexSmart uses self-attention neural network model based on BERT for semantic role labeling.

Text Classification

Text Classification aims to assign a label for an input text among a predefined label set. Text Classification is a classical task in NLP and it has been widely used in many applications such as email filtering and sentiment analysis. Assume the input text is:

TensorFlow (released as open source on November 2015) is a library developed by Google to accelerate deep learning research.

Texsmart assigns a corresponding label to the input, and Figure 7 shows a detailed result:

Figure 7. Text classification

the label of the input is tc.technology, and its model score (or confidence) is 0.198. The predefined label set for text classification in Texsmart is presented in label set.

Text Matching

Text Matching evaluates the semantic similarity between a pair of inputs. Text matching is a core task in natural language understanding and many NLP tasks can be actually considered as text matching. For example, information retrieval can be reduced as the matching between a query and a document candidate, question answering as the matching between a question and its answer candidate, dialogue systems as the matching between a query dialogue and its response. Texsmart implements two text matching algorithm: one is ESIM based on neural networks, and the other is Linkage which is an unsupervised algorithm based on synonym and word embedding. As shown in Figure 8, for a pair of input sentences "A person pours cooking oil into a pot." and "A person is pouring olive oil into a pot on the stove.", esim algorithm returns a similarity score of 0.608917 and linkage algorithm obtains a score of 0.876792. When calling its HTTP API, users can select each of both algorithms by themselves。

Figure 8. Text Matching

Text Graph

Text Graph is the latest text structured knowledge base that supports knowledge queries such as "synonyms", "antonyms", "hypernyms", "hyponyms" and "similar words". Knowledge in Text Graph provides rich background knowledge for text understanding tasks (such as entity recognition, relation extraction, reading comprehension, paraphrase detection, text matching, etc.). We provide an example:

Figure 9. Text Graph

Text Normalization (Case Recovery for English)

Text normalization aims to normalize an input text, i.e. converting an unnormalized text to a normalized one. It is the basis of many NLP tasks since many of NLP tasks require its input text to be normalized. Text normalization includes several ways such as case recovery for English, spelling and grammar correction. Currently, Texsmart supports case recovery for English. For example, suppose the input text is:

john smith stayed in san francisco last month.

Texsmart is able to obtain its normalized text as follows:

John Smith stayed in San Francisco last month.