Technical Solution for TexSmart

For more mature NLP tasks such as word segmentation, part-of-speech tagging and syntax analysis, TexSmart implements a variety of representative methods[1-4]. The following is a brief introduction to the technical implementation of its features.

Coarse-grained Named Entity Recognition

A new coarse-grained entity recognition algorithm based on LUA (Lexical Unit Analysis) is added. The details are presented in [14].

Fine-grained Named Entity Recognition

Existing named entity recognition (NER) systems mostly rely on a manually annotated data set with coarse-grained entity type as the training set. However, there are thousands of entity types in TexSmart, and it is very time-consuming to manually annotate a training set with all entity types. In order to reduce the amount of manual annotation, this module uses a hybrid approach, which is the combination of the following three methods.

1) Unsupervised fine-grained entity recognition method, based on two types of data: one is the mapping table from entity names to types derived from the knowledge graph TopBase[5] maintained by Tencent AI Lab; the other is relation information in the word context extracted from large-scale text data by unsupervised method in literature[6-7].

2) The supervised sequence tagging model, trained on a data set which contains dozens of course-grained entity types with manual annotation.

3) The entity linking method of Tencent AI Lab winning the international competition[8].

The results of these three methods will have some errors and defects independently, however, experiments show that the combination of them all can achieve better results.

Semantic Expansion

Context-aware semantic expansion (CASE) is a new NLP task[9] abstracted from industrial applications by Tencent AI Lab . The challenge of this task is the lack of annotated training data. This module uses two methods to build a semantic expansion model. The first method combines technologies of word vector, distributional similarity and template matching to generate a semantic similarity graph[10-12], and then uses the similarity graph and context information to generate the related entity sets. The second method creates a pseudo annotation data set of comparable scale based on large-scale unstructured data, and trains a neural network model with full consideration of context.

Deep Semantic Expression for Specific Types of Entities

For time and quantity entity, TexSmart can infer their specific semantic expression. Some NLP tools make use of regular expression or supervised sequence tagging methods to recognize the time and quantity entities. However, it is difficult for those methods to derive the structured semantic information of entities. In order to overcome this problem, the implementation of this module takes advantage of Context Free Grammar (CFG), which is more expressive than regular expression. The basic process is that: firstly, according to the natural language expression format of a specific entity type, write the CFG production; secondly, parse the natural language text representing this entity into a syntax tree by Earley algorithm[13]; finally, generate the deep semantic expression of the entity by traversing the syntax tree.

Reference