TexSmart: A Text Understanding Toolkit and Service

En | 中

TexSmart: A Text Understanding Toolkit and Service

TexSmart is a text understanding system built by the NLP Team at Tencent AI Lab, which is used to analyze morphology, syntax and semantics for text in both Chinese and English. Besides supporting common features such as word segmentation, part-of-speech tagging, named entity recognition, syntactic analysis, semantic role labeling, text classification, text matching, text normalization (case recovery for English), TexSmart provides special features such as fine-grained named entity recognition, semantic expansion, deep semantic expression and so on. Also, TexSmart add the text graph module, which provides the knowledge retrieval of many important relations for short texts or words. Text understanding technologies, which are used for structural analysis and processing of natural language, have been widely applied in search engines, personalized recommendation systems, advertisement matching, intelligent dialogue systems and beyond. For more details about TexSmart, please check the technical report.

Key Features of TexSmart

Compared with other public natural language processing tools, TexSmart supports several unique features as follows.

Key Feature 1: Fine-grained Named Entity Recognition

TexSmart supports one thousand entity types with hierarchical structure, while others, at present, only support a few or a dozen of (coarse-grained) entity types such as person, location, organization, etc. Large-scale fine-grained entity types are expected to provide richer semantic information for downstream NLP applications.

Table 1 and 2 provide a comparison of TexSmart and an existing open source text understanding tool on an example English sentence.

Captain Marvel¹ was premiered in Los Angeles² 14 months ago³.

Table 1 The function of TexSmart in fine-grained named entity recognition and enhanced semantic understanding (May 2020)
No.	Entity	Type id	Type Name	Semantic
1	Captain Marvel	work.movie	Movie	{"related":["Spider-Man","Iron Man","Batman","Hulk","Captain America","Thor","Superman","X-Men","Wolverine","Avengers"]}
2	Los Angeles	loc.city	City	{"related":["Toronto","Chicago","Montreal","Boston","London","Vancouver","Paris","Ottawa","Seattle","Sydney"]}
3	14 months ago	time.generic	Time	{"value":[2019,3]}

Captain Marvel¹ was premiered in Los Angeles² 14 months ago.

Table 2 The result of traditional tools in named entity recognition (NER)
No.	Entity	Type Name
1	Marvel	Person
2	Los Angeles	Location

The input text:”Captain Marvel was premiered in Los Angeles 14 months ago.”

It can be seen that TexSmart recognizes more types of entities (for instance movie, etc.), and supports more fine-grained entity typing (such as refining the entity type of “Los Angeles” from “loc” to “loc.city”).

TexSmart recognizes up to one thousand entity types, including person, location, organization, product, trademark, work, time, numerical value, living creature, food, medicine, disease, subject, language, celestial body, organ, event, activity and so on. In the common coarse-grained entity types of people, location, organization, etc., TexSmart can further recognize many fine-grained sub-types such as actor, politician, athlete, country, city, company, university, financial institution and so on.

Key Feature 2: Enhanced Semantic Understanding

In addition to fine-grained named entity recognition, TexSmart also provides two enhanced semantic understanding functions: semantic expansion and deep semantic expression for specific type of entities. These two functions are not available in most existing open-source text understanding systems.

1) Semantic Expansion

The function of semantic expansion is to provide a list of related entities for the entities in the input sentence. Semantic expansion is a way to enhance the understanding of each entity’s semantics. It has wide applications in industry, for instance in search engines and recommendation systems. In the example above, TexSmart can associate “Captain Marvel” with other movies such as “Spider-Man”, “Iron Man”, etc.; and associate “Los Angeles” with other cities such as “Toronto”, “Chicago”, etc.

2) Deep Semantic Expression for Specific Type of Entities

For time, quantity and other specific types of entities, TexSmart can analyze their potential structured expressions, so as to further derive the precise semantics of these entities. For example in Table 1, the deep semantic expression given by TexSmart for “14 months ago” is a structured string with precise date in JSON format: {"value":[2019, 3]}. This kind of deep semantic understanding is essential for certain NLP applications. For example, in intelligent dialogue systems, a user send a request to the bot on May 2, 2020, which is “Please help me book an air ticket to Beijing at 4 pm the day after tomorrow.”. The bot not only needs to know that “at 4 pm the day after tomorrow” is a time entity, but also needs to know the deep semantics of this entity refers to “4 pm, May 4, 2020”. At present, most public NLP tools do not provide the function of deep semantic expression like this, which is needed to be implemented by the application layer itself.

Key Feature 3: Designed for multi-dimensional application requirements

There are different requirements for speed, precision and timeliness in various applications scenarios in academia and industry, and it is often difficult to achieve both speed and precision. The goal of TexSmart is to consider these three aspects as much as possible within one system.

First of all, TexSmart implements a variety of algorithms and models with different speed and precision for a certain function (such as part of speech tagging and named entity recognition) to customize the upper applications, so as to meet the diverse application needs in different scenarios of industry and academia.

Secondly, TexSmart takes advantage of large-scale unstructured data and unsupervised or weakly-supervised methods. On the one hand, these unstructured data covers a large number of words and entities with strong timeliness (such as “Captain Marvel” above); on the other hand, using unsupervised or weakly-supervised methods can make the system update at a low cost, so as to ensure its strong timeliness.

Table 3 Features of TexSmart
Existing tools TexSmart

Entity grained A dozen of coarse-grained entity types, such as person, location, organization One thousand entity types, including coarse-grained types like product, work, food, time, quantity, and fine-grained types like actor, politician, athlete, city, etc.

Semantic expansion Mostly unsupported Implement semantic expansion, for instance, Los Angeles -> Toronto, Chicago and so on.

Entity semantic expression Mostly unsupported Rich semantic representation for specific types of entities (e.g., time and quantities)

Table 3 Features of TexSmart
	Existing tools	TexSmart
Entity grained	A dozen of coarse-grained entity types, such as person, location, organization	One thousand entity types, including coarse-grained types like product, work, food, time, quantity, and fine-grained types like actor, politician, athlete, city, etc.
Semantic expansion	Mostly unsupported	Implement semantic expansion, for instance, Los Angeles -> Toronto, Chicago and so on.
Entity semantic expression	Mostly unsupported	Rich semantic representation for specific types of entities (e.g., time and quantities)

Versions

The lastest version of TexSmart is v0.3.7, it includes the following updates:

The vocabulary and model are updated to March 2023 for Chinese and Dec. 2022 for English
Fixed some bugs

Instructions

TexSmart users can choose to call HTTP APIs or download and use the offline SDK version. Note that for the same input text, the results of the HTTP API and the offline SDK may be different, because the HTTP API version employs a larger knowledge base and supports more text understanding tasks and algorithms. For more details, please check TexSmart instructions.
Quick Links

TexSmart Online Demo
TexSmart HTTP API: Text Understanding | Text Matching | Text Graph
Offline Toolkit（SDK） Download

Techniques

Techniques for implementing TexSmart can be found here.

Citing TexSmart

If you use TexSmart for research, please cite the following technical report:


@article{texsmart2020,
  title={TexSmart: A Text Understanding System for Fine-Grained NER and Enhanced Semantic Analysis}, 
  author={Haisong Zhang and Lemao Liu and Haiyun Jiang and Yangming Li and Enbo Zhao and Kun Xu and
   Linfeng Song and Suncong Zheng and Botong Zhou and Jianchen Zhu and  Xiao Feng and  Tao Chen and 
   Tao Yang and Dong Yu and Feng Zhang and Zhanhui Kang and Shuming Shi}
  journal={arXiv preprint arXiv:2012.15639},
  year={2020}
}

@inproceedings{texsmart2021,
  title={TexSmart: A System for Enhanced Natural Language Understanding}, 
  author={Lemao Liu and Haisong Zhang and Haiyun Jiang and Yangming Li and Enbo Zhao and Kun Xu and
   Linfeng Song and Suncong Zheng and Botong Zhou and Jianchen Zhu and Xiao Feng and Tao Chen and Tao
    Yang and Dong Yu and Feng Zhang and Zhanhui Kang and Shuming Shi}
  booktitle={The Joint Conference of the 59th Annual Meeting of the Association for Computational
   Linguistics and the 11th International Joint Conference on Natural Language Processing 
   (ACL-IJCNLP): System Demonstrations},
  year={2021}
}

Frequently Asked Questions (FAQ)

The answers to frequently asked questions about TexSmart are here.

Related Resources (Systems and Datasets)

Chinese and English Word Embedding including 8 million words
Tencent Interactive Translation System TranSmart

Contacts

Should you have any questions, please contact us via texsmart@tencent.com or join the QQ group for API and toolkit trials (see below).