Compared with other public natural language processing tools, TexSmart supports several unique features as follows.
TexSmart supports one thousand entity types with hierarchical structure, while others, at present, only support a few or a dozen of (coarse-grained) entity types such as person, location, organization, etc. Large-scale fine-grained entity types are expected to provide richer semantic information for downstream NLP applications.
Table 1 and 2 provide a comparison of TexSmart and an existing open source text understanding tool on an example English sentence.
No. | Entity | Type id | Type Name | Semantic |
---|---|---|---|---|
1 | Captain Marvel | work.movie | Movie | {"related":["Spider-Man","Iron Man","Batman","Hulk","Captain America","Thor","Superman","X-Men","Wolverine","Avengers"]} |
2 | Los Angeles | loc.city | City | {"related":["Toronto","Chicago","Montreal","Boston","London","Vancouver","Paris","Ottawa","Seattle","Sydney"]} |
3 | 14 months ago | time.generic | Time | {"value":[2019,3]} |
No. | Entity | Type Name |
---|---|---|
1 | Marvel | Person |
2 | Los Angeles | Location |
The input text:”Captain Marvel was premiered in Los Angeles 14 months ago.”
It can be seen that TexSmart recognizes more types of entities (for instance movie, etc.), and supports more fine-grained entity typing (such as refining the entity type of “Los Angeles” from “loc” to “loc.city”).
TexSmart recognizes up to one thousand entity types, including person, location, organization, product, trademark, work, time, numerical value, living creature, food, medicine, disease, subject, language, celestial body, organ, event, activity and so on. In the common coarse-grained entity types of people, location, organization, etc., TexSmart can further recognize many fine-grained sub-types such as actor, politician, athlete, country, city, company, university, financial institution and so on.
In addition to fine-grained named entity recognition, TexSmart also provides two enhanced semantic understanding functions: semantic expansion and deep semantic expression for specific type of entities. These two functions are not available in most existing open-source text understanding systems.
The function of semantic expansion is to provide a list of related entities for the entities in the input sentence. Semantic expansion is a way to enhance the understanding of each entity’s semantics. It has wide applications in industry, for instance in search engines and recommendation systems. In the example above, TexSmart can associate “Captain Marvel” with other movies such as “Spider-Man”, “Iron Man”, etc.; and associate “Los Angeles” with other cities such as “Toronto”, “Chicago”, etc.
For time, quantity and other specific types of entities, TexSmart can analyze their potential structured expressions, so as to further derive the precise semantics of these entities. For example in Table 1, the deep semantic expression given by TexSmart for “14 months ago” is a structured string with precise date in JSON format: {"value":[2019, 3]}. This kind of deep semantic understanding is essential for certain NLP applications. For example, in intelligent dialogue systems, a user send a request to the bot on May 2, 2020, which is “Please help me book an air ticket to Beijing at 4 pm the day after tomorrow.”. The bot not only needs to know that “at 4 pm the day after tomorrow” is a time entity, but also needs to know the deep semantics of this entity refers to “4 pm, May 4, 2020”. At present, most public NLP tools do not provide the function of deep semantic expression like this, which is needed to be implemented by the application layer itself.
There are different requirements for speed, precision and timeliness in various applications scenarios in academia and industry, and it is often difficult to achieve both speed and precision. The goal of TexSmart is to consider these three aspects as much as possible within one system.
First of all, TexSmart implements a variety of algorithms and models with different speed and precision for a certain function (such as part of speech tagging and named entity recognition) to customize the upper applications, so as to meet the diverse application needs in different scenarios of industry and academia.
Secondly, TexSmart takes advantage of large-scale unstructured data and unsupervised or weakly-supervised methods. On the one hand, these unstructured data covers a large number of words and entities with strong timeliness (such as “Captain Marvel” above); on the other hand, using unsupervised or weakly-supervised methods can make the system update at a low cost, so as to ensure its strong timeliness.
Existing tools | TexSmart | |
---|---|---|
Entity grained | A dozen of coarse-grained entity types, such as person, location, organization | One thousand entity types, including coarse-grained types like product, work, food, time, quantity, and fine-grained types like actor, politician, athlete, city, etc. |
Semantic expansion | Mostly unsupported | Implement semantic expansion, for instance, Los Angeles -> Toronto, Chicago and so on. |
Entity semantic expression | Mostly unsupported | Rich semantic representation for specific types of entities (e.g., time and quantities) |
@article{texsmart2020,
title={TexSmart: A Text Understanding System for Fine-Grained NER and Enhanced Semantic Analysis},
author={Haisong Zhang and Lemao Liu and Haiyun Jiang and Yangming Li and Enbo Zhao and Kun Xu and
Linfeng Song and Suncong Zheng and Botong Zhou and Jianchen Zhu and Xiao Feng and Tao Chen and
Tao Yang and Dong Yu and Feng Zhang and Zhanhui Kang and Shuming Shi}
journal={arXiv preprint arXiv:2012.15639},
year={2020}
}
@inproceedings{texsmart2021,
title={TexSmart: A System for Enhanced Natural Language Understanding},
author={Lemao Liu and Haisong Zhang and Haiyun Jiang and Yangming Li and Enbo Zhao and Kun Xu and
Linfeng Song and Suncong Zheng and Botong Zhou and Jianchen Zhu and Xiao Feng and Tao Chen and Tao
Yang and Dong Yu and Feng Zhang and Zhanhui Kang and Shuming Shi}
booktitle={The Joint Conference of the 59th Annual Meeting of the Association for Computational
Linguistics and the 11th International Joint Conference on Natural Language Processing
(ACL-IJCNLP): System Demonstrations},
year={2021}
}
Should you have any questions, please contact us via texsmart@tencent.com or join the QQ group for API and toolkit trials (see below).