En |

TexSmart HTTP API


TexSmart API includes three parts: i.e. Text Understanding API, Text Matching API and Text Graph API, and this page introduces text understanding API. The API supports access via HTTP-POST and its url is https://texsmart.qq.com/api.
The input post-data needs to be in JSON format, and the output is also in JSON format. It is recommended to use Postman for testing.

Here is a simple example request (as HTTP-POST body):
  {"str":"He stayed in San Francisco."}
Example codes for calling the API: Python Code | Java Code | C++ Code | C# Code

The results returned after calling the API is also in JSON format as follows:
{
	"header":{"time_cost_ms":1.18,"time_cost":0.00118,
				  "core_time_cost_ms":1.139,"ret_code":"succ"},
	"norm_str":"He stayed in San Francisco.",
	"word_list":[
		{"str":"He","hit":[0,2,0,1],"tag":"PRP"},
        {"str":"stayed","hit":[3,6,1,1],"tag":"VBD"},
        {"str":"in","hit":[10,2,2,1],"tag":"IN"},
        {"str":"San","hit":[13,3,3,1],"tag":"NNP"},
        {"str":"Francisco","hit":[17,9,4,1],"tag":"NNP"},
        {"str":".","hit":[26,1,5,1],"tag":"NFP"}
	],
	"phrase_list":[
		{"str":"He","hit":[0,2,0,1],"tag":"PRP"},
        {"str":"stayed","hit":[3,6,1,1],"tag":"VBD"},
        {"str":"in","hit":[10,2,2,1],"tag":"IN"},
        {"str":"San Francisco","hit":[13,13,3,2],"tag":"NNP"},
        {"str":".","hit":[26,1,5,1],"tag":"NFP"}
	],
	"entity_list":[
		{"str":"San Francisco","hit":[13,13,3,2],"tag":"loc.city","tag_i18n":"city",
		"meaning":{"related":["Los Angeles", "San Diego", "San Jose", "Santa Clara", "Palo Alto", 
				      "Santa Cruz", "Sacramento", "San Mateo", "Santa Barbara", "Oakland"]}}
	],
	"syntactic_parsing_str":"",
	"srl_str":""
}
The field “header” gives some auxiliary information (time cost, error codes, etc.) which explains this API call; the field “norm_str” gives the result of text normalization; the field “word_list” contains the results of basic-granularity word segmentation and part-of-speech tagging; the field “phrase_list” is the word segmentation and part-of-speech tagging of compound granularity; the field “entity_list” gives all the recognized entities and their types; and the fields “syntactic_parsing_str” and “srl_str” represent constituent syntax tree and semantic role labeling results respectively. In this example, the field “syntactic_parsing_str” and “srl_str” have both empty strings, because “syntactic_parsing” and “srl” are not activated by default.

We have to point out that the word segmentation and part-of-speech tagging of compound granularity depend on the results of entity recognition. Therefore, when the entity recognition function is activated or deactivated (specified in the option: "ner":{"enable":true/false}), the results of word segmentation and part-of-speech tagging of compound granularity may be different.

Instructions on Input and Output Format

An introduction of each field of the input JSON object is as follows:

Field Name Data Type Field Introduction
str string The input text to be analyzed
options JSON Object option information, mainly used to assign functions to be called and algorithms to be used for functions. More details can be found in the section of "More Ways to Call".
echo_data JSON Object The JSON object can be defined by users and the TexSmart service will return the same object in the way of echo.
users can use this field to record the identity information of the current request.

The introduction of each field of output JSON object is as follows:

Field Name Data Type Field Introduction
header JSON Object Auxiliary information returned after calling and execution
time_cost_ms field: the total time for processing request calculated with millisecond (ms).
time_cost field: the total time for processing request calculated with second (s).
ret_code field: return code. "succ" denotes success; others are false codes which include the following cases:
    error.invalid_request_format: the request format is invalid, for example,
     it is not a JSON format;
    error.timeout: time out
    error.busy: the service is busy (it is handling other requests)
    error.too_long_text: the input sentence is too long(the lenght limit is 8192 characters)
norm_str string Normalization results of the input sentence
word_list JSON Object Results of the basic-granularity word segmentation and part-of-speech tagging
hit field:its value is a JSON array, whose first number denotes the position of the word within norm_str,
second number denotes the length of the word, and the last two numbers can be neglected.
Position and length are computed in terms of character rather than byte, such as a Chinese character, digit number, puncutation, space.
tag field:POS tag of the word.
phrase_list JSON Object Results of compound granularity word segmentation and part-of-speech tagging
(The meaning of all fields is the same as that in word_list)
entity_list JSON Object Information of the recognized entities
hit field: Same as that in word_list
type field: Its value is a JSON object, with the following fields:
    name: Standard name of this entity type;
    i18n: Natural language expression (Chinese or English) of this type;
    flag: Indicating whether this entity mention is an instance or a sub-type (1: instance, 2: sub-type, 0: unknown); this field can be absent when flag = 0.
    path: Path of this type in the TexSmart ontology (from root to the direct super-type).
    The TexSmart ontology can be downloaded from the download page.
meaning field: the semantic information of entity. It is denoted by a JSON object and its specific format is dependent on the entity type.
tag field: [to be expired and please use type field instead] the standard name of entity type.
tag_i18n field: [to be expired and please use type field instead] the natural language expression (Chinese or English) of entity type.
syntactic_parsing_str string Results of constituent syntactic analysis
srl_str string Results of semantic role labeling

More Ways to Calling the API

Input Option Settings

More generally, the input JSON can also include some options, including word segmentation, part-of-speech tagging, named entity recognition, syntactic analysis, semantic role labeling and other functions, as well as what algorithm is called to execute a function. There is no options in the input JSON of the simple example above, where results returned are similar to the results of the following json-format input. 

{
  "str":"he stayed in San Francisco.",
  "options":
  {
    "input_spec":{"lang":"auto"},
    "word_seg":{"enable":true},
    "pos_tagging":{"enable":true,"alg":"log_linear"},
    "ner":{"enable":true,"alg":"fine.std"},
    "syntactic_parsing":{"enable":false},
    "srl":{"enable":false}
  },
  "echo_data":{"request_id":12345}
}

Specifically, the field “input_spec” represents the type of input language, which has three available values, recognizing the input language automatically (“auto”), Chinese (“chs”) and English (“en”) respectively. The field “Enable” can be “true” or “false”, representing whether to activate the corresponding function. The field “alg” represents the algorithm that the corresponding function needs to call. There are three alternatives for “alg” in “pos_tagging” (“crf”,“dnn”and “log_linear”), and five alternatives in “ner” ("coarse.crf", "coarse.dnn", “coarse.lua”, "fine.std" and "fine.high_acc"), where "coarse/fine" denotes the results of fine-grained/“coarse-grained NER. "syntactic_parsing", "srl", "text_cat" respectively denote parsing, srl and text classification functions and they are disabled by default. The value of “echo_data” can be customized by the user, for example, the user can record the identity information of the current request by it, such as “request_id”, which may be useful in asynchronous calls and some other occasions.

Batch Call

TexSmart also supports API for batch calling: through a JSON-format input, it can analyze multiple (Chinese/English) sentences. Here is an input example with its JSON format:

{
  "str":[
         "上个月30号,南昌王先生在自己家里边看流浪地球边吃煲仔饭。",
         "2020年2月7日,经中央批准,国家监察委员会决定派出调查组赴湖北省武汉市,就群众反映的涉及李文亮医生的有关问题作全面调查。",
         "John Smith stayed in San Francisco last month."
        ]
}
Note that the output format of batch call is slightly different from that of the ordinary call, and the result of all sentences is a JSON array which is used as the value of "res_list" field.


Code Example

API Call with Python

Code Example 1(wotjhttp.client):
# -*- coding: utf8 -*-
import json
import http.client


obj = {"str": "he stayed in San Francisco."}
req_str = json.dumps(obj)

conn = http.client.HTTPSConnection("texsmart.qq.com")
conn.request("POST", "/api", req_str)
response = conn.getresponse()
print(response.status, response.reason)
res_str = response.read().decode('utf-8')
print(res_str)
#print(json.loads(res_str))
Code Example 2(with requests module installed):
# -*- coding: utf8 -*-
import json
import requests

obj = {"str": "he stayed in San Francisco."}
req_str = json.dumps(obj).encode()

url = "https://texsmart.qq.com/api"
r = requests.post(url, data=req_str)
r.encoding = "utf-8"
print(r.text)
#print(json.loads(r.text))

API Call with Java

[TBD]

API Call with C++

[TBD]

API Call with C#

[TBD]

About TexSmart

TexSmart is a text understanding system built by NLP Team at Tencent AI Lab, which is used to analyze morphology, syntax and semantics for text in both Chinese and English. It provides basic natural language understanding functions such as word segmentation, part-of-speech tagging, named entity recognition(NER), semantic expansion, and particularly supports some key functions including fine-grained named entity recognition, semantic expansion and deep semantic expression for specific entities.

Experience Demo | System Introduction