Semarchy Text Tokens enricher
The Semarchy Text Tokens enricher replaces tokens in a text using a dictionary of strings or patterns.
Description
This enricher replaces tokens in a text using an ordered dictionary of strings or patterns.
Dictionary definition
The dictionary is an ordered list of entries with a "from" and a "to" value. A token matching the "from" value is replaced by the corresponding "to" value. In a dictionary entries':
-
"from" values may plain strings, wildcard patterns or regular expressions.
-
"to" value may be plain strings, possibly containing capture groups when the corresponding "from" is a regular expression.
A dictionary is ordered. The order of the dictionary entries is important as token replacement takes place with the first matching entry in the dictionary. |
For example, with the dictionary below, the Company token would be transformed to Co.
[
{"from":"Limited", "to":"Ltd."},
{"from":"Incorporated", "to":"Inc."},
{"from":"Company", "to":"Co."}
]
There are multiple ways to configure and store the dictionary. The dictionary may be:
-
Provided as a static JSON object in the JSON Dictionary parameter.
-
Stored in a database and loaded at run-time from a Datasource:
-
either from a Dictionary Table, which must have a Token Pattern Column (From) and a Token Replacement Column (To). An optional Sort Column defines the order of the dictionary,
-
or from a Dictionary Query select statement, which returns the ordered list of dictionary entries.
-
You can only configure one dictionary storage option. JSON Dictionary, Dictionary Query and Dictionary Table are mutually exclusive. |
Tokenization and replacement process
This section describes how this enricher processes an input text.
First, the input text is tokenized according to the Tokenize On parameter on whitespaces, non-alphanumeric characters, or on a Token Separator Regex regular expression.
The tokenization creates an Original list of tokens. This list is then transformed, using the dictionary, in order to build:
-
a Transformed list containing these tokens, possibly replaced by the matching entries in the dictionary.
-
a Matched list containing the tokens with a matching entry in the dictionary.
To build these lists, each token in the Original token list is matched against the ordered dictionary entries.
For the first dictionary entry for which the "from" value matches the token:
-
The token is added to the Matched list.
-
The corresponding "to" is added to the Transformed list.
If a token has no match in the dictionary, it is added as is to the Transformed list, and not to the Matched list.
Outputs
When all tokens in the Original list are processed, the outputs are built:
-
Transformed Text contains the list of Transformed tokens, separated by the Transformed Text Separator and possibly sorted alphabetically.
-
Matched Tokens contains the list of Matched tokens, separated by the Matched Token List Separator and possibly sorted alphabetically.
Regular expressions and wildcards
The Match/Replace Mode parameter defines how the from and to entries in the dictionary should be processed.
EXACT_STRING
(default) matches tokens exactly against the "from" string and replaces them with the corresponding "to" string, as illustrated in the example below.
[
{"from":"Limited", "to":"Ltd."},
{"from":"Incorporated", "to":"Inc."},
{"from":"Company", "to":"Co."}
]
WILDCARDS
assumes that the "from" string contains a wildcard pattern to match ("?" representing one character, "*" representing any number of characters).
[
{"from":"AZ-*", "to":"Arizona"},
...
{"from":"CA-*", "to":"California"}
...
]
REGEX
matches tokens against the "from" java regular expression and replaces them with the "to" value, supporting capture groups replacement.
AZ-99577
or AZ 99577
.[
{"from":"(\w{2})\s\d{4,5}", "to":"$1"}
{"from":"(\w{2})-\d{4,5}", "to":"$1"}
]
Using Regular expressions, particularly if relying on a dictionary stored in a database table which data may be modified by users possibly exposes you to Regular expression Denial of Service (ReDoS) attacks. |
Plugin parameters
The following table lists the plugin parameters.
Parameter name | Mandatory | Type | Description |
---|---|---|---|
Dictionary Definition |
|||
JSON Dictionary |
No |
String |
Dictionary in JSON format. This is a JSON array of ordered entries. Each entry is JSON object with "from" and "to" properties. When this parameter is not set, the dictionary is loaded from a datasource using the table or query. |
Datasource |
No |
String |
Name of datasource containing the dictionary data. This datasource must be configured in the platform. If this parameter is not set, the enricher uses the data location datasource. |
Dictionary Query |
No |
String |
Custom Select query returning the ordered dictionary of token replacement patterns. This query should return two columns corresponding to dictionary’s "From" and "To" properties. For example:
Leave this parameter empty to use a SQL query generated from the Dictionary Table, Token Pattern Column (From), Token Replacement Column (To) and Sort Column parameters. |
Dictionary Table |
No |
String |
Physical name of the table containing the token replacement patterns. |
Token Pattern Column (From) |
No |
String |
Column in the dictionary table containing the token string or pattern to detect. This columnn corresponds to the "From" property of a replacement pattern. |
Token Replacement Column (To) |
No |
String |
Column in the dictionary table containing the replacement string or pattern. This column corresponds to the "To" property of a replacement pattern. |
Sort Column |
No |
String |
Column in the dictionary table used to order (asc) token detection. If not set the Token Pattern Column (From) is used. |
Tokenization and Replacement |
|||
Tokenize On |
No |
String |
Defines how the input text is tokenized:
|
Token Separator Regex |
No |
String |
Java regular expression pattern used to tokenize the input text when Tokenize On is set to |
Match/Replace Mode |
No |
String |
Defines how tokens are matched against the "from" values and replaced with the "To" values defined in the dictionary:
|
Outputs Configuration |
|||
Transformed Text Separator |
No |
String |
Token separator in the transformed text. The default value is a space (" "). |
Sort Replaced Tokens |
No |
Boolean |
Set to true to sort the replaced tokens alphabetically in the transformed text. Default value is false. Note that sorting is case-insensitive. |
Return Input Token List if Null |
No |
Boolean |
Set to true (default) to return the original token list in the replaced tokens when all tokens are replaced by nulls. |
Matched Token List Separator |
No |
String |
Token separator in the Matched Tokens list. The default value is a space (" ") |
Sort Matched Tokens |
No |
Boolean |
Set to true to sort alphabetically the Matched Tokens list. Default value is false. Note that sorting is case-insensitive. |
Max. Matched Tokens |
No |
String |
Maximum number of token to return in the Matched Tokens list. Defaults to no limit. |