Semarchy Text Tokens enricher

The Semarchy Text Tokens enricher replaces tokens in a text using a dictionary of strings or patterns.

Plugin ID

Semarchy Text Tokens Enricher - com.semarchy.engine.plugins.convergence.texttokens

Description

This enricher replaces tokens in a text using an ordered dictionary of strings or patterns.

Dictionary definition

The dictionary is an ordered list of entries with a "from" and a "to" value. A token matching the "from" value is replaced by the corresponding "to" value. In a dictionary entries':

  • "from" values may plain strings, wildcard patterns or regular expressions.

  • "to" value may be plain strings, possibly containing capture groups when the corresponding "from" is a regular expression.

A dictionary is ordered. The order of the dictionary entries is important as token replacement takes place with the first matching entry in the dictionary.

For example, with the dictionary below, the Company token would be transformed to Co.

Example 1. JSON dictionary example
[
    {"from":"Limited", "to":"Ltd."},
    {"from":"Incorporated", "to":"Inc."},
    {"from":"Company", "to":"Co."}
]

There are multiple ways to configure and store the dictionary. The dictionary may be:

  • Provided as a static JSON object in the JSON Dictionary parameter.

  • Stored in a database and loaded at run-time from a Datasource:

    • either from a Dictionary Table, which must have a Token Pattern Column (From) and a Token Replacement Column (To). An optional Sort Column defines the order of the dictionary,

    • or from a Dictionary Query select statement, which returns the ordered list of dictionary entries.

You can only configure one dictionary storage option. JSON Dictionary, Dictionary Query and Dictionary Table are mutually exclusive.

Tokenization and replacement process

This section describes how this enricher processes an input text.

First, the input text is tokenized according to the Tokenize On parameter on whitespaces, non-alphanumeric characters, or on a Token Separator Regex regular expression.

The tokenization creates an Original list of tokens. This list is then transformed, using the dictionary, in order to build:

  • a Transformed list containing these tokens, possibly replaced by the matching entries in the dictionary.

  • a Matched list containing the tokens with a matching entry in the dictionary.

To build these lists, each token in the Original token list is matched against the ordered dictionary entries.
For the first dictionary entry for which the "from" value matches the token:

  • The token is added to the Matched list.

  • The corresponding "to" is added to the Transformed list.

If a token has no match in the dictionary, it is added as is to the Transformed list, and not to the Matched list.

Outputs

When all tokens in the Original list are processed, the outputs are built:

  • Transformed Text contains the list of Transformed tokens, separated by the Transformed Text Separator and possibly sorted alphabetically.

  • Matched Tokens contains the list of Matched tokens, separated by the Matched Token List Separator and possibly sorted alphabetically.

Regular expressions and wildcards

The Match/Replace Mode parameter defines how the from and to entries in the dictionary should be processed.

EXACT_STRING (default) matches tokens exactly against the "from" string and replaces them with the corresponding "to" string, as illustrated in the example below.

Example 2. JSON dictionary example for EXACT_STRING
[
    {"from":"Limited", "to":"Ltd."},
    {"from":"Incorporated", "to":"Inc."},
    {"from":"Company", "to":"Co."}
]

WILDCARDS assumes that the "from" string contains a wildcard pattern to match ("?" representing one character, "*" representing any number of characters).

Example 3. JSON dictionary example for WILDCARDS: Transform postal codes to state names
[
    {"from":"AZ-*", "to":"Arizona"},
    ...
    {"from":"CA-*", "to":"California"}
    ...
]

REGEX matches tokens against the "from" java regular expression and replaces them with the "to" value, supporting capture groups replacement.

Example 4. JSON dictionary example for Regexp: extract a state code from a postal code token such as AZ-99577 or AZ 99577.
[
    {"from":"(\w{2})\s\d{4,5}", "to":"$1"}
    {"from":"(\w{2})-\d{4,5}", "to":"$1"}
]
Using Regular expressions, particularly if relying on a dictionary stored in a database table which data may be modified by users possibly exposes you to Regular expression Denial of Service (ReDoS) attacks.

Plugin parameters

The following table lists the plugin parameters.

Parameter name Mandatory Type Description

Dictionary Definition
The following parameters define how the dictionary of replacement patterns is stored.

JSON Dictionary

No

String

Dictionary in JSON format. This is a JSON array of ordered entries. Each entry is JSON object with "from" and "to" properties. When this parameter is not set, the dictionary is loaded from a datasource using the table or query.

Datasource

No

String

Name of datasource containing the dictionary data. This datasource must be configured in the platform. If this parameter is not set, the enricher uses the data location datasource.

Dictionary Query

No

String

Custom Select query returning the ordered dictionary of token replacement patterns. This query should return two columns corresponding to dictionary’s "From" and "To" properties. For example:

select
    <token_pattern_column_from>,
    <token_replacement_column_to>
from <dictionary_table>
where ...
order by <sort_column>

Leave this parameter empty to use a SQL query generated from the Dictionary Table, Token Pattern Column (From), Token Replacement Column (To) and Sort Column parameters.

Dictionary Table

No

String

Physical name of the table containing the token replacement patterns.

Token Pattern Column (From)

No

String

Column in the dictionary table containing the token string or pattern to detect. This columnn corresponds to the "From" property of a replacement pattern.

Token Replacement Column (To)

No

String

Column in the dictionary table containing the replacement string or pattern. This column corresponds to the "To" property of a replacement pattern.

Sort Column

No

String

Column in the dictionary table used to order (asc) token detection. If not set the Token Pattern Column (From) is used.

Tokenization and Replacement
The following parameters define how the Input Text is tokenized and how token replacement takes place.

Tokenize On

No

String

Defines how the input text is tokenized:

  • WHITESPACES (default) splits the input text on whitespaces

  • NON_ALPHANUM splits the input text on all non-alphanumeric characters

  • REGEX splits the input text using the Token Separator Regex regular expression.

Token Separator Regex

No

String

Java regular expression pattern used to tokenize the input text when Tokenize On is set to REGEX. Any subsequence matching this pattern is considered a separator. For example, "\r?\n" tokenizes the input text on the line terminators.

Match/Replace Mode

No

String

Defines how tokens are matched against the "from" values and replaced with the "To" values defined in the dictionary:

  • EXACT_STRING (default) matches tokens exactly against the "from" string and replaces them with the corresponding "to" string.

  • WILDCARDS assumes that the "from" string contains a wildcard pattern to match ("?" representing one character, "*" representing any number of characters).

  • REGEX matches tokens against the "from" java regular expression and replaces them with the "to" value, supporting capture groups replacement.

Outputs Configuration
The following parameters define how the outputs are rendered.

Transformed Text Separator

No

String

Token separator in the transformed text. The default value is a space (" ").

Sort Replaced Tokens

No

Boolean

Set to true to sort the replaced tokens alphabetically in the transformed text. Default value is false. Note that sorting is case-insensitive.

Return Input Token List if Null

No

Boolean

Set to true (default) to return the original token list in the replaced tokens when all tokens are replaced by nulls.

Matched Token List Separator

No

String

Token separator in the Matched Tokens list. The default value is a space (" ")

Sort Matched Tokens

No

Boolean

Set to true to sort alphabetically the Matched Tokens list. Default value is false. Note that sorting is case-insensitive.

Max. Matched Tokens

No

String

Maximum number of token to return in the Matched Tokens list. Defaults to no limit.

Plugin inputs

The following table lists the plugin inputs.

Input name Mandatory Type Description

Input Text

Yes

String

Text to transform

Plugin outputs

The following table lists the plugin outputs.

Output name Type Description

Transformed Text

String

Transformed text with the tokens replaced

Matched Tokens

String

The list of tokens found and replaced