Semarchy Text enricher
The Semarchy Text enricher applies normalization, transliteration and phonetic transformations to text strings.
Description
This enricher applies normalization, transliteration and phonetic transformations to text strings. It takes an Input Text and applies an Input Filter to this text, for example to remove all characters but letters. Then it applies a series of transformations defined in the Transformation parameter and returns a Transformed Text.
This plugin is thread-safe and supports parallel execution. |
Plugin parameters
The following table lists the plugin parameters.
Parameter name | Mandatory | Type | Description |
---|---|---|---|
Input Filter |
No |
String |
Filter applied to the input text before the transformation. Valid values for the Filter are: |
Transformation |
Yes |
String |
A pipe-separated sequence of transformation definitions. Transformations include:
See the Transformations section for a detailed description of each transformation. |
Synonyms Separator |
No |
String |
Separator used between the synonyms returned by the enricher. Default value is a pipe (|). |
Plugin inputs
The following table lists the plugin inputs.
Input name | Mandatory | Type | Description |
---|---|---|---|
Input Text |
Yes |
String |
Text to transform. |
Plugin outputs
The following table lists the plugin outputs.
Output name | Type | Description |
---|---|---|
Transformed Text |
String |
Filtered and transformed text. |
Secondary Transformed Text |
String |
Secondary transformed text. This text may contain transformation resulting from a Beidermorse or Double Metaphone transformation. See Other transformations for more information. |
Input filters
The following input filters are supported by the enricher:
-
NONE
: No filter is applied to the input text. -
LETTERS
: This transformation removes all non-letter characters from the input string. -
STANDARD
: Breaks words in the input text according to the rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29.
Transformations
The following transformations definitions are supported by the enricher:
-
Normalization
-
NORMALIZE
: Performs a Normalization
-
-
Phonetic Transformation
-
PHONETIC [SOUNDEX | REFINEDSOUNDEX | METAPHONE [<max_code_length>] | DOUBLEMETAPHONE [<max_code_length>] | CAVERPHONE | CAVERPHONE1 | NYSIIS | MRA | COLOGNE | BEIDERMORSE ]
: applies Phonetic transformations
-
-
Other Transformations
-
BEIDERMORSE [Split] [RuleType] [MaxPhonems] [NameType]
-
DOUBLEMETAPHONE [<max_code_length>] [split]
-
-
Transliteration
-
TRANSLITERATE [<ID>]
apply a Transliteration transformation to the string. The transliteration is identified by an ID. If not ID is provided, the Any-Latin transliteration is used.
-
It is possible to sequence transformations. Successive transformations are separated by a pipe |
sign.
Examples of transformations:
-
Normalize and apply Phonetic Soundex:
NORMALIZE | SOUNDEX
-
Normalize and then transliterate to Latin script:
NORMALIZE | TRANSLITERATE Any-Latin
-
Normalize, transliterate to Latin script and then apply Metaphone with a maximum resulting length of 5 characters:
NORMALIZE | TRANSLITERATE Any-Latin | PHONETIC METAPHONE 5
-
Perform a BEIDERMORSE transformation for family names with an approximate transformation on generic name types:
BEIDERMORSE APPROX 10 FALSE GENERIC
Normalization
The NORMALIZE
transformation normalizes the string by applying a series of transformations, which map similar characters to a common target, to ignore certain distinctions between similar characters. This includes accent removal, case folding, etc.
Example of transformations:
Original Text | Normalized Text | Comments |
---|---|---|
‒ – — ― |
- - - - |
4 different dashes converted to 4 similar dashes. |
AbSoLuteLy TRUE |
absolutely true |
CaseFolding |
… |
... |
convert [dotdotdot] to [dot dot dot] |
½ Tsp |
1/2 tsp |
Symbol folding |
Æsop |
aesop |
|
Äsop |
asop |
|
Dürst |
durst |
|
Encyclopædia |
encyclopaedia |
|
œuvre |
oeuvre |
|
poſt |
post |
|
résumé français |
resume francais |
Accent removal and case folding |
Straße |
strasse |
|
٣ is a magic number |
3 is a magic number |
Native Digital folding |
The complete list of transformations is given below:
Accent removal |
Hebrew Alternates folding |
Overline folding |
Suzhou Numeral folding |
Case folding |
Jamo folding |
Positional forms folding |
Symbol folding |
Canonical duplicates folding |
Letterforms folding |
Small forms folding |
Underline folding |
Dashes folding |
Math symbol folding |
Space folding |
Vertical forms folding |
Diacritic removal (including stroke, hook, descender) |
Multigraph Expansions: All |
Spacing Accents folding |
Width folding |
Greek letterforms folding |
Native digit folding |
Subscript folding |
Han Radical folding |
For more information about these transformations see the UTR#30 Characters Foldings transformation.
Phonetic transformations
A phonetic transformation applied to the string transforms it to a string corresponding to its pronunciation. The default phonetic transformation is PHONETIC METAPHONE
.
Phonetic transformations include:
-
PHONETIC SOUNDEX
andPHONETIC REFINEDSOUNDEX
: Phonetic algorithms for indexing names by sound, as pronounced in English. The goal is for homophones to be encoded to the same representation so that they can be matched despite minor differences in spelling. More information about Soundex -
PHONETIC METAPHONE
andPHONETIC DOUBLEMETAPHONE
are algorithms for indexing words by their English pronunciation. They are suitable for use with most English words, not just names. Double Metaphone can return both a primary and a secondary code for an input string; this accounts for some ambiguous cases as well as for multiple variants of surnames with common ancestry. These algorithms support a Max Code Length parameter which defines the maximum length of the encoded result. This value default to 4. More Details about Metaphone. -
PHONETIC CAVERPHONE
andPHONETIC CAVERPHONE1
. Algorithm for data matching for electoral rolls, optimized for accents present in parts of New Zealand. More Details about Caverphone and Caverphone 1 -
PHONETIC NYSIIS
. New York State Identification and Intelligence System (NYSIIS), which maps similar phonemes to the same letter. The result is a string that can be pronounced by the reader without decoding. More Details about NYSIIS -
PHONETIC MRA
: Match Rating Approach developed by Western Airlines - this algorithm has an encoding and range comparison technique. More Details about MRA -
PHONETIC COLOGNE
Phonetic algorithm optimized for the German language. See Kölner Phonetik -
PHONETIC BEIDERMORSE
is a phonetic algorithm supporting greater accuracy in matching Slavic and Yiddish surnames with similar pronunciation but differences in spelling. It returns a list of tokens (separated by the string specified in the Synonyms Separator parameter.): first the transformed input text, then the transformed synonyms of the input text. More information about Beidermorse.
Other transformations
These other transformations return a list of tokens which can be split into the Transformed Text and Secondary Transformed Text outputs.
These transformations should be preferably used at the end of the transformation sequence, as their secondary transformed text is not processed in subsequent transformations in the sequence. |
Other transformations include:
-
BEIDERMORSE [<split>] [<rule_type>] [<max_phonems>] [<name_type>]
The Beidermorse transformation returns a list of tokens: first the transformed input text, then the transformed synonyms of the input text. Beidermorse supports the following parameters:-
split. If this parameter is set to
true
all synonyms after the first one are concatenated in the Secondary Transformed Text output. If this parameter is set tofalse
(default value) all synonyms are appended to the first token in the Transformed Text output. -
rule_type is
EXACT
for exact orAPPROX
for approximate phonetic transformation. -
max_phonems is the maximum number of synonyms returned. Default is 20.
-
name_type default value is
GENERIC
. UseASHKENAZI
orSEPHARDIC
if you specifically want phonetic encodings optimized for Ashkenazi or Sephardic Jewish family names.
-
-
DOUBLEMETAPHONE [<max_code_length>] [<split>]
. This transformation encodes the input string with the Double Metaphone algorithm and returns a primary code and a secondary code. If split is set totrue
, then the secondary code is pushed to the Secondary Transformed Text output. Otherwise, it is concatenated to the primary code in the Transformed Text output.
Transliteration
The TRANSLITERATE
transformation transforms a text from one character script to another. For example, Traditional to Simplified Chinese, Japanese Hiragana to Katakana, Cyrillic to Latin script.
Each source/target transliteration is identified by an ID. The list of supported transliteration IDs is provided in the list below. If no ID is provided, the Any-Latin transliteration is used.
Each ID represents a transliteration from one script/language to another. For example: Katakana-Latin, Latin-thai, etc. The special tag any stands for any script/language. For example, Any-Latin converts any input script to Latin script.
Accents-Any |
Any-Name |
Devanagari-Bengali |
Han-Latin |
Latin-Greek |
Pinyin-NumericPinyin |
Amharic-Latin/BGN |
Any-NFC |
Devanagari-Gujarati |
Han-Latin/Names |
Latin-Greek/UNGEGN |
pl_FONIPA-ja |
Any-Accents |
Any-NFD |
Devanagari-Gurmukhi |
Hangul-Latin |
Latin-Gujarati |
pl-ja |
Any-am |
Any-NFKC |
Devanagari-Kannada |
Hans-Hant |
Latin-Gurmukhi |
pl-pl_FONIPA |
Any-Arabic |
Any-NFKD |
Devanagari-Latin |
Hant-Hans |
Latin-Han |
Publishing-Any |
Any-Armenian |
Any-Null |
Devanagari-Malayalam |
Hebrew-Latin |
Latin-Hangul |
ro_FONIPA-ja |
Any-Bengali |
Any-Oriya |
Devanagari-Oriya |
Hebrew-Latin/BGN |
Latin-Hebrew |
ro-ja |
Any-Bopomofo |
Any-pl_FONIPA |
Devanagari-Tamil |
Hex-Any |
Latin-Hiragana |
ro-ro_FONIPA |
Any-CaseFold |
Any-Publishing |
Devanagari-Telugu |
Hex-Any/C |
Latin-Jamo |
ru-ja |
Any-cs_FONIPA |
Any-Remove |
Digit-Tone |
Hex-Any/Java |
Latin-Kannada |
ru-zh |
Any-Cyrillic |
Any-ro_FONIPA |
es_419-ja |
Hex-Any/Perl |
Latin-Katakana |
Russian-Latin/BGN |
Any-Devanagari |
Any-ru |
es_419-zh |
Hex-Any/Unicode |
Latin-Malayalam |
Serbian-Latin/BGN |
Any-es_419_FONIPA |
Any-sk_FONIPA |
es_FONIPA-am |
Hex-Any/XML |
Latin-NumericPinyin |
Simplified-Traditional |
Any-es_FONIPA |
Any-Syriac |
es_FONIPA-es_419_FONIPA |
Hex-Any/XML10 |
Latin-Oriya |
sk_FONIPA-ja |
Any-FCC |
Any-Tamil |
es_FONIPA-ja |
Hiragana-Katakana |
Latin-Syriac |
sk-ja |
Any-FCD |
Any-Telugu |
es_FONIPA-zh |
Hiragana-Latin |
Latin-Tamil |
sk-sk_FONIPA |
Any-Georgian |
Any-Thaana |
es-am |
IPA-XSampa |
Latin-Telugu |
Syriac-Latin |
Any-Greek |
Any-Thai |
es-es_FONIPA |
it-am |
Latin-Thaana |
Tamil-Bengali |
Any-Greek/UNGEGN |
Any-Title |
es-ja |
it-ja |
Latin-Thai |
Tamil-Devanagari |
Any-Gujarati |
Any-Upper |
es-zh |
ja_Latn-ko |
Macedonian-Latin/BGN |
Tamil-Gujarati |
Any-Gurmukhi |
Any-zh |
Fullwidth-Halfwidth |
ja_Latn-ru |
Malayalam-Bengali |
Tamil-Gurmukhi |
Any-Han |
Arabic-Latin |
Georgian-Latin |
Jamo-Latin |
Malayalam-Devanagari |
Tamil-Kannada |
Any-Hangul |
Arabic-Latin/BGN |
Georgian-Latin/BGN |
JapaneseKana-Latin/BGN |
Malayalam-Gujarati |
Tamil-Latin |
Any-Hans |
Armenian-Latin |
Greek-Latin |
Kannada-Bengali |
Malayalam-Gurmukhi |
Tamil-Malayalam |
Any-Hant |
Armenian-Latin/BGN |
Greek-Latin/BGN |
Kannada-Devanagari |
Malayalam-Kannada |
Tamil-Oriya |
Any-Hebrew |
ASCII-Latin |
Greek-Latin/UNGEGN |
Kannada-Gujarati |
Malayalam-Latin |
Tamil-Telugu |
Any-Hex |
Azerbaijani-Latin/BGN |
Gujarati-Bengali |
Kannada-Gurmukhi |
Malayalam-Oriya |
Telugu-Bengali |
Any-Hex/C |
Belarusian-Latin/BGN |
Gujarati-Devanagari |
Kannada-Latin |
Malayalam-Tamil |
Telugu-Devanagari |
Any-Hex/Java |
Bengali-Devanagari |
Gujarati-Gurmukhi |
Kannada-Malayalam |
Malayalam-Telugu |
Telugu-Gujarati |
Any-Hex/Perl |
Bengali-Gujarati |
Gujarati-Kannada |
Kannada-Oriya |
Maldivian-Latin/BGN |
Telugu-Gurmukhi |
Any-Hex/Plain |
Bengali-Gurmukhi |
Gujarati-Latin |
Kannada-Tamil |
Mongolian-Latin/BGN |
Telugu-Kannada |
Any-Hex/Unicode |
Bengali-Kannada |
Gujarati-Malayalam |
Kannada-Telugu |
Name-Any |
Telugu-Latin |
Any-Hex/XML |
Bengali-Latin |
Gujarati-Oriya |
Katakana-Hiragana |
NumericPinyin-Latin |
Telugu-Malayalam |
Any-Hex/XML10 |
Bengali-Malayalam |
Gujarati-Tamil |
Katakana-Latin |
NumericPinyin-Pinyin |
Telugu-Oriya |
Any-Hiragana |
Bengali-Oriya |
Gujarati-Telugu |
Kazakh-Latin/BGN |
Oriya-Bengali |
Telugu-Tamil |
Any-ja |
Bengali-Tamil |
Gurmukhi-Bengali |
Kirghiz-Latin/BGN |
Oriya-Devanagari |
Thaana-Latin |
Any-Kannada |
Bengali-Telugu |
Gurmukhi-Devanagari |
Korean-Latin/BGN |
Oriya-Gujarati |
Thai-Latin |
Any-Katakana |
Bopomofo-Latin |
Gurmukhi-Gujarati |
Latin-Arabic |
Oriya-Gurmukhi |
Tone-Digit |
Any-ko |
Bulgarian-Latin/BGN |
Gurmukhi-Kannada |
Latin-Armenian |
Oriya-Kannada |
Traditional-Simplified |
Any-Latin (default) |
cs_FONIPA-ja |
Gurmukhi-Latin |
Latin-ASCII |
Oriya-Latin |
Turkmen-Latin/BGN |
Any-Latin/BGN |
cs_FONIPA-ko |
Gurmukhi-Malayalam |
Latin-Bengali |
Oriya-Malayalam |
Ukrainian-Latin/BGN |
Any-Latin/Names |
cs-cs_FONIPA |
Gurmukhi-Oriya |
Latin-Bopomofo |
Oriya-Tamil |
Uzbek-Latin/BGN |
Any-Latin/UNGEGN |
cs-ja |
Gurmukhi-Tamil |
Latin-Cyrillic |
Oriya-Telugu |
XSampa-IPA |
Any-Lower |
cs-ko |
Gurmukhi-Telugu |
Latin-Devanagari |
Pashto-Latin/BGN |
zh_Latn_PINYIN-ru |
Any-Malayalam |
Cyrillic-Latin |
Halfwidth-Fullwidth |
Latin-Georgian |
Persian-Latin/BGN |