Functions and other constructs

SemQL supports built-in or customized functions within expressions and conditions, enabling them to return a value.

Functions differ from comparison operators as they return a non-boolean value. They cannot be used as is in conditions unless used with a comparison operator. For example, TO_CHAR(IsValidCustomer) is a valid expression, but not a valid condition. TO_CHAR(IsValidCustomer)='1' is a valid condition.

Built-in functions

The functions available in Semarchy xDM include functions in the following categories:

  • Strings

  • Comparison

  • Conversion

  • Date and time

  • Matching

  • Miscellaneous

  • Null management

  • Numeric

For the complete set of built-in functions with their description, see SemQL function list.

When using a function, Semarchy xDM executes it with the connection information of the data location’s datasource:

  • For Oracle, the function is looked for in the default schema accessed by this connection. If the function does not exist in that schema, then the public synonym for the function is used.

  • For PostgreSQL, the function is looked for in the search path of the user configured for the JNDI datasource. This will determine where the function is called from.

  • For SQL Server, a function called in SemQL is automatically prefixed with dbo.

Useful and noteworthy functions

The following list contains noteworthy functions and expressions:

  • TO_CHAR, TO_DATE, and TO_NUMBER to perform conversions across data types.

  • TRIM, LTRIM, RTRIM, PAD LPAD, and RPAD to trip or pad with blanks.

  • SUBSTR to retrieve a part of a string.

  • REPLACE and REGEXP_REPLACE to replace parts of a string.

  • INSTR to find the location of a substring in a string.

  • NULLIF, COALESCE, and NVL to handle null values.

  • GREATEST and LEAST to return the greatest and least of a list of expressions.

  • SYSDATE to retrieve the system date.

Combining functions to extract a substring (PostgreSQL)

Consider a StoreLocation attribute containing values such as '5433 - Midtown'. To extract the 'Midtown' store name, use the following function combination in an enricher:

SUBSTR(StoreLocation, STRPOS(StoreLocation, ' - ') + 3)

Functions for matching

Certain functions are key in a fuzzy matching process.

The functions below can be used for normalizing or transforming values to reduce noise during fuzzy matching:

  • UPPER, LOWER, and INITCAP absorb the case-sensitivity differences in strings.

  • SOUNDEX, METAPHONE, and DMETAPHONE return phonetic representations (phonetization) of strings, absorbing typos.

  • SEM_NORMALIZE returns a string with non-ASCII characters transformed to ASCII-equivalent or a blank.

Soundex is not recommended as a general-purpose method for phonetizing strings. Phonetization methods such as CAVERPHONE or METAPHONE for person names and METAPHONE or REFINEDSOUNDEX for organization names give better results for matching. These methods are available as functions for certain databases, and in the Text Normalization and Transliteration plugin.

The functions below implement fuzzy-matching capabilities:

  • SEM_EDIT_DISTANCE and SEM_EDIT_DISTANCE_SIMILARITY respectively return the distance and percentage of similarity between two strings according to the Levenshtein distance algorithm.

  • SEM_JARO_WINKLER and SEM_JARO_WINKLER_SIMILARITY respectively return the distance and percentage of similarity between two strings according to the Jaro-Winkler distance algorithm.

  • SEM_NGRAMS_SIMILARITY returns the percentage of similarity of two strings according to the Dice’s coefficient similarity measure applied to the n-grams of the strings.

The *_SIMILARITY functions return a value between 0 (no match) and 100 (perfect match). If one or both strings are null, the returned value is 0.

With Oracle and PostgreSQL data locations, matching functions rely on database native capabilities. For SQL Server, matching functions rely on Transact-SQL implementations, which do not provide the same performances as native capabilities.

For large data volumes, it is recommended to use third-party common language runtime (CLR) implementations of these functions for better performance. For example the Fastenshtein implementation of the Levenshtein algorithm. These functions must be installed in the SQL Server instance, and then declared or used as custom functions.

Custom functions

SemQL allows to use custom database functions implemented in the database instance that hosts the hub.

These functions must be declared in the model to appear in the list of functions. For more information on how to declare customized functions, see Database functions and procedures.

Functions that are not declared can still be used in SemQL, but will not be recognized by the SemQL parser and will trigger validation warnings.

Call these functions as regular functions by prefixing them with their schema and optionally their package name: <schema>.<package>.<function>.

Call a CUSTFUNC() function, stored in a CUST001 package, in a COMMON_TOOLS schema
COMMON_TOOLS.CUST001.CUSTFUNC(CustomerName)
The database user of the schema hosting the hub must have sufficient privileges to execute customized functions.

Database functions use the database engine to process data. For complex processing involving algorithms, libraries, or services that are not easily implemented with the database capabilities, it is advised to opt for the Java plugin or REST client option.

Other constructs

CASE expression

The CASE expression selects a result from one or more options and returns this result.

This syntax returns the first result for which the expression matches the selector. If none match, the default result is returned.

CASE selector
    WHEN expression_1 THEN result_1
    ...
    WHEN expression_n THEN result_n
    [ELSE default_result]
END

This syntax returns the first result for which the condition is true. If none is true, the default result is returned.

CASE
    WHEN condition_1 THEN result_1
    ...
    WHEN condition_n THEN result_n
    [ELSE default_result]
END
Enricher expression to transform the CustomerName attribute according to the records' publisher
CASE PublisherID
    WHEN 'CRM' THEN Upper(CustomerName)
    WHEN 'MKT' THEN Upper(Replace(CustomerName, '-', ' '))
    ELSE CustomerName
END
Same as above, with the second syntax
CASE
    WHEN PublisherID='CRM' THEN Upper(CustomerName)
    WHEN PublisherID='MKT' THEN Upper(Replace(CustomerName, '-', ' '))
    ELSE CustomerName
END

Table functions

SemQL supports searching for an expression’s value in the values returned by a table function, using the following syntax:

expression IN table_function(parameter_1, parameter_2 ...)

The following condition uses a table function named SEARCH_FOR_IDS that returns a list of IDs from a customer name:

CUSTOMER_ID in SEARCH_FOR_IDS(CUSTOMER_NAME)