Configure the database schemas

This page explains how to configure databases and schemas for the xDM repository and data locations.

For Semarchy xDM to install and boot properly, the following four elements must be set on your database:

  • A repository storage—that is, internal storage used by Semarchy xDM to store the design-time and run-time information (recommended name: semarchy_repository)

  • A repository user—that is, a user to log in to the repository storage with write privileges (recommended name: semarchy_repository).

  • A repository read-only user—that is, a user to log in to the repository storage with read-only privileges (recommended name: semarchy_repository_readonly). It is used by Semarchy xDM Discovery’s built-in profile dashboards.

  • At least one data location storage that includes the database tables and other objects generated from the model edition of your application (e.g., customer_mdm, product_data_location).

Configure the repository storage

Repository storage

Before installing Semarchy xDM, you must create storage for the repository. This can be done either manually or using your database administration interface. Below are sample scripts for creating storage. Make sure to adapt the script according to your database configuration.

Oracle

Create the repository schema
CREATE USER <repository_user> IDENTIFIED BY <repository_password>
 DEFAULT TABLESPACE USERS TEMPORARY TABLESPACE TEMP;
GRANT CONNECT, RESOURCE TO <repository_user>;

-- Use the following command for Oracle 12c and above
GRANT UNLIMITED TABLESPACE TO <repository_user>;

PostgreSQL

Create a database and the repository schema
-- Create a database for the repository and data locations
CREATE DATABASE <postgresql_database_name> WITH ENCODING 'UTF8';

-- Disconnect and then reconnect using one of the following options:
-- the JDBC URL: jdbc:postgresql://<host>:<port>/<postgresql_database_name>
-- psql with the following command: psql -U postgres <postgresql_database_name>

CREATE SCHEMA extensions;
GRANT USAGE ON SCHEMA extensions TO PUBLIC;
ALTER DEFAULT PRIVILEGES IN SCHEMA extensions GRANT EXECUTE ON FUNCTIONS TO PUBLIC;
ALTER DATABASE <postgresql_database_name> SET SEARCH_PATH TO "$user",public,extensions;

CREATE EXTENSION IF NOT EXISTS "uuid-ossp"     with schema extensions;
CREATE EXTENSION IF NOT EXISTS "fuzzystrmatch" with schema extensions;

-- Create the repository user and schema
CREATE USER <repository_user> WITH PASSWORD '<repository_password>';

-- Use the following syntax for PostgreSQL 9
-- CREATE USER <repository_user> WITH UNENCRYPTED PASSWORD '<repository_password>';

-- The following command is required only for PostgreSQL running on Amazon RDS.
-- It grants access to the repository to the RDS superuser.
-- GRANT <repository_user> TO <rds_superuser_name>

CREATE SCHEMA <repository_user> AUTHORIZATION <repository_user>;

SQL Server

Create a repository database, login, and user
-- Create a database for the repository
CREATE DATABASE <repository_database_name>
GO

-- Configure the database
ALTER DATABASE <repository_database_name> SET READ_COMMITTED_SNAPSHOT ON;
GO

ALTER DATABASE <repository_database_name> SET QUOTED_IDENTIFIER ON;
GO

-- Create a login to connect the database
CREATE LOGIN <repository_user> WITH PASSWORD='<repository_password>', DEFAULT_DATABASE=<repository_database_name>
GO

-- Add a user for that login in the database
USE <repository_database_name>
GO

CREATE USER <repository_user> FOR LOGIN <repository_user>
GO

-- Make this user database owner
ALTER ROLE db_owner ADD MEMBER <repository_user>
GO
Store the values of the <repository_user>, <repository_password>, and <repository_database_name> (for SQL Server), as you will need them later for the startup configuration.

Repository read-only user

In addition to the repository storage and credentials with full access to the repository schema, a user with limited access to the repository must be created.

This user is used by xDM Discovery’s built-in profile dashboards and should have at least read-only access to the tables in the repository schema that stores the profiling metrics. These tables have the PRF_ prefix.

Since the repository tables are not created yet, you must run the GRANT statements for this user on the repository tables after creating the repository. These grants are also provided in the post-installation steps.

Oracle

Create the repository read-only user
-- Create the repository read-only user
CREATE USER <repository_readonly_user>
 IDENTIFIED BY <repository_readonly_password>
 DEFAULT TABLESPACE USERS TEMPORARY TABLESPACE TEMP;

-- Grant minimum access to the user
GRANT CONNECT TO <repository_readonly_user>;

-- Run the following commands after the repository creation.
-- Grant select privileges on the profiling tables
GRANT SELECT ON <repository_user>.PRF_PROFILING TO <repository_readonly_user>;
GRANT SELECT ON <repository_user>.PRF_TABLE TO <repository_readonly_user>;
GRANT SELECT ON <repository_user>.PRF_COLUMN TO <repository_readonly_user>;
GRANT SELECT ON <repository_user>.PRF_DIST_VALS TO <repository_readonly_user>;
GRANT SELECT ON <repository_user>.PRF_DIST_PATTERNS TO <repository_readonly_user>;

-- The <repository_readonly_user> user must use the following statement to default their schema to the repository.
-- ALTER SESSION SET CURRENT_SCHEMA = <repository_user>;

PostgreSQL

Create the repository read-only user
-- Create the repository read-only user
CREATE USER <repository_readonly_user> WITH PASSWORD '<repository_readonly_password>';

-- Use the following syntax for PostgreSQL 9
-- CREATE USER <repository_readonly_user> WITH UNENCRYPTED PASSWORD '<repository_readonly_password>';

GRANT CONNECT ON DATABASE <postgresql_database_name> to <repository_readonly_user>;

-- Set the search path to include the repository
ALTER ROLE <repository_readonly_user> SET SEARCH_PATH TO "$user", <repository_user>,public,extensions;

-- Run the following commands after the repository creation
-- Grant select privileges on the profiling tables
GRANT USAGE ON SCHEMA <repository_user> TO <repository_readonly_user>;
GRANT SELECT ON TABLE
    <repository_user>.PRF_PROFILING,
    <repository_user>.PRF_TABLE,
    <repository_user>.PRF_COLUMN,
    <repository_user>.PRF_DIST_VALS,
    <repository_user>.PRF_DIST_PATTERNS
TO <repository_readonly_user>;

SQL Server

Create the repository read-only user
-- Create the repository read-only user
CREATE LOGIN <repository_readonly_user> WITH PASSWORD='<repository_readonly_password>', DEFAULT_DATABASE=<repository_database_name>
GO

-- Add a user for that login in the database
USE <repository_database_name>
GO

CREATE USER <repository_readonly_user> FOR LOGIN <repository_readonly_user>;
GO

-- Grant minimum access to the user
GRANT CONNECT TO <repository_readonly_user>;

-- Run the following commands after the repository creation
-- Grant select privileges on the profiling tables
GRANT SELECT ON PRF_PROFILING TO <repository_readonly_user>;
GRANT SELECT ON PRF_TABLE TO <repository_readonly_user>;
GRANT SELECT ON PRF_COLUMN TO <repository_readonly_user>;
GRANT SELECT ON PRF_DIST_VALS TO <repository_readonly_user>;
GRANT SELECT ON PRF_DIST_PATTERNS TO <repository_readonly_user>;
Store the values of the <repository_readonly_user> and <repository_readonly_password>, as you will need them later for the startup configuration.

Configure the data location storage

Data locations' databases or schemas do not necessarily have to be created at the time of installation, but it is recommended to anticipate their creation as part of the installation and configuration efforts. You can create them manually or use your database administration interface for this purpose.
Below are sample scripts for creating a data location database or schema. Make sure to adapt this script to your database configuration and duplicate it to create storage for all data locations.

Oracle

Create a data location schema
CREATE USER <data_location_user_name> IDENTIFIED BY <data_location_user_password>
 DEFAULT TABLESPACE USERS TEMPORARY TABLESPACE TEMP;
GRANT CONNECT, RESOURCE, CREATE VIEW TO <data_location_user_name>;

-- Use the following command for Oracle 12c and above
GRANT UNLIMITED TABLESPACE TO <data_location_user_name>;

-- Grant data-loading privileges to the data location owner
GRANT EXECUTE ON <repository_database_schema>.INTEGRATION_LOAD TO <data_location_user_name>;
The CREATE VIEW privilege is required to deploy database views.

PostgreSQL

Create a data location schema
CREATE USER <data_location_user_name> WITH PASSWORD '<data_location_user_password>';

-- Use the following syntax for PostgreSQL 9
-- CREATE USER <data_location_user_name> WITH UNENCRYPTED PASSWORD '<data_location_user_password>';

-- The following command is required only for PostgreSQL running on Azure or Amazon RDS.
-- It grants access to the data location to the Azure or RDS superuser.
-- GRANT <data_location_user_name> TO <azure|rds_superuser_name>

CREATE SCHEMA <data_location_user_name> AUTHORIZATION <data_location_user_name>;

-- Grant data-loading privileges to the data location owner
GRANT EXECUTE ON ALL FUNCTIONS IN SCHEMA <repository_schema> TO <data_location_user_role>;

Snowflake (preview)

Snowflake support in xDM is a preview feature. It is not suitable for production environments. Use this feature for testing and evaluation purposes only.
Create a data location database, login, and user
-- Create a role for managing access privileges
CREATE OR REPLACE ROLE <data_location_user_role>;

-- Create the data location schema
CREATE OR REPLACE SCHEMA <database_schema_name>;

-- Grant all privileges on the schema to the specified role
GRANT ALL ON SCHEMA <database_schema_name> TO ROLE <data_location_user_role>;

-- Create a user with a specified password and other default settings
CREATE USER <data_location_user_name>
PASSWORD = '<password>'
TYPE = LEGACY_SERVICE   -- Specifies that this is a service account
DEFAULT_WAREHOUSE = <warehouse_name>
DEFAULT_ROLE = <data_location_user_role>
DEFAULT_NAMESPACE = <data_location_database_name>.<database_schema_name>
-- (Optional) Set the user's timezone (default is 'America/Los_Angeles')
TIMEZONE = '<timezone name>';   -- Adjust to the desired timezone if necessary

-- Assign the role to the created user and grant them the role's privileges
GRANT ROLE <data_location_user_role> TO USER <data_location_user_name>;

-- Grant usage privileges on the database to the role and allow access to the database
GRANT USAGE ON <data_location_database_name> TO ROLE <data_location_user_role>;

SQL Server

Create a data location database, login, and user
-- Create a database for the data location
CREATE DATABASE <data_location_database_name>
GO

-- Configure the database
ALTER DATABASE <data_location_database_name> SET READ_COMMITTED_SNAPSHOT ON;
GO

ALTER DATABASE <data_location_database_name> SET QUOTED_IDENTIFIER ON;
GO

-- Create a login to connect the database
CREATE LOGIN <data_location_user_name> WITH PASSWORD='<data_location_user_password>', DEFAULT_DATABASE=<data_location_database_name>
GO

-- Add a user for that login in the database
USE <data_location_database_name>
GO

CREATE USER <data_location_user_name> FOR LOGIN <data_location_user_name>
GO

-- Make this user database owner
ALTER ROLE db_owner ADD MEMBER <data_location_user_name>
GO

-- Grant data-loading privileges to the data location owner
USE <repository_database_name>;
GRANT EXECUTE ON SCHEMA :: dbo TO <data_location_user_name>;
Store the values of the <data_location_user_name>, <data_location_user_password>, and <data_location_database> (for SQL Server), as you will need them later for creating the data location datasource.

Database-specific considerations

General considerations

Configure the database with a charset that supports all languages, such as AL32UTF8 for Oracle and UTF8 for PostgreSQL. Semarchy xDM uses specific characters for storing internal information. Applications in Semarchy xDM natively support multilingual data without preventing users from entering accented characters (or Cyrillic, Arabic, or Chinese characters). A database configured for a specific language or with a limited charset may not function optimally with Semarchy xDM.

Oracle

Repositories and data locations should be located in separate schemas. However, they do not necessarily need to be located in the same database.

Semarchy xDM ships with an Oracle JDBC driver (ojdbc8.jar) for Oracle Database 12c Release 2 (12.2.x). This driver is strongly recommended for all recent database versions. If you are using an older Oracle version (11g), it is recommended to review the compatibility of this driver with your Oracle database version and possibly install an older driver version instead (ojdbc6 or ojdbc7).

To increase the maximum string length from 4,000 to 32,767 bytes on an Oracle database, platform administrators can set the MAX_STRING_SIZE parameter to EXTENDED in the Oracle database settings, and add com.semarchy.mdm.supportExtendedMaxStringSize to xDM system properties.

Once set to EXTENDED, the MAX_STRING_SIZE parameter cannot be set back to STANDARD. Likewise, the com.semarchy.mdm.supportExtendedMaxStringSize property cannot be reverted to false once it has been set to true.

PostgreSQL

Repositories and data locations should be located in separate schemas.

Snowflake (preview)

Snowflake support in xDM is a preview feature. It is not suitable for production environments. Use this feature for testing and evaluation purposes only.

Hosting recommendations

xDM does not support the installation of the repository on Snowflake; it can only use Snowflake to host data locations. Therefore, the xDM repository and data location should be hosted on separate databases.

For better performance, we recommend deploying xDM within the same cloud provider region as your Snowflake database. Additionally, optimizing network routes can help reduce latency and enhance data transfer performance between xDM and Snowflake.

Hybrid tables

The deployment of xDM requires the use of hybrid tables, enabled by Snowflake Unistore. During the model deployment process, hybrid tables are automatically generated by xDM in the designated Snowflake data location.

Timezone considerations

By default, Snowflake uses the America/Los_Angeles timezone. However, to ensure consistency between xDM and your Snowflake data location, the timezone should match that of the xDM server.

When creating users, include the appropriate timezone setting in the script. For example, if the xDM server is configured to use UTC, add TIMEZONE='UTC' to your Snowflake user creation script.

SQL Server

The configuration described above uses logins defined within the database. You can modify this script to accomodate Windows or Active Directory logins as needed.

xDM does not support schemas for SQL Server instances. A separate database is necessary for each repository and data location, and each user accessing these databases should have the database owner role.

For the following reasons, exercise caution when configuring the database instance’s collation:

  • Collation determines the code page, case sensitivity (CS/CI), and accent sensitivity (AS/AI). It significantly affects comparison functions, ORDER BY clauses, and execution performance. During comparisons, any character outside the defined collation code page is treated as an "unknown character." Since two different unknown characters are consistently considered distinct, this may cause unexpected complications during comparisons.

  • To ensure support for all characters you intend to use or store in the hub, the collation for the repository’s hosting database should be selected judiciously. Semarchy xDM internally utilizes the following special characters (identified by their Unicode number): £ (U+00A3), $ (U+0024), ¤ (U+00A4), • (U+2022). These characters should be also supported by the collation.

Consider using a UTF-8 collation if supported by your SQL Server version.
  • It is recommended to select a case-sensitive collation for the repository and data locations database. Using a case-insensitive collation may cause unexpected results in search operations.

the SEM_NORMALIZE function used is made collation-proof and forces the Latin1_General_100_CS_AS_KS_WS_SC collation.

Additionally, SQL Server repository and data location databases should be configured as follows for Semarchy xDM:

  • Set QUOTED_IDENTIFIER to ON to force SQL Server to follow the ISO rules for identifiers and literal values quoting, using the following command:

    ALTER DATABASE <database_name> SET QUOTED_IDENTIFIER ON;
  • Set READ_COMMITTED_SNAPSHOT to ON to enable connections to access the previous (committed) version of the records being modified, rather than of waiting for the records to be unlocked. This can be done using the following command:

    ALTER DATABASE <database_name> READ_COMMITTED_SNAPSHOT ON;

Size and maintain the databases and schemas

Repository

The following considerations should be taken into account when sizing repository databases and schemas:

  • The repository’s overall space is distributed across several components, including model-related data—​whose volume is relatively modest and depends on the model size; the execution log—​which includes data on jobs, job instances, and steppers; and data from the deployment history and model editions, encompassing even closed model editions.

  • Determining the storage size for a repository involves considering factors such as the number of batches processed, the volume of data generated during model executions, and the dynamic nature of development activities. As a standard recommendation, we suggest allocating a minimum of 20 GB of disk space for the repository schema. While hosting models, applications, and dashboard metadata may not require that much space initially, it allows for proper monitoring setup and potential future expansion. Moreover, real-world observations show that, over time, repositories often exceed this size significantly, especially with continuous development activities. Therefore, it is advisable to consider a more substantial storage allocation to accommodate potential growth.

  • To maintain the execution log at a manageable volume, it is recommended to perform regular purges. Configuring sensible retention policies within models and scheduling periodic data purges in data locations helps control the repository’s overall size.

Data location

When sizing the data location databases or schemas, the following considerations should be taken into account:

  • For each entity of the deployed model, the data location contains several tables respresenting data at various stages of the certification process. Some of these tables also store the history of previous process iterations.

  • The required volume in the data location depends on the following factors:

    • The number of entities.

    • The number of source records pushed for these entities.

    • The number of new or updated source records pushed for these entities.

The recommended original sizing involves adding the source data volume pushed for each entity by all publishers, along with one instance of data authoring (the overall input), and multiply it by a factor of 10. After establishing the initial sizing, it is recommended to regularly monitor the size of the data location during normal operations and adjust the sizing as needed.

The same sizing guidelines apply to both the data and temporary tablespaces or filegroups when considering data locations, as the database engine handles most of the processing effort during the certification process.

You can create data retention policies to specify the volume of data to retain in the data locations, and schedule data purges to initiate the pruning of unnecessary data.