Integration jobs

Overview

An Integration Job is a job executed by Semarchy xDM to integrate and certify source data into golden records. This job uses the rules defined as part of the certification process, and contains a sequence of Tasks running these rules. Each task addresses one entity, and performs several processes (Enrichment, Validation, etc.) for this entity.

When rules are defined in the model, or more Integration Jobs can be defined for the model.
The integration job will run to perform the certification process, using the hub’s database engine for most of the processing (including SemQL processing) and Semarchy xDM for running the plugins code.

Integration job triggers

Integration jobs are triggered to integrate data published in batch by data integration/ETL tools, or to process data managed by users in applications.

When pushing data in the hub, a data integration or ETL product performs the following:

  1. It requests a Load ID to identify the data load and initiate a transaction with Semarchy xDM.

  2. It loads data in the landing tables of Semarchy xDM, possibly from several sources identified as Publishers.

  3. It submits the load identified by the Load ID, and when submitting the load, it provides the name of the Integration Job that must be executed to process the incoming data.

Similarly, when a user starts a stepper:

  1. A transaction is created and attached to the stepper instance and is identified by a Load ID.

  2. The user performs the data authoring operations in the graphical user interface. All the data manipulations are performed within the transaction. The user can save data in this transaction and resume these data changes later.

  3. When the stepper is finished, the transaction is submitted. This triggers the Integration Job specified in the stepper definition.

Create integration jobs

To create a job:

  1. Right-click the Jobs node and select Add Job…. The Create New Job wizard opens.

  2. In the Create New Job wizard, check the Auto Fill option and then enter the following values:

    • Name: Internal name of the object.

    • Description: Optionally enter a description for the Job.

    • Queue Name: Name of the queue that will contain this job.

  3. Click Next.

  4. In the Tasks page, select the Available Entities you want to process in this job and click the Add >> button to add them to the Selected Entities.

  5. Click Finish to close the wizard. The Job editor opens.

  6. Select Tasks in the editor sidebar. In the list of Tasks, the entities involved in each task are listed, as well as the processes (Enrichers, Matchers, etc.) that will run for these entities.

  7. Use the Move Up and Move Down buttons to order the tasks.

  8. To edit the processes involved in one task:

    1. Double-click the entity Name in the Tasks table. The editor switches to the Task editor.

    2. Select the process you want to enable for this task.

    3. Use the editor breadcrumb to go back to the Job editor.

  9. Press Control+S (or Command+S on macOS) to save the editor.

  10. Close the editor.

Configuring job parameters

Jobs can be parameterized to customize or optimize their execution.

To change a job parameter:

  1. In the job editor, select Job Parameters in the editor sidebar.

  2. In the Job Parameters table, click the Add Parameter button. The Create New Job Parameter wizard opens.

  3. In the Name field, enter the name of the parameter.

  4. In the Value field, enter the value for this parameter.

  5. Click Finish to close the wizard.

  6. Press Control+S (or Command+S on macOS) to save the editor.

  7. Close the editor.

See job parameters for a list of the parameters available to customize and optimize the integration jobs execution.

Best practices for job design

Jobs sequence and parallelism

A Job is a sequence of tasks. These tasks must be ordered to handle referential integrity. For example, if you process the Contact entity and then the Customer entity, you may start processing new contacts attached to customers that do not exist yet. You should preferably process the customers then the contacts.

Jobs are themselves executed sequentially in defined Queues in a FIFO (First-In First-Out) mode.
If two jobs can run simultaneously, they should be in different queues. For example, if two jobs address two different areas in the same model, then these jobs can run simultaneously in different queues.

It is not recommended to configure jobs processing the same entity to run in different queues. Instances of these jobs running simultaneously in two different queues may write conflicting changes to this entity, causing major data inconsistencies, SQL exceptions, and database deadlocks.

Parallel processing is not supported for fuzzy-matched entities.

Design jobs for data integration

Data published in batches may target several entities.

It is recommended to define jobs specific to the data loads targeting the hub:

  • Such jobs should include all the entities loaded by the data load process.

  • In addition, it is recommended to include the entities referencing the entities loaded by the data load process.

Design jobs for authoring

A stepper (used directly or in a workflow) manipulates data in several entities.

If a stepper (or the workflow using it) is created without a reference to a job, a job is automatically generated and attached to that stepper when the model is deployed. This generated job processes all the entities involved in that stepper.

It is also possible to define a specific job for the stepper:

  • Such job should process all the entities involved in the stepper.

  • In addition, if some of these are fuzzy-matched entities, the job should process all the entities referencing these entities.

Design jobs for duplicate management

A duplicate management action handles duplicates detected on a specific entity.

It is possible to define a specific job for each duplicate management action:

  • Such job should process the entity involved in the duplicate management workflow.

  • In addition, if the entity involved in the duplicate management workflow is a fuzzy-matched entity, the job should process all the entities referencing this entity.