Configure data retention
This page describes how to define the retention policies for historical and lineage data.
Introduction to data retention
The data hub stores the lineage and history of the certified golden data, that is the data that led to the current state of the golden data:
-
The built-in lineage traces the whole certification process from the golden data back to the sources. It traces the source data changes that were either pushed to the hub through external loads or performed into the hub, for example using steppers.
-
Data historization traces the changes made to the golden and master data.
Preserving the lineage and history is a master data governance requirement and a key regulatory compliance focus. However, keeping this information may also create a large volume of data in the hub storage.
To keep a reasonable volume of information, data location managers schedule data purges.
To make sure lineage and history are preserved according to the data governance and compliance requirements, model designers can apply a data retention policy to a model.
Data retention policies
There are different types of data retention policies:
-
The model data retention policy defines the retention duration for history and lineage data in the hub. This policy applies by default to all entities with no specific policy.
-
Entity data retention policies can be specified to override the model retention policy for specific entities.
For example:
-
The hub is configured to retain no history at all. This is the general policy.
-
Employee data is retained for 10 years.
-
Product data is retained forever.
When running a workflow, metadata from the workflow instances and their related stepper and branch instances, obsolete work items, attachments, and datasets may be stored in the database schema.
The retention and purge of workflow metadata are distinct considerations that should be addressed by workflow designers. For this reason, the retention policy of a specific workflow must be configured as a workflow definition property within the Workflow Builder.
Data purge
Depending on the retention policy defined for the model, data purge takes place in the deployed hub.
The purge deletes the following elements of the lineage and history:
-
Source data published to the hub via external loads.
-
Data authored (created, modified or overridden) in the hub.
-
Traces of deleted data.
-
Golden and master data history (if historization is configured).
-
Errors detected on the source and authored data by the integration job.
-
Duplicate choices made by users in duplicate managers.
Note that duplicate-management decisions still apply after a purge, but information about the time of the decision and the decision maker is deleted.
The purges only impact the history and lineage of the data in the data location. They do not delete actual golden and master data. |
Optionally, the following repository artifacts can also be deleted as part of the purge process:
-
Job logs, batches and loads for which all the processed data has been purged.
-
Direct authoring, duplicate manager and workflow instances for which all the changed data has been purged.
Job logs, batches, loads, direct authoring, duplicate manager and workflow instances are purged when all their data have been purged. Therefore they are purged based on the longest retention policy of all the entities that they manage. |
Deploy purge jobs
When a model is deployed, a purge job is automatically created in the deployment data location. This job purges data and artifacts according to the retention policy defined in the deployed model.
Purge jobs are scheduled by the data location manager as part of the data location configuration. For more information, see Configure data purge.
Regardless of the frequency of the purge schedule, the data history is retained as defined by the model designer in the data retention policies. |
Define a default retention policy
The model retention policy applies to all entities that are not subject to a specific retention policy.
To define the default data retention policy:
-
In the Application Builder, open the model edition for which you wish to define a retention policy.
-
In the Model Design view, double-click the Retention Policies node. The Data Retention Policy editor opens.
-
In the Data Retention Policy editor, in the Data Retention Policy section, set the properties for each of the following types of data:
-
Source Data
-
Source Errors
-
History
-
Deletions
-
-
(Optional) In the Description field, enter a description for the retention policy.
-
Press Control+S (or Command+S on macOS) to save your changes.
Define an entity data retention policy
The default retention policy defined in the previous section applies to all entities. You can also define entity-specific retention policies to override the default retention policy.
To define an entity data retention policy:
-
In the Data Retention Policy editor, click Add Entity Retention Policy. The Create New Entity Retention Policy wizard opens.
-
In the Create New Entity Retention Policy wizard, in the Entity field, select the entity for which you want to define a retention policy.
-
Set the properties for each of the following types of data:
-
Source Data
-
Source Errors
-
History
-
Deletions
-
-
Click Finish to close the wizard.
-
Press Control+S (or Command+S on macOS) to save your changes.
You can only have one entity retention policy per entity of the model. |
The retention policy has no effect unless the model is deployed and a purge schedule is configured. |
Data retention properties
The following table lists the properties used for defining the retention policy for different types of data.
Property | Description |
---|---|
<DataType> Retention Type |
Defines how long the data should be retained. Possible values are:
|
<DataType> Time Duration |
Only editable if the retention type is set to Period. Number representing the duration for which the data should be retained. |
<DataType> Time Unit |
Only editable if the retention type is set to Period. Unit of the duration. Possible values are:
|