Transformers
Transforms modify and enrich the metadata collected by the harvesting process. Transformers run between the source and the sink, as part of the harvesting process.
Manage ownership
Using transformers, you can manage dataset ownership.
transformers:
- type: "simple_add_dataset_ownership" (1)
config:
owner_urns: (2)
- "urn:li:corpuser:john.doe" # User URN
- "urn:li:corpGroup:sales" # Group URN
ownership_type: "TECHNICAL_OWNER" (3)
1 | Transformer type. |
2 | List of owner URNs to assign to the assets. These URNs may be user or group URNs. |
3 | Ownership type for these owners. |
transformers:
- type: "pattern_add_dataset_ownership" (1)
config:
owner_pattern: (2)
rules:
# Assign owners to the assets
# which URN contains 'gd'
".*gd.*": ["urn:li:corpuser:john.doe", "urn:li:corpuser:jane.doe"]
ownership_type: "TECHNICAL_OWNER" (3)
1 | Transformer type. |
2 | List of regular expression patterns, each followed by a list of owner URNs. When a dataset URN matches a pattern, the owners listed for that pattern are assigned to that dataset. |
3 | Ownership type for these owners. |
Add tags
Using transformers, you can assign tags to datasets and dataset schema fields.
transformers:
- type: "simple_add_dataset_tags" (1)
config:
tag_urns: (2)
- "urn:li:tag:ToDo"
- "urn:li:tag:Review"
1 | Transformer type. |
2 | List of tag URNs to assign to the assets. |
transformers:
- type: "pattern_add_dataset_tags" (1)
config:
tag_pattern: (2)
rules:
# Assign the Temporary and Review tags to the assets
# which URN contains 'tmp'
".*tmp.*": ["urn:li:tag:Temporary", "urn:li:tag:Review"]
# Assign the Obsolete tag to the assets
# which URN contains 'old'
".*old.*": ["urn:li:tag:Obsolete"]
1 | Transformer type. |
2 | List of regular expression patterns, each followed by a list of tag URNs. When a dataset URN matches a pattern, the tags listed for that pattern are assigned to that dataset. |
transformers:
- type: "pattern_add_dataset_schema_tags" (1)
config:
tag_pattern: (2)
rules:
# Assign the Review and Quality tags to the schema fields
# which URN contains 'email'
".*email.*": ["urn:li:tag:Review", "urn:li:tag:Quality"]
# Assign the Obsolete tag to the schema fields
# which URN contains 'old'
".*old.*": ["urn:li:tag:Obsolete"]
1 | Transformer type. |
2 | List of regular expression patterns, each followed by a list of tag URNs. When a schema field path matches a pattern, the tags listed for that pattern are assigned to that schema field. |
Only the tags from the first matching pattern are applied and not the subsequent ones. |
Add glossary terms
Using transformers, you can assign glossary terms to datasets and dataset schema fields.
transformers:
- type: "simple_add_dataset_terms" (1)
config:
term_urns: (2)
- "urn:li:glossaryTerm:GoldenData"
- "urn:li:glossaryTerm:Regulated"
1 | Transformer type. |
2 | List of glossary term URNs to assign to the assets. |
transformers:
- type: "pattern_add_dataset_terms" (1)
config:
term_pattern: (2)
rules:
# Assign the GoldenData and Certified terms to the assets
# which URN contains 'gd'
".*gd.*": ["urn:li:glossaryTerm:GoldenData", "urn:li:glossaryTerm:Certified"]
# Assign the MasterData term to the assets
# which URN contains 'md'
".*md.*": ["urn:li:glossaryTerm:MasterData"]
1 | Transformer type. |
2 | List of regular expression patterns, each followed by a list of term URNs. When a dataset URN matches a pattern, the terms listed for that pattern are assigned to that dataset. |
transformers:
- type: "pattern_add_dataset_schema_terms" (1)
config:
term_pattern: (2)
rules:
# Assign the PII and Email terms to the schema fields
# which URN contains 'email'
".*email.*": ["urn:li:glossaryTerm:PII", "urn:li:glossaryTerm:Email"]
# Assign the Confidential term to the assets
# which URN contains 'internal'
".*internal.*": ["urn:li:glossaryTerm:Condfidential"]
1 | Transformer type. |
2 | List of regular expression patterns, each followed by a list of term URNs. When a schema field path matches a pattern, the terms listed for that pattern are assigned to that schema field. |
Edit domains
Using transformers, you can assign domains to datasets.
transformers:
- type: "simple_add_dataset_domain" (1)
config:
domains: (2)
- "urn:li:domain:sales"
1 | Transformer type. |
2 | List of domain URNs to assign to the assets. You can also use the domain name instead of the URN, for example, "sales". |
transformers:
- type: "pattern_add_dataset_domain" (1)
config:
domain_pattern: (2)
rules:
# Assign the sales domain to the assets
# which URN matches the pattern
"urn:li:dataset:\(urn:li:dataPlatform:postgres,postgres\.public\.n.*": ["urn:li:domain:sales"]
1 | Transformer type. |
2 | List of regular expression patterns, each followed by a list of domain URNs or names. When a dataset URN matches a pattern, the domains listed for that pattern are assigned to that dataset. |
Set properties
Using transformers, you can set properties on datasets.
transformers:
- type: "simple_add_dataset_properties" (1)
config:
properties: (2)
confidential: value_1
property_2: value_2
1 | Transformer type. |
2 | List of properties to set on the dataset, with their values. |
Assign to initiatives
Using transformers, you can assign datasets to initiatives.
transformers:
- type: "add_to_xdg_initiative" (1)
config:
initiative_urn: "urn:li:dataProduct:dataQuality" (2)
xdg_backend_url: 'https://<your-tenant-name>.semarchy.net/api/xdg/v1' (3)
1 | Transformer type. |
2 | List of initiative URNs assets are assigned to. |
3 | URL of the Semarchy xDG tenant. |
Replace existing values
When using transformers, you can optionally define how tags, terms, etc that you set with the transformer should behave in regards to the tags, terms, etc collected from the source and those already present on the assets stored in Semarchy xDG
You can set two optional properties on each transformer to define this behavior:
-
replace_existing
: When set totrue
, the transformer replaces - instead of adding - the values produced by the transformer to those collected from the source. This property defaults tofalse
. -
semantics
: When set toOVERWRITE
(default value), the transformer overwrites all the values stored in Semarchy xDG with those produced by the transformer. When set toPATCH
, it adds the values to those in Semarchy xDG.
transformers:
- type: "simple_add_dataset_tags" (1)
config:
tag_urns:
- "urn:li:tag:ToDo"
- "urn:li:tag:Review"
replace_existing: true (2)
semantics: OVERWRITE (3)
1 | Transformer adding tags to assets. |
2 | Replace all tags that may be present in the harvested assets. |
3 | Overwrite all tags for these assets in Semarchy xDG. |