Match and merge

The match and merge process detects the duplicates in order to consolidate them into a single golden record.

Capabilities per entity type

The matching process differs based on the entity type—​whether fuzzy-matched or ID-matched.

  • Fuzzy-matched entities rely on a matcher to automatically identify duplicates records and group them into clusters. The values are then consolidated into a golden record using survivorship rules.

  • ID-matched entities perform an exact match using a unique, user-provided ID. Since the ID serves as a primary key across systems, no matcher is required. However, survivorship rules are still applied to merge the values into a golden record.

  • Basic entities do not support the match and merge process.

Basic and ID-matched entities do not support fuzzy matching within the certification process.
Nevertheless, matchers can still be defined for these entities to detect duplicates during record creation. When the DETECT_DUPS validation is enabled, the matcher checks for similar records to alert the user if a new record matches an existing one.

ID matching vs. fuzzy matching

ID matching produces highly predictable results, as all records sharing the same ID will match and merge.

When using fuzzy matching, the outcome of the matching process is more complex and flexible, as it is based on fuzzy matching rules and algorithms tailored to the specific data.

Creating match groups

The initial phase of the match and merge process is to identify duplicate records and create match groups, or duplicate clusters, which will later be consolidated (i.e., merged) into golden records.

This phase is not applicable for ID-matching entities, as records are already grouped by their ID.

Matching record pairs

The first step in matching is to identify duplicate record pairs—that is, records that appear similar.

The match rules within a matcher define the conditions under which two records are considered duplicates. These rules generate a matching score, reflecting the level of confidence in the match based on the rule applied.

Match rules and scores

For example, the following rule (MATCH_RULE_1) considers two businesses identical, with a score of 100:

Record1.CustomerName = Record2.CustomerName and
Record1.InputAddress.Address = Record2.InputAddress.Address and
...
Record1.InputAddress.City = Record2.InputAddress.City

The following rule (MATCH_RULE_2) might assign a score of 85, as it detects businesses with numerous similarities:

SEM_EDIT_DISTANCE_SIMILARITY( Record1.CustomerName, Record2.CustomerName ) > 85 and
SEM_EDIT_DISTANCE_SIMILARITY( Record1.InputAddress.Address, Record2.InputAddress.Address ) > 65 and
SEM_EDIT_DISTANCE_SIMILARITY( Record1.InputAddress.City, Record2.InputAddress.City ) > 65

Another rule (MATCH_RULE_3) would assign a lower score, such as 20, as it identifies businesses with somewhat similar names in the same city:

SEM_EDIT_DISTANCE_SIMILARITY( Record1.CustomerName, Record2.CustomerName ) > 50
Record1.InputAddress.City = Record2.InputAddress.City

When two records are considered a match, they receive a match score equal to the highest score among all the rules that apply (indicating the highest level of confidence).

For example, if two records match using both MATCH_RULE_2 (score 85) and MATCH_RULE_3 (score 20), the final match score will be 85, reflecting the highest confidence score between the two.

Creating match groups

Match groups are formed by re-grouping duplicate records that have been identified as matches.

By default, the initial grouping mechanism is transitive and coarse-grained. It groups all records connected through a match rule.
In simple terms, if record A matches record B and record B matches record C, then records A, B, and C will belong to the same match group.

Transitive grouping

The example below illustrates how transitive grouping works:

Match Group

In this case:

  • Jane Smith (j.smith@acme.com) matches Jane Smith (jane@goliath.com) based on the Same Name rule,

  • Jane Smith (jane@goliath.com) matches Janet Jones (jane@goliath.com) based on the Same Email rule,

  • Jane Smith (j.smith@acme.com) does not match Janet Jones (jane@goliath.com) because their names and emails differ.

However, due to the transitive nature of grouping, Jane Smith (j.smith@acme.com), Jane Smith (jane@goliath.com), and Janet Jones (jane@goliath.com) are placed in the same initial match group.

The multi-iterating grouping algorithm offers a more refined approach for clustering beyond the initial coarse-grained grouping mechanism.

Confidence score

A match group is assigned a confidence score, which indicates how confident the system is in the accuracy of the matching records within the group. This score is derived from the scores of the individual match pairs that make up the group.

There are two methods for calculating a group’s confidence score: direct scoring and transitive scoring. These methods influence the final confidence score of the match clusters.

Merging groups into golden records

Depending on the confidence score, you may allow the matcher to automatically merge the match group. This process creates a golden record from the group and applies the consolidation rule to determine which values from the group will be consolidated into the golden record.

If the confidence score is not high enough to trigger an automatic merge, the match group is flagged as a merge suggestion. In this scenario:

  • Incoming records are kept as singleton golden records.

  • Existing record groups and golden records remain unchanged.

Data stewards can review merge suggestions using duplicate management tools to decide whether or not to merge the groups and create golden records.

The merge policy set when creating a matcher defines the confidence score thresholds required for automatic merging in different situations. For more information about merge policies, see Automate merge and confirmation.

Confirming golden records

As values in the source records are updated, match groups and golden records can evolve.

Impact of value changes on unconfirmed records

For instance, renaming a business from "Micro Soft Incorporated" to "Micro Soft" may cause it to match and merge with an existing "Microsoft" record if fuzzy matching is applied to business names. In this case, the original "Micro Soft Incorporated" golden record would be merged into the "Microsoft" record and would no longer exist as a separate entity.

Confirming a golden record involves "locking" the match group to prevent it from being reevaluated each time the source data changes.

This action is typically performed by a data steward using a duplicate management tool. The steward manually confirms accurate match groups and corrects any incorrect ones.

Depending on the confidence score of a match group, you may want to set up automatic confirmation to reduce the need for data stewards to review all records manually.

The auto-confirm policy, configured when creating a matcher, determines the confidence score required to automatically confirm golden records. This policy also allows you to specify whether singletons (i.e., golden records based on a single master record) should be confirmed automatically.

Over time, records may fall into one of the following confirmation statuses:

  • Not confirmed: the golden record has not been confirmed by either the matcher or a user.

  • Confirmed: the golden record has been fully confirmed by either the matcher or a user.

  • Partially confirmed: part of the match group forming the golden record has been confirmed by a user, but some master records in the group remain unconfirmed.

  • Previously confirmed: the record was confirmed, but its group has since been modified by a user.

Regardless of a record’s confirmation status, a data steward can always manually split and merge duplicates using a duplicate management tool.

Consolidating and overriding values

When multiple master records are merged into a golden record, their data is consolidated into the golden record. This process may also involve overrides—​that is, values manually entered by data stewards in the data hub.

Survivorship determines which data prevails in the golden record when multiple records are merged.

A survivorship rule specifies how golden record values are calculated for attributes in both fuzzy-matched and ID-matched entities. It consists of two key components:

  • A _consolidation rule_that defines how values from duplicate records (identified by the matcher) are combined into a single (golden) record.

  • An override rule that determines how user-authored values, when provided, take precedence over the consolidated values in the golden record.

Each survivorship rule applies to a specific attribute or set of attributes within an entity. For more information about survivorship rules, see Survivorship.

Golden records changes

Once a golden record is created, it may evolve over time as new master records are added, merged with other golden records, or become part of a suggestion. These group changes may trigger the consolidation and survivorship processes, resulting in updates to the golden record’s values.

The life cycle of a golden record is managed automatically based on the matcher and survivorship rules configured for each entity.

While the automated matching process merges groups of records, it does not automatically split existing groups.

Data stewards can use duplicate management tools to manually review, confirm, merge, or split match groups and suggestions. They also have the ability to override golden record values in line with the configured override rules.