Advanced matching algorithms

The match and merge process uses advanced algorithms to enable more accurate grouping and scoring capabilities. Both these algorithms can be optionally configured when creating a matcher.

Transitive scoring

There are two methods for computing the confidence score of a group:

  • Direct scoring: this method only takes into account the direct matches found with the rules. This group score is the average of these match scores. Pairs in the group that have not directly matched by any rule are considered as having a score of zero.

  • Transitive scoring: this method takes into account the direct matches found with the rules, plus indirect transitive matches, which are computed. The group score is also the average of the match scores in the group. The major difference is that the pairs in the group that have not matched by any rule are considered as having a transitive score, which is the best combination of scores linking one record to the other.

Understanding transitive scoring

Transitive scoring interprets the match score as the likelihood of similarity between two records, which can be combined with other probabilities.

For example, if record A matches record B with a probability of 0.5 and record B matches record C with a probability of 0.8, this method estimates that record A matches record C with a combined probability of (0.5 * 0.8) = 0.4.

Direct vs. transitive scoring

The following example illustrates direct scoring and transitive scoring. In this example, the match rules have a match score of 90.

Direct vs. transitive scoring

If using direct scoring:

  • Jane Smith (j.smith@acme.com) matches Jane Smith (jane@goliath.com) according to the Same Name rule with a match score of 90.

  • Jane Smith (jane@goliath.com) matches Janet Jones (jane@goliath.com) according to the Same Email rule with a match score of 90.

  • Jane Smith (j.smith@acme.com) did not match Janet Jones (jane@goliath.com) since they have a different name and email, so their direct match score is 0.

When computed with the direct scoring method, the confidence score of this group is: (90 + 90 + 0) / 3 = 60.

If using transitive scoring:

  • Jane Smith (j.smith@acme.com) did not match Janet Jones (jane@goliath.com) since they have a different name and email, but since they are linked via the other record, their transitive match probability is (.90 * .90) = .81, which corresponds to a score of 81.

When computed with the transitive scoring method, the confidence score of this group is: (90 + 90 + 81) / 3 = 87.

Transitive scoring evaluates all possible connections between records, calculates all potential score combinations, and selects the highest transitive score.

A transitive score can surpass a direct match score. Two records that do not match well directly may still be strongly linked through other records.

When data stewards review matches using the graph available in the Explain Record view or in the duplicate managers, the best transitive matches appear as gray edges.

For performance reasons, the transitive scoring mode is automatically disabled for clusters larger than 300 records. These large clusters fallback on direct scoring.

Multi-iteration grouping

The most simple method for merging groups into golden records consists of taking the initial largest match group that contains all records related by any match rule, compute the overall score for that group (either by direct or transitive scoring), and decide whether or not to merge that group.

However, in certain situations, you may want to consider possible sub-groups in the largest group.

Multi-iteration grouping provides the capability to process sub-groups within the coarse-grained group. It merges records by iterations, taking into account rules sharing the same score at each iteration, before moving to the next lower score. At each iteration, the merge policy thresholds apply to decide whether to create golden records or suggestions.

Multi-iteration grouping

The following example illustrates a multi-iteration grouping.

Iterative Grouping

In this example:

  • Janet Smith (j.smith@acme.com) matches J Smith (j.smith@acme.com) according to the Same Email rule with a match score of 95.

  • Jean Smith (jane.jones@goliath.com) matches J Jones (jane.jones@goliath.com) according to the Same Email rule with a match score of 95.

  • Janet Smith (j.smith@acme.com) matches Jean Smith (jane.jones@goliath.com) according to a Fuzzy Name rule with a match score of 15.

The overall group score with direct scoring is (95 + 95 + 15) / 3 = 68.3 but with transitive scoring, the group score is (95 + 95 + 81) / 3 = 90.3.

Intuitively, you would like to merge the first pair, then merge the second pair, since they each match very well. Then you would let the data steward decide whether the two resulting golden records should merge.
If you only rely on the overall group score, you cannot achieve that.

Multi-iteration grouping first takes into account only the rules sharing the highest score (95) and merge the groups. Then it takes rules sharing the following score (10), and so on.

In the example, If you configure a merge policy to create golden records for scores greater than or equal to 95, and have suggestions raised to the data stewards for groups with lower scores, you achieve the following:

  • Janet Smith (j.smith@acme.com) and J Smith (j.smith@acme.com) merge into a first golden record.

  • Jean Smith (jane.jones@goliath.com) and J Jones (jane.jones@goliath.com) merge into a second golden record.

  • The two golden records are then grouped in a suggestion.