Matching records
The matching process involves comparing pairs of records using multiple match rules to determine if they are duplicates.
The actual matching phase may be preceded by a binning phase.
Match rules operate on two pseudo-records: Record1 and Record2. In the context of the certification process, Record1 is an alias for any master record either created or updated by the current data load, while Record2 is an alias for any master record, including those created or updated by the current load.
Understanding the difference between Record1 and Record2 helps in designing match rules that either maintain consistency or require additional safeguards to ensure reliable data matching and consolidation. For more information, see Advanced match rules.
Binning
Binning groups records into smaller, more manageable sets, or bins. This approach reduces the number of comparisons needed by narrowing down the search scope.
It is achieved through multiple SemQL expressions specified in a match rule. Records for which all binning expressions produce identical results are placed in the same bin.
While binning is a valuable tool that makes the matching logic conceptually simpler, it does not necessarily enhance performance. Its effectiveness relies on using selective binning attributes and having custom indexes on the related columns in the database.
|
Effective use of binning
Instead of matching all records globally, binning allows for:
-
Matching records based on a specific attribute (e.g., matching customers only when they are located in the same country).
-
Further narrowing the binning results by combining attributes or specific functions (e.g., matching customers within the same country and sales region).
To bin customers according to their country and sales region in the GeocodedAddress
complex field, you could use the following expressions:
-
Binning expression #1:
GeocodedAddress.Country
-
Binning expression #2:
GeocodedAddress.Region
Smaller bins lead to faster processing, but designers should make sure that binning does not exclude potential matches. Example. Incorrect binning
Using the first four letters of a last name for binning can separate similar records into different bins, leading to missed matches. In such a case, To prevent this issue, designers could use more appropriate attributes or apply phonetization techniques like SOUNDEX or METAPHONE for binning. These techniques help capture similar-sounding names, ensuring that potential matches are not overlooked due to minor variations in spelling. |
Matching
During the matching phase, a SemQL condition compares all the pairs of records within a bin. This condition uses Record1 and Record2, which corresponds to the two records being matched. If the SemQL condition returns true, records are considered duplicates.
The following matching condition matches pairs of customers who meet one of these criteria:
-
Their names sound the same in English (they are phonetized using PostgreSQL’s built-in double metaphone function), or
-
They have multiple elements of similarity:
-
Their names are over 80% similar (using Semarchy edit distance function), and
-
Their city name and address are more than 65% similar.
-
( DMETAPHONE(Record1.CustomerName) = DMETAPHONE(Record2.CustomerName)
OR
SEM_EDIT_DISTANCE_SIMILARITY(Record1.CustomerName,Record2.CustomerName) > 80 )
and SEM_EDIT_DISTANCE_SIMILARITY(Record1.InputAddress.Address, Record2.InputAddress.Address) > 65
and SEM_EDIT_DISTANCE_SIMILARITY(Record1.InputAddress.City, Record2.InputAddress.City) > 65
The level of overmatching and undermatching that is acceptable depends on the requirements of each data management project and should be carefully considered, as it can have a significant impact in terms of performance and efficiency.
-
Undermatching happens when matching conditions are too strict, which can result in an excessive amount of duplicate records. While not causing technical problems, actual matches may be overlooked in the process, ultimately affecting data quality.
-
Overmatching happens when matching conditions are too lax, which results in excessively large clusters of matching records that are impossible to manage, either manually or automatically.
Overmatching is a common cause of performance bottlenecks during the matching phase, and can trigger overflow errors due to database limitations during the consolidation process. For this reason, defining match rules that would combine an excessive number of master records (in the range of 1,000) into a single golden record is highly discouraged. Additionally, match groups containing several thousand records are considered excessively large clusters and are likely to cause out-of-memory errors. These errors may disrupt the matching or consolidation processes, leading to task failures. To maintain reliable operation and prevent system strain, it is crucial to design match rules that generate manageable match groups and align with available system resources, such as memory and database capacity. |