The addresses found on lists that are used for marketing campaigns often come from different sources. Even when all the addresses come from the same source, you can be sure that some of the addresses appear two or three times.
There are different causes for duplicates in address lists:
- When compiling addresses from different sources, duplicates are almost inevitable; the case is rare where the address lists to be compiled are non-overlapping.
- Different employees have different ideas as to the best way to record an address, for example, if the word ‘Street’ should be spelled out or not in the street name. Even with just one employee, the recorded addresses can look quite different. For example, addresses recorded under time pressure may only contain the bare minimum of information.
- If the program used to record the addresses is not flexible enough, addresses may be recorded in duplicate for the simple reason that the program does not allow you to record more than one contact person at a single address.
- If the program used to record the addresses is not designed to prevent duplicates during the actual data entry, or if the function intended for this purpose is not effective enough, the employee whose job it is to enter the new addresses may not even realize that these addresses are already on the address list.
It is almost impossible to prevent the occurrence of multiple entries in address lists. This makes it all the more important to search for duplicate addresses in your address lists from time to time. Many of the solutions that are proposed for this problem, or simple functions for this purpose that are integrated in address administration programs, only offer a partial solution to this problem. Still, two addresses that actually are identical can look quite different:
- The first name could be written in front of the surname in one address, and behind the surname in the other.
- The first name and other address components could be abbreviated.
- Particularly with company names, individual parts of the company might not have been recorded, for example, when ‘BMW’ was recorded instead of 'BMW Group'.
- Single letters could be missing, switched with the neighbouring letter or entered wrong, for example, when an 'i' was typed instead of a 'j'.
- The use of upper and lower case can be different. For example, people entering their address in the internet often do so without capitalization, consisting of only lower case letters.
The name 'Albert Einstein', for example, could be recorded in a number of different ways:
- (100%) Einstein Albert
- (95%) A. Einstein
- (98%) Albert Einssein
- (87%) Abert Meinstein
Software designed especially for this purpose solves this problem by calculating the degree of matching for two words. In the previous example, the degree of matching percentages indicated in brackets are those calculated by DataQualityTools. With such programs, by setting a threshold, the user can usually determine the degree of discrepancy allowed between addresses recognized as being duplicates. The lower the threshold value, the greater the allowed discrepancy between two addresses, and the greater the probability that the program will deliver hits that are actually not duplicate addresses. Ideally, the user can still verify the results of the matching and remove one or the other hit from the results, before the program deletes the addresses it had recognised as being duplicates from the address list.
Two programs that are best suited for this job are DedupeWizard or DataQualityTools:
- You can read about how to use DedupeWizard to search for duplicates within a table in the article 'Remove Duplicates in Excel'.
- You can read about how to use DataQualityTools to search for duplicates between two tables in the article 'Find Duplicates between two Tables in Access'.