Fuzzy Matching for Addresses: Finding Duplicates Despite Typos

Two records, the same person, but the computer sees two strangers. "John Smith, 42 Main St" and "Jonathan Smyth, 42 Main Street" are clearly the same individual to a human reader. For a character-by-character comparison, they are completely different entries.
This is precisely the problem fuzzy matching solves. Instead of demanding exact matches, it calculates a similarity score between two strings. The result is not a simple yes or no, but a percentage: 95% similar, 72% similar, 41% similar. Above a defined threshold, a pair is flagged as a duplicate.
For businesses that regularly process address data, fuzzy matching is not optional. It is the difference between a duplicate detection rate of 20% (exact matching) and 85% or more.
How Fuzzy Matching Differs from Exact Comparison
An exact comparison works in binary: two strings are identical or they are not. This works well when data is perfectly maintained, which in practice never happens.
Exact comparison:
"John Smith" vs. "John Smith" → Match ✓
"John Smith" vs. "Jon Smith" → No match ✗
"John Smith" vs. "john smith" → No match ✗
"John Smith" vs. "John Smith" → No match ✗ (double space)
Fuzzy matching calculates a similarity score instead:
Fuzzy matching:
"John Smith" vs. "Jon Smith" → 91% similarity → Duplicate
"John Smith" vs. "john smith" → 100% (after normalization)
"John Smith" vs. "John Smith" → 95% → Duplicate
"John Smith" vs. "Jane Miller" → 35% → No duplicate
The key advantage: fuzzy matching catches typos, spelling variants, abbreviations, and formatting differences, exactly the errors that occur in large quantities in real address databases.
The Five Most Important Algorithms
Behind the term fuzzy matching are several algorithms that calculate similarity in different ways. Each has specific strengths and weaknesses.
Levenshtein Distance
The most widely known algorithm counts the minimum number of single-character operations (insert, delete, replace) needed to transform one string into another.
"Smith" → "Smyth"
Step 1: i → y (Replace)
Distance: 1
"Main Street" → "Main St."
Step 1: r → . (Replace)
Step 2: e → (Delete)
Step 3: e → (Delete)
Step 4: t → (Delete)
Distance: 4
The distance is converted into a similarity score: 1 - (distance / length of the longer string). For "Smith" (5 characters) and "Smyth" (5 characters), that yields 1 - 1/5 = 80%.
Strength: Intuitive, reliably detects typos. Weakness: Treats all positions in a string equally, so an error at the start carries the same weight as one at the end.
Jaro-Winkler Similarity
This algorithm was designed specifically for comparing personal names. It considers two factors: the number of matching characters and their order. The Winkler extension adds a bonus when the first characters match.
Jaro-Winkler examples:
"Johnson" vs. "Jonhson" → 96% similarity
"Meyer" vs. "Meier" → 93% similarity
"Stephen" vs. "Steven" → 90% similarity
"Hans" vs. "Jens" → 60% similarity
Strength: Particularly precise for short strings like first and last names. Accounts for the fact that typos at the beginning of a word are rarer. Weakness: Less suitable than Levenshtein for long strings like full addresses.
N-Gram Comparison
This approach splits both strings into fragments of fixed length (N-grams) and compares the overlap.
Bi-grams (N=2) of "Smith":
{Sm, mi, it, th}
Bi-grams of "Smyth":
{Sm, my, yt, th}
Shared bi-grams: {Sm, th} = 2
Union: {Sm, mi, it, my, yt, th} = 6
Jaccard similarity: 2/6 = 33%
Strength: Robust against transpositions and rearrangements within a string. Scales well for large datasets. Weakness: Less meaningful for very short strings.
Phonetic Algorithms (Soundex and Cologne Phonetics)
Phonetic algorithms convert strings into sound codes. Names that sound alike receive the same code, regardless of spelling.
Soundex:
"Smith" → S530
"Smyth" → S530
→ Same code despite different spelling
Cologne Phonetics (optimized for German):
"Mueller" → 657
"Müller" → 657
"Miller" → 657
→ All three are treated as identical
Strength: Catches variants that text-based algorithms miss, such as "Schmitz" and "Schmits" in German data. Weakness: Provides only yes/no (same code or not), not a similarity score. Can produce false positives with short or unusual names.
TF-IDF Weighted Comparison
This approach weights rare components of an address more heavily than common ones. A match on a rare surname is a stronger signal for a duplicate than a match on "Smith" or "Johnson."
| Component | Frequency | Weight |
|---|---|---|
| "Smith" (surname) | Very common | Low |
| "Zbigniew" (first name) | Rare | High |
| "Chestnut Lane" (street) | Medium | Medium |
| "12a" (house number) | - | Standard |
Strength: Drastically reduces false positives for common names. Weakness: Requires a reference database for frequency calculation.
How Fuzzy Matching Works on Addresses in Practice
In practice, fuzzy matching is not applied to the entire address as a single string. Instead, each field is compared separately, and the individual results are combined into an overall score.
Record A: John Smith | 42 Main St. | 10001 New York
Record B: Jonathan Smyth | 42 Main Street | 10001 New York
Field comparison:
First name: "John" vs. "Jonathan" → Jaro-Winkler: 82%
Last name: "Smith" vs. "Smyth" → Levenshtein: 80%, Phonetic: Match
Street: "Main St." vs. "Main Street" → Normalized: 100%
House number: "42" vs. "42" → Exact: 100%
Postal code: "10001" vs. "10001" → Exact: 100%
City: "New York" vs. "New York" → Exact: 100%
Weighted overall score:
(82% x 0.15) + (80% x 0.30) + (100% x 0.20) + (100% x 0.15) + (100% x 0.15) + (100% x 0.05)
= 12.30 + 24.00 + 20.00 + 15.00 + 15.00 + 5.00
= 91.30% → Above threshold (80%) → Duplicate detected
The field weighting is critical. The last name typically receives the highest weight (30%) because it provides the strongest identification. Postal code and house number each get 15% as they are usually unambiguous when entered correctly.
Common Error Types in Address Data and How Fuzzy Matching Handles Them
Not every discrepancy in address data is a typo. Many differences arise from varying data entry standards, system migrations, or simply different staff members entering data.
| Error Type | Example | Best Algorithm |
|---|---|---|
| Abbreviations | St. vs. Street, Ave. vs. Avenue | Normalization (dictionary) |
| Typos | "Schmdit" instead of "Schmidt" | Levenshtein |
| Name variants | "John" vs. "Jonathan" | Jaro-Winkler + nickname list |
| Field order | "Smith, John" vs. "John Smith" | N-Gram + parsing |
| Missing fields | Postal code present, city missing | Weighted comparison (skip field) |
| Phonetic variants | "Meyer" vs. "Meier" vs. "Maier" | Cologne Phonetics |
| Character encoding | "Muller" vs. "Mueller" vs. "Müller" | Normalization + Phonetics |
Professional systems combine several of these algorithms. They first normalize the data (expand abbreviations, standardize special characters), then apply fuzzy matching field by field, and calculate a weighted overall score.
Choosing the Right Threshold
The threshold determines at what similarity level two records are flagged as duplicates. It is the most important configuration parameter, and also the one most frequently set incorrectly.
Too low (e.g. 60%): Finds many duplicates but also many false positives. "John Smith, Stuttgart" and "Jane Miller, Strasbourg" might pass as a duplicate.
Too high (e.g. 95%): Only catches obvious duplicates, misses subtler variants. "John Smith" and "Jonathan Smyth" would not be detected.
Practical recommendation: Start at 80% and manually review 50 to 100 results. If too many false positives appear, increase to 85%. If obvious duplicates slip through, lower to 75%.
Threshold impact on 10,000 records:
Threshold | Duplicates Found | False Positives | Missed Duplicates
95% | 120 | ~2% | ~40%
85% | 280 | ~5% | ~15%
80% | 350 | ~8% | ~8%
70% | 520 | ~20% | ~3%
A threshold around 80% offers the best trade-off for most use cases. For particularly sensitive data (e.g. in medical contexts), a lower value with subsequent manual review is advisable.
Why Spreadsheets Are Not Enough for Fuzzy Matching
Excel offers MATCH and COUNTIF functions for exact duplicate searches. Fuzzy matching, however, is only possible through workarounds, such as VBA macros or Microsoft's Fuzzy Lookup add-in.
The limitations of spreadsheets for approximate matching:
- No built-in fuzzy functions: Levenshtein, Jaro-Winkler, or N-grams must be implemented as VBA code
- Performance: With 10,000 records, every entry must be compared against every other, nearly 50 million comparisons. Excel becomes extremely slow or crashes
- No normalization: Abbreviations, special characters, and formatting must be cleaned manually
- No field weighting: Excel cannot weight individual fields differently
- No phonetic comparison: Cologne Phonetics or Soundex are not available
For lists with fewer than 500 entries, a manual approach in Excel may work. Beyond 1,000 records it becomes impractical, beyond 10,000 impossible. For more on why spreadsheets fall short at address cleaning, see our article Removing Duplicate Addresses in Excel: Why Spreadsheets Fall Short.
Specialized tools like ListenFix implement all common fuzzy matching algorithms and combine them automatically. Processing runs locally on your computer, without transmitting address data to external servers. For 10,000 records, the analysis takes seconds instead of hours. On top of that, automatic normalization, postal code validation for 29 countries, and household detection go well beyond simple duplicate finding.
Fuzzy Matching in Practice: Three Scenarios
Scenario 1: Annual CRM Cleanup
A mid-sized company with 25,000 customer addresses performs an annual cleanup. Typical duplicate rate: 8 to 12%.
Cost example: With 10% duplicates and a direct mail campaign at EUR 0.28 per piece, deduplication saves 2,500 x EUR 0.28 = EUR 700 per mailing. With four mailings per year, that adds up to EUR 2,800, not counting the cost of returns and customer irritation.
Scenario 2: Merging Two Databases
After a company merger, two CRM systems need to be consolidated. Without fuzzy matching, thousands of phantom duplicates emerge, hampering sales and marketing for years.
Scenario 3: Real-Time Duplicate Check During Data Entry
Every new address entry is checked against the existing database in real time. This prevents duplicates from being created in the first place, the most effective approach, though technically the most demanding.
For more on which methods suit which use case, see our overview Detecting Duplicates: 7 Methods for Clean Address Data.
Getting Started with Fuzzy Matching
Fuzzy matching does not require specialized expertise, but it does require the right tools. The key steps to get started:
-
Normalize your data: Before applying fuzzy matching, bring your data into a consistent format. Expand abbreviations, standardize special characters, clean up whitespace.
-
Choose the right algorithm: For personal names, Jaro-Winkler works best. For street names and longer strings, Levenshtein is the better choice. Ideally, combine multiple algorithms.
-
Calibrate the threshold: Start at 80% and manually review 50 to 100 results. Adjust until the balance between detection rate and false positive rate is right.
-
Review results: No algorithm is perfect. Plan time for spot-checking, especially during the first run.
-
Repeat regularly: Data quality is not a one-time project. Schedule at least an annual cleanup, ideally supplemented by real-time checks during data entry.
The effort pays off. Clean address data means fewer returns, more accurate analytics, and more professional customer communication, with every single mailing.
Clean your mailing list — try it now
ListenFix uses fuzzy matching to find significantly more duplicates than Excel. 100% offline, GDPR-compliant.
Try for free