Saving Time with Levenshtein Distance: An Overview and Personal Experience.
Introduction
Time is a precious commodity. Whether you're a software developer, a data analyst, or anyone dealing with textual data, finding efficient ways to save time and streamline processes is crucial.
An algorithm I find interesting in this regard is the Levenshtein Distance. It measures the difference between two sequences(strings). basically, it calculates the minimum number of edits required to change one word into another and it's named after the Soviet mathematician Vlamdir Levenshtein.
How Levenshtein Distance Works
The algorithm operates on the principle of dynamic programming, breaking down the problem into smaller subproblems and solving them iteratively, by building a matrix to represent the relationships between characters in two strings.
// Function to calculate the edit distance between two strings
const editDistance = (s1, s2) => {
// Convert strings to lowercase for case-insensitive comparison
s1 = s1.toLowerCase();
s2 = s2.toLowerCase();
// Array to store the costs of edit operations
const costs = new Array();
// Loop through the characters of the first string
for (let i = 0; i <= s1.length; i++) {
let lastValue = i;
// Loop through the characters of the second string
for (let j = 0; j <= s2.length; j++) {
// If at the beginning of the first string
if (i === 0) {
// Initialize the cost array for the second string
costs[j] = j;
} else {
// If not at the beginning of the first string
if (j > 0) {
let newValue = costs[j - 1];
// If the current characters are different, calculate the cost
if (s1.charAt(i - 1) !== s2.charAt(j - 1)) {
newValue = Math.min(
Math.min(newValue, lastValue),
costs[j]
) + 1;
}
// Update the cost values for the next iteration
costs[j - 1] = lastValue;
lastValue = newValue;
}
}
}
// Update the cost for the last character of the second string
if (i > 0) {
costs[s2.length] = lastValue;
}
}
// Return the final edit distance
return costs[s2.length];
};
// Example usage
const s1 = "book";
const s2 = "talk";
const editD = editDistance(s1, s2);
console.log(editD) // 3
The Levenshtein distance between "book" and "talk" is three because it can only be achieved with the following three alterations, which transform one into the other:
book → took: Substitution of "b" for "t"
took → taok: Substitution of "a" for "o"
taok → talk: Substitution of "l" for "o"
We can further complement the editDistance
function to return the percentage of similarity between the two strings by implementing a similarity
function as follows:
// Function to calculate the similarity between two strings
const similarity = (s1, s2) => {
// Determine the longer and shorter strings
let longer = s1;
let shorter = s2;
if (s1.length < s2.length) {
longer = s2;
shorter = s1;
}
// Get the length of the longer string
const longerLength = longer.length;
// If the longer string is empty, the similarity is 100
if (longerLength === 0) return 100
// Calculate the similarity based on edit distance
const editDist = editDistance(longer, shorter);
const similarityRatio = (longerLength - editDist) / parseFloat(longerLength);
// Return the similarity ratio as a percentage (multiply by 100)
return similarityRatio * 100;
};
// Example usage
const s1 = "book";
const s2 = "talk";
const sim = similarity(s1, s2);
console.log(sim) // 25
Looking at "book" and "talk," they have four letters each. Comparing them, only one letter is the same, which is 25% of the total length. This example helps validate the effectiveness of our function.
Applications of Levenshtein Distance
One of the most common applications of Levenshtein Distance is in spell-checking systems. The algorithm can suggest corrections based on the minimum edit distance by comparing a user's input to a dictionary of correctly spelt words. This improves the accuracy of autocorrect features and enhances the user experience by saving time that would otherwise be spent manually correcting mistakes.
In text analysis and data cleaning, Levenshtein Distance is also heavily used. It can be useful when dealing with messy or misspelt data, allowing for the identification of potential duplicates or matches that might have been overlooked using exact matching methods.
My Personal Experience with Levenshtein Distance
User verification: Levenshtein distance implementation was used in the user verification system I developed. This feature enables a smoother verification experience by accommodating minor typos or variations in the user's entered data and if the traditional equal to "=="
comparison is strictly applied, such variations could lead to unnecessary verification failures.
Here is an example snippet written regarding this:
const threshold = 80
const userIdentity = getIdentity(identityCode);
const enteredName = "Rapheal Akinola";
const storedName = userIdentity.name // Raphael Akinola
// "ea" => "ae" variation
const similarityRatio = similarity(enteredName, storedName); // 86.67%
if (similarityRatio >= threshold) {
console.log(`User verification successful! Similarity ratio: ${similarityRatio}`);
} else {
console.log(`User verification failed! Similarity ratio: ${similarityRatio}`);
}
Specifically, the system compares the information provided by the user, with the data associated with their provided identity code, such as a national identity number or driver's license number. The capability to adjust to minor misspellings or errors improves the efficiency of the verification process, eliminating the need for manual intervention.
Data processing: There was a process that required me to organize information from various JSON files, but the challenge was that the keys weren't always the same and were often misspelt and inconsistent. I used Levenshtein distance to get the closest match for each key even if it's a bit off in spelling or wording. This dynamic method helped me grab the right data from the JSON files, despite their inconsistencies.
// Function to find the closest matching key using Levenshtein Distance
const findClosestMatch = (targetKey, keysArray) => {
let closestMatch = "";
let maxSimilarity = 0;
keysArray.forEach((key) => {
const similarityRatio = similarity(targetKey, key);
if (similarityRatio > maxSimilarity) {
maxSimilarity = similarityRatio;
closestMatch = key;
}
});
return closestMatch;
};
// Function to process JSON files with inconsistent keys
const processJSONFiles = (targetKey, jsonFiles) => {
jsonFiles.forEach((object) => {
// Extract keys from the current JSON file
const fileKeys = Object.keys(object);
// Find the closest matching key for the target key
const closestMatchKey = findClosestMatch(targetKey, fileKeys);
// Access the value associated with the closest matching key
const targetValue = object[closestMatchKey];
console.log(`For target key '${targetKey}', closest matching key: '${closestMatchKey}', value: ${targetValue}`);
});
};
// Example usage
const targetKey = "userName";
const jsonFilesDataMock = [
{ "username": "JohnDoe", "age": 30, "email": "john@example.com" },
{ "usrname": "Alice", "age": 25, "email": "alice@example.com" },
{ "user": "Bob", "age": 35, "email": "bob@example.com" },
{ "user_nm": "Eve", "age": 28, "email": "eve@example.com" }
];
processJSONFiles(targetKey, jsonFilesDataMock);
// For target key 'userName', closest matching key: 'username', value: JohnDoe
// For target key 'userName', closest matching key: 'usrname', value: Alice
// For target key 'userName', closest matching key: 'user', value: Bob
// For target key 'userName', closest matching key: 'user_nm', value: Eve
As seen in the example usage I was able to get the value for userName
even when the spelling wasn't consistent across the JSON objects. This saved me a lot of time and the pain of having to manually edit hundreds of files.
Conclusion
The Levenshtein Distance algorithm stands out as a versatile and invaluable tool across various applications. Drawing from my personal experience, incorporating this algorithm into your workflow proves to be a time-saving strategy, by automating processes and mitigating the need for manual processes or interventions, it can also contribute to increased productivity and efficiency in your workflow.
For a deeper dive into the Levenshtein Distance algorithm and its applications, check out the following resources: