Saving Time with Levenshtein Distance: An Overview and Personal Experience.

Saving Time with Levenshtein Distance: An Overview and Personal Experience.

Introduction

Time is a precious commodity. Whether you're a software developer, a data analyst, or anyone dealing with textual data, finding efficient ways to save time and streamline processes is crucial.

An algorithm I find interesting in this regard is the Levenshtein Distance. It measures the difference between two sequences(strings). basically, it calculates the minimum number of edits required to change one word into another and it's named after the Soviet mathematician Vlamdir Levenshtein.

How Levenshtein Distance Works

The algorithm operates on the principle of dynamic programming, breaking down the problem into smaller subproblems and solving them iteratively, by building a matrix to represent the relationships between characters in two strings.

// Function to calculate the edit distance between two strings
const editDistance = (s1, s2) => {
    // Convert strings to lowercase for case-insensitive comparison
    s1 = s1.toLowerCase();
    s2 = s2.toLowerCase();

    // Array to store the costs of edit operations
    const costs = new Array();

    // Loop through the characters of the first string
    for (let i = 0; i <= s1.length; i++) {
        let lastValue = i;

        // Loop through the characters of the second string
        for (let j = 0; j <= s2.length; j++) {
            // If at the beginning of the first string
            if (i === 0) {
                // Initialize the cost array for the second string
                costs[j] = j;
            } else {
                // If not at the beginning of the first string
                if (j > 0) {
                    let newValue = costs[j - 1];

                    // If the current characters are different, calculate the cost
                    if (s1.charAt(i - 1) !== s2.charAt(j - 1)) {
                        newValue = Math.min(
                            Math.min(newValue, lastValue),
                            costs[j]
                        ) + 1;
                    }

                    // Update the cost values for the next iteration
                    costs[j - 1] = lastValue;
                    lastValue = newValue;
                }
            }
        }

        // Update the cost for the last character of the second string
        if (i > 0) {
            costs[s2.length] = lastValue;
        }
    }

    // Return the final edit distance
    return costs[s2.length];
};

// Example usage
const s1 = "book";
const s2 = "talk";

const editD = editDistance(s1, s2);
console.log(editD) // 3

The Levenshtein distance between "book" and "talk" is three because it can only be achieved with the following three alterations, which transform one into the other:

  1. book → took: Substitution of "b" for "t"

  2. took → taok: Substitution of "a" for "o"

  3. taok → talk: Substitution of "l" for "o"

We can further complement the editDistance function to return the percentage of similarity between the two strings by implementing a similarity function as follows:

// Function to calculate the similarity between two strings
const similarity = (s1, s2) => {
    // Determine the longer and shorter strings
    let longer = s1;
    let shorter = s2;
    if (s1.length < s2.length) {
        longer = s2;
        shorter = s1;
    }

    // Get the length of the longer string
    const longerLength = longer.length;

    // If the longer string is empty, the similarity is 100
    if (longerLength === 0) return 100

    // Calculate the similarity based on edit distance
    const editDist = editDistance(longer, shorter);
    const similarityRatio = (longerLength - editDist) / parseFloat(longerLength);

    // Return the similarity ratio as a percentage (multiply by 100)
    return similarityRatio * 100;
};

// Example usage
const s1 = "book";
const s2 = "talk";

const sim = similarity(s1, s2);
console.log(sim) // 25

Looking at "book" and "talk," they have four letters each. Comparing them, only one letter is the same, which is 25% of the total length. This example helps validate the effectiveness of our function.

Applications of Levenshtein Distance

One of the most common applications of Levenshtein Distance is in spell-checking systems. The algorithm can suggest corrections based on the minimum edit distance by comparing a user's input to a dictionary of correctly spelt words. This improves the accuracy of autocorrect features and enhances the user experience by saving time that would otherwise be spent manually correcting mistakes.

In text analysis and data cleaning, Levenshtein Distance is also heavily used. It can be useful when dealing with messy or misspelt data, allowing for the identification of potential duplicates or matches that might have been overlooked using exact matching methods.

My Personal Experience with Levenshtein Distance

User verification: Levenshtein distance implementation was used in the user verification system I developed. This feature enables a smoother verification experience by accommodating minor typos or variations in the user's entered data and if the traditional equal to "==" comparison is strictly applied, such variations could lead to unnecessary verification failures.

Here is an example snippet written regarding this:

const threshold = 80

const userIdentity = getIdentity(identityCode);
const enteredName = "Rapheal Akinola";
const storedName = userIdentity.name // Raphael Akinola
// "ea" => "ae" variation

const similarityRatio = similarity(enteredName, storedName); // 86.67%

if (similarityRatio >= threshold) {
    console.log(`User verification successful! Similarity ratio: ${similarityRatio}`);
} else {
    console.log(`User verification failed! Similarity ratio: ${similarityRatio}`);
}

Specifically, the system compares the information provided by the user, with the data associated with their provided identity code, such as a national identity number or driver's license number. The capability to adjust to minor misspellings or errors improves the efficiency of the verification process, eliminating the need for manual intervention.

Data processing: There was a process that required me to organize information from various JSON files, but the challenge was that the keys weren't always the same and were often misspelt and inconsistent. I used Levenshtein distance to get the closest match for each key even if it's a bit off in spelling or wording. This dynamic method helped me grab the right data from the JSON files, despite their inconsistencies.

// Function to find the closest matching key using Levenshtein Distance
const findClosestMatch = (targetKey, keysArray) => {
    let closestMatch = "";
    let maxSimilarity = 0;

    keysArray.forEach((key) => {
        const similarityRatio = similarity(targetKey, key);

        if (similarityRatio > maxSimilarity) {
            maxSimilarity = similarityRatio;
            closestMatch = key;
        }
    });

    return closestMatch;
};

// Function to process JSON files with inconsistent keys
const processJSONFiles = (targetKey, jsonFiles) => {
    jsonFiles.forEach((object) => {
        // Extract keys from the current JSON file
        const fileKeys = Object.keys(object);

        // Find the closest matching key for the target key
        const closestMatchKey = findClosestMatch(targetKey, fileKeys);

        // Access the value associated with the closest matching key
        const targetValue = object[closestMatchKey];

        console.log(`For target key '${targetKey}', closest matching key: '${closestMatchKey}', value: ${targetValue}`);
    });
};

// Example usage
const targetKey = "userName";

const jsonFilesDataMock = [
    { "username": "JohnDoe", "age": 30, "email": "john@example.com" },
    { "usrname": "Alice", "age": 25, "email": "alice@example.com" },
    { "user": "Bob", "age": 35, "email": "bob@example.com" },
    { "user_nm": "Eve", "age": 28, "email": "eve@example.com" }
];

processJSONFiles(targetKey, jsonFilesDataMock);
// For target key 'userName', closest matching key: 'username', value: JohnDoe
// For target key 'userName', closest matching key: 'usrname', value: Alice
// For target key 'userName', closest matching key: 'user', value: Bob
// For target key 'userName', closest matching key: 'user_nm', value: Eve

As seen in the example usage I was able to get the value for userName even when the spelling wasn't consistent across the JSON objects. This saved me a lot of time and the pain of having to manually edit hundreds of files.

Conclusion

The Levenshtein Distance algorithm stands out as a versatile and invaluable tool across various applications. Drawing from my personal experience, incorporating this algorithm into your workflow proves to be a time-saving strategy, by automating processes and mitigating the need for manual processes or interventions, it can also contribute to increased productivity and efficiency in your workflow.

For a deeper dive into the Levenshtein Distance algorithm and its applications, check out the following resources:

Wikipedia - Levenshtein Distance

Stack Overflow - Levenshtein Distance Implementation