Revisiting an Anagram Detector

The following example is taken from Problem Solving with Algorithms and Data Structures using Python. We wrote this example for the first edition of the book 10 years ago! Our thinking at the time was to write Python in a way that would prepare students for Java and/or C++. In addition, the algorithms were written to illustrate specific "Big-O" runtimes.

In addition to the code, I'm making use of the %timeit magic in Jupyter notebooks. Even though this is a simple example timeit runs the example thousands of times and takes the average of the best 3 times.

In [56]:
def anagramSolution4(s1,s2):
    c1 = [0]*26
    c2 = [0]*26

    for i in range(len(s1)):
        pos = ord(s1[i])-ord('a')
        c1[pos] = c1[pos] + 1

    for i in range(len(s2)):
        pos = ord(s2[i])-ord('a')
        c2[pos] = c2[pos] + 1

    j = 0
    stillOK = True
    while j<26 and stillOK:
        if c1[j]==c2[j]:
            j = j + 1
        else:
            stillOK = False

    return stillOK

original = '''Thats one small step for a man one giant leap for mankind Neil Armstrong'''.lower().replace(' ','')
anagram = '''thin man ran makes a large stride left planet pins flag on moon on to Mars'''.lower().replace(' ','')

%timeit anagramSolution4('apple','pleap')
%timeit anagramSolution4(original,anagram)
100000 loops, best of 3: 6.85 µs per loop
10000 loops, best of 3: 31.1 µs per loop

A better solution from github?

Just yesterday I received a bug report about the example above on github, which claimed that the following code was better and faster. There are two important points to make:

  1. The following example is incorrect for strings that are not the same length! The Python zip function simply throws away extra values when zipping two lists together. This is easily fixed by checking the lengths of the strings.
  2. The point of the example is to illustrate big O algorithm analysis. In that sense the two algorithms are both linear. In the case of Algorithm analysis two sequential loops are the same as one loop that iterates in parallel. Indexing and hashing are both constant time operations. Testing whether or not two dictionaries are equal is also going to be linear.

However, the bug report did get me thinking about the style that this original example was written in.

In [ ]:
def compare(str1, str2):
    time1= time.time()
    dict1 = {}
    dict2 = {}

    list1 = list(str1)
    list2 = list(str2)

    for i, j in zip(list1, list2):
        if i in dict1:
            dict1[i] += 1
        else:
            dict1[i] = 1

        if j in dict2:
            dict2[j] += 1
        else:
            dict2[j] = 1

    if dict1 == dict2:
        return True
    else:
        return False

Becoming more Pythonic

Since 2005 my thinking has evolved, and especially for a second course in CS, I think I should be writing in a more Pythonic style. So lets take my original anagramSolution4 and modernize it.

  1. There is no reason to use range and indexing when we can simply iterate over the characters in each string.
  2. Use the += operator
  3. Calculate the ordinal offset from 'a' once rather than many times.
In [46]:
def anagramSolution4(s1,s2):
    
    if len(s1) != len(s2):
        return False
    
    c1 = [0]*26
    c2 = [0]*26

    ord_offset = ord('a')
    for i in s1:
        pos = ord(i)-ord_offset
        c1[pos] += 1

    for i in s2:
        pos = ord(i)-ord_offset
        c2[pos] += 1

    j = 0
    stillOK = True
    while j<26 and stillOK:
        if c1[j]==c2[j]:
            j = j + 1
        else:
            stillOK = False

    return stillOK

%timeit anagramSolution4('apple','pleap')
%timeit anagramSolution4(original, anagram)
100000 loops, best of 3: 5.29 µs per loop
100000 loops, best of 3: 19.6 µs per loop

Well, that is quite a bit faster for the longer anagram, and less code is always nicer than more in terms of making things easier to understand.

Now lets see if the fancy use of zip makes our soution any faster that the two loops run in serial?

In [49]:
def anagramSolution4(s1,s2):
    
    if len(s1) != len(s2):
        return False
    
    c1 = [0]*26
    c2 = [0]*26

    ord_offset = ord('a')
    for i,j in zip(s1,s2):
        c1[ord(i)-ord_offset] += 1
        c2[ord(j)-ord_offset] += 1

    j = 0
    stillOK = True
    while j<26 and stillOK:
        if c1[j]==c2[j]:
            j = j + 1
        else:
            stillOK = False

    return stillOK

%timeit anagramSolution4('apple','pleap')
%timeit anagramSolution4(original,anagram)
100000 loops, best of 3: 5.6 µs per loop
10000 loops, best of 3: 20.8 µs per loop

Wow, there is nothing gained, in fact we lose a microsecond!

What can we do to make that last while loop more pythonic?

We have two options. If we want to be really explicit about what we are doing we could do something like this return all(map(lambda x,y: x==y, c1,c2)) You can break this apart on your own to see that map creates a list of True or False values. all simply checks to see that all of the values in the provided list are true.

But the simplest solution is just to let Python decide if the two lists are equal.

In [72]:
def anagramSolution4(s1,s2):
    
    if len(s1) != len(s2):
        return False
    
    c1 = [0]*26
    c2 = [0]*26

    ord_offset = ord('a')
    for i,j in zip(s1,s2):
        c1[ord(i)-ord_offset] += 1
        c2[ord(j)-ord_offset] += 1

    return c1 == c2

print(anagramSolution4(original,anagram))
%timeit anagramSolution4('apple','pleap')
%timeit anagramSolution4(original,anagram)
True
100000 loops, best of 3: 2.78 µs per loop
100000 loops, best of 3: 18 µs per loop

That gives us some really nice improvements over our original solution!

For the short anangram the new solution is 2.5 times faster, for the longer anagram the new solution is 1.7 times faster.

Are dictionaries better than lists for this problem?

In [60]:
def compare(str1, str2):

    # Check lengths first, then rest of solution is valid
    if len(str1) != len(str2):
        return False

    dict1 = {}
    dict2 = {}
    
    list1 = list(str1)
    list2 = list(str2)

    for i, j in zip(list1, list2):
        if i in dict1:
            dict1[i] += 1
        else:
            dict1[i] = 1

        if j in dict2:
            dict2[j] += 1
        else:
            dict2[j] = 1

    if dict1 == dict2:
        return True
    else:
        return False

%timeit compare("apple","peapl")
%timeit compare(anagram,original)
The slowest run took 4.11 times longer than the fastest. This could mean that an intermediate result is being cached 
100000 loops, best of 3: 2.43 µs per loop
100000 loops, best of 3: 16.7 µs per loop

The suggested function is indeed faster than our published solution, but is pretty close to our cleaned up solution.

Now lets do a couple of cleanups on the compare function:

In [70]:
from collections import defaultdict

def compare(str1, str2):

    if len(str1) != len(str2):
        return False

    dict1 = defaultdict(int)
    dict2 = defaultdict(int)

    for i, j in zip(str1,str2):
        dict1[i] += 1
        dict2[j] += 1

    return dict1 == dict2


%timeit compare("apple","peapl")
%timeit compare(anagram,original)
100000 loops, best of 3: 3.25 µs per loop
100000 loops, best of 3: 17.9 µs per loop

The use of a defaultdict allows us to eliminate all the checks for whether a key is in the dictionary yet. We can just assume it is in there, and let the default of 0 kick in when necessary. This is very similar to using the get method on a regular dictionary and having it return 0 when the key is not already in the dictionary. For example we could write dict1[i] = dict1.get(i,0) + 1 If i is not in dict1 then get will return 0 (rather than throwing an exception) and we can add 1 to it to initialize the key value pair.

Finally lets compare our two cleaned up solutions on the short anagram.

In [63]:
%timeit anagramSolution4("apple","pleap")
%timeit compare("apple","pleap")
The slowest run took 12.64 times longer than the fastest. This could mean that an intermediate result is being cached 
100000 loops, best of 3: 2.73 µs per loop
100000 loops, best of 3: 2.37 µs per loop

After all that there is not much difference, a few tenths of a microsecond. How about on our longer anagram?

In [64]:
%timeit anagramSolution4(original,anagram)
%timeit compare(original,anagram)
10000 loops, best of 3: 18.1 µs per loop
100000 loops, best of 3: 16.4 µs per loop

Some really short solutions that don't help the educational intent of the section

Of course there are other solutions that really make use of the power of Python, but are not necessarily interesting in the context of providing algorithms for Analysis. Here are two:

The first solution makes use of the standard library counter, and essentially does exactly what compare does, but making use of a standard library data structure.

The second makes use of sorting, which at best is $O(n \log{n})$

In [68]:
from collections import Counter

def pythonic1(str1,str2):
    return Counter(str1) == Counter(str2)

def pythonic2(str1,str2):
    return sorted(str1) == sorted(str2)

%timeit pythonic1("apple","pleap")
%timeit pythonic2("apple","pleap")
print("----")
%timeit pythonic1(original,anagram)
%timeit pythonic2(original,anagram)
100000 loops, best of 3: 9.77 µs per loop
The slowest run took 4.60 times longer than the fastest. This could mean that an intermediate result is being cached 
1000000 loops, best of 3: 1.13 µs per loop
----
100000 loops, best of 3: 15.4 µs per loop
100000 loops, best of 3: 13.9 µs per loop


Comments

comments powered by Disqus