Python effective TLD library update

    The effective TLD library is now being used for a couple of projects of mine, but I've had some troubles with it being almost unusable slow. I ended up waking up this morning with the revelation that the problem is that I use regexps to match domain names, but the failure of a match occurs at the end of a string. That means that the FSA has to scan the entire string before it gets to decide that it isn't a match. That's expensive.

    I ran some tests on tweaks to try and fix this. Without any changes, scanning 1,000 semi-random domain names took 6.941666 seconds. I then tweaked the implementation to reverse the strings it was scanning, and that halved the run time of the test to 3.212203 seconds. That's a big improvement, but still way too slow. The next thing I tried was then adding buckets of rules on top of those reverse matches.... In other words, the code now assumes that anything after the last dot is some for of TLD approximation, and only executes rules which also have that string after the last dot. This was a massive improvement, with 1,000 domains taking only 0.026120 seconds.

    I've updated the code at http://www.stillhq.com/python/etld/etld.py.

posted at: 09:45 | path: /python/etld | permanent link to this entry

    ### Darryl

    Hi,

    Coming from the DNS world, the notion of "effective" TLDs is a little foreign to me. That aside, I see some results that looks a little odd.

    example 1. this works as expected:
    something.org.uk ==> etld='org.uk' domain='something'

    example 2. not what I expected
    www.something.org.uk ==> etld='something.org.uk' domain='www'

    I would have expected that the etld would be the same as the first
    example and the domain would change. Am I missing something?

    Thanks,
    Darryl


    ### Chris Hills

    There seems to be a problem with wildcards.

    abc.def.example.com -> ('abc.def.example', 'com')
    abc.def.example.co.uk -> ('abc', 'def.example.co.uk')

    I had to sort the list like so to get most queries working as expected. For instance, blah.auto.pl would give "pl" rather than "auto.pl".

    grep -v '//' effective_tld_names.dat | grep ^\. | awk '{print length"\t"$0}'|sort -nr|cut -f2-

    ### Bunker

    In truth, immediately i didn't understand the essence. But after re-reading all at once became clear.

    ### Chris Hills

    To fix the wildcard problem replace the line:

    line = line[::-1].replace('.', '\\.').replace('*', '.*').replace('!', '')

    with:

    line = line[::-1].replace('.', '\\.').replace('*', '[^\\.]*').replace('!', '')

    ### Michael Still

    Thanks for all your super helpful comments. I'll tweak the source now and do another "release". I also think its a good idea to add unit tests, so I'll work on that as soon as I can.

    Add a comment to this post:

    Your name:

    Your email: Email me new comments on this post
      (Your email will not be published on this site, and will only be used to contact you directly with a reply to your comment if needed. Oh, and we'll use it to send you new comments on this post it you selected that checkbox.)


    Your website:

    Comments: