Content here is by Michael Still mikal@stillhq.com. All opinions are my own.
See recent comments. RSS feed of all comments.


Tue, 24 Nov 2009



Python effective TLD library bug fix

posted at: 13:57 | path: /python/etld | permanent link to this entry


Sun, 01 Nov 2009



Python effective TLD library update

    The effective TLD library is now being used for a couple of projects of mine, but I've had some troubles with it being almost unusable slow. I ended up waking up this morning with the revelation that the problem is that I use regexps to match domain names, but the failure of a match occurs at the end of a string. That means that the FSA has to scan the entire string before it gets to decide that it isn't a match. That's expensive.

    I ran some tests on tweaks to try and fix this. Without any changes, scanning 1,000 semi-random domain names took 6.941666 seconds. I then tweaked the implementation to reverse the strings it was scanning, and that halved the run time of the test to 3.212203 seconds. That's a big improvement, but still way too slow. The next thing I tried was then adding buckets of rules on top of those reverse matches.... In other words, the code now assumes that anything after the last dot is some for of TLD approximation, and only executes rules which also have that string after the last dot. This was a massive improvement, with 1,000 domains taking only 0.026120 seconds.

    I've updated the code at http://www.stillhq.com/python/etld/etld.py.

    Tags for this post: python etld effective tld performance regexp
    Related posts: Python effective TLD library; Python effective TLD library bug fix; Implementing SCP with paramiko; Thinkpad x41 tablet PCMCIA IO; On syncing with Google Contacts; pyconau 2010 twitter summary

posted at: 09:45 | path: /python/etld | permanent link to this entry


Mon, 26 Oct 2009



Python effective TLD library

    I had a need recently for a library which would take a host name and return the domain-specific portion of the name, and the effective TLD being used. "Effective TLD" is a term coined by the Mozilla project for something which acts like a TLD. For example, .com is a TLD and has domains allocated under it. However, .au is a TLD with no domains under it. The effective TLDs for the .au domain are things like .com.au and .edu.au. Whilst there are libraries for other languages, I couldn't find anything for python.

    I therefore wrote one. Its very simple, and not optimal. For example, I could do most of the processing with a single regexp if python supported more than 100 match groups in a regexp, but it doesn't. I'm sure I'll end up revisiting this code sometime in the future. Additionally, the code ended up being much easier to write than I expected, mainly because the Mozilla project has gone to the trouble of building a list of rules to determine the effective TLD of a host name. This is awesome, because it saved me heaps and heaps of work.

    The code is at http://www.stillhq.com/python/etld/etld.py if you're interested.

    Tags for this post: python etld effective tld mozilla
    Related posts: Python effective TLD library update; Python effective TLD library bug fix; Twisted Python and Jabber SSL; SSL, X509, ASN.1 and certificate validity dates; paramiko exec_command timeout; Finding locking deadlocks in python

posted at: 06:42 | path: /python/etld | permanent link to this entry