Python effective TLD library update

    The effective TLD library is now being used for a couple of projects of mine, but I've had some troubles with it being almost unusable slow. I ended up waking up this morning with the revelation that the problem is that I use regexps to match domain names, but the failure of a match occurs at the end of a string. That means that the FSA has to scan the entire string before it gets to decide that it isn't a match. That's expensive.

    I ran some tests on tweaks to try and fix this. Without any changes, scanning 1,000 semi-random domain names took 6.941666 seconds. I then tweaked the implementation to reverse the strings it was scanning, and that halved the run time of the test to 3.212203 seconds. That's a big improvement, but still way too slow. The next thing I tried was then adding buckets of rules on top of those reverse matches.... In other words, the code now assumes that anything after the last dot is some for of TLD approximation, and only executes rules which also have that string after the last dot. This was a massive improvement, with 1,000 domains taking only 0.026120 seconds.

    I've updated the code at http://www.stillhq.com/python/etld/etld.py.

    Tags for this post: python etld effective tld performance regexp
    Related posts: Python effective TLD library bug fix; Python effective TLD library; Getting started with OpenStack development; Packet capture in python; mbot: new hotness in Google Talk bots; More coding club

posted at: 09:45 | path: /python/etld | permanent link to this entry