| Python effective TLD library update |
The effective TLD library is now being used for a couple of projects of mine, but I've had some troubles with it being almost unusable slow. I ended up waking up this morning with the revelation that the problem is that I use regexps to match domain names, but the failure of a match occurs at the end of a string. That means that the FSA has to scan the entire string before it gets to decide that it isn't a match. That's expensive.
I ran some tests on tweaks to try and fix this. Without any changes, scanning 1,000 semi-random domain names took 6.941666 seconds. I then tweaked the implementation to reverse the strings it was scanning, and that halved the run time of the test to 3.212203 seconds. That's a big improvement, but still way too slow. The next thing I tried was then adding buckets of rules on top of those reverse matches.... In other words, the code now assumes that anything after the last dot is some for of TLD approximation, and only executes rules which also have that string after the last dot. This was a massive improvement, with 1,000 domains taking only 0.026120 seconds.
I've updated the code at http://www.stillhq.com/python/etld/etld.py.
posted at: 09:45 | path: /python/etld | permanent link to this entry
-
###
Darryl
Hi,
Coming from the DNS world, the notion of "effective" TLDs is a little foreign to me. That aside, I see some results that looks a little odd.
example 1. this works as expected:
something.org.uk ==> etld='org.uk' domain='something'
example 2. not what I expected
www.something.org.uk ==> etld='something.org.uk' domain='www'
I would have expected that the etld would be the same as the first
example and the domain would change. Am I missing something?
Thanks,
Darryl
-
###
Chris Hills
There seems to be a problem with wildcards.
abc.def.example.com -> ('abc.def.example', 'com')
abc.def.example.co.uk -> ('abc', 'def.example.co.uk')
I had to sort the list like so to get most queries working as expected. For instance, blah.auto.pl would give "pl" rather than "auto.pl".
grep -v '//' effective_tld_names.dat | grep ^\. | awk '{print length"\t"$0}'|sort -nr|cut -f2-
-
###
Bunker
In truth, immediately i didn't understand the essence. But after re-reading all at once became clear.
-
###
Chris Hills
To fix the wildcard problem replace the line:
line = line[::-1].replace('.', '\\.').replace('*', '.*').replace('!', '')
with:
line = line[::-1].replace('.', '\\.').replace('*', '[^\\.]*').replace('!', '')
-
###
Michael Still
Thanks for all your super helpful comments. I'll tweak the source now and do another "release". I also think its a good idea to add unit tests, so I'll work on that as soon as I can.
