stillhq.com : Mikal, a geek from Canberra living in Silicon Valley (no blather posts) http://www.stillhq.com The life, times, travel and software of Michael Still (no blather posts) en Copyright (c) Michael Still 2000 - 2006 blosxom simplerss20 v20050208hh 180 http://blogs.law.harvard.edu/tech/rss Noticed that smtpsurvey.stillhq.com is down? /research/smtp/survey Thu, 07 Aug 2008 09:11:00 GMT smtpsurvey.stillhq.com has been down for a couple of days now. This is because the machine at ANU which hosts the data has a hardware fault, and service techs have not yet arrived on site. The ever-helpful admins at ANU are aware of the problem, and are pursuing it as rapidly as they can. <br/><br/><i>Tags for this post: research(<a href="http://www.stillhq.com/research"><img src="http://www.stillhq.com/favicon.png" border="0" alt="S"></a>) smtp(<a href="http://www.stillhq.com/smtp"><img src="http://www.stillhq.com/favicon.png" border="0" alt="S"></a>) survey(<a href="http://www.stillhq.com/survey"><img src="http://www.stillhq.com/favicon.png" border="0" alt="S"></a>) </i> <br/><br/> <a href="http://www.stillhq.com/research/smtp/survey/000004.commentform.html">Comment</a> http://www.stillhq.com/research/smtp/survey/000004.html http://www.stillhq.com/research/smtp/survey/000004.html RemoteWorker v74 /research Sun, 15 Jun 2008 09:44:00 GMT This is the next public release of remoteworker, <a href="http://www.stillhq.com/research/remoteworker-v70.html">my distributed internet measurement system</a> used for my <a href="http://smtpsurvey.stillhq.com">survey of internet SMTP servers</a>. I don't intend to do a public release every time the version number increments, because I'm still actively working on the code, as can be seen from the way I've been through four versions in the last fourteen days. <br/><br/> That said, this release is a big improvement over the previous one. Changes include: <br/><br/> <ul> <li> No longer force the use of Python 2.4 <li> Small TLS probe bug fix (a missing newline in the output) <li> Probe the DNS servers in /etc/resolv.conf and find one which appears to work. Previously only the first entry in /etc/resolv.conf was used, which would mean DNS jobs failed on machines with a bad first entry. <li> The service thread will no longer crash if an async job which isn't a SMTP probe exists <li> TLS errors are no longer reported as probe errors, and are instead reported as TLS errors. This means that good probe values aren't clobbered by subsequent TLS errors <li> DNS lookups (DNS-LOOKUP, DNS-REVERSE, and IPV6-DNS-LOOKUP) are now done asynchronously. This is much faster -- with 10,000 lookups taking about seven minutes, even with the default rate limiting <li> Added a user agent for command fetches as well as HTTP-FETCHes <li> Command fetches now support gzip encoding, saving a lot of central server bandwidth. Your central server will also need to support gzip encoding for this to work too</ul> <br/><br/> The source code is <a href="http://www.stillhq.com/research/remoteworker-v74.tgz">here</a>. <br/><br/><i>Tags for this post: research(<a href="http://www.stillhq.com/research"><img src="http://www.stillhq.com/favicon.png" border="0" alt="S"></a>) </i> <br/><br/> <a href="http://www.stillhq.com/research/remoteworker-v74.commentform.html">Comment</a> http://www.stillhq.com/research/remoteworker-v74.html http://www.stillhq.com/research/remoteworker-v74.html Dear Lazyweb: how do I check SSL keys for vulnerability? /research Sun, 08 Jun 2008 21:03:00 GMT Based on conversations on the Freenode channel #linux.conf.au, I modified my <a href="http://smtpsurvey.stillhq.com">survey of mail servers</a> to attempt a STARTTLS command, and collect SSL key fingerprints from the mail servers which have a valid response. I now have a collection of SSL keys "from the wild". Interestingly, the distribution is decidedly non-random, with <i>5c4b1e60f69c168d40ad648017f8856a7d3816c7</i> appearing more than 7,000 times in my dataset. <br/><br/> I've had a quick look at the openssl-blacklist package on Ubuntu, and its not immediately obvious how I can efficiently feed a large list of SSL key fingerprints to openssl-vulnkey to determine which ones are vulnerable. It occurs to me that someone must have already thought about this. Does that person want to save me some time? <br/><br/><i>Tags for this post: research(<a href="http://www.stillhq.com/research"><img src="http://www.stillhq.com/favicon.png" border="0" alt="S"></a>) </i> <br/><br/> <a href="http://www.stillhq.com/research/000005.commentform.html">Comment</a> http://www.stillhq.com/research/000005.html http://www.stillhq.com/research/000005.html RemoteWorker v70 /research Sun, 01 Jun 2008 14:03:00 GMT I've been meaning for a while to release the RemoteWorker stuff I use for my <a href="http://smtpsurvey.stillhq.com">SMTP server</a> as open source. I finally got around to it. <br/><br/> You can find the source code <a href="http://www.stillhq.com/research/remoteworker-v70.tgz">here</a>. Here is what the README file in that tarball has to say for itself: <br/><br/> <blockquote> RemoteWorker is licensed under the terms of the GNU GPL v2 and is Copyright (C) Michael Still (mikal@stillhq.com) 2007 and 2008. Note that several of the dependencies have their own licenses, as described at the end of this file. <br/><br/> This is version v70 of RemoteWorker, a system intended for measuring the Internet by running on many machines at once. Setup of RemoteWorker is not simple, because everyone's use of the system will be different. You should expect to have to write your own bootstrap script, your own SSH babysitter, as well as probably needing to implement more commands in RemoteWorker itself to meet your measurement needs. <br/><br/> RemoteWorker is very simple. It takes a list of filenames on the command line, and executes the commands from each of these files in the order they are listed on the command line. The commands currently implemented are: <br/><br/> <table> <tr><td>START: </td><td>Denotes the start of a command file. Not required.</td></tr> <tr><td>END: </td><td>Denotes the end of a command file. Not required.</td></tr> <tr><td>COMMAND-FETCH: </td><td>Uses HTTP to fetch a new command file. The new commands are stored in a file named todo.new in the current working directory</td></tr> <tr><td>SLEEP: </td><td>Sleep for the named number of seconds</td></tr> <tr><td>SERVICE-THREAD: </td><td>Run a service thread which can be connected to on the port specified by the command. If you then telnet to this port, you'll be told the currently running command, and the amount of memory in use by RemoteWorker</td></tr> <tr><td>HEARTBEAT-THREAD: </td><td>Run a heartbeat thread which will regularly log that the RemoteWorker process is still running</td></tr> <tr><td>SMTP-PROBE: </td><td>Probe a remote SMTP server on TCP port 25. This will perform a traceroute (and return traceroute results), and a SMTP probe if the remote server is not hosted on, or routed via, an AS which has asked to not receive probe traffic.</td></tr> <tr><td>DEST-AS-CHECK: </td><td>Check if named IP is hosted by an AS which doesn't like probe traffic.</td></tr> <tr><td>TRACEROUTE: </td><td>Perform a traceroute to the IP, including AS information</td></tr> <tr><td>MX-LOOKUP: </td><td>Perform a DNS MX lookup for a named domain</td></tr> <tr><td>DNS-LOOKUP: </td><td>Perform a forward DNS lookup</td></tr> <tr><td>DNS-REVERSE: </td><td>Perform a reverse DNS lookup</td></tr> <tr><td>IPv6-DNS-LOOKUP: </td><td>Perform an IPv6 DNS lookup</td></tr> <tr><td>HTTP-FETCH: </td><td>Fetch the named URL, and log the HTML of the page. The HTML</td></tr> will be truncated if it is too long</td></tr> </table> <br/><br/> A sample command file might look like this: <br/><br/> <pre> SMTP-PROBE 1.2.3.4 DNS-REVERSE www.stillhq.com HTTP-FETCH http://www.stillhq.com/research/http_test.txt </pre> <br/><br/> Note that RemoteWorker needs to be run as root for the traceroute implementation to work. Run that command file to see what happens! <br/><br/> There are a few other things to note: <br/><br/> <b>Dependencies</b> <br/><br/> You will need to install the gflags module from http://code.google.com/p/google-gflags/ before some of the helper scripts will work. The worker nodes do not however need gflags. <br/><br/> <b>Command and control</b> <br/><br/> My workers run a simple shell script which sits in a loop (sitter.sh). This script runs a series of command files. One of these command files uses the COMMAND-FETCH command to contact a CGI script and download more work. Your RemoteWorkers will need to do something similar, but there is nothing forcing you to deliver the work via HTTP. You could for example just write a simple script which calculates what work to perform in some manner, and then runs that. If you specify a file named "-" on the command line, then RemoteWorker will run in interactive mode and read its commands from stdin. <br/><br/> <b>Bootstrapping</b> <br/><br/> Not all PlanetLab nodes are created equal, and I therefore need a bootstrap script which checks the capabilities of a node. This script when writes out the command file which fetches new commands. The capability checks are in the command files named sample and sample-smtpprobes. <br/><br/> <b>Collecting logs</b> <br/><br/> RemoteWorker doesn't return its results in real time, it just writes them to stdout. My sitter script then writes these to log files on disk, and I use a SSH based babysitter script to download and process them. This is useful because it means that the server handing out work doesn't need to be up when the commands are executed, and they just sit around on disk waiting for me to collect them later. <br/><br/> <b>SMTP probes, traceroutes, and the need for AS maps</b> <br/><br/> If you intend to use the SMTP probing or tracerouting functionality, then you need to have a recent mapping of IP ranges to AS numbers. I don't however include this map in this download because it is quite large and becomes out of date rapidly. You can create your own mapping trivially, by running these commands: <br/><br/> <pre> ./fetchnewasmap.sh ./preprocessmap.py </pre> <br/><br/> This will download an AS map from the U Washington Computer Science department, and then preprocess it into the correct format for the workers. However, you should only do the download once, and then push the processed file to the workers, as processing the file into the correct format is quite expensive. <br/><br/> <b>Dependency credits</b> <br/><br/> RemoteWorker is intended to be installed on worker nodes with a minimal python installation without needing other things to be installed on the system itself. There are therefore several dependencies which are shipped with RemoteWorker, but weren't written by me. I take no credit for these dependencies, and it should be noted that they each have their own license and usage rules. Please check that you are obeying the licenses for these dependencies. <br/><br/> Dependencies currently shipped with RemoteWorker are: <br/><br/> <ul> <li>pydns: used for DNS lookups <li>IPy: used for IP address manipulation <li>tlslite: a light weight SSL implementation used for SMTP TLS checks </ul> </blockquote> <br/><br/> So there you are. <br/><br/><i>Tags for this post: research(<a href="http://www.stillhq.com/research"><img src="http://www.stillhq.com/favicon.png" border="0" alt="S"></a>) </i> <br/><br/> <a href="http://www.stillhq.com/research/remoteworker-v70.commentform.html">Comment</a> http://www.stillhq.com/research/remoteworker-v70.html http://www.stillhq.com/research/remoteworker-v70.html Are license tags common in web pages? /research Wed, 21 May 2008 02:30:00 GMT <a href="http://www.cyberlawcentre.org/unlocking-ip/blog/2008/05/night-of-analysing-data.html">Ben Bildstein</a> talks about his attempts to determine if license tags are common on web pages. This seems like a perfect use of <a href="http://www.planet-lab.org">PlanetLab</a> to me, where downloading a few million web pages and performing an analysis isn't hard. For example, <a href="http://www.stillhq.com/research/smtp/mxes_feb2008.html">I downloaded over a million web pages in a few days</a> a little while ago. <br/><br/> Ben's problem seems easier than the parking analysis though, as I presume that he doesn't need to actually store the downloaded pages. If a simple regexp check of the content is sufficient, then storage (which is the slow) bit goes away as an issue. <br/><br/><i>Tags for this post: research(<a href="http://www.stillhq.com/research"><img src="http://www.stillhq.com/favicon.png" border="0" alt="S"></a>) </i> <br/><br/> <a href="http://www.stillhq.com/research/000004.commentform.html">Comment</a> http://www.stillhq.com/research/000004.html http://www.stillhq.com/research/000004.html The web is probably parkier than it seems /research Tue, 01 Apr 2008 20:08:00 GMT I've been reading academic papers again (it tends to happen in batches) -- this time I've been focusing on the papers from the <a href="http://www.imconf.net/imc-2007/">Usenix Internet Measurement Conference (IMC) last year</a>. One of the more immediately interesting papers presented was <a href="http://www.imconf.net/imc-2007/papers/imc124.pdf">The Web is Smaller than it Seems</a> (<a href="http://portal.acm.org/citation.cfm?id=1298306.1298324&coll=Portal&dl=ACM&CFID=22497112&CFTOKEN=21988624#">bibtex</a>). <br/><br/> The paper discusses a measurement of the size of "the web" based on a scan of domain names listed in either the <a href="http://www.dmoz.org">DMOZ open directory</a> or the <a href="http://www.stillhq.com/research/domain_access.html">.com and .net TLD zone files</a>. You'll note that is a very similar technique to one of those that I use to acquire domain names for my <a href="http://smtpsurvey.stillhq.com">survey of Internet mail servers</a>, which is what originally interested me in this paper. The domains had their www hostname looked up, and then the number of domain names per IP address was used to create an estimate of the total number of web servers present on the Internet. It is of course a little bit more complicated than that, but you can read the paper for more details if you really want. <br/><br/> The paper's findings are interesting: <br/><br/> <blockquote> We find that as much as 60% of the Web servers are co-hosted with 10, 000 or more other Web servers, indicating that the Internet contains many small co-hosted Web servers. Likewise, more than 95% of Web servers share their AS with 1000 or more other Web servers. We additionally find that heavily co-hosted Web servers contribute much less traffic than Web servers that are not co-hosted, confirming that popular servers are not co-located, while less popular servers co-locate more frequently. When considering block lists, we find the vast majority of blocked Web servers are hosted on IPs hosting 100 Web servers or more. This indicates there may be a great deal of collateral damage with IP blocking. Finally, when looking at authoritative DNS servers, we see a high degree of co-location on a very small number of DNS servers, which may result in the Web being fragile from a DNS perspective. </blockquote> <br/><br/> That's a pretty interesting result. Unfortunately, I think that the researchers missed an opportunity here. While they determined that a small number of IP addresses host a large number of web sites, they didn't attempt to determine how many of those domains are just parked content. Now that would have been something interesting to know. Specifically, I've poked around a little with <a href="http://www.stillhq.com/research/smtp/mxes_feb2008.html">the parking behaviour of domains</a> via the result of MX record look ups, which leads me to suspect that a large number of those heavily co-located domain names are simply parked, and not adding any interesting content to the Internet. <br/><br/><i>Tags for this post: research(<a href="http://www.stillhq.com/research"><img src="http://www.stillhq.com/favicon.png" border="0" alt="S"></a>) </i> <br/><br/> <a href="http://www.stillhq.com/research/websize_apr2008.commentform.html">Comment</a> http://www.stillhq.com/research/websize_apr2008.html http://www.stillhq.com/research/websize_apr2008.html The Internet is a strange place /research Sun, 23 Mar 2008 14:14:00 GMT As mentioned <a href="http://www.stillhq.com/research/smtp/mxes_feb2008.html">previously</a>, I've been downloading HTTP pages as part of my <a href="http://smtpsurvey.stillhq.com">survey of Internet mail servers</a> in order to detect domain parking behaviour. I should have thought a bit harder about that code though, because the implementation is a bit naive. Specifically, the code downloads the source of the web page (to RAM), and then base64 encodes it (to RAM), and finally writes it to the log file. That means that there is a little bit more than two copies of a given page's source in RAM before the operation is complete. However, it hadn't occurred to me that sites such as <a href="http://sixela.com/">http://sixela.com/</a> would exist. That URL results in an endless stream of the word "blah". It took me three worker deaths before I had figured out what the problem was, mainly because when workers use to much RAM their slice is killed, and often the log files are lost. <br/><br/> So the moral of this tale? Don't trust the Internets. <br/><br/><i>Tags for this post: research(<a href="http://www.stillhq.com/research"><img src="http://www.stillhq.com/favicon.png" border="0" alt="S"></a>) </i> <br/><br/> <a href="http://www.stillhq.com/research/000003.commentform.html">Comment</a> http://www.stillhq.com/research/000003.html http://www.stillhq.com/research/000003.html Normalising mail server package names /research/smtp/survey Fri, 21 Mar 2008 18:37:00 GMT While starting to look at mail server deployment trends, it came to my attention that I needed to normalise the names used for various mail servers across the mail server surveys for which I have data. In some cases the other guys' name for a given mail server was more accurate than mine, so you might notice over the next couple of days that mail server names are a bit variable in the results I have online. <br/><br/><i>Tags for this post: research(<a href="http://www.stillhq.com/research"><img src="http://www.stillhq.com/favicon.png" border="0" alt="S"></a>) smtp(<a href="http://www.stillhq.com/smtp"><img src="http://www.stillhq.com/favicon.png" border="0" alt="S"></a>) survey(<a href="http://www.stillhq.com/survey"><img src="http://www.stillhq.com/favicon.png" border="0" alt="S"></a>) </i> <br/><br/> <a href="http://www.stillhq.com/research/smtp/survey/000003.commentform.html">Comment</a> http://www.stillhq.com/research/smtp/survey/000003.html http://www.stillhq.com/research/smtp/survey/000003.html Announcing early results of my survey of SMTP servers /research/smtp/survey Wed, 19 Mar 2008 08:54:00 GMT Since June 2007 I have been building as close to an exhaustive survey of SMTP servers connected to the Internet as possible. This has involved coming up with a method for finding IP addresses to probe, probing those IP addresses, and generating results from the data collected. That code has been "finished" for a while now, and I am now ready to make it available to the public. <br/><Br/> The current data set includes 46,135,101 IP addresses, with 1,942,603 successfully identified servers. The <a href="http://smtpsurvey.stillhq.com/smtp-survey.cgi?dashboard=1">results for the survey are online</a>, as well as <a href="http://smtpsurvey.stillhq.com/smtp-survey.cgi?dashboard=2">status information for the machines running the measurement system</a> (<a href="http://smtpsurvey.stillhq.com/smtp-survey.cgi?dashboard=3">a different view of that data is available as well</a>). You can even <a href="http://smtpsurvey.stillhq.com/smtp-survey.cgi?dashboard=4">lookup your favourite domain name to see what software its running</a>. <br/><br/> This is the most recent open survey of SMTP servers that I am aware of. <a href="http://smtpsurvey.stillhq.com/">There have been other surveys</a>, but they are either quite old or don't make their data publically accessible. Its quite possible there are bugs in the web site which displays the data, so <a href="mailto:mikal@stillhq.com?subject=SMTP Survey">please let me know if you find one</a>. Apart from that, I hope this data is useful to others. <br/><br/> <form name="lookup" action="http://smtpsurvey.stillhq.com/smtp-survey.cgi" method="post">Use this form to lookup what mail server software a given domain is using. Remember to enter a domain name (like ibm.com), not a hostname (like www.ibm.com).<br/><br/>Lookup: <input type="text" name="lookup" size=50> <input type="submit" value="Submit"></form> <br/><br/><i>Tags for this post: research(<a href="http://www.stillhq.com/research"><img src="http://www.stillhq.com/favicon.png" border="0" alt="S"></a>) smtp(<a href="http://www.stillhq.com/smtp"><img src="http://www.stillhq.com/favicon.png" border="0" alt="S"></a>) survey(<a href="http://www.stillhq.com/survey"><img src="http://www.stillhq.com/favicon.png" border="0" alt="S"></a>) </i> <br/><br/> <a href="http://www.stillhq.com/research/smtp/survey/000002.commentform.html">Comment</a> http://www.stillhq.com/research/smtp/survey/000002.html http://www.stillhq.com/research/smtp/survey/000002.html What is the definition of publication? /research Tue, 18 Mar 2008 09:11:00 GMT Here's my quandary for the day -- I have an online interface for my SMTP survey data that allows users to do things like see current overall results, investigate the state of the worker nodes used to perform the measurements, and perform lookups against the data. However, I'm worried that I can't put this interface online because that might count as prior publication, and therefore preclude me from presenting information derived from the survey at academic conferences. I have a similar question about blog posts as well -- does blogging about something I am investigating mean that I can't publish it later? <br/><br/> This must be a common question. How do other researchers handle these sort of publication issues? <br/><br/><i>Tags for this post: research(<a href="http://www.stillhq.com/research"><img src="http://www.stillhq.com/favicon.png" border="0" alt="S"></a>) </i> <br/><br/> <a href="http://www.stillhq.com/research/000002.commentform.html">Comment</a> http://www.stillhq.com/research/000002.html http://www.stillhq.com/research/000002.html Compendium of TLD domain access agreements /research Sun, 16 Mar 2008 10:12:00 GMT One of the things I need to further my SMTP survey is lists of domain names. Lots of domain names. I'm sure I'm not the only one who is interested in such things, so I figured I'd take notes on what the process was to get access to that sort of information for research. This post is a "living document", and I'll update it as I find details of new registrars, or actually experience their service. <br/><br/> Note that almost all of these TLDs use basically identical access agreements. Access is free, but you agree to not hammer the registrar's servers, and not to do something like spamming with the data. Additionally, the domain names can only be redistributed as part of a value-added product, and without the ability to dump all of the data the registrar provided in one pass. I need to think about that bit more, because it probably restricts the way I allow use of my dataset. <br/><br/> <table width=100%> <tr><td><b>Access</b></td><td><b>TLD</b></td><td><b>Registrar</b></td><td><b>Procedure</b></td><td><b>Thanks to<b/></td></tr> <tr> <td>Yes</td> <td>.com<br/>.net<br/>.arpa<br/>.root</td> <td>Verisign</td> <td>To get access to the list of domains in these four TLDs, you need to sign the <a href="http://www.verisign.com/information-services/naming-services/com-net-registry/page_001052.html">domain access agreement</a> and fax the contract to Verisign. If accepted (they accepted me as an individual with no real problems), they say they will fax login back to you. However, they snail mailed mine to my contact address, for reasons I can't explain. Access is provided by FTP, and you must connect from a static IP which is defined in the access agreement.</td> <td>&nbsp;</td> </tr> <tr bgcolor="#EEEEEE"> <td>No</td> <td>.edu</td> <td>eduCause</td> <td>Access is not available to the list of registered domains, as per US Department of Commerce requirements. The registrar is seeking to be allowed to provide this data in the future, so it might be worth using the <a href="http://www.educause.edu/edudomain/contact_us.asp">contact form</a> to check if things have changed.</td> <td>&nbsp;</td> </tr> <tr> <td>Partial</td> <td>.info<br/>.asia</td> <td>Afilias</td> <td>Afilias is an outsourced TLD registrar. They have different access agreement forms for each of the TLDs they manage: <br/><br/> .info: Print out <a href="http://www.afilias.info/faqs/for_registrars/Zone_File_Access_Agreement.pdf">the domain access agreement</a> and fax it to Afilias at +1 215 706 5701. My request for access was not responded to. <br/><br/> .asia: Print out <a href="http://www.dotasia.org/info/DAO.ZONE-2007-10-24.pdf">the domain access agreement</a> and fax it to Afilias at +1 215 706 5701. Access was granted via an email, the data is exposed over FTP. My request was responded to after a several week delay.</td> <td>&nbsp;</td> </tr> <tr bgcolor="#EEEEEE"> <td>Yes</td> <td>.mobi</td> <td>dotMobi</td> <td>Print out <a href="http://mtld.mobi/system/files/mobi_zone_file_access_agreement_0.pdf">the domain access agreement</a>, fill it out, and then scan / email it to <a href="mailto:operations@mtld.mobi">operations@mtld.mobi</a>. They replied within a week or so via email, providing a FTP username and password. Note that they only allow you to download the zone files once a day.</td> <td>&nbsp;</td> </tr> <tr> <td>&nbsp;</td> <td>.org</td> <td>&nbsp;</td> <td>Application form is at <a href="http://www.pir.org/PDFs/zone_file_access_agreement.pdf">http://www.pir.org/PDFs/zone_file_access_agreement.pdf</a></td> <td>Andy Warner</td> </tr> <tr bgcolor="#EEEEEE"> <td>&nbsp;</td> <td>.biz</td> <td>&nbsp;</td> <td>Application form is at <a href="https://www.neulevel.biz/zonefile/">https://www.neulevel.biz/zonefile/</a></td> <td>Andy Warner</td> </tr> </table> <br/><br/><i>Tags for this post: research(<a href="http://www.stillhq.com/research"><img src="http://www.stillhq.com/favicon.png" border="0" alt="S"></a>) </i> <br/><br/> <a href="http://www.stillhq.com/research/domain_access.commentform.html">Comment</a> http://www.stillhq.com/research/domain_access.html http://www.stillhq.com/research/domain_access.html Mikal, tell something I didn't know about SMTP servers on the Internet /research/smtp Fri, 14 Mar 2008 20:14:00 GMT As part of my <a href="http://www.stillhq.com/research/smtp/survey/poster-lisa2007.html">survey of SMTP servers on the Internet</a> (a graphical representation of the results from that post are <a href="http://www.stillhq.com/research/smtp/survey/000001.html">here</a>), I need to find SMTP servers to survey. One of the ways that I've been doing that is I've been performing large numbers of DNS Mail eXchanger (MX) lookups and then probing the SMTP servers identified by those lookups. I haven't been able to perform those lookups on every domain registered, because not all registrars make their zone files available to researchers. I have a <a href="http://www.stillhq.com/research/domain_access.html">compendium of what I've learnt about zone file access agreements online if you're interested</a>. <br/><br/> Specifically, I performed the following lookups: <br/><br/> <table> <tr><td><i>Zone</i></td><td>&nbsp;</td><td><i>Number of lookups</i></td></tr> <tr><td>.arpa</td><td>&nbsp;</td><td>5</td></tr> <tr><td>.asia</td><td>&nbsp;</td><td>9,044</td></tr> <tr><td>.com</td><td>&nbsp;</td><td>72,529,657</td></tr> <tr><td>.mobi</td><td>&nbsp;</td><td>819,849</td></tr> <tr><td>.net</td><td>&nbsp;</td><td>10,734,157</td></tr> <tr><td>.root</td><td>&nbsp;</td><td>281</td></tr> <tr><td></td><td>&nbsp;</td><td><b>84,092,993</b></td></tr> </table> <br/><br/> For each of these domains a DNS MX record lookup was performed using <a href="http://www.planet-lab.org">around 100 machines</a>, and the results stored in a series of <a href="http://www.datacenterknowledge.com/archives/2007/Apr/27/database_sharding_helps_high-traffic_sites.html">sharded tables in a MySQL database</a>. <br/><br/> In aggregate, the results look like this: <br/><br/> <table> <tr><td>Total (IP, domain) tuples:</td><td>72,863,506</td></tr> <tr><td>Total unique IPs:</td><td>2,136,511</td></tr> <tr><td>Total unique domains:</td><td>46,993,011</td></tr> </table> <br/><br/> There are some interesting things to be found in the MX record data. For example, only 55.8% of the domains I scanned have an MX record at all. That might seem a bit counter intuitive, but when you take into account that a lot of domain names are unused or used simply for a web site, I guess its not that surprising. I would like to spend some more time verifying that this isn't a bug in my survey code, but I haven't gotten around to doing that yet. <br/><br/> Another interesting fact is that <a href="http://www.godaddy.com">GoDaddy</a> appears to be hosting a very large number of domains. Specifically, I found 12,105,590 domains which had one of just two IP addresses owned by GoDaddy as their MX record. That's 25.76% of all of my results. This means that's GoDaddy's domain hosting business is massive -- certainly much larger than I realized previously. <br/><br/> The IP addresses in question are 64.202.166.11 and 64.202.166.12. Some detail: <br/><br> <table> <tr><td>IP</td><td>DNS Reverse</td></tr> <tr><td>64.202.166.11</td><td>mailstore1.secureserver.net</td></tr> <tr><td>64.202.166.12</td><td>smtp.secureserver.net</td></tr> </table> <br/><br/> secureserver.net is a domain registered to "Wild West Domains, Inc.", who appear to be part of the GoDaddy family (<a href="http://help.godaddy.com/topic/612/article/45">according to this GoDaddy help page, secureserver.net is used for GoDaddy DNS servers among other things</a>). To determine how many of these domains are parked, I fired off some download jobs to download the top level page of each domain. At the moment, 1,087,885 of those downloads are complete. <br/><br/> Domains parked with GoDaddy HTTP 302 redirect from the top level page to a URL which is the domain name followed by a short identifier. For example, rastegarenterprises.net 302 redirects to rastegarenterprises.net/?bdb1d640 -- which is a page displaying advertising. Of the sites I have tested so far, 714,455 are parked in this manner. <br/><br/> That means GoDaddy currently has approximately 7,950,196 domains parked. That's around 9.4% of all the domains I have scanned! <br/><br/> Based on looking at IPs serving as MX for an unusual number of domains, the only other immediately obvious entry is that 184,213 domains point to 127.0.0.1. That seems a little bit odd to me. <br/><br/> I'm sure there is other interesting information in this MX data, but I think I'll leave it here for now. <br/><br/><i>Tags for this post: research(<a href="http://www.stillhq.com/research"><img src="http://www.stillhq.com/favicon.png" border="0" alt="S"></a>) smtp(<a href="http://www.stillhq.com/smtp"><img src="http://www.stillhq.com/favicon.png" border="0" alt="S"></a>) </i> <br/><br/> <a href="http://www.stillhq.com/research/smtp/mxes_feb2008.commentform.html">Comment</a> http://www.stillhq.com/research/smtp/mxes_feb2008.html http://www.stillhq.com/research/smtp/mxes_feb2008.html Initial SMTP survey poster results in a pie chart /research/smtp/survey Thu, 06 Dec 2007 16:31:00 GMT <div align=center> <img src="http://chart.apis.google.com/chart?cht=p3&chd=t:31,19,15,9,5,56&chs=600x300&chl=Exchange|Postfix|Sendmail|Anonymous|Exim|Other&chco=0000ff"> </div> <br/><br/> Graph generated with <a href="http://code.google.com/apis/chart/">Google Chart API</a>, which <a href="http://google-code-updates.blogspot.com/2007/12/embed-charts-in-webpages-with-one-of.html">was announced today</a>. <br/><br/><i>Tags for this post: research(<a href="http://www.stillhq.com/research"><img src="http://www.stillhq.com/favicon.png" border="0" alt="S"></a>) smtp(<a href="http://www.stillhq.com/smtp"><img src="http://www.stillhq.com/favicon.png" border="0" alt="S"></a>) survey(<a href="http://www.stillhq.com/survey"><img src="http://www.stillhq.com/favicon.png" border="0" alt="S"></a>) </i> <br/><br/> <a href="http://www.stillhq.com/research/smtp/survey/000001.commentform.html">Comment</a> http://www.stillhq.com/research/smtp/survey/000001.html http://www.stillhq.com/research/smtp/survey/000001.html Interesting paper: "YouTube Traffic Characterization: A View From the Edge" /research Wed, 05 Dec 2007 12:46:00 GMT <a href="http://www.imconf.net/imc-2007/papers/imc78.pdf">YouTube Traffic Characterization: A View From the Edge"</a> from <a href="http://www.imconf.net/imc-2007/">IMC 2007</a> is quite interesting. It tracks use of <a href="http://www.youtube.com">YouTube</a> from the University of Calgary. Interesting random quote: <br/><br/> <blockquote> In total we recorded 23,250,438 valid (i.e., non-failed) HTTP transactions (i.e., request/response pairs). These transactions account for approximately 6.54 TB of data transfer. Only 3% of the HTTP requests were for video files; however, the corresponding HTTP responses accounted for 99% of the total bytes transferred. </blockquote> <br/><br/><i>Tags for this post: research(<a href="http://www.stillhq.com/research"><img src="http://www.stillhq.com/favicon.png" border="0" alt="S"></a>) </i> <br/><br/> <a href="http://www.stillhq.com/research/000001.commentform.html">Comment</a> http://www.stillhq.com/research/000001.html http://www.stillhq.com/research/000001.html Microsoft Exchange the most popular SMTP server on the Internet? /research/smtp/survey Sat, 01 Dec 2007 15:27:00 GMT <a href="http://cs.anu.edu.au/~Eric.McCreath/">Eric McCreath</a> from the <a href="http://cs.anu.edu.au">Department of Computer Science</a> at the <a href="http://www.anu.edu.au">Australian National University</a> and I presented a poster entitled "Inferring Relative Popularity of SMTP Servers" at <a href="http://www.usenix.org">USENIX</a> <a href="http://www.usenix.org/event/lisa07/">LISA 2007</a>. This blog post is a brief discussion of the content of the poster, as well as a landing page for the <a href="http://www.stillhq.com/research/smtp/survey/poster-lisa2007.pdf">paper version of the poster</a> as well as the <a href="http://www.stillhq.com/research/smtp/survey/poster-lisa2007-poster.pdf">the PDF of the actual poster</a>. For more detail into the measurement techniques used, please check out the complete paper. <br/><br/> We conducted this research because there is little data on the relative popularity of the various available SMTP server implementations. This data is of interest because it aids the development of systems which interact with these servers. For example, a potential DDoS protection system should be tested with the most common SMTP servers, as these are the ones that it is most likely to encounter in everyday use. <br/><br/> Many businesses rely on email of some form for their day to day operation. This is especially true for product support organisations, who are largely unable to perform their role in the company if their in-boxes are unavailable. Allman in "<a href="http://doi.acm.org/10.1145/945131.945157">Spam, Spam, Spam, Spam, Spam, the FTC, and Spam</a>" states that Nuclear Research studies estimate that spam costs US businesses $87 billion a year. It seems reasonable to assume that if a low level attack is costing that much, then a complete outage would impose an even greater burden on an enterprise. <br/><br/> There has been little research conducted into the current state of SMTP servers on the Internet, perhaps because this area of research has not been particularly fashionable in comparison to the HTTP metrics which are commonly collected. This is an important area of research however given the level of traffic served by these systems has been growing for years. Barracuda Networks cite Radicati research which indicates that in 2009 228 billion emails will be sent per day, with the vast majority being spam (see <a href="http://www.barracudanetworks.com/ns/products/spam_features.php">Barracuda's site for more details</a>). Afergan and Beverly in "<a href="http://doi.acm.org/10.1145/1052812.1052822">The state of the email address</a>" evaluate the state of email servers in an attempt to determine how SMTP servers are coping with the growth in traffic. Their approach involved sending out probe emails to a variety of domains. The email was crafted to have a strong assurance of bouncing because of not being addressed to a valid address. The authors then monitored the bounce traffic. They concluded that corporate SMTP servers are under surprising levels of strain and do not bounce undeliverable emails in a predictable manner. <br/><br/> We have therefore started to undertake research into SMTP servers as they appear on the Internet, with our first study being a simple survey of which SMTP implementations are most commonly deployed. Our poster discussed the current state of that survey, and provide some early results. <br/><br/> The challenge with determining the popularity of various SMTP server implementations is twofold -- firstly, not all of the SMTP servers which interact with the Internet are able to be probed from the public Internet (for example SMTP routers which route email that came from the Internet, but are not themselves accessible from the Internet); and secondly the sheer number of SMTP servers connected to the network. We have therefore used both passive and active measurements to survey these servers. Each of these measurement techniques is described below. <br/><br/> Bearing in mind that our survey is quite new, and that only 34.6 million IP addresses have been probed so far, the initial results are quite interesting. <br/><br/> <div align=center> <img src="/research/smtp/survey/smaller-poster-lisa2007-graph.png"> </div> <br/><br/> You can see from the graph that the most popular SMTP server in our dataset is Microsoft Exchange, followed by Postfix and then Sendmail. <br/><br/> Additional analysis of our existing data, as well as further development of the email parser will improve the accuracy of our survey, which will also increase the number of machines included in the survey. The survey also needs a wider set of inputs for possible IP addresses to probe -- one example of another possible source of probable SMTP servers is MX records for registered domain names. The distributed probing system needs further development to handle the scale of the proving required for a large number of SMTP servers to be included in the survey, and improvements to the reliability of the central server are also required. <br/><br/> This SMTP survey is in its early stages, and there is much work still to do. However, research of this nature is likely to produce results which are of interest to both the research community, as well as software developers and systems administrators. So far a small dataset has been analysed, which has resulted in a reasonably robust distributed probing system being constructed. Further work on the survey will continue in the future, with updated results being published from time to time. <br/><br/><i>Tags for this post: research(<a href="http://www.stillhq.com/research"><img src="http://www.stillhq.com/favicon.png" border="0" alt="S"></a>) smtp(<a href="http://www.stillhq.com/smtp"><img src="http://www.stillhq.com/favicon.png" border="0" alt="S"></a>) survey(<a href="http://www.stillhq.com/survey"><img src="http://www.stillhq.com/favicon.png" border="0" alt="S"></a>) </i> <br/><br/> <a href="http://www.stillhq.com/research/smtp/survey/poster-lisa2007.commentform.html">Comment</a> http://www.stillhq.com/research/smtp/survey/poster-lisa2007.html http://www.stillhq.com/research/smtp/survey/poster-lisa2007.html