Methodology for my SMTP survey

    Most of the questions raised by yesterday's initial post about my SMTP survey were about methodology. Unfortunately, the questions are all on Facebook, which means they're not publicly visible and I'm not keen on making other peoples comments visible to y'all without permission. However, I already have a lot written about the methodology which I can include here. Hopefully it will either address most questions, or lead us off in an interesting direction.

    I'll start off by talking about the other surveys I have discovered, although it should be noted that I became aware of another, possibly larger survey yesterday. I have some questions about that new survey that I will address today, so let's ignore it for this discussion until I know more. The Bernstein surveys are the oldest SMTP survey data available. These surveys have used a variety of methodologies for selecting servers to probe -- the first survey selected 500,000 random IP addresses from a combination of a DNS walk, glue records from zone files, and MX lookups from zone files, the second survey similarly used 200,000 random addresses from a DNS walk, the third survey performed an MX lookup on .net domains, and then probed the resulting IP addresses, the fourth survey did the same as the third, but only looked up 1/256th of .com, the fifth survey looked up PTR records for 1 million random IP addresses, and then attempted to connect to those which had a valid PTR record, and finally the sixth survey followed the same selection methodology as the fifth.

    The Credentia surveys, which are more recent, have probed relatively small random samples of the IP address space. This has resulted in extremely small numbers of actual SMTP servers being reported on, which makes these surveys less reliable than those with larger samples. The MailChannels survey, which was run at the time that I started developing my own survey implementation, used a list of 400,000 domain names registered by large companies. This has resulted in a selection bias in their results towards solutions favored by large for profit companies.

    All of the surveys I have seen so far use the same method to determine the SMTP software in use -- the 220 status line is logged, and then run through a series of rules to attempt to determine what software is running. Its surprising how many implementations will just tell you what they are, and some even include the version of the software. Even if they don't tell you outright, often they use slightly different banner strings than other implementations, which can be used as a fingerprint. All of the various survey implementations use their own rule sets, although I am willing to publish mine if anyone is interested. I also today became aware of smtpscan, which is a perl script that implements the same form of fingerprinting. I need to dig through its source and see how its rules compare with mine.

    (As an aside, I was keeping complete logs of my scans until I had a very embarrassing disk failure a few months ago. Whilst I no longer have complete logs, I do have complete sets of the collected data, which makes it possible for me to tweak the detection rules I use and then rerun the processing to see how that changes the results over time.)

    Onto my target selection methodology. My survey tracks IP addresses, not domain names. This means if many domain names are hosted on a single IP address, then there is still only one result counted for those domain names. This probably skews my results away from large installations, because they often have many servers sitting behind a single shared IP address to handle load. In that case, only one of the servers in their cluster will be scanned and counted by my survey. One challenge with my probing technique is that it is not possible to simply scan all 4,294,967,296 IPv4 addresses. Instead, it is important to only scan IP addresses likely to be running a SMTP server. To that end, I have experimented with six techniques to determine candidate IP addresses for scanning:

    • Entries in the RBL (http://dsbl.org/main) and CBL (http://cbl.abuseat.org/) real time spam black lists were surveyed in the April 2008, July 2008 and October 2008 surveys. These blacklists contain lists of IP addresses suspected to run open relays or mail proxies, and are aggregated into one list of IP addresses per source. At its peak, there were 105 million IP addresses derived with this technique.
    • IP addresses were also collected from the received headers an email collection consisting of emails sent to Gmane (http://www.gmane.org), mail-archive.com, and my personal mail boxes. Overall, 1,101,066 IP addresses were collected with this method.
    • A number of domain registrars make their zone files available to researchers. These zone files contain lists of domains currently registered in their top level or second level domains. I currently have access to the zone files for .com, .net, .root, .root, .arpa, .mobi, and .asia. These zone files are used to perform Domain Name System queries for Mail Exchanger (MX) entries. These entries are used by domains to advertise to email senders which servers handle email for a given domain. The IP addresses of these servers are then added to the list to be probed as part of my survey. So far over 95 million domains have been scanned in this manner, resulting in 2.1 million unique mail server IP addresses. It should be noted that there is some overlap between the mail servers discovered by my RBL analysis, and the mail servers identified by this MX lookup. The table below provides a comparison of the various zone files we have used over time. This is important, as .com and .net are significantly larger than the other zones.
    • One of the limitations of the MX lookups described above is that I do not possess a complete list of all domains registered on the Internet. This is because many domain registrars do not make their DNS zone files available for analysis. I have therefore developed two other methods for locating mail exchangers. The first is to use the DNS reverse lookup results for IP addresses I have surveyed -- these DNS names are recursively queried for MX records. The second technique performs DNS reverse lookups on IPs on the boundary of IP allocations, on the theory that an IP allocation is likely to share a common domain name across its IPs. The results of these lookups are then recursively queried for MX records as well.


    Here's that table of zone file sizes as promised in a bullet point above:

    Datearpaasiacommobinetroot
    19 Dec 20075...71,545,277783,19910,660,216281
    29 Mar 20085112,36674,996,409892,91611,174,045281
    14 Jul 20085194,82476,312,322928,73611,591,319280
    29 Sep 20085227,45177,661,007954,88811,822,954280
    1 Jan 20095241,13178,701,644866,59011,960,758280
    1 Apr 20095244,16480,324,644840,01912,170,629280
    2 Oct 20095212,05482,259,663858,50312,433,457280


    Again, there is much more to say here -- for example a discussion of the effectiveness of each IP discovery technique would be useful. However, I'm once again over my 1,000 word limit for a single post, so I'll leave that discussion for later.

    Tags for this post: research smtp popularity internet measurement remoteworker
    Related posts: Measuring the popularity of SMTP server implementations on the Internet; Interesting paper: "YouTube Traffic Characterization: A View From the Edge"; Initial SMTP survey poster results in a pie chart; RemoteWorker v74; First paper published; Announcing early results of my survey of SMTP servers

posted at: 15:50 | path: /research | permanent link to this entry