Content here is by Michael Still All opinions are my own.
See recent comments. RSS feed of all comments.

Thu, 10 Feb 2011

First paper published

posted at: 15:14 | path: /research/smtp | permanent link to this entry

Tue, 09 Feb 2010

Methodology for my SMTP survey

    Most of the questions raised by yesterday's initial post about my SMTP survey were about methodology. Unfortunately, the questions are all on Facebook, which means they're not publicly visible and I'm not keen on making other peoples comments visible to y'all without permission. However, I already have a lot written about the methodology which I can include here. Hopefully it will either address most questions, or lead us off in an interesting direction.

    I'll start off by talking about the other surveys I have discovered, although it should be noted that I became aware of another, possibly larger survey yesterday. I have some questions about that new survey that I will address today, so let's ignore it for this discussion until I know more. The Bernstein surveys are the oldest SMTP survey data available. These surveys have used a variety of methodologies for selecting servers to probe -- the first survey selected 500,000 random IP addresses from a combination of a DNS walk, glue records from zone files, and MX lookups from zone files, the second survey similarly used 200,000 random addresses from a DNS walk, the third survey performed an MX lookup on .net domains, and then probed the resulting IP addresses, the fourth survey did the same as the third, but only looked up 1/256th of .com, the fifth survey looked up PTR records for 1 million random IP addresses, and then attempted to connect to those which had a valid PTR record, and finally the sixth survey followed the same selection methodology as the fifth.

    The Credentia surveys, which are more recent, have probed relatively small random samples of the IP address space. This has resulted in extremely small numbers of actual SMTP servers being reported on, which makes these surveys less reliable than those with larger samples. The MailChannels survey, which was run at the time that I started developing my own survey implementation, used a list of 400,000 domain names registered by large companies. This has resulted in a selection bias in their results towards solutions favored by large for profit companies.

    All of the surveys I have seen so far use the same method to determine the SMTP software in use -- the 220 status line is logged, and then run through a series of rules to attempt to determine what software is running. Its surprising how many implementations will just tell you what they are, and some even include the version of the software. Even if they don't tell you outright, often they use slightly different banner strings than other implementations, which can be used as a fingerprint. All of the various survey implementations use their own rule sets, although I am willing to publish mine if anyone is interested. I also today became aware of smtpscan, which is a perl script that implements the same form of fingerprinting. I need to dig through its source and see how its rules compare with mine.

    (As an aside, I was keeping complete logs of my scans until I had a very embarrassing disk failure a few months ago. Whilst I no longer have complete logs, I do have complete sets of the collected data, which makes it possible for me to tweak the detection rules I use and then rerun the processing to see how that changes the results over time.)

    Onto my target selection methodology. My survey tracks IP addresses, not domain names. This means if many domain names are hosted on a single IP address, then there is still only one result counted for those domain names. This probably skews my results away from large installations, because they often have many servers sitting behind a single shared IP address to handle load. In that case, only one of the servers in their cluster will be scanned and counted by my survey. One challenge with my probing technique is that it is not possible to simply scan all 4,294,967,296 IPv4 addresses. Instead, it is important to only scan IP addresses likely to be running a SMTP server. To that end, I have experimented with six techniques to determine candidate IP addresses for scanning:

    • Entries in the RBL ( and CBL ( real time spam black lists were surveyed in the April 2008, July 2008 and October 2008 surveys. These blacklists contain lists of IP addresses suspected to run open relays or mail proxies, and are aggregated into one list of IP addresses per source. At its peak, there were 105 million IP addresses derived with this technique.
    • IP addresses were also collected from the received headers an email collection consisting of emails sent to Gmane (,, and my personal mail boxes. Overall, 1,101,066 IP addresses were collected with this method.
    • A number of domain registrars make their zone files available to researchers. These zone files contain lists of domains currently registered in their top level or second level domains. I currently have access to the zone files for .com, .net, .root, .root, .arpa, .mobi, and .asia. These zone files are used to perform Domain Name System queries for Mail Exchanger (MX) entries. These entries are used by domains to advertise to email senders which servers handle email for a given domain. The IP addresses of these servers are then added to the list to be probed as part of my survey. So far over 95 million domains have been scanned in this manner, resulting in 2.1 million unique mail server IP addresses. It should be noted that there is some overlap between the mail servers discovered by my RBL analysis, and the mail servers identified by this MX lookup. The table below provides a comparison of the various zone files we have used over time. This is important, as .com and .net are significantly larger than the other zones.
    • One of the limitations of the MX lookups described above is that I do not possess a complete list of all domains registered on the Internet. This is because many domain registrars do not make their DNS zone files available for analysis. I have therefore developed two other methods for locating mail exchangers. The first is to use the DNS reverse lookup results for IP addresses I have surveyed -- these DNS names are recursively queried for MX records. The second technique performs DNS reverse lookups on IPs on the boundary of IP allocations, on the theory that an IP allocation is likely to share a common domain name across its IPs. The results of these lookups are then recursively queried for MX records as well.

    Here's that table of zone file sizes as promised in a bullet point above:

    19 Dec 20075...71,545,277783,19910,660,216281
    29 Mar 20085112,36674,996,409892,91611,174,045281
    14 Jul 20085194,82476,312,322928,73611,591,319280
    29 Sep 20085227,45177,661,007954,88811,822,954280
    1 Jan 20095241,13178,701,644866,59011,960,758280
    1 Apr 20095244,16480,324,644840,01912,170,629280
    2 Oct 20095212,05482,259,663858,50312,433,457280

    Again, there is much more to say here -- for example a discussion of the effectiveness of each IP discovery technique would be useful. However, I'm once again over my 1,000 word limit for a single post, so I'll leave that discussion for later.

    Tags for this post: research smtp popularity internet measurement remoteworker
    Related posts: Measuring the popularity of SMTP server implementations on the Internet; Interesting paper: "YouTube Traffic Characterization: A View From the Edge"; Initial SMTP survey poster results in a pie chart; The witty worm with Vern Paxson; Announcing early results of my survey of SMTP servers; RemoteWorker v74

posted at: 15:50 | path: /research | permanent link to this entry

Mon, 08 Feb 2010

Measuring the popularity of SMTP server implementations on the Internet

    I'm interested in measuring the performance of SMTP servers connected to the Internet. Before I can poke around inside a SMTP implementation, I want to ensure that I am using one which lots of people use. To that end I have been running a series of SMTP server surveys for the last several years. This work has been alluded to in the past, but I haven't published any results. This has mainly been because while I have written a number of papers on the topic, I am yet to have one accepted by an academic conference. I've been hesitant to comment about my results because of the requirement for academic publications not be previously published work.

    I've decided to change that policy. I'm going to reserve a lot of the deeper analysis for academic publication (if I can make such a thing happen), but I am going to start talking about the work I am doing more in public. To start that off, I should mention what I've been doing...

    There have been a number of previous surveys of SMTP servers connected to the Internet, with each survey using a different methodology. So although these results are not directly comparable, a comparison still provides some insight into how the server landscape has changed over the last 12 years. A comparison of published surveys is presented in the table below. Each survey in this table shows the: sample size, which is the number of IP addresses surveyed; sample approach, which is the methodology used to determine which IP addresses to sample and adds bias into the sampling; and the number of responses, which is the number of SMTP servers that responded. The majority of these surveys have relied on random sampling of the IP address space, perhaps with a selection algorithm to limit the results selected. Few of the more recent surveys provide complete information on their probing implementation or the rules they used to identify specific implementations from their observations. It should be noted that non-response from a surveyed IP generally indicates that it is not in fact running a SMTP server accessible from the Internet.

    DateSurveyorSample sizeSample methodResponses
    27 Nov 1996Bernstein500,000Selective random25,121
    214 Aug 1997Bernstein200,000Selective random8,056
    211 May 1998Bernstein20,310MX walk17,592
    22 Apr 2000Bernstein12,595Selective random10,087
    25 Oct 2000Bernstein25,777Random859
    227 Sep 2001Bernstein39,206Random937
    21 Dec 2002Credentia4,096Random1,837
    21 Jan 2003Credentia30,000Random17,540
    21 Apr 2003Credentia37,563Random20,410
    21 May 2007MailChannels400,000Corporate domain names254,400

    Whereas the surveys that I have been running with the assistance of my ever patient PhD supervisor Dr Eric McCreath have been quite a bit larger. Note that larger isn't necessarily better with these sorts of surveys, but my methodology attempts to aim for completeness, and the relative power of PlanetLab makes these computations surprisingly cheap. Details of my surveys so far:

    DateSurveyorSample sizeSample methodResponses
    January 2008Still / McCreath46,136,113Exhaustive1,973,748
    April 2008Still / McCreath92,286,998Exhaustive1,609,111
    July 2008Still / McCreath97,545,668Exhaustive1,579,507
    October 2008Still / McCreath109,661,889Exhaustive1,801,081
    January 2009Still / McCreath110,397,428Exhaustive1,916,719
    April 2009Still / McCreath110,706,130Exhaustive1,925,760
    October 2009Still / McCreath111,209,212Exhaustive1,800,573

    Our survey is implemented by attempting to identify the MTA software running on an SMTP server using the SMTP connection banner. In other words, a collection of IP addresses are connected to on the SMTP port (TCP 25), and an attempt is made from the early stages of the SMTP protocol interaction to determine what SMTP server software is running on that host. The SMTP protocol will often reply to the connection with a status 220 line, referred to as the SMTP banner, this tells the connecting client that the server is ready. The SMTP banner also frequently states what software the server is running. Even if the software in use isn't explicitly named, it is often a string which is unique to a given SMTP implementation. This technique simply connects on the SMTP port, and logs any lines starting with 220. The connection is then closed, with no attempt to transfer an email occurring.

    So what results have I found so far? I'm trying to keep these blog posts to less than 1,000 words each, so that's too big a question to answer here. I've found some quite unexpected things along the way, such as an accurate technique for measuring the occurrence of domain parking on the Internet, and I'll discuss those in future posts. Instead, let me leave you with this short graphical summary of the results so far:

    This is the history of the currently five most popular implementations over time. You can see that Sendmail has fallen from a position of market dominance, and Exim is currently the most popular SMTP server implementation.

    I have a lot more to say about all this work, but as I mentioned earlier I want to keep the length of these posts down. I'll say more in future posts.

    Tags for this post: research smtp popularity internet measurement remoteworker
    Related posts: Methodology for my SMTP survey; Initial SMTP survey poster results in a pie chart; Interesting paper: "YouTube Traffic Characterization: A View From the Edge"; Announcing early results of my survey of SMTP servers; RemoteWorker v74; The witty worm with Vern Paxson

posted at: 21:29 | path: /research | permanent link to this entry

Mon, 08 Dec 2008

Parked domains

posted at: 10:21 | path: /research | permanent link to this entry

Thu, 07 Aug 2008

Noticed that is down?

posted at: 09:11 | path: /research/smtp/survey | permanent link to this entry

Sun, 15 Jun 2008

RemoteWorker v74

posted at: 09:44 | path: /research | permanent link to this entry

Sun, 08 Jun 2008

Dear Lazyweb: how do I check SSL keys for vulnerability?

posted at: 21:03 | path: /research | permanent link to this entry

Sun, 01 Jun 2008

RemoteWorker v70

    I've been meaning for a while to release the RemoteWorker stuff I use for my SMTP server as open source. I finally got around to it.

    You can find the source code here. Here is what the README file in that tarball has to say for itself:

    RemoteWorker is licensed under the terms of the GNU GPL v2 and is Copyright (C) Michael Still ( 2007 and 2008. Note that several of the dependencies have their own licenses, as described at the end of this file.

    This is version v70 of RemoteWorker, a system intended for measuring the Internet by running on many machines at once. Setup of RemoteWorker is not simple, because everyone's use of the system will be different. You should expect to have to write your own bootstrap script, your own SSH babysitter, as well as probably needing to implement more commands in RemoteWorker itself to meet your measurement needs.

    RemoteWorker is very simple. It takes a list of filenames on the command line, and executes the commands from each of these files in the order they are listed on the command line. The commands currently implemented are:

    will be truncated if it is too long
    START: Denotes the start of a command file. Not required.
    END: Denotes the end of a command file. Not required.
    COMMAND-FETCH: Uses HTTP to fetch a new command file. The new commands are stored in a file named in the current working directory
    SLEEP: Sleep for the named number of seconds
    SERVICE-THREAD: Run a service thread which can be connected to on the port specified by the command. If you then telnet to this port, you'll be told the currently running command, and the amount of memory in use by RemoteWorker
    HEARTBEAT-THREAD: Run a heartbeat thread which will regularly log that the RemoteWorker process is still running
    SMTP-PROBE: Probe a remote SMTP server on TCP port 25. This will perform a traceroute (and return traceroute results), and a SMTP probe if the remote server is not hosted on, or routed via, an AS which has asked to not receive probe traffic.
    DEST-AS-CHECK: Check if named IP is hosted by an AS which doesn't like probe traffic.
    TRACEROUTE: Perform a traceroute to the IP, including AS information
    MX-LOOKUP: Perform a DNS MX lookup for a named domain
    DNS-LOOKUP: Perform a forward DNS lookup
    DNS-REVERSE: Perform a reverse DNS lookup
    IPv6-DNS-LOOKUP: Perform an IPv6 DNS lookup
    HTTP-FETCH: Fetch the named URL, and log the HTML of the page. The HTML

    A sample command file might look like this:


    Note that RemoteWorker needs to be run as root for the traceroute implementation to work. Run that command file to see what happens!

    There are a few other things to note:


    You will need to install the gflags module from before some of the helper scripts will work. The worker nodes do not however need gflags.

    Command and control

    My workers run a simple shell script which sits in a loop ( This script runs a series of command files. One of these command files uses the COMMAND-FETCH command to contact a CGI script and download more work. Your RemoteWorkers will need to do something similar, but there is nothing forcing you to deliver the work via HTTP. You could for example just write a simple script which calculates what work to perform in some manner, and then runs that. If you specify a file named "-" on the command line, then RemoteWorker will run in interactive mode and read its commands from stdin.


    Not all PlanetLab nodes are created equal, and I therefore need a bootstrap script which checks the capabilities of a node. This script when writes out the command file which fetches new commands. The capability checks are in the command files named sample and sample-smtpprobes.

    Collecting logs

    RemoteWorker doesn't return its results in real time, it just writes them to stdout. My sitter script then writes these to log files on disk, and I use a SSH based babysitter script to download and process them. This is useful because it means that the server handing out work doesn't need to be up when the commands are executed, and they just sit around on disk waiting for me to collect them later.

    SMTP probes, traceroutes, and the need for AS maps

    If you intend to use the SMTP probing or tracerouting functionality, then you need to have a recent mapping of IP ranges to AS numbers. I don't however include this map in this download because it is quite large and becomes out of date rapidly. You can create your own mapping trivially, by running these commands:


    This will download an AS map from the U Washington Computer Science department, and then preprocess it into the correct format for the workers. However, you should only do the download once, and then push the processed file to the workers, as processing the file into the correct format is quite expensive.

    Dependency credits

    RemoteWorker is intended to be installed on worker nodes with a minimal python installation without needing other things to be installed on the system itself. There are therefore several dependencies which are shipped with RemoteWorker, but weren't written by me. I take no credit for these dependencies, and it should be noted that they each have their own license and usage rules. Please check that you are obeying the licenses for these dependencies.

    Dependencies currently shipped with RemoteWorker are:

    • pydns: used for DNS lookups
    • IPy: used for IP address manipulation
    • tlslite: a light weight SSL implementation used for SMTP TLS checks

    So there you are.

    Tags for this post: research remoteworker release
    Related posts: RemoteWorker v74; Methodology for my SMTP survey; Announcing early results of my survey of SMTP servers; Measuring the popularity of SMTP server implementations on the Internet; MythNetTV release 6; MythIPTV Beta 2

posted at: 14:03 | path: /research | permanent link to this entry

Wed, 21 May 2008

Are license tags common in web pages?

posted at: 02:30 | path: /research | permanent link to this entry

Tue, 01 Apr 2008

The web is probably parkier than it seems

    I've been reading academic papers again (it tends to happen in batches) -- this time I've been focusing on the papers from the Usenix Internet Measurement Conference (IMC) last year. One of the more immediately interesting papers presented was The Web is Smaller than it Seems (bibtex).

    The paper discusses a measurement of the size of "the web" based on a scan of domain names listed in either the DMOZ open directory or the .com and .net TLD zone files. You'll note that is a very similar technique to one of those that I use to acquire domain names for my survey of Internet mail servers, which is what originally interested me in this paper. The domains had their www hostname looked up, and then the number of domain names per IP address was used to create an estimate of the total number of web servers present on the Internet. It is of course a little bit more complicated than that, but you can read the paper for more details if you really want.

    The paper's findings are interesting:

    We find that as much as 60% of the Web servers are co-hosted with 10, 000 or more other Web servers, indicating that the Internet contains many small co-hosted Web servers. Likewise, more than 95% of Web servers share their AS with 1000 or more other Web servers. We additionally find that heavily co-hosted Web servers contribute much less traffic than Web servers that are not co-hosted, confirming that popular servers are not co-located, while less popular servers co-locate more frequently. When considering block lists, we find the vast majority of blocked Web servers are hosted on IPs hosting 100 Web servers or more. This indicates there may be a great deal of collateral damage with IP blocking. Finally, when looking at authoritative DNS servers, we see a high degree of co-location on a very small number of DNS servers, which may result in the Web being fragile from a DNS perspective.

    That's a pretty interesting result. Unfortunately, I think that the researchers missed an opportunity here. While they determined that a small number of IP addresses host a large number of web sites, they didn't attempt to determine how many of those domains are just parked content. Now that would have been something interesting to know. Specifically, I've poked around a little with the parking behaviour of domains via the result of MX record look ups, which leads me to suspect that a large number of those heavily co-located domain names are simply parked, and not adding any interesting content to the Internet.

    Tags for this post: research domain parking http
    Related posts: The Internet is a strange place; Mikal, tell something I didn't know about SMTP servers on the Internet; Interesting paper: "YouTube Traffic Characterization: A View From the Edge"; Redirect to a file:// URL?; The witty worm with Vern Paxson; MelbourneIT are into search engine optimisation?

posted at: 20:08 | path: /research | permanent link to this entry

Sun, 23 Mar 2008

The Internet is a strange place

posted at: 14:14 | path: /research | permanent link to this entry

Fri, 21 Mar 2008

Normalising mail server package names

posted at: 18:37 | path: /research/smtp/survey | permanent link to this entry

Wed, 19 Mar 2008

Announcing early results of my survey of SMTP servers

posted at: 08:54 | path: /research/smtp/survey | permanent link to this entry

Tue, 18 Mar 2008

What is the definition of publication?

posted at: 09:11 | path: /research | permanent link to this entry

Sun, 16 Mar 2008

Compendium of TLD domain access agreements

    One of the things I need to further my SMTP survey is lists of domain names. Lots of domain names. I'm sure I'm not the only one who is interested in such things, so I figured I'd take notes on what the process was to get access to that sort of information for research. This post is a "living document", and I'll update it as I find details of new registrars, or actually experience their service.

    Note that almost all of these TLDs use basically identical access agreements. Access is free, but you agree to not hammer the registrar's servers, and not to do something like spamming with the data. Additionally, the domain names can only be redistributed as part of a value-added product, and without the ability to dump all of the data the registrar provided in one pass. I need to think about that bit more, because it probably restricts the way I allow use of my dataset.

    AccessTLDRegistrarProcedureThanks to
    Yes .com
    Verisign To get access to the list of domains in these four TLDs, you need to sign the domain access agreement and fax the contract to Verisign. If accepted (they accepted me as an individual with no real problems), they say they will fax login back to you. However, they snail mailed mine to my contact address, for reasons I can't explain. Access is provided by FTP, and you must connect from a static IP which is defined in the access agreement.  
    No .edu eduCause Access is not available to the list of registered domains, as per US Department of Commerce requirements. The registrar is seeking to be allowed to provide this data in the future, so it might be worth using the contact form to check if things have changed.  
    Partial .info
    Afilias Afilias is an outsourced TLD registrar. They have different access agreement forms for each of the TLDs they manage:

    .info: Print out the domain access agreement and fax it to Afilias at +1 215 706 5701. My request for access was not responded to.

    .asia: Print out the domain access agreement and fax it to Afilias at +1 215 706 5701. Access was granted via an email, the data is exposed over FTP. My request was responded to after a several week delay.
    Yes .mobi dotMobi Print out the domain access agreement, fill it out, and then scan / email it to They replied within a week or so via email, providing a FTP username and password. Note that they only allow you to download the zone files once a day.  
      .org   Application form is at Andy Warner
      .biz   Application form is at Andy Warner

    Tags for this post: research zone_file dns
    Related posts: Parked domains; Normalising mail server package names; The web is probably parkier than it seems; Nerd link of the day; First paper published; The Accidental Time Machine

posted at: 10:12 | path: /research | permanent link to this entry

Fri, 14 Mar 2008

Mikal, tell something I didn't know about SMTP servers on the Internet

    As part of my survey of SMTP servers on the Internet (a graphical representation of the results from that post are here), I need to find SMTP servers to survey. One of the ways that I've been doing that is I've been performing large numbers of DNS Mail eXchanger (MX) lookups and then probing the SMTP servers identified by those lookups. I haven't been able to perform those lookups on every domain registered, because not all registrars make their zone files available to researchers. I have a compendium of what I've learnt about zone file access agreements online if you're interested.

    Specifically, I performed the following lookups:

    Zone Number of lookups
    .arpa 5
    .asia 9,044
    .com 72,529,657
    .mobi 819,849
    .net 10,734,157
    .root 281

    For each of these domains a DNS MX record lookup was performed using around 100 machines, and the results stored in a series of sharded tables in a MySQL database.

    In aggregate, the results look like this:

    Total (IP, domain) tuples:72,863,506
    Total unique IPs:2,136,511
    Total unique domains:46,993,011

    There are some interesting things to be found in the MX record data. For example, only 55.8% of the domains I scanned have an MX record at all. That might seem a bit counter intuitive, but when you take into account that a lot of domain names are unused or used simply for a web site, I guess its not that surprising. I would like to spend some more time verifying that this isn't a bug in my survey code, but I haven't gotten around to doing that yet.

    Another interesting fact is that GoDaddy appears to be hosting a very large number of domains. Specifically, I found 12,105,590 domains which had one of just two IP addresses owned by GoDaddy as their MX record. That's 25.76% of all of my results. This means that's GoDaddy's domain hosting business is massive -- certainly much larger than I realized previously.

    The IP addresses in question are and Some detail:

    IPDNS Reverse is a domain registered to "Wild West Domains, Inc.", who appear to be part of the GoDaddy family (according to this GoDaddy help page, is used for GoDaddy DNS servers among other things). To determine how many of these domains are parked, I fired off some download jobs to download the top level page of each domain. At the moment, 1,087,885 of those downloads are complete.

    Domains parked with GoDaddy HTTP 302 redirect from the top level page to a URL which is the domain name followed by a short identifier. For example, 302 redirects to -- which is a page displaying advertising. Of the sites I have tested so far, 714,455 are parked in this manner.

    That means GoDaddy currently has approximately 7,950,196 domains parked. That's around 9.4% of all the domains I have scanned!

    Based on looking at IPs serving as MX for an unusual number of domains, the only other immediately obvious entry is that 184,213 domains point to That seems a little bit odd to me.

    I'm sure there is other interesting information in this MX data, but I think I'll leave it here for now.

    Tags for this post: research smtp godaddy hosting domain mx
    Related posts: The web is probably parkier than it seems; Normalising mail server package names; Noticed that is down?; Microsoft Exchange the most popular SMTP server on the Internet?; First paper published; Measuring the popularity of SMTP server implementations on the Internet

posted at: 20:14 | path: /research/smtp | permanent link to this entry

Thu, 06 Dec 2007

Initial SMTP survey poster results in a pie chart

posted at: 16:31 | path: /research/smtp/survey | permanent link to this entry

Wed, 05 Dec 2007

Interesting paper: "YouTube Traffic Characterization: A View From the Edge"

posted at: 12:46 | path: /research | permanent link to this entry

Sat, 01 Dec 2007

Microsoft Exchange the most popular SMTP server on the Internet?

    Eric McCreath from the Department of Computer Science at the Australian National University and I presented a poster entitled "Inferring Relative Popularity of SMTP Servers" at USENIX LISA 2007. This blog post is a brief discussion of the content of the poster, as well as a landing page for the paper version of the poster as well as the the PDF of the actual poster. For more detail into the measurement techniques used, please check out the complete paper.

    We conducted this research because there is little data on the relative popularity of the various available SMTP server implementations. This data is of interest because it aids the development of systems which interact with these servers. For example, a potential DDoS protection system should be tested with the most common SMTP servers, as these are the ones that it is most likely to encounter in everyday use.

    Many businesses rely on email of some form for their day to day operation. This is especially true for product support organisations, who are largely unable to perform their role in the company if their in-boxes are unavailable. Allman in "Spam, Spam, Spam, Spam, Spam, the FTC, and Spam" states that Nuclear Research studies estimate that spam costs US businesses $87 billion a year. It seems reasonable to assume that if a low level attack is costing that much, then a complete outage would impose an even greater burden on an enterprise.

    There has been little research conducted into the current state of SMTP servers on the Internet, perhaps because this area of research has not been particularly fashionable in comparison to the HTTP metrics which are commonly collected. This is an important area of research however given the level of traffic served by these systems has been growing for years. Barracuda Networks cite Radicati research which indicates that in 2009 228 billion emails will be sent per day, with the vast majority being spam (see Barracuda's site for more details). Afergan and Beverly in "The state of the email address" evaluate the state of email servers in an attempt to determine how SMTP servers are coping with the growth in traffic. Their approach involved sending out probe emails to a variety of domains. The email was crafted to have a strong assurance of bouncing because of not being addressed to a valid address. The authors then monitored the bounce traffic. They concluded that corporate SMTP servers are under surprising levels of strain and do not bounce undeliverable emails in a predictable manner.

    We have therefore started to undertake research into SMTP servers as they appear on the Internet, with our first study being a simple survey of which SMTP implementations are most commonly deployed. Our poster discussed the current state of that survey, and provide some early results.

    The challenge with determining the popularity of various SMTP server implementations is twofold -- firstly, not all of the SMTP servers which interact with the Internet are able to be probed from the public Internet (for example SMTP routers which route email that came from the Internet, but are not themselves accessible from the Internet); and secondly the sheer number of SMTP servers connected to the network. We have therefore used both passive and active measurements to survey these servers. Each of these measurement techniques is described below.

    Bearing in mind that our survey is quite new, and that only 34.6 million IP addresses have been probed so far, the initial results are quite interesting.

    You can see from the graph that the most popular SMTP server in our dataset is Microsoft Exchange, followed by Postfix and then Sendmail.

    Additional analysis of our existing data, as well as further development of the email parser will improve the accuracy of our survey, which will also increase the number of machines included in the survey. The survey also needs a wider set of inputs for possible IP addresses to probe -- one example of another possible source of probable SMTP servers is MX records for registered domain names. The distributed probing system needs further development to handle the scale of the proving required for a large number of SMTP servers to be included in the survey, and improvements to the reliability of the central server are also required.

    This SMTP survey is in its early stages, and there is much work still to do. However, research of this nature is likely to produce results which are of interest to both the research community, as well as software developers and systems administrators. So far a small dataset has been analysed, which has resulted in a reasonably robust distributed probing system being constructed. Further work on the survey will continue in the future, with updated results being published from time to time.

    Tags for this post: research smtp survey microsoft exchange postfix sendmail
    Related posts: Announcing early results of my survey of SMTP servers; Initial SMTP survey poster results in a pie chart; Normalising mail server package names; Noticed that is down?; Long time not much write; Methodology for my SMTP survey

posted at: 15:27 | path: /research/smtp/survey | permanent link to this entry