Measuring the popularity of SMTP server implementations on the Internet

    I'm interested in measuring the performance of SMTP servers connected to the Internet. Before I can poke around inside a SMTP implementation, I want to ensure that I am using one which lots of people use. To that end I have been running a series of SMTP server surveys for the last several years. This work has been alluded to in the past, but I haven't published any results. This has mainly been because while I have written a number of papers on the topic, I am yet to have one accepted by an academic conference. I've been hesitant to comment about my results because of the requirement for academic publications not be previously published work.

    I've decided to change that policy. I'm going to reserve a lot of the deeper analysis for academic publication (if I can make such a thing happen), but I am going to start talking about the work I am doing more in public. To start that off, I should mention what I've been doing...

    There have been a number of previous surveys of SMTP servers connected to the Internet, with each survey using a different methodology. So although these results are not directly comparable, a comparison still provides some insight into how the server landscape has changed over the last 12 years. A comparison of published surveys is presented in the table below. Each survey in this table shows the: sample size, which is the number of IP addresses surveyed; sample approach, which is the methodology used to determine which IP addresses to sample and adds bias into the sampling; and the number of responses, which is the number of SMTP servers that responded. The majority of these surveys have relied on random sampling of the IP address space, perhaps with a selection algorithm to limit the results selected. Few of the more recent surveys provide complete information on their probing implementation or the rules they used to identify specific implementations from their observations. It should be noted that non-response from a surveyed IP generally indicates that it is not in fact running a SMTP server accessible from the Internet.

    DateSurveyorSample sizeSample methodResponses
    27 Nov 1996Bernstein500,000Selective random25,121
    214 Aug 1997Bernstein200,000Selective random8,056
    211 May 1998Bernstein20,310MX walk17,592
    22 Apr 2000Bernstein12,595Selective random10,087
    25 Oct 2000Bernstein25,777Random859
    227 Sep 2001Bernstein39,206Random937
    21 Dec 2002Credentia4,096Random1,837
    21 Jan 2003Credentia30,000Random17,540
    21 Apr 2003Credentia37,563Random20,410
    21 May 2007MailChannels400,000Corporate domain names254,400


    Whereas the surveys that I have been running with the assistance of my ever patient PhD supervisor Dr Eric McCreath have been quite a bit larger. Note that larger isn't necessarily better with these sorts of surveys, but my methodology attempts to aim for completeness, and the relative power of PlanetLab makes these computations surprisingly cheap. Details of my surveys so far:

    DateSurveyorSample sizeSample methodResponses
    January 2008Still / McCreath46,136,113Exhaustive1,973,748
    April 2008Still / McCreath92,286,998Exhaustive1,609,111
    July 2008Still / McCreath97,545,668Exhaustive1,579,507
    October 2008Still / McCreath109,661,889Exhaustive1,801,081
    January 2009Still / McCreath110,397,428Exhaustive1,916,719
    April 2009Still / McCreath110,706,130Exhaustive1,925,760
    October 2009Still / McCreath111,209,212Exhaustive1,800,573


    Our survey is implemented by attempting to identify the MTA software running on an SMTP server using the SMTP connection banner. In other words, a collection of IP addresses are connected to on the SMTP port (TCP 25), and an attempt is made from the early stages of the SMTP protocol interaction to determine what SMTP server software is running on that host. The SMTP protocol will often reply to the connection with a status 220 line, referred to as the SMTP banner, this tells the connecting client that the server is ready. The SMTP banner also frequently states what software the server is running. Even if the software in use isn't explicitly named, it is often a string which is unique to a given SMTP implementation. This technique simply connects on the SMTP port, and logs any lines starting with 220. The connection is then closed, with no attempt to transfer an email occurring.

    So what results have I found so far? I'm trying to keep these blog posts to less than 1,000 words each, so that's too big a question to answer here. I've found some quite unexpected things along the way, such as an accurate technique for measuring the occurrence of domain parking on the Internet, and I'll discuss those in future posts. Instead, let me leave you with this short graphical summary of the results so far:



    This is the history of the currently five most popular implementations over time. You can see that Sendmail has fallen from a position of market dominance, and Exim is currently the most popular SMTP server implementation.

    I have a lot more to say about all this work, but as I mentioned earlier I want to keep the length of these posts down. I'll say more in future posts.

    Tags for this post: research smtp popularity internet measurement remoteworker
    Related posts: Methodology for my SMTP survey; Initial SMTP survey poster results in a pie chart; Interesting paper: "YouTube Traffic Characterization: A View From the Edge"; RemoteWorker v74; I think I've worked out the problem with the hotel network; Internet access in Perth

posted at: 21:29 | path: /research | permanent link to this entry