|Measuring the popularity of SMTP server implementations on the Internet|
I'm interested in measuring the performance of SMTP servers connected to the Internet. Before I can poke around inside a SMTP implementation, I want to ensure that I am using one which lots of people use. To that end I have been running a series of SMTP server surveys for the last several years. This work has been alluded to in the past, but I haven't published any results. This has mainly been because while I have written a number of papers on the topic, I am yet to have one accepted by an academic conference. I've been hesitant to comment about my results because of the requirement for academic publications not be previously published work.
I've decided to change that policy. I'm going to reserve a lot of the deeper analysis for academic publication (if I can make such a thing happen), but I am going to start talking about the work I am doing more in public. To start that off, I should mention what I've been doing...
There have been a number of previous surveys of SMTP servers connected to the Internet, with each survey using a different methodology. So although these results are not directly comparable, a comparison still provides some insight into how the server landscape has changed over the last 12 years. A comparison of published surveys is presented in the table below. Each survey in this table shows the: sample size, which is the number of IP addresses surveyed; sample approach, which is the methodology used to determine which IP addresses to sample and adds bias into the sampling; and the number of responses, which is the number of SMTP servers that responded. The majority of these surveys have relied on random sampling of the IP address space, perhaps with a selection algorithm to limit the results selected. Few of the more recent surveys provide complete information on their probing implementation or the rules they used to identify specific implementations from their observations. It should be noted that non-response from a surveyed IP generally indicates that it is not in fact running a SMTP server accessible from the Internet.
|Date||Surveyor||Sample size||Sample method||Responses|
|27 Nov 1996||Bernstein||500,000||Selective random||25,121|
|214 Aug 1997||Bernstein||200,000||Selective random||8,056|
|211 May 1998||Bernstein||20,310||MX walk||17,592|
|22 Apr 2000||Bernstein||12,595||Selective random||10,087|
|25 Oct 2000||Bernstein||25,777||Random||859|
|227 Sep 2001||Bernstein||39,206||Random||937|
|21 Dec 2002||Credentia||4,096||Random||1,837|
|21 Jan 2003||Credentia||30,000||Random||17,540|
|21 Apr 2003||Credentia||37,563||Random||20,410|
|21 May 2007||MailChannels||400,000||Corporate domain names||254,400|
Whereas the surveys that I have been running with the assistance of my ever patient PhD supervisor Dr Eric McCreath have been quite a bit larger. Note that larger isn't necessarily better with these sorts of surveys, but my methodology attempts to aim for completeness, and the relative power of PlanetLab makes these computations surprisingly cheap. Details of my surveys so far:
|Date||Surveyor||Sample size||Sample method||Responses|
|January 2008||Still / McCreath||46,136,113||Exhaustive||1,973,748|
|April 2008||Still / McCreath||92,286,998||Exhaustive||1,609,111|
|July 2008||Still / McCreath||97,545,668||Exhaustive||1,579,507|
|October 2008||Still / McCreath||109,661,889||Exhaustive||1,801,081|
|January 2009||Still / McCreath||110,397,428||Exhaustive||1,916,719|
|April 2009||Still / McCreath||110,706,130||Exhaustive||1,925,760|
|October 2009||Still / McCreath||111,209,212||Exhaustive||1,800,573|
Our survey is implemented by attempting to identify the MTA software running on an SMTP server using the SMTP connection banner. In other words, a collection of IP addresses are connected to on the SMTP port (TCP 25), and an attempt is made from the early stages of the SMTP protocol interaction to determine what SMTP server software is running on that host. The SMTP protocol will often reply to the connection with a status 220 line, referred to as the SMTP banner, this tells the connecting client that the server is ready. The SMTP banner also frequently states what software the server is running. Even if the software in use isn't explicitly named, it is often a string which is unique to a given SMTP implementation. This technique simply connects on the SMTP port, and logs any lines starting with 220. The connection is then closed, with no attempt to transfer an email occurring.
So what results have I found so far? I'm trying to keep these blog posts to less than 1,000 words each, so that's too big a question to answer here. I've found some quite unexpected things along the way, such as an accurate technique for measuring the occurrence of domain parking on the Internet, and I'll discuss those in future posts. Instead, let me leave you with this short graphical summary of the results so far:
This is the history of the currently five most popular implementations over time. You can see that Sendmail has fallen from a position of market dominance, and Exim is currently the most popular SMTP server implementation.
I have a lot more to say about all this work, but as I mentioned earlier I want to keep the length of these posts down. I'll say more in future posts.
Tags for this post: research smtp popularity internet measurement remoteworker
Related posts: Methodology for my SMTP survey; Initial SMTP survey poster results in a pie chart; Interesting paper: "YouTube Traffic Characterization: A View From the Edge"; Mikal, tell something I didn't know about SMTP servers on the Internet; Normalising mail server package names; Noticed that smtpsurvey.stillhq.com is down?
posted at: 21:29 | path: /research | permanent link to this entry