I've been reading academic papers again (it tends to happen in batches) -- this time I've been focusing on the papers from the Usenix Internet Measurement Conference (IMC) last year. One of the more immediately interesting papers presented was The Web is Smaller than it Seems (bibtex).
The paper discusses a measurement of the size of "the web" based on a scan of domain names listed in either the DMOZ open directory or the .com and .net TLD zone files. You'll note that is a very similar technique to one of those that I use to acquire domain names for my survey of Internet mail servers, which is what originally interested me in this paper. The domains had their www hostname looked up, and then the number of domain names per IP address was used to create an estimate of the total number of web servers present on the Internet. It is of course a little bit more complicated than that, but you can read the paper for more details if you really want.
The paper's findings are interesting:
We find that as much as 60% of the Web servers are co-hosted with 10, 000 or more other Web servers, indicating that the Internet contains many small co-hosted Web servers. Likewise, more than 95% of Web servers share their AS with 1000 or more other Web servers. We additionally find that heavily co-hosted Web servers contribute much less traffic than Web servers that are not co-hosted, confirming that popular servers are not co-located, while less popular servers co-locate more frequently. When considering block lists, we find the vast majority of blocked Web servers are hosted on IPs hosting 100 Web servers or more. This indicates there may be a great deal of collateral damage with IP blocking. Finally, when looking at authoritative DNS servers, we see a high degree of co-location on a very small number of DNS servers, which may result in the Web being fragile from a DNS perspective.
That's a pretty interesting result. Unfortunately, I think that the researchers missed an opportunity here. While they determined that a small number of IP addresses host a large number of web sites, they didn't attempt to determine how many of those domains are just parked content. Now that would have been something interesting to know. Specifically, I've poked around a little with the parking behaviour of domains via the result of MX record look ups, which leads me to suspect that a large number of those heavily co-located domain names are simply parked, and not adding any interesting content to the Internet.
Tags for this post: research(
posted at: 13:08 | path: /research | permanent link to this entry
