Dealing with remote HTTP servers with buggy chunking implementations

    HTTP 1.1 implements chunking as a way of servers telling clients how much content is left for a given request, which enables you to send more than one piece of content in a given HTTP connection. Unfortunately for me, the site I was trying to access has a buggy chunking implementation, and that causes the somewhat fragile python urllib2 code to throw an exception:

      Traceback (most recent call last):
        File "./", line 55, in ?
          xml = remote.readlines()
        File "/usr/lib/python2.4/", line 382, in readlines
          line = self.readline()
        File "/usr/lib/python2.4/", line 332, in readline
          data = self._sock.recv(self._rbufsize)
        File "/usr/lib/python2.4/", line 460, in read
          return self._read_chunked(amt)
        File "/usr/lib/python2.4/", line 499, in _read_chunked
          chunk_left = int(line, 16)
      ValueError: invalid literal for int(): 

    I muttered about this earlier today, including finding the bug tracking the problem in pythonistan. However, finding the will not fix bug wasn't satisfying enough...

    It turns out you can just have urllib2 lie to the server about what HTTP version it talks, and therefore turn off chunking. Here's my sample code for how to do that:

      import httplib
      import urllib2
      class HTTP10Connection(httplib.HTTPConnection):
        """HTTP10Connection -- a HTTP connection which is forced to ask for HTTP
        _http_vsn_str = 'HTTP/1.0'
      class HTTP10Handler(urllib2.HTTPHandler):
        """HTTP10Handler -- don't use HTTP 1.1"""
        def http_open(self, req):
          return self.do_open(HTTP10Connection, req)
      // ...
        request = urllib2.Request(feed)
        request.add_header('User-Agent', 'mythingie')
        opener = urllib2.build_opener(HTTP10Handler())
        remote =
        content = remote.readlines()

    I hereby declare myself Michael Still, bringer of the gross python hacks.

    Tags for this post: python urllib2 buggy chunking
    Related posts: On syncing with Google Contacts; Finding locking deadlocks in python; A pythonic example of recording metrics about ephemeral scripts with prometheus; mbot: new hotness in Google Talk bots; Calculating a SSH host key with paramiko; Universal Feedparser and XML namespaces

posted at: 22:27 | path: /python | permanent link to this entry