Dealing with remote HTTP servers with buggy chunking implementations

    HTTP 1.1 implements chunking as a way of servers telling clients how much content is left for a given request, which enables you to send more than one piece of content in a given HTTP connection. Unfortunately for me, the site I was trying to access has a buggy chunking implementation, and that causes the somewhat fragile python urllib2 code to throw an exception:

      
      Traceback (most recent call last):
      
        File "./mythingie.py", line 55, in ?
      
          xml = remote.readlines()
      
        File "/usr/lib/python2.4/socket.py", line 382, in readlines
      
          line = self.readline()
      
        File "/usr/lib/python2.4/socket.py", line 332, in readline
      
          data = self._sock.recv(self._rbufsize)
      
        File "/usr/lib/python2.4/httplib.py", line 460, in read
      
          return self._read_chunked(amt)
      
        File "/usr/lib/python2.4/httplib.py", line 499, in _read_chunked
      
          chunk_left = int(line, 16)
      
      ValueError: invalid literal for int(): 
      
      


    I muttered about this earlier today, including finding the bug tracking the problem in pythonistan. However, finding the will not fix bug wasn't satisfying enough...

    It turns out you can just have urllib2 lie to the server about what HTTP version it talks, and therefore turn off chunking. Here's my sample code for how to do that:

      
      import httplib
      
      import urllib2
      
      
      
      class HTTP10Connection(httplib.HTTPConnection):
      
        """HTTP10Connection -- a HTTP connection which is forced to ask for HTTP
      
           1.0
      
        """
      
      
      
        _http_vsn_str = 'HTTP/1.0'
      
      
      
      
      
      class HTTP10Handler(urllib2.HTTPHandler):
      
        """HTTP10Handler -- don't use HTTP 1.1"""
      
      
      
        def http_open(self, req):
      
          return self.do_open(HTTP10Connection, req)
      
      
      
      // ...
      
      
      
        request = urllib2.Request(feed)
      
        request.add_header('User-Agent', 'mythingie')
      
        opener = urllib2.build_opener(HTTP10Handler())
      
        
      
        remote = opener.open(request)
      
        content = remote.readlines()
      
        remote.close()
      
      


    I hereby declare myself Michael Still, bringer of the gross python hacks.

posted at: 22:27 | path: /python | permanent link to this entry

    Add a comment to this post:

    Your name:

    Your email: Email me new comments on this post
      (Your email will not be published on this site, and will only be used to contact you directly with a reply to your comment if needed. Oh, and we'll use it to send you new comments on this post it you selected that checkbox.)


    Your website:

    Comments: