Universal Feedparser and XML namespaces

    I've always found python's Universal Feedparser to be a bit hard to work with when using feeds with XML namespaces. Specifically, if you don't care about the stuff in the namespaces then you're fine, but if you want that data it gets a lot harder.

    In the past I've had to do some gross hacks. For example this gem is from the MythNetTV code:

        # Modify the XML to work around namespace handling bugs in FeedParser
        lines = []
        re_mediacontent = re.compile('(.*)<media:content([^>]*)/ *>(.*)')
      
        for line in xmllines:
          m = re_mediacontent.match(line)
          count = 1
          while m:
            line = '%s<media:wannabe%d>%s</media:wannabe%d>%s' %(m.group(1), count,
                                                               m.group(2),
                                                               count, m.group(3))
            m = re_mediacontent.match(line)
            count = count + 1
      
          lines.append(line)
      
        # Parse the modified XML
        xml = ''.join(lines)
        parser = feedparser.parse(xml)
      


    Which is horrible, but works. This time around the problem is that I am having trouble getting to the gr:annotation tags in my Google reader shared items feed. How annoying.

    In the case of the Google reader feed, the problem seems to be that the annotation is presented like this:

      <gr:annotation><content type="html">Awesome. Canberra has needed
      something better than buses between the towncenters for a while, and light rail 
      seems like a great way to do it. I much prefer trains to buses, and catch a 
      light rail service to work every day when I am in Mountain View.
      </content><author gr:user-id="09387883873401903052" 
      gr:profile-id="114835605728492647856"><name>mikal</name>
      </author></gr:annotation>
      


    Feedparser can only handle simple elements (not elements that contain other elements). Therefore, this gross hack is required to get this to parse correctly:

        simplify_re = re.compile('(.*)<gr:annotation>'
                                 '<content type="html">(.*)</content>'
                                 '<author .*><name>.*</name></author>'
                                 '</gr:annotation>(.*)')
      
        new_lines = []
        for line in lines:
          m = simplify_re.match(line)
          if m:
            new_lines.append('%s<gr:annotation>%s</gr:annotation>%s'
                             %(m.group(1), m.group(2), m.group(3)))
          else:
            new_lines.append(line)
      
        d = feedparser.parse(''.join(new_lines))
      


    Gross, and fragile, but working. This is cool, because it now means that I can apply more logic in the shared links that end up in my blather feed. I'm thinking of something along the lines of only shared links with an annotation will end up in that feed, and the blather entry will include the annotation. Or something like that.

    Tags for this post: python feedparser namespace hack
    Related posts: Building a symlink tree for MythTV recordings; Weird paramiko problem; I'm liking python too, thanks for asking; Killing a blocking thread in python?; Domain name lookup helper for python?; paramiko exec_command timeout

posted at: 05:22 | path: /python/feedparser | permanent link to this entry