Universal Feedparser and XML namespaces

    I've always found python's Universal Feedparser to be a bit hard to work with when using feeds with XML namespaces. Specifically, if you don't care about the stuff in the namespaces then you're fine, but if you want that data it gets a lot harder.

    In the past I've had to do some gross hacks. For example this gem is from the MythNetTV code:

      
        # Modify the XML to work around namespace handling bugs in FeedParser
      
        lines = []
      
        re_mediacontent = re.compile('(.*)<media:content([^>]*)/ *>(.*)')
      
      
      
        for line in xmllines:
      
          m = re_mediacontent.match(line)
      
          count = 1
      
          while m:
      
            line = '%s<media:wannabe%d>%s</media:wannabe%d>%s' %(m.group(1), count,
      
                                                               m.group(2),
      
                                                               count, m.group(3))
      
            m = re_mediacontent.match(line)
      
            count = count + 1
      
      
      
          lines.append(line)
      
      
      
        # Parse the modified XML
      
        xml = ''.join(lines)
      
        parser = feedparser.parse(xml)
      
      


    Which is horrible, but works. This time around the problem is that I am having trouble getting to the gr:annotation tags in my Google reader shared items feed. How annoying.

    In the case of the Google reader feed, the problem seems to be that the annotation is presented like this:

      
      <gr:annotation><content type="html">Awesome. Canberra has needed
      
      something better than buses between the towncenters for a while, and light rail 
      
      seems like a great way to do it. I much prefer trains to buses, and catch a 
      
      light rail service to work every day when I am in Mountain View.
      
      </content><author gr:user-id="09387883873401903052" 
      
      gr:profile-id="114835605728492647856"><name>mikal</name>
      
      </author></gr:annotation>
      
      


    Feedparser can only handle simple elements (not elements that contain other elements). Therefore, this gross hack is required to get this to parse correctly:

      
        simplify_re = re.compile('(.*)<gr:annotation>'
      
                                 '<content type="html">(.*)</content>'
      
                                 '<author .*><name>.*</name></author>'
      
                                 '</gr:annotation>(.*)')
      
      
      
        new_lines = []
      
        for line in lines:
      
          m = simplify_re.match(line)
      
          if m:
      
            new_lines.append('%s<gr:annotation>%s</gr:annotation>%s'
      
                             %(m.group(1), m.group(2), m.group(3)))
      
          else:
      
            new_lines.append(line)
      
      
      
        d = feedparser.parse(''.join(new_lines))
      
      


    Gross, and fragile, but working. This is cool, because it now means that I can apply more logic in the shared links that end up in my blather feed. I'm thinking of something along the lines of only shared links with an annotation will end up in that feed, and the blather entry will include the annotation. Or something like that.

posted at: 05:22 | path: /python/feedparser | permanent link to this entry

    #17 kioopi

    That was exactly what i was searching for. This is quite hacky but does the job. thank you.

    i ended up changing:
    new_lines.append('%s%s%s' %(...)
    to:
    new_lines.append('%s%s%s' %(...)

    The whole thing might need revisiting for the case of multiple annotatins to an entry.

    Thanks again, your entry has been a great help!

    cheers

    #124 minicraviny (this post not yet moderated)

    http://playblackjack-online.eu - play blackjack online
    play blackjack online

    #125 Toitlydoomoft (this post not yet moderated)

    http://playcraps-online.eu - play craps online
    play craps online

    Add a comment to this post:

    Your name:

    Your email: Email me new comments on this post
      (Your email will not be published on this site, and will only be used to contact you directly with a reply to your comment if needed. Oh, and we'll use it to send you new comments on this post it you selected that checkbox.)


    Your website:

    Comments: