|Universal Feedparser and XML namespaces|
I've always found python's Universal Feedparser to be a bit hard to work with when using feeds with XML namespaces. Specifically, if you don't care about the stuff in the namespaces then you're fine, but if you want that data it gets a lot harder.
In the past I've had to do some gross hacks. For example this gem is from the MythNetTV code:
# Modify the XML to work around namespace handling bugs in FeedParser lines =  re_mediacontent = re.compile('(.*)<media:content([^>]*)/ *>(.*)') for line in xmllines: m = re_mediacontent.match(line) count = 1 while m: line = '%s<media:wannabe%d>%s</media:wannabe%d>%s' %(m.group(1), count, m.group(2), count, m.group(3)) m = re_mediacontent.match(line) count = count + 1 lines.append(line) # Parse the modified XML xml = ''.join(lines) parser = feedparser.parse(xml)
Which is horrible, but works. This time around the problem is that I am having trouble getting to the gr:annotation tags in my Google reader shared items feed. How annoying.
In the case of the Google reader feed, the problem seems to be that the annotation is presented like this:
<gr:annotation><content type="html">Awesome. Canberra has needed something better than buses between the towncenters for a while, and light rail seems like a great way to do it. I much prefer trains to buses, and catch a light rail service to work every day when I am in Mountain View. </content><author gr:user-id="09387883873401903052" gr:profile-id="114835605728492647856"><name>mikal</name> </author></gr:annotation>
Feedparser can only handle simple elements (not elements that contain other elements). Therefore, this gross hack is required to get this to parse correctly:
simplify_re = re.compile('(.*)<gr:annotation>' '<content type="html">(.*)</content>' '<author .*><name>.*</name></author>' '</gr:annotation>(.*)') new_lines =  for line in lines: m = simplify_re.match(line) if m: new_lines.append('%s<gr:annotation>%s</gr:annotation>%s' %(m.group(1), m.group(2), m.group(3))) else: new_lines.append(line) d = feedparser.parse(''.join(new_lines))
Gross, and fragile, but working. This is cool, because it now means that I can apply more logic in the shared links that end up in my blather feed. I'm thinking of something along the lines of only shared links with an annotation will end up in that feed, and the blather entry will include the annotation. Or something like that.
Tags for this post: python feedparser namespace hack
Related posts: SSL, X509, ASN.1 and certificate validity dates; paramiko exec_command timeout; Domain name lookup helper for python?; Twisted Python and Jabber SSL; Calculating a SSH host key with paramiko; Python DNS modules
posted at: 05:22 | path: /python/feedparser | permanent link to this entry