I am parsing through an XML Wikipedia
data dump and I'd like to pull out a page and make it into a new XML document with a
stripped down version of the page. For example, of each page, I'm only interested in the
title, id, timestamp, username, and text.
Here
is a full Wikipedia
page:
AccessibleComputing
0
10
title="Computer accessibility" />
381202555
2010-08-26T22:38:36Z
OlEnglish
7181920
[[Help:Reverting|Reverted]] edits by
[[Special:Contributions/76.28.186.133|76.28.186.133]] ([[User talk:76.28.186.133|talk]])
to last version by Gurch
xml:space="preserve">#REDIRECT [[Computer accessibility]] {{R from
CamelCase}}
What
I'd like to end up with after the stripping is done would be something like
this:
AccessibleComputing
10
2010-08-26T22:38:36Z
OlEnglish
#REDIRECT [[Computer
accessibility]] {{R from CamelCase}}
Because
of the sheer size of these documents I know I can't use DOM to handle this. I know how
to set up a SAX parser but what would be the best way to build a new XML file while
parsing the
document?
Thanks
No comments:
Post a Comment