Monday 25 December 2017

java - Building XML file with SAX parser

itemprop="text">

I am parsing through an XML Wikipedia
data dump and I'd like to pull out a page and make it into a new XML document with a
stripped down version of the page. For example, of each page, I'm only interested in the
title, id, timestamp, username, and text.



Here
is a full Wikipedia
page:




AccessibleComputing

0
10
title="Computer accessibility" />


381202555

2010-08-26T22:38:36Z


OlEnglish

7181920




[[Help:Reverting|Reverted]] edits by
[[Special:Contributions/76.28.186.133|76.28.186.133]] ([[User talk:76.28.186.133|talk]])
to last version by Gurch

xml:space="preserve">#REDIRECT [[Computer accessibility]] {{R from
CamelCase}}






What
I'd like to end up with after the stripping is done would be something like
this:






AccessibleComputing

10


2010-08-26T22:38:36Z


OlEnglish


#REDIRECT [[Computer
accessibility]] {{R from CamelCase}}







Because
of the sheer size of these documents I know I can't use DOM to handle this. I know how
to set up a SAX parser but what would be the best way to build a new XML file while
parsing the
document?



Thanks



Answer




You can use XMLFilterImpl and leave only
content you need, here is the idea, both input and output are streams, so it can process
XML of any size



 XMLReader xr =
new XMLFilterImpl(XMLReaderFactory.createXMLReader()) {

public
void startElement(String uri, String localName, String qName, Attributes
atts)
throws SAXException {
if (qName.equals("page")) {

super.startElement(uri, localName, qName, atts);
}

}

public void endElement(String uri, String localName, String
qName) throws SAXException {
if (qName.equals("page")) {

super.endElement(uri, localName, qName);

}

}

public void characters(char[] ch, int start, int length) throws
SAXException {
//super.characters(ch, start, length);
}

};
Source src = new SAXSource(xr, new InputSource("1.xml"));

Result res = new StreamResult(System.out);

TransformerFactory.newInstance().newTransformer().transform(src,
res);



No comments:

Post a Comment

php - file_get_contents shows unexpected output while reading a file

I want to output an inline jpg image as a base64 encoded string, however when I do this : $contents = file_get_contents($filename); print &q...