java - Building XML file with SAX parser

Monday, 25 December 2017

java - Building XML file with SAX parser

itemprop="text">

I am parsing through an XML Wikipedia
data dump and I'd like to pull out a page and make it into a new XML document with a
stripped down version of the page. For example, of each page, I'm only interested in the
title, id, timestamp, username, and text.

Here
is a full Wikipedia
page:


AccessibleComputing

0
10
            title="Computer accessibility" />


            381202555

            2010-08-26T22:38:36Z

            
 OlEnglish

            7181920
 


            
 [[Help:Reverting|Reverted]] edits by
            [[Special:Contributions/76.28.186.133|76.28.186.133]] ([[User talk:76.28.186.133|talk]])
            to last version by Gurch
             xml:space="preserve">#REDIRECT [[Computer accessibility]] {{R from
            CamelCase}}

What
I'd like to end up with after the stripping is done would be something like
this:



            AccessibleComputing

            10
 

            2010-08-26T22:38:36Z

            
 OlEnglish

            
 #REDIRECT [[Computer
            accessibility]] {{R from CamelCase}}

Because
of the sheer size of these documents I know I can't use DOM to handle this. I know how
to set up a SAX parser but what would be the best way to build a new XML file while
parsing the
document?

Thanks

Answer

You can use XMLFilterImpl and leave only
content you need, here is the idea, both input and output are streams, so it can process
XML of any size

 XMLReader xr =
            new XMLFilterImpl(XMLReaderFactory.createXMLReader()) {

 public
            void startElement(String uri, String localName, String qName, Attributes
            atts)
 throws SAXException {
 if (qName.equals("page")) {

            super.startElement(uri, localName, qName, atts);
 }

            }

 public void endElement(String uri, String localName, String
            qName) throws SAXException {
 if (qName.equals("page")) {

            super.endElement(uri, localName, qName);

 }

            }

 public void characters(char[] ch, int start, int length) throws
            SAXException {
 //super.characters(ch, start, length);
 }

            };
 Source src = new SAXSource(xr, new InputSource("1.xml"));

            Result res = new StreamResult(System.out);

            TransformerFactory.newInstance().newTransformer().transform(src,
            res);

Blog

Monday, 25 December 2017

java - Building XML file with SAX parser

No comments:

Post a Comment

php - file_get_contents shows unexpected output while reading a file