How can one parse HTML/XML
            and extract information from it?
Monday, 13 November 2017
How do you parse and process HTML/XML in PHP?
Native
            XML Extensions
I prefer using one of the             href="http://php.net/manual/en/refs.xml.php" rel="noreferrer">native XML
            extensions since they come bundled with PHP, are usually faster than all the
            3rd party libs and give me all the control I need over the
            markup.
            href="http://php.net/manual/en/book.dom.php"
            rel="noreferrer">DOM
The DOM extension allows you to operate on XML documents through the DOM API
with PHP 5. It is an implementation of the W3C's Document Object Model Core Level 3, a
platform- and language-neutral interface that allows programs and scripts to dynamically
access and update the content, structure and style of
documents.
DOM is
            capable of parsing and modifying real world (broken) HTML and it can do             href="http://schlitt.info/opensource/blog/0704_xpath.html" rel="noreferrer">XPath
            queries. It is based on             href="http://xmlsoft.org/html/libxml-HTMLparser.html"
            rel="noreferrer">libxml.
It takes
            some time to get productive with DOM, but that time is well worth it IMO. Since DOM is a
            language-agnostic interface, you'll find implementations in many languages, so if you
            need to change your programming language, chances are you will already know how to use
            that language's DOM API then.
A basic usage
            example can be found in             href="https://stackoverflow.com/questions/3820666/regular-expression-for-grabbing-the-href-attribute-of-an-a-element/3820783#3820783">Grabbing
            the href attribute of an A element and a general conceptual overview can be
            found at             href="https://stackoverflow.com/questions/4979836/noob-question-about-domdocument-in-php/4983721#4983721">DOMDocument
            in php
            href="https://stackoverflow.com/search?q=DOM+HTML+[PHP]&submit=search">How to use
            the DOM extension has been covered extensively on StackOverflow, so if you
            choose to use it, you can be sure most of the issues you run into can be solved by
            searching/browsing .
            href="http://php.net/manual/en/book.xmlreader.php"
            rel="noreferrer">XMLReader
The XMLReader extension is an XML pull parser. The reader acts as a cursor
going forward on the document stream and stopping at each node on the way.
XMLReader, like
            DOM, is based on libxml. I am not aware of how to trigger the HTML Parser Module, so
            chances are using XMLReader for parsing broken HTML might be less robust than using DOM
            where you can explicitly tell it to use libxml's HTML Parser
            Module.
A basic usage example can be found at
                        href="https://stackoverflow.com/questions/3299033/getting-all-values-from-h1-tags-using-php/3299140#3299140">getting
            all values from h1 tags using
            php
            href="http://php.net/manual/en/book.xml.php" rel="noreferrer">XML
            Parser
This extension lets you create XML parsers and then define handlers for
different XML events. Each XML parser also has a few parameters you can
adjust.
The XML
            Parser library is also based on libxml, and implements a             href="http://en.wikipedia.org/wiki/Simple_API_for_XML" rel="noreferrer">SAX
            style XML push parser. It may be a better choice for memory management than DOM or
            SimpleXML, but will be more difficult to work with than the pull parser implemented by
            XMLReader.
            href="http://php.net/manual/en/book.simplexml.php"
            rel="noreferrer">SimpleXml
The SimpleXML extension provides a very simple and easily usable toolset to
convert XML to an object that can be processed with normal property selectors and array
iterators.
SimpleXML is an
            option when you know the HTML is valid XHTML. If you need to parse broken HTML, don't
            even consider SimpleXml because it will choke.
A
            basic usage example can be found at             href="https://stackoverflow.com/questions/4906073/a-simple-program-to-crud-node-and-node-values-of-xml-file">A
            simple program to CRUD node and node values of xml file and there is             href="http://php.net/manual/en/simplexml.examples-basic.php" rel="noreferrer">lots of
            additional examples in the PHP Manual.
/>
3rd Party Libraries (libxml
            based)
If you prefer to use a 3rd-party lib,
            I'd suggest using a lib that actually uses             href="http://php.net/manual/en/book.dom.php" rel="noreferrer">DOM/            href="http://xmlsoft.org/" rel="noreferrer">libxml underneath instead of
            string parsing.
            href="https://thomas.weinert.info/FluentDOM/" rel="noreferrer">FluentDom -
                        rel="noreferrer">Repo
FluentDOM provides a jQuery-like fluent XML interface for the DOMDocument in
PHP. Selectors are written in XPath or CSS (using a CSS to XPath converter). Current
versions extend the DOM implementing standard interfaces and add features from the DOM
Living Standard. FluentDOM can load formats like JSON, CSV, JsonML, RabbitFish and
others. Can be installed via
Composer.
            href="https://github.com/wasinger/htmlpagedom/blob/master/README.md"
            rel="noreferrer">HtmlPageDom
Wa72\HtmlPageDom` is a PHP library for easy manipulation of HTML
documents using It requires rel="noreferrer">DomCrawler from Symfony2
components for
traversing the
DOM tree and extends it by adding methods for manipulating the
DOM
tree of HTML
documents.
            href="http://code.google.com/p/phpquery/" rel="noreferrer">phpQuery (not
            updated for years)
phpQuery is a server-side, chainable, CSS3 selector driven Document Object
Model (DOM) API based on jQuery JavaScript Library written in PHP5 and provides
additional Command Line Interface (CLI).
Also see:             href="https://github.com/electrolinux/phpquery"
            rel="noreferrer">https://github.com/electrolinux/phpquery
            href="http://framework.zend.com/manual/current/en/modules/zend.dom.intro.html"
            rel="noreferrer">Zend_Dom
Zend_Dom provides tools for working with DOM documents and structures.
Currently, we offer Zend_Dom_Query, which provides a unified interface for querying DOM
documents utilizing both XPath and CSS selectors.
            href="http://querypath.org/"
            rel="noreferrer">QueryPath
QueryPath is a PHP library for manipulating XML and HTML. It is designed to
work not only with local files, but also with web services and database resources. It
implements much of the jQuery interface (including CSS-style selectors), but it is
heavily tuned for server-side use. Can be installed via
Composer.
            href="http://github.com/theseer/fDOMDocument"
            rel="noreferrer">fDOMDocument
fDOMDocument extends the standard DOM to use exceptions at all occasions of
errors instead of PHP warnings or notices. They also add various custom methods and
shortcuts for convenience and to simplify the usage of
DOM.
            href="http://sabre.io/xml/"
            rel="noreferrer">sabre/xml
sabre/xml is a library that wraps and extends the XMLReader and XMLWriter
classes to create a simple "xml to object/array" mapping system and design pattern.
Writing and reading XML is single-pass and can therefore be fast and require low memory
on large xml
files.
            href="https://github.com/servo-php/fluidxml"
            rel="noreferrer">FluidXML
FluidXML is a PHP library for manipulating XML with a concise and fluent
API.
It leverages XPath and the fluent programming pattern to be fun and
effective.
/>
3rd-Party (not
            libxml-based)
The benefit of building upon
            DOM/libxml is that you get good performance out of the box because you are based on a
            native extension. However, not all 3rd-party libs go down this route. Some of them
            listed below
            href="http://simplehtmldom.sourceforge.net/manual.htm#section_traverse"
            rel="noreferrer">PHP Simple HTML DOM
            Parser
- An HTML DOM parser written in PHP5+ lets you
 
manipulate HTML in a very easy way!
- Require PHP
 
5+.
- Supports invalid HTML.
 
- Find tags on an HTML page with selectors just like
 
jQuery.
- Extract contents from HTML in a single
 
line.
I generally do
            not recommend this parser. The codebase is horrible and the parser itself is rather slow
            and memory hungry. Not all jQuery Selectors (such as             href="https://api.jquery.com/child-selector/" rel="noreferrer">child
            selectors) are possible. Any of the libxml based libraries should outperform
            this easily.
            href="https://github.com/paquettg/php-html-parser" rel="noreferrer">PHP Html
            Parser
PHPHtmlParser is a simple, flexible, html parser which allows you to select
tags using any css selector, like jQuery. The goal is to assiste in the development of
tools which require a quick, easy way to scrap html, whether it's valid or not! This
project was original supported by sunra/php-simple-html-dom-parser but the support seems
to have stopped so this project is my adaptation of his previous
work.
Again, I
            would not recommend this parser. It is rather slow with high CPU usage. There is also no
            function to clear memory of created DOM objects. These problems scale particularly with
            nested loops. The documentation itself is inaccurate and misspelled, with no responses
            to fixes since 14 Apr 16.
            href="https://code.google.com/p/ganon/"
            rel="noreferrer">Ganon
- A universal tokenizer and HTML/XML/RSS DOM
 
Parser
- Ability to manipulate
 
elements and their attributes
- Supports invalid HTML and
 
UTF8
- Can perform
 
advanced CSS3-like queries on elements (like jQuery -- namespaces supported)
- A HTML beautifier (like HTML
 
Tidy)
- Minify CSS and
 
Javascript
- Sort attributes, change character case,
 
correct indentation, etc.
- Extensible
 
- Parsing
 
documents using callbacks based on current character/token
- Operations separated in smaller functions for easy overriding
 
- Fast and
 
Easy
Never used it.
            Can't tell if it's any good.
/>
HTML
            5
You can use the above for parsing
            HTML5, but             href="https://stackoverflow.com/questions/4029341/dom-parser-that-allows-html5-style-in-script-tag/4029412">there
            can be quirks due to the markup HTML5 allows. So for HTML5 you want to
            consider using a dedicated parser, like
            href="https://github.com/html5lib/html5lib-php"
            rel="noreferrer">html5lib
A Python and PHP implementations of a HTML parser based on the WHATWG HTML5
specification for maximum compatibility with major desktop web browsers.
We might see more
            dedicated parsers once HTML5 is finalized. There is also a blogpost by the W3's titled
                        rel="noreferrer">How-To for html 5 parsing that is worth checking
            out.
/>
WebServices
If
            you don't feel like programming PHP, you can also use Web services. In general, I found
            very little utility for these, but that's just me and my use
            cases.
            href="http://scraperwiki.com/api/1.0"
            rel="noreferrer">ScraperWiki.
ScraperWiki's external interface allows you to extract data in the form you
want for use on the web or in your own applications. You can also extract information
about the state of any
scraper.
/>
Regular
            Expressions
Last and least
            recommended, you can extract data from HTML with             href="https://stackoverflow.com/search?q=regular%20expression%20tutorials">regular
            expressions. In general using Regular Expressions on HTML is
            discouraged.
Most of the snippets you will find
            on the web to match markup are brittle. In most cases they are only working for a very
            particular piece of HTML. Tiny markup changes, like adding whitespace somewhere, or
            adding, or changing attributes in a tag, can make the RegEx fails when it's not properly
            written. You should know what you are doing before using RegEx on
            HTML.
HTML parsers already know the
            syntactical rules of HTML. Regular expressions have to be taught for each new RegEx you
            write. RegEx are fine in some cases, but it really depends on your
            use-case.
You             href="https://stackoverflow.com/questions/4231382/regular-expression-pattern-not-matching-anywhere-in-string/4234491#4234491">can
            write more reliable parsers, but writing a complete and
            reliable custom parser with regular expressions is a waste of time when the
            aforementioned libraries already exist and do a much better job on
            this.
Also see             href="http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html"
            rel="noreferrer">Parsing Html The Cthulhu
            Way
/>
Books
If
            you want to spend some money, have a look
            at
I am not affiliated
            with PHP Architect or the authors.
php - file_get_contents shows unexpected output while reading a file
I want to output an inline jpg image as a base64 encoded string, however when I do this : $contents = file_get_contents($filename); print ...
- 
I would like to split a String by comma ',' and remove whitespace from the beginning and end of each split. For example, if I have ...
 - 
I have an app which needs a login and a registration with SQLite. I have the database and a user can login and register. But i would like th...
 - 
I have a method in repository with this implementation which returns a Task Task > GetAllAppsRequestAsync(); I write the getter which cal...
 
No comments:
Post a Comment