How can one parse HTML/XML
and extract information from it?
Monday 13 November 2017
How do you parse and process HTML/XML in PHP?
Native
XML Extensions
I prefer using one of the href="http://php.net/manual/en/refs.xml.php" rel="noreferrer">native XML
extensions since they come bundled with PHP, are usually faster than all the
3rd party libs and give me all the control I need over the
markup.
href="http://php.net/manual/en/book.dom.php"
rel="noreferrer">DOM
The DOM extension allows you to operate on XML documents through the DOM API
with PHP 5. It is an implementation of the W3C's Document Object Model Core Level 3, a
platform- and language-neutral interface that allows programs and scripts to dynamically
access and update the content, structure and style of
documents.
DOM is
capable of parsing and modifying real world (broken) HTML and it can do href="http://schlitt.info/opensource/blog/0704_xpath.html" rel="noreferrer">XPath
queries. It is based on href="http://xmlsoft.org/html/libxml-HTMLparser.html"
rel="noreferrer">libxml.
It takes
some time to get productive with DOM, but that time is well worth it IMO. Since DOM is a
language-agnostic interface, you'll find implementations in many languages, so if you
need to change your programming language, chances are you will already know how to use
that language's DOM API then.
A basic usage
example can be found in href="https://stackoverflow.com/questions/3820666/regular-expression-for-grabbing-the-href-attribute-of-an-a-element/3820783#3820783">Grabbing
the href attribute of an A element and a general conceptual overview can be
found at href="https://stackoverflow.com/questions/4979836/noob-question-about-domdocument-in-php/4983721#4983721">DOMDocument
in php
href="https://stackoverflow.com/search?q=DOM+HTML+[PHP]&submit=search">How to use
the DOM extension has been covered extensively on StackOverflow, so if you
choose to use it, you can be sure most of the issues you run into can be solved by
searching/browsing .
href="http://php.net/manual/en/book.xmlreader.php"
rel="noreferrer">XMLReader
The XMLReader extension is an XML pull parser. The reader acts as a cursor
going forward on the document stream and stopping at each node on the way.
XMLReader, like
DOM, is based on libxml. I am not aware of how to trigger the HTML Parser Module, so
chances are using XMLReader for parsing broken HTML might be less robust than using DOM
where you can explicitly tell it to use libxml's HTML Parser
Module.
A basic usage example can be found at
href="https://stackoverflow.com/questions/3299033/getting-all-values-from-h1-tags-using-php/3299140#3299140">getting
all values from h1 tags using
php
href="http://php.net/manual/en/book.xml.php" rel="noreferrer">XML
Parser
This extension lets you create XML parsers and then define handlers for
different XML events. Each XML parser also has a few parameters you can
adjust.
The XML
Parser library is also based on libxml, and implements a href="http://en.wikipedia.org/wiki/Simple_API_for_XML" rel="noreferrer">SAX
style XML push parser. It may be a better choice for memory management than DOM or
SimpleXML, but will be more difficult to work with than the pull parser implemented by
XMLReader.
href="http://php.net/manual/en/book.simplexml.php"
rel="noreferrer">SimpleXml
The SimpleXML extension provides a very simple and easily usable toolset to
convert XML to an object that can be processed with normal property selectors and array
iterators.
SimpleXML is an
option when you know the HTML is valid XHTML. If you need to parse broken HTML, don't
even consider SimpleXml because it will choke.
A
basic usage example can be found at href="https://stackoverflow.com/questions/4906073/a-simple-program-to-crud-node-and-node-values-of-xml-file">A
simple program to CRUD node and node values of xml file and there is href="http://php.net/manual/en/simplexml.examples-basic.php" rel="noreferrer">lots of
additional examples in the PHP Manual.
/>
3rd Party Libraries (libxml
based)
If you prefer to use a 3rd-party lib,
I'd suggest using a lib that actually uses href="http://php.net/manual/en/book.dom.php" rel="noreferrer">DOM/ href="http://xmlsoft.org/" rel="noreferrer">libxml underneath instead of
string parsing.
href="https://thomas.weinert.info/FluentDOM/" rel="noreferrer">FluentDom -
rel="noreferrer">Repo
FluentDOM provides a jQuery-like fluent XML interface for the DOMDocument in
PHP. Selectors are written in XPath or CSS (using a CSS to XPath converter). Current
versions extend the DOM implementing standard interfaces and add features from the DOM
Living Standard. FluentDOM can load formats like JSON, CSV, JsonML, RabbitFish and
others. Can be installed via
Composer.
href="https://github.com/wasinger/htmlpagedom/blob/master/README.md"
rel="noreferrer">HtmlPageDom
Wa72\HtmlPageDom` is a PHP library for easy manipulation of HTML
documents using It requires rel="noreferrer">DomCrawler from Symfony2
components for
traversing the
DOM tree and extends it by adding methods for manipulating the
DOM
tree of HTML
documents.
href="http://code.google.com/p/phpquery/" rel="noreferrer">phpQuery (not
updated for years)
phpQuery is a server-side, chainable, CSS3 selector driven Document Object
Model (DOM) API based on jQuery JavaScript Library written in PHP5 and provides
additional Command Line Interface (CLI).
Also see: href="https://github.com/electrolinux/phpquery"
rel="noreferrer">https://github.com/electrolinux/phpquery
href="http://framework.zend.com/manual/current/en/modules/zend.dom.intro.html"
rel="noreferrer">Zend_Dom
Zend_Dom provides tools for working with DOM documents and structures.
Currently, we offer Zend_Dom_Query, which provides a unified interface for querying DOM
documents utilizing both XPath and CSS selectors.
href="http://querypath.org/"
rel="noreferrer">QueryPath
QueryPath is a PHP library for manipulating XML and HTML. It is designed to
work not only with local files, but also with web services and database resources. It
implements much of the jQuery interface (including CSS-style selectors), but it is
heavily tuned for server-side use. Can be installed via
Composer.
href="http://github.com/theseer/fDOMDocument"
rel="noreferrer">fDOMDocument
fDOMDocument extends the standard DOM to use exceptions at all occasions of
errors instead of PHP warnings or notices. They also add various custom methods and
shortcuts for convenience and to simplify the usage of
DOM.
href="http://sabre.io/xml/"
rel="noreferrer">sabre/xml
sabre/xml is a library that wraps and extends the XMLReader and XMLWriter
classes to create a simple "xml to object/array" mapping system and design pattern.
Writing and reading XML is single-pass and can therefore be fast and require low memory
on large xml
files.
href="https://github.com/servo-php/fluidxml"
rel="noreferrer">FluidXML
FluidXML is a PHP library for manipulating XML with a concise and fluent
API.
It leverages XPath and the fluent programming pattern to be fun and
effective.
/>
3rd-Party (not
libxml-based)
The benefit of building upon
DOM/libxml is that you get good performance out of the box because you are based on a
native extension. However, not all 3rd-party libs go down this route. Some of them
listed below
href="http://simplehtmldom.sourceforge.net/manual.htm#section_traverse"
rel="noreferrer">PHP Simple HTML DOM
Parser
- An HTML DOM parser written in PHP5+ lets you
manipulate HTML in a very easy way!
- Require PHP
5+.
- Supports invalid HTML.
- Find tags on an HTML page with selectors just like
jQuery.
- Extract contents from HTML in a single
line.
I generally do
not recommend this parser. The codebase is horrible and the parser itself is rather slow
and memory hungry. Not all jQuery Selectors (such as href="https://api.jquery.com/child-selector/" rel="noreferrer">child
selectors) are possible. Any of the libxml based libraries should outperform
this easily.
href="https://github.com/paquettg/php-html-parser" rel="noreferrer">PHP Html
Parser
PHPHtmlParser is a simple, flexible, html parser which allows you to select
tags using any css selector, like jQuery. The goal is to assiste in the development of
tools which require a quick, easy way to scrap html, whether it's valid or not! This
project was original supported by sunra/php-simple-html-dom-parser but the support seems
to have stopped so this project is my adaptation of his previous
work.
Again, I
would not recommend this parser. It is rather slow with high CPU usage. There is also no
function to clear memory of created DOM objects. These problems scale particularly with
nested loops. The documentation itself is inaccurate and misspelled, with no responses
to fixes since 14 Apr 16.
href="https://code.google.com/p/ganon/"
rel="noreferrer">Ganon
- A universal tokenizer and HTML/XML/RSS DOM
Parser
- Ability to manipulate
elements and their attributes
- Supports invalid HTML and
UTF8
- Can perform
advanced CSS3-like queries on elements (like jQuery -- namespaces supported)
- A HTML beautifier (like HTML
Tidy)
- Minify CSS and
Javascript
- Sort attributes, change character case,
correct indentation, etc.
- Extensible
- Parsing
documents using callbacks based on current character/token
- Operations separated in smaller functions for easy overriding
- Fast and
Easy
Never used it.
Can't tell if it's any good.
/>
HTML
5
You can use the above for parsing
HTML5, but href="https://stackoverflow.com/questions/4029341/dom-parser-that-allows-html5-style-in-script-tag/4029412">there
can be quirks due to the markup HTML5 allows. So for HTML5 you want to
consider using a dedicated parser, like
href="https://github.com/html5lib/html5lib-php"
rel="noreferrer">html5lib
A Python and PHP implementations of a HTML parser based on the WHATWG HTML5
specification for maximum compatibility with major desktop web browsers.
We might see more
dedicated parsers once HTML5 is finalized. There is also a blogpost by the W3's titled
rel="noreferrer">How-To for html 5 parsing that is worth checking
out.
/>
WebServices
If
you don't feel like programming PHP, you can also use Web services. In general, I found
very little utility for these, but that's just me and my use
cases.
href="http://scraperwiki.com/api/1.0"
rel="noreferrer">ScraperWiki.
ScraperWiki's external interface allows you to extract data in the form you
want for use on the web or in your own applications. You can also extract information
about the state of any
scraper.
/>
Regular
Expressions
Last and least
recommended, you can extract data from HTML with href="https://stackoverflow.com/search?q=regular%20expression%20tutorials">regular
expressions. In general using Regular Expressions on HTML is
discouraged.
Most of the snippets you will find
on the web to match markup are brittle. In most cases they are only working for a very
particular piece of HTML. Tiny markup changes, like adding whitespace somewhere, or
adding, or changing attributes in a tag, can make the RegEx fails when it's not properly
written. You should know what you are doing before using RegEx on
HTML.
HTML parsers already know the
syntactical rules of HTML. Regular expressions have to be taught for each new RegEx you
write. RegEx are fine in some cases, but it really depends on your
use-case.
You href="https://stackoverflow.com/questions/4231382/regular-expression-pattern-not-matching-anywhere-in-string/4234491#4234491">can
write more reliable parsers, but writing a complete and
reliable custom parser with regular expressions is a waste of time when the
aforementioned libraries already exist and do a much better job on
this.
Also see href="http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html"
rel="noreferrer">Parsing Html The Cthulhu
Way
/>
Books
If
you want to spend some money, have a look
at
I am not affiliated
with PHP Architect or the authors.
php - file_get_contents shows unexpected output while reading a file
I want to output an inline jpg image as a base64 encoded string, however when I do this : $contents = file_get_contents($filename); print &q...
-
I have an app which needs a login and a registration with SQLite. I have the database and a user can login and register. But i would like th...
-
I got an error in my Java program. I think this happens because of the constructor is not intialized properly. My Base class Program public ...
-
I would like to use enhanced REP MOVSB (ERMSB) to get a high bandwidth for a custom memcpy . ERMSB was introduced with the Ivy Bridge micro...
No comments:
Post a Comment