Wednesday 5 December 2018

php - DOM parser that allows HTML5-style

Update: html5lib (bottom of question) seems to get close, I just need to improve my understanding of how it's used.



I am attempting to find an HTML5-compatible DOM parser for PHP 5.3. In particular, I need to access the following HTML-like CDATA within a script tag:







Most parsers will end parsing prematurely because HTML 4.01 ends script tag parsing when it finds ETAGO () inside a . All of the parsers I have tried so far have either failed, or they are so poorly documented that I haven't figured out if they work or not.



My requirements:




  1. Real parser, not regex hacks.

  2. Ability to load full pages or HTML fragments.

  3. Ability to pull script contents back out, selecting by the tag's id attribute.




Input:






Example of failing output (no closing ):







Some parsers and their results:







Source:





header('Content-type: text/plain');
$d = new DOMDocument;
$d->loadHTML('');
echo $d->saveHTML();


Output:



Warning: DOMDocument::loadHTML(): Unexpected end tag : td in Entity, line: 1 in /home/adam/public_html/2010/10/26/dom.php on line 5









Source:





header('Content-type: text/plain');
require_once 'FluentDOM/src/FluentDOM.php';
$html = "";
echo FluentDOM($html, 'text/html');


Output:












Source:





header('Content-type: text/plain');

require_once 'phpQuery.php';

phpQuery::newDocumentHTML(<<
EOF

);


echo (string)pq('#foo');



Output:











Possibly promising. Can I get at the contents of the script#foo tag?



Source:





header('Content-type: text/plain');

include 'HTML5/Parser.php';

$html = "";
$d = HTML5_Parser::parse($html);

echo $d->saveHTML();



Output:




No comments:

Post a Comment

php - file_get_contents shows unexpected output while reading a file

I want to output an inline jpg image as a base64 encoded string, however when I do this : $contents = file_get_contents($filename); print &q...