Sunday, 5 January 2020

php - Grabbing the href attribute of an A element




Trying to find the links on a page.



my regex is:



/]*href=(\"\'??)([^\"\' >]*?)[^>]*>(.*)<\/a>/


but seems to fail at



what?



How would I change my regex to deal with href not placed first in the a tag?


Answer



Reliable Regex for HTML are difficult. Here is how to do it with DOM:



$dom = new DOMDocument;
$dom->loadHTML($html);
foreach ($dom->getElementsByTagName('a') as $node) {
echo $dom->saveHtml($node), PHP_EOL;

}


The above would find and output the "outerHTML" of all A elements in the $html string.



To get all the text values of the node, you do



echo $node->nodeValue; 



To check if the href attribute exists you can do



echo $node->hasAttribute( 'href' );


To get the href attribute you'd do



echo $node->getAttribute( 'href' );



To change the href attribute you'd do



$node->setAttribute('href', 'something else');


To remove the href attribute you'd do



$node->removeAttribute('href'); 



You can also query for the href attribute directly with XPath



$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query('//a/@href');
foreach($nodes as $href) {
echo $href->nodeValue; // echo current attribute value
$href->nodeValue = 'new value'; // set new attribute value
$href->parentNode->removeAttribute('href'); // remove attribute

}


Also see:





On a sidenote: I am sure this is a duplicate and you can find the answer somewhere in here


No comments:

Post a Comment

php - file_get_contents shows unexpected output while reading a file

I want to output an inline jpg image as a base64 encoded string, however when I do this : $contents = file_get_contents($filename); print ...