I need to improve some open source code. It contains a function to extract all <a>
and <img>
tags with a specific class from a string that represents HTML. This function uses regular expressions:
preg_match_all('#<img(.*)class="(.*)foo(.*)"(.*)>#Umsi', $text, $matches_img, PREG_PATTERN_ORDER);preg_match_all('#<a(.*)class="(.*)foo(.*)">(.*)</a>#Umsi', $text, $matches_a, PREG_PATTERN_ORDER);// Build the union set from $matches_img and $matches_a
This works mostly, but not always. Specifically, the regular expressions can match multiple tags in a single match:
$text = '<a href="target1">link text 1</a><a class="foo" src="target2">link text 2</a>';// matches whole string in a single matchpreg_match_all('#<img(.*)class="(.*)foo(.*)"(.*)>#Umsi', $text, $matches_img, PREG_PATTERN_ORDER);
My first approach
I tried to make the regular expression more specific:
// old<a(.*)class="(.*)foo(.*)">(.*)</a>// new<a([^<>]*)class="(.*)edusharing_atto(.*)">([^<>]*)</a>
But this, too, can match substrings that contain multiple tags:
$text = '<img class="bar"><img class="foo" src="baz">';// matches whole string in a single matchpreg_match_all('#<img(.*)class="(.*)foo(.*)"(.*)>#Umsi', $text, $matches_img, PREG_PATTERN_ORDER);
Is there a robust way to improve on this? I could replace the second and third groups ((.*)
) with [^<>]*
as well, but then I'd run into trouble as soon as an images's alt or title attribute contains a pointed bracket.
My second approach: DOMDocument
I tried to parse the HTML with this function:
function getElementsByClassName($html, $className, $tagName=null) { $dom = new DOMDocument('1.0'); $dom->loadHTML($html); if ($tagName){ $elements = $dom->getElementsByTagName($tagName); } else { $elements = $dom->getElementsByTagName("*"); } $matched = []; for ($i=0; $i<$elements->length; $i++) { if ($elements->item($i)->attributes->getNamedItem('class')) { $classes = $elements->item($i)->attributes->getNamedItem('class')->nodeValue; if (str_contains($classes, $className)) { $matched[]=$dom->saveHTML($elements->item($i)); } } } return $matched;}
The problem here is that the matches I get do not correspond exactly to the input. There seems to be some encoding problem, but more important, DOMDocument was written to parse HTML 4. The function does give me me all the tags I need to extract, but there are some problems with special characters and syntax differences between HTML 4 and HTML 5. I need to get the tags exactly as they are contained in the input string.
Is there a robust solution to achieve this?