Extract all and tags (exactly as represented in the input) from HTML 5 that contain a specific class

I need to improve some open source code. It contains a function to extract all <a> and <img> tags with a specific class from a string that represents HTML. This function uses regular expressions:

preg_match_all('#<img(.*)class="(.*)foo(.*)"(.*)>#Umsi', $text, $matches_img, PREG_PATTERN_ORDER);preg_match_all('#<a(.*)class="(.*)foo(.*)">(.*)</a>#Umsi', $text, $matches_a, PREG_PATTERN_ORDER);// Build the union set from $matches_img and $matches_a

This works mostly, but not always. Specifically, the regular expressions can match multiple tags in a single match:

$text = '<a href="target1">link text 1</a><a class="foo" src="target2">link text 2</a>';// matches whole string in a single matchpreg_match_all('#<img(.*)class="(.*)foo(.*)"(.*)>#Umsi', $text, $matches_img, PREG_PATTERN_ORDER);

My first approach

I tried to make the regular expression more specific:

// old<a(.*)class="(.*)foo(.*)">(.*)</a>// new<a([^<>]*)class="(.*)edusharing_atto(.*)">([^<>]*)</a>

But this, too, can match substrings that contain multiple tags:

$text = '<img class="bar"><img class="foo" src="baz">';// matches whole string in a single matchpreg_match_all('#<img(.*)class="(.*)foo(.*)"(.*)>#Umsi', $text, $matches_img, PREG_PATTERN_ORDER);

Is there a robust way to improve on this? I could replace the second and third groups ((.*)) with [^<>]* as well, but then I'd run into trouble as soon as an images's alt or title attribute contains a pointed bracket.

My second approach: DOMDocument

I tried to parse the HTML with this function:

function getElementsByClassName($html, $className, $tagName=null) {    $dom = new DOMDocument('1.0');     $dom->loadHTML($html);    if ($tagName){        $elements = $dom->getElementsByTagName($tagName);    } else {        $elements = $dom->getElementsByTagName("*");    }    $matched = [];    for ($i=0; $i<$elements->length; $i++) {        if ($elements->item($i)->attributes->getNamedItem('class')) {            $classes = $elements->item($i)->attributes->getNamedItem('class')->nodeValue;            if (str_contains($classes, $className)) {                $matched[]=$dom->saveHTML($elements->item($i));            }        }    }    return $matched;}

The problem here is that the matches I get do not correspond exactly to the input. There seems to be some encoding problem, but more important, DOMDocument was written to parse HTML 4. The function does give me me all the tags I need to extract, but there are some problems with special characters and syntax differences between HTML 4 and HTML 5. I need to get the tags exactly as they are contained in the input string.

Is there a robust solution to achieve this?

Extract all and tags (exactly as represented in the input) from HTML 5 that contain a specific class

My first approach

My second approach: DOMDocument

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112