Given an HTML file you want to exract all the HREF urls.
You could use a Regex
I've done this before, but haven't found it entirely reliable. I use regex's so infrequently it is painful to relearn the syntax every time.
What I recommend
Use the HTML Agility Pack. If you are familiar with the XML DOM, using the HTML Agility Pack will come naturally.
HTML Agility Pack URL
Two things that make HTML Agility Pack interesting
- It doesn't depend on Internet Explorer
- It works on malformed HTML. See this post for a little for context: NET Html Agility Pack: How to use malformed HTML just like it was well-formed XML
// this isn't a full sample, but enough to see the value of using the HTML Agility Pack
HtmlDocument input_doc = HtmlDocument();
foreach ( HtmlNode node in input_doc.DocumentNode.SelectNodes("//a") )
string href_url = node.GetAttributeValue("href", "");