Posted on Friday July 30, 2010

Extracting all links from a HTML page

The code below is a small class that extracts all links from a HTML page using a regular expression. The method returns a list of URLs, which can include formats such as “#” and “javascript:;”.

If you get proxy authentication problems behind a corporate firewall or proxy, add an app.config with the following lines:

This will take the proxy details from IE. I still get issues even with the above configuration, so it is fairly hit and miss depending on the hardware or software you proxy is using.