Jean Paul's Blog

There are 2 types of People in the World, One who Likes SharePoint and..

  • Microsoft MVP

  • MindCracker MVP

  • CodeProject MVP

  • eBook on SharePoint 2010

  • eBook on Design Patterns

  • eBook on Windows Azure

  • NLayers Framework @ CodePlex

  • MSDN Forums

  • .Net vs. Java

    Due to Public Demand
  • Advertisements

Website Recursive Url Parser

Posted by Paul on March 26, 2011

In this article I am trying to share a piece of code that might be useful to some of the developers.

We can find a lot of code in C# that will parse the http urls in given string. But it is difficult to find a code that will:

  • Accept a url as argument, parse the site content
  • Fetch all urls in the site content, parse the site content of each urls
  • Repeat the above process until all urls are fetched.


Taking the website (A Stock Market Site) as example I would like to get all the urls inside the website recursively.


The main class is SpiderLogic which contains all necessary methods and properties.


The GetUrls() method is used to parse the website and return the urls. There are two overloads for this method.

The first one takes 2 arguments. The url and and a Boolean indicating if recursive parsing is needed or not.

Eg: GetUrls(”, true);

The second one is 3 arguments, url, base url and recursive Boolean.

This method is intended for usage like the url is a sub level of the base url. And the web page contains relative paths. So in order to construct the valid absolute urls, the second argument is necessary.

Eg: GetUrls(“”,, true);

Method Body of GetUrls()

public IList<string> GetUrls(string url, string baseUrl,

bool recursive)


if (recursive)



RecursivelyGenerateUrls(url, baseUrl);

return _urls;



return InternalGetUrls(url, baseUrl);



Another method of interest would be InternalGetUrls() which fetches the content of url, parses the urls inside it and constructs the absolute urls.

private IList<string> InternalGetUrls(string baseUrl, string absoluteBaseUrl)


IList<string> list = new List<string>();

Uri uri = null;

if (!Uri.TryCreate(baseUrl, UriKind.RelativeOrAbsolute, out uri))

return list;

// Get the http content

string siteContent = GetHttpResponse(baseUrl);

var allUrls = GetAllUrls(siteContent);

foreach (string uriString in allUrls)


uri = null;

if (Uri.TryCreate(uriString, UriKind.RelativeOrAbsolute, out uri))


if (uri.IsAbsoluteUri)


if (uri.OriginalString.StartsWith(absoluteBaseUrl)) // If different domain / javascript: urls needed exclude this check







string newUri = GetAbsoluteUri(uri, absoluteBaseUrl, uriString);

if (!string.IsNullOrEmpty(newUri))






if (!uriString.StartsWith(absoluteBaseUrl))


string newUri = GetAbsoluteUri(uri, absoluteBaseUrl, uriString);

if (!string.IsNullOrEmpty(newUri))





return list;


Handling Exceptions

There is an OnException delegate that can be used to get the exceptions occurring while parsing.

Tester Application

A tester windows application is included with the source code of the article.

You can try executing it.

The form accepts a base url as the input and clicking the Go button it parses the content of url and extracts all urls in it. If you need a recursive parsing please check the Is Recursive check box.


Next Part

In the next part of the article, I would like to create a url verifier website that verifies all the urls in a website. I agree after doing a search we can find free providers like that. My aim is to learn & develop a custom code that could be extensible and reusable across multiple projects by community.

Source Code

The associated source code can be found in c-sharpcorner.  Please search using the same title there.


2 Responses to “Website Recursive Url Parser”

  1. David Wier said

    really good code, but I don’t get all the pages/urls in my site, recursively. if you need my url for testing, let me know.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s