Jean Paul's Blog

There are 2 types of People in the World, One who Likes SharePoint and..

  • Microsoft MVP

  • MindCracker MVP

  • CodeProject MVP

  • eBook on SharePoint 2010

  • eBook on Design Patterns

  • eBook on Windows Azure

  • NLayers Framework @ CodePlex

  • MSDN Forums

  • .Net vs. Java

    Due to Public Demand

Archive for March, 2011

Website Recursive Url Parser

Posted by JP on March 26, 2011

In this article I am trying to share a piece of code that might be useful to some of the developers.

We can find a lot of code in C# that will parse the http urls in given string. But it is difficult to find a code that will:

  • Accept a url as argument, parse the site content
  • Fetch all urls in the site content, parse the site content of each urls
  • Repeat the above process until all urls are fetched.

Scenario

Taking the website http://valuestocks.in (A Stock Market Site) as example I would like to get all the urls inside the website recursively.

Design

The main class is SpiderLogic which contains all necessary methods and properties.

clip_image002

The GetUrls() method is used to parse the website and return the urls. There are two overloads for this method.

The first one takes 2 arguments. The url and and a Boolean indicating if recursive parsing is needed or not.

Eg: GetUrls(http://www.google.com”, true);

The second one is 3 arguments, url, base url and recursive Boolean.

This method is intended for usage like the url is a sub level of the base url. And the web page contains relative paths. So in order to construct the valid absolute urls, the second argument is necessary.

Eg: GetUrls(“http://www.whereincity.com/india-kids/baby-names/”, http://www.whereincity.com/, true);

Method Body of GetUrls()

public IList<string> GetUrls(string url, string baseUrl,

bool recursive)

{

if (recursive)

{

_urls.Clear();

RecursivelyGenerateUrls(url, baseUrl);

return _urls;

}

else

return InternalGetUrls(url, baseUrl);

}

InternalGetUrls()

Another method of interest would be InternalGetUrls() which fetches the content of url, parses the urls inside it and constructs the absolute urls.

private IList<string> InternalGetUrls(string baseUrl, string absoluteBaseUrl)

{

IList<string> list = new List<string>();

Uri uri = null;

if (!Uri.TryCreate(baseUrl, UriKind.RelativeOrAbsolute, out uri))

return list;

// Get the http content

string siteContent = GetHttpResponse(baseUrl);

var allUrls = GetAllUrls(siteContent);

foreach (string uriString in allUrls)

{

uri = null;

if (Uri.TryCreate(uriString, UriKind.RelativeOrAbsolute, out uri))

{

if (uri.IsAbsoluteUri)

{

if (uri.OriginalString.StartsWith(absoluteBaseUrl)) // If different domain / javascript: urls needed exclude this check

{

list.Add(uriString);

}

}

else

{

string newUri = GetAbsoluteUri(uri, absoluteBaseUrl, uriString);

if (!string.IsNullOrEmpty(newUri))

list.Add(newUri);

}

}

else

{

if (!uriString.StartsWith(absoluteBaseUrl))

{

string newUri = GetAbsoluteUri(uri, absoluteBaseUrl, uriString);

if (!string.IsNullOrEmpty(newUri))

list.Add(newUri);

}

}

}

return list;

}

Handling Exceptions

There is an OnException delegate that can be used to get the exceptions occurring while parsing.

Tester Application

A tester windows application is included with the source code of the article.

You can try executing it.

The form accepts a base url as the input and clicking the Go button it parses the content of url and extracts all urls in it. If you need a recursive parsing please check the Is Recursive check box.

clip_image004

Next Part

In the next part of the article, I would like to create a url verifier website that verifies all the urls in a website. I agree after doing a search we can find free providers like that. My aim is to learn & develop a custom code that could be extensible and reusable across multiple projects by community.

Source Code

The associated source code can be found in c-sharpcorner.  Please search using the same title there.

Advertisements

Posted in ASP.NET | Tagged: , , | 2 Comments »

C#–Invoke CPP method returning String

Posted by JP on March 16, 2011

[DllImport("Cpp.dll")]
public static extern void GetString(ref StringBuilder sb, int length);

StringBuilder sb = new StringBuilder(128);
GetString(ref mySB, 128);

Posted in C# | Tagged: , , | Leave a Comment »