How to read the Website content in c#?

Published on Author Code Father
How to read the Website content in c#?

Here is how you would do it using the HtmlAgilityPack. First your sample HTML:

var html = "<html>\r\n<body>\r\nbla bla </td><td>\r\nbla bla \r\n<body>\r\n<html>";

Load it up (as a string in this case):

var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);

If getting it from the web, similar:

var web = new HtmlWeb();
var doc = web.Load(url);

Now select only text nodes with non-whitespace and trim them.

var text = doc.DocumentNode.Descendants()
              .Where(x => x.NodeType == HtmlNodeType.Text && x.InnerText.Trim().Length > 0)
              .Select(x => x.InnerText.Trim());

You can get this as a single joined string if you like:

String.Join(" ", text)

Of course this will only work for simple web pages. Anything complex will also return nodes with data you clearly don’t want, such as javascript functions etc.

Comments

comments

Categories C#