Convert Html To Plain Text In C#

I have snippets of Html stored in a table. Not entire pages, no tags or the like, just basic formatting.

Bạn đang xem: Convert html to plain text in c#

I would lượt thích to be able to lớn display that Html as text only, no formatting, on a given page (actually just the first 30 - 50 characters but that"s the easy bit).

How do I place the "text" within that Html into a string as straight text?

So this piece of code.

Hello World.

Xem thêm: Css Border And Outline Generator, Css Generator

Is there anyone out there?Becomes:

Hello World. Is there anyone out there?


*

*

The MIT licensed HtmlAgilityPack has in one of its samples a method that converts from HTML khổng lồ plain text.

var plainText = HtmlUtilities.ConvertToPlainText(string html);Feed it an HTML string like

hello, world!And you"ll get a plain text result like:

hello world!

*

*

I could not use HtmlAgilityPack, so I wrote a second best solution for myself

private static string HtmlToPlainText(string html) const string tagWhiteSpace =
"(>
chia sẻ
Follow
answered May 6, 2013 at 21:06

*

Ben AndersonBen Anderson
6,55344 gold badges3737 silver badges3939 bronze badges
7
| Show 2 more comments
30
If you are talking about tag stripping, it is relatively straight forward if you don"t have lớn worry about things lượt thích ", System.Text.RegularExpressions.RegexOptions.IgnoreCase); //result = System.Text.RegularExpressions.Regex.Replace(result, //
"()>)*()", // string.Empty, // System.Text.RegularExpressions.RegexOptions.IgnoreCase); result = System.Text.RegularExpressions.Regex.Replace(result,
"()",string.Empty, System.Text.RegularExpressions.RegexOptions.IgnoreCase); // remove all styles (prepare first by clearing attributes) result = System.Text.RegularExpressions.Regex.Replace(result,
">)*>","", System.Text.RegularExpressions.RegexOptions.IgnoreCase); result = System.Text.RegularExpressions.Regex.Replace(result, "()",string.Empty, System.Text.RegularExpressions.RegexOptions.IgnoreCase); // insert tabs in spaces of tags result = System.Text.RegularExpressions.Regex.Replace(result,
">)*>"," ", System.Text.RegularExpressions.RegexOptions.IgnoreCase); // insert line breaks in places of and tags result = System.Text.RegularExpressions.Regex.Replace(result,
""," ", System.Text.RegularExpressions.RegexOptions.IgnoreCase); result = System.Text.RegularExpressions.Regex.Replace(result,
""," ", System.Text.RegularExpressions.RegexOptions.IgnoreCase); // insert line paragraphs (double line breaks) in place // if , và tags result = System.Text.RegularExpressions.Regex.Replace(result,
">)*>"," ", System.Text.RegularExpressions.RegexOptions.IgnoreCase); result = System.Text.RegularExpressions.Regex.Replace(result,
">)*>"," ", System.Text.RegularExpressions.RegexOptions.IgnoreCase); result = System.Text.RegularExpressions.Regex.Replace(result,
">)*>"," ", System.Text.RegularExpressions.RegexOptions.IgnoreCase); // Remove remaining tags like , links, images, // comments etc - anything that"s enclosed inside result = System.Text.RegularExpressions.Regex.Replace(result,
">*>",string.Empty, System.Text.RegularExpressions.RegexOptions.IgnoreCase); // replace special characters: result = System.Text.RegularExpressions.Regex.Replace(result,
" "," ", System.Text.RegularExpressions.RegexOptions.IgnoreCase); result = System.Text.RegularExpressions.Regex.Replace(result,
"•"," * ", System.Text.RegularExpressions.RegexOptions.IgnoreCase); result = System.Text.RegularExpressions.Regex.Replace(result,
"‹","", System.Text.RegularExpressions.RegexOptions.IgnoreCase); result = System.Text.RegularExpressions.Regex.Replace(result,
"™","(tm)", System.Text.RegularExpressions.RegexOptions.IgnoreCase); result = System.Text.RegularExpressions.Regex.Replace(result,
"⁄","/", System.Text.RegularExpressions.RegexOptions.IgnoreCase); result = System.Text.RegularExpressions.Regex.Replace(result,
"",">", System.Text.RegularExpressions.RegexOptions.IgnoreCase); result = System.Text.RegularExpressions.Regex.Replace(result,
"©","(c)", System.Text.RegularExpressions.RegexOptions.IgnoreCase); result = System.Text.RegularExpressions.Regex.Replace(result,
"®","(r)", System.Text.RegularExpressions.RegexOptions.IgnoreCase); // Remove all others. More can be added, see // http://hotwired.lycos.com/webmonkey/reference/special_characters/ result = System.Text.RegularExpressions.Regex.Replace(result,
"&(.2,6);", string.Empty, System.Text.RegularExpressions.RegexOptions.IgnoreCase); // for testing //System.Text.RegularExpressions.Regex.Replace(result, // this.txtRegex.Text,string.Empty, // System.Text.RegularExpressions.RegexOptions.IgnoreCase); // make line breaking consistent result = result.Replace(" ", " "); // Remove extra line breaks & tabs: // replace over 2 breaks with 2 và over 4 tabs with 4. // Prepare first khổng lồ remove any whitespaces in between // the escaped characters and remove redundant tabs in between line breaks result = System.Text.RegularExpressions.Regex.Replace(result, "( )( )+( )"," ", System.Text.RegularExpressions.RegexOptions.IgnoreCase); result = System.Text.RegularExpressions.Regex.Replace(result, "( )( )+( )"," ", System.Text.RegularExpressions.RegexOptions.IgnoreCase); result = System.Text.RegularExpressions.Regex.Replace(result, "( )( )+( )"," ", System.Text.RegularExpressions.RegexOptions.IgnoreCase); result = System.Text.RegularExpressions.Regex.Replace(result, "( )( )+( )"," ", System.Text.RegularExpressions.RegexOptions.IgnoreCase); // Remove redundant tabs result = System.Text.RegularExpressions.Regex.Replace(result, "( )( )+( )"," ", System.Text.RegularExpressions.RegexOptions.IgnoreCase); // Remove multiple tabs following a line break with just one tab result = System.Text.RegularExpressions.Regex.Replace(result, "( )( )+"," ", System.Text.RegularExpressions.RegexOptions.IgnoreCase); // Initial replacement target string for line breaks string breaks = " "; // Initial replacement target string for tabs string tabs = " "; for (int index=0; index}

Escape characters such as & had to be removed first because they cause regexes to lớn cease working as expected.

Moreover, to make the result string display correctly in the textbox, one might need lớn split it up and set textbox"s Lines property instead of assigning to lớn Text property.

this.txtResult.Lines = StripHTML(this.txtSource.Text).Split(" ".ToCharArray());

Source : https://www.codeproject.com/Articles/11902/Convert-HTML-to-Plain-Text-2