HTML regex C#

HTML regex C#

HTML stands for HyperText Markup Language and is used to display information in the browser. HTML regular expressions can be used to find tags in the text, extract them or remove them. Generally, it’s not a good idea to parse HTML with regex, but a limited known set of HTML can be sometimes parsed.

Match all HTML tags

Below is a simple regex to validate the string against HTML tag pattern. This can be later used to remove all tags and leave text only.

new Regex("<(?:\"[^\"]*\"['\"]*|'[^']*'['\"]*|[^'\">])+>")

Test it!
/<(?:"[^"]*"['"]*|'[^']*'['"]*|[^'">])+>/

True

False

Enter a text in the input above to see the result

Example code in C#:

using System.Text.RegularExpressions;
using System;
                    
public class Program
{
    public static void Main()
    {
        // Remove all HTML tags from a string
        Regex removeHTMLtagsRegex = new Regex("<(?:\"[^\"]*\"['\"]*|'[^']*'['\"]*|[^'\">])+>");
        string newText = removeHTMLtagsRegex.Replace("<html><body>Hello, <b>world</b>!<br /></body></html>", "");
        Console.WriteLine(newText); // prints Hello, world!
    }
}

Extract text between certain tags

One of the most common operations with HTML and regex is the extraction of the text between certain tags (a.k.a. scraping). For this operation, the following regular expression can be used.

Regex regex1 = new Regex("<div>(.*?)<\\/div>"); // Tag only
Regex regex2 = new Regex("(?:<div.*?class=\"some-class\".*?>)(.*?)(?:<\\/div>)"); // Tag and class

Test it!
/<div>(.*?)<\/div>/g

True

False

Enter a text in the input above to see the result

Example code in C#:

using System.Text.RegularExpressions;
using System;
                    
public class Program
{
    public static void Main()
    {
        // Extract text between specific HTML tag
        Regex extractHTMLRegex = new Regex("(?:<div.*?class=\"some-class\".*?>)(.*?)(?:<\\/div>)");
        Match match = extractHTMLRegex.Match("<html><body>Probably.<div class=\"some-class\">Hello, world!</div><br />Today</body></html>");
        if (match.Success)
        {
            Console.WriteLine(match.Groups[1].Captures[0].Value); // prints Hello, world!
        }
    }
}
Test it!

True

False

Enter a text in the input above to see the result

Notes on HTML regex

You should never use regular expressions to fully parse HTML documents as regular expressions are not intended for such tasks. Instead, you can use HTML or XML document parsers that can do validation alongside parsing.