XML regex Java

XML

XML stands for Extensible Markup Language and is used for storing arbitrary data. Usually, it’s not a good thing to parse XML with regular expressions, but in certain situations, it can be very helpful to retrieve (scrape) a specific piece of information that you need.

Extract value between XML tags

One of the most common operations with XML and regex is the extraction of the text between certain tags (a.k.a. scraping). For this operation, the following regular expression can be used.

Pattern.compile("(?:<TAG.*?>)(.*?)(?:<\\/TAG>)")

Test it!
/(?:<TAG.*?>)(.*?)(?:<\/TAG>)/g

True

False

Enter a text in the input above to see the result

Example code in Java:

import java.util.regex.Pattern;
import java.util.regex.Matcher;

public class Main {

    public static void main(String []args) {
        // Extract text between specific XML tag
        Matcher matcher = Pattern.compile("(?:<from.*?>)(.*?)(?:<\\/from>)")
                          .matcher("<note><to>Tove</to><from>Jani</from><heading>Reminder</heading><body>Don't forget me this weekend!</body></note>");
        if (matcher.find()) {
            System.out.println(matcher.group(1)); // prints Jani
        }
    }
}

Test it!

True

False

Enter a text in the input above to see the result

Test it!

True

False

Enter a text in the input above to see the result

Notes on regex XML extraction

While this extraction might be a good option in some cases, usually it’s better to use specific XML parsers for such tasks. In such case, once XML is validated and parsed, the required information can be retrieved using Document queries. For instance, in Java the following code can be used:

import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.DocumentBuilder;
import org.w3c.dom.Document;
import java.io.ByteArrayInputStream;


public class Main {

     public static void main(String []args) throws Exception  {
        DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
        DocumentBuilder builder = factory.newDocumentBuilder();
        ByteArrayInputStream input = new ByteArrayInputStream(
            "<note><to>Tove</to><from>Jani</from><heading>Reminder</heading><body>Don't forget me this weekend!</body></note>".getBytes("UTF-8")
        );
        Document doc = builder.parse(input);
        System.out.println(doc.getElementsByTagName("from").item(0).getTextContent()); // prints Jani
     }
}