Main Tutorials

How to read UTF-8 XML file in Java – (SAX Parser)

This article shows how to use the SAX parser to read or parse a UTF-8 XML file.

Table of contents

1. SAX parser to parse a UTF-8 XML file.

1.1 An XML file contains UTF-8 and Chinese characters.

staff.xml

<?xml version="1.0" encoding="utf-8"?>
<Company>
    <staff id="1001">
        <name>揚木金</name>
        <role>support &amp; code</role>
        <salary currency="USD">5000</salary>
        <bio><![CDATA[HTML tag <code>testing</code>]]></bio>
    </staff>
    <staff id="1002">
        <name>yflow</name>
        <role>admin</role>
        <salary currency="EUR">8000</salary>
        <bio><![CDATA[a & b]]></bio>
    </staff>
</Company>

1.2 The below example set a UTF-8 encoding explicitly.

Note
For the SAX handler PrintAllHandlerSax, refer to this article.

ReadXmlSaxParser.java

package com.mkyong.xml.sax;

import com.mkyong.xml.sax.handler.PrintAllHandlerSax;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;

import javax.xml.parsers.ParserConfigurationException;
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
import java.io.IOException;
import java.nio.charset.StandardCharsets;

public class ReadXmlSaxParser {

  private static final String FILENAME = "src/main/resources/staff-unicode.xml";

  public static void main(String[] args) {

      SAXParserFactory factory = SAXParserFactory.newInstance();

      try {

          SAXParser saxParser = factory.newSAXParser();

          PrintAllHandlerSax handler = new PrintAllHandlerSax();

          XMLReader xmlReader = saxParser.getXMLReader();
          xmlReader.setContentHandler(handler);

          InputSource source = new InputSource(FILENAME);

          // explicitly set a encoding
          source.setEncoding(StandardCharsets.UTF_8.displayName());

          xmlReader.parse(source);

      } catch (ParserConfigurationException | SAXException | IOException e) {
          e.printStackTrace();
      }

  }

}

Output

Terminal

Start Document
Start Element : Company
Start Element : staff
Staff id : 1001
Start Element : name
End Element : name
Name : 揚木金
Start Element : role
End Element : role
Role : support & code
Start Element : salary
Currency :USD
End Element : salary
Salary : 5000
Start Element : bio
End Element : bio
Bio : HTML tag <code>testing</code>
End Element : staff
Start Element : staff
Staff id : 1002
Start Element : name
End Element : name
Name : yflow
Start Element : role
End Element : role
Role : admin
Start Element : salary
Currency :EUR
End Element : salary
Salary : 8000
Start Element : bio
End Element : bio
Bio : a & b
End Element : staff
End Element : Company
End Document  

2. Character Encoding in XML and code

Ensure we are using the correct encoding to parser the XML file.

2.1 For XML files, it’s best practice to declare the encoding attribute.


<?xml version="1.0" encoding="character-encoding-here"?>
<Company>

</Company>

For example, the below is a UTF-8 encoded XML file.


<?xml version="1.0" encoding="utf-8"?>
<Company>

</Company>

2.2 For the SAX parser, we can set a the encoding in via the XMLReader.


  SAXParserFactory factory = SAXParserFactory.newInstance();

  try {

      SAXParser saxParser = factory.newSAXParser();

      PrintAllHandlerSax handler = new PrintAllHandlerSax();

      XMLReader xmlReader = saxParser.getXMLReader();
      xmlReader.setContentHandler(handler);

      InputSource source = new InputSource(FILENAME);

      // utf-8
      source.setEncoding(StandardCharsets.UTF_8.displayName());

      // utf-16
      // source.setEncoding(StandardCharsets.UTF_16.displayName());

      // ascii
      // source.setEncoding(StandardCharsets.US_ASCII.displayName());

      xmlReader.parse(source);

  } catch (ParserConfigurationException | SAXException | IOException e) {
      e.printStackTrace();
  }

3. SAX common errors

Below are some common errors in SAX XML parsing.

3.1 Invalid byte 1 of 1-byte UTF-8 sequence

The XML file contains invalid UTF-8 characters, read this.

3.2 Content is not allowed in prolog

The XML file contains invalid text or BOM before the XML declaration, read this.

3.3 The entity name must immediately follow the ‘&’ in the entity reference

The & is an invalid character in XML file, please replace it with &amp; or wrap with CDATA, for example <![CDATA[a & b]]>.

4. Download Source Code

$ git clone https://github.com/mkyong/core-java

$ cd java-xml

$ cd src/main/java/com/mkyong/xml/sax/

5. References

About Author

author image
Founder of Mkyong.com, love Java and open source stuff. Follow him on Twitter. If you like my tutorials, consider make a donation to these charities.

Comments

Subscribe
Notify of
14 Comments
Most Voted
Newest Oldest
Inline Feedbacks
View all comments
Minu
9 years ago

Another consequence of not using this approch to specify the encoding method is that the parser fails with a “content not allowed in prolog” exception. I tried everything that many posts hjad suggested (remove extra characters, check encoding withing XML), but that did not help. This approcah did. If possble, please include that as a tag so that this post is picked up in searches.

Jaikrat
10 years ago

This doesn’t work for & operator. Any thought?

Thanks
Jai

Java Expert
2 years ago

When parsing the document through dom parser, An invalid XML character (Unicode: 0x10) or (Unicode: 0x1a) was found in the value of attribute was found. It took a long time to find out that there were illegal characters in the document.
The occurrence of these errors is due to the presence of some invisible special characters, and these characters are illegal for XMl files, so the XML parser will be abnormal during parsing.
Somtimes it also throws same type of error while parsing large xml (size is huge) though there is no special char. 

Nguy?n Hòa
9 years ago

how can I do with url? I use httpConnection.getInputStream() not support any parameter

sandeep
10 years ago

Thanks MKYong, you fixed my issue 🙂

Aravind
11 years ago

do you know how to do this please explain me also

BionicMessiah
4 years ago
Reply to  mkyong

7 years later the question is still here.

sp
10 years ago
Reply to  mkyong

Hi mkyong,

I failed to under stand how to handle elements like
or
with sax parsers, can you please explain about that.

Thanks in advance
sp

sp
10 years ago
Reply to  sp

Hi mkyong,

I failed to under stand how to handle empty tags like or
with sax parsers, can you please explain about that.

Thanks in advance
sp

rman
13 years ago

sorry the example is supposed to be [salary][/salary]