Problem

In previous SAX example, it is working fine to parse a plain text (ANSI) XML file, if some special UTF-8 words inside a XML file, it will encounter the “MalformedByteSequenceException” UTF-8 exception.

Solution

1. Create a XML file

This is a xml file which contain a special UTF-8 characters “§” (press Alt+789)

<?xml version="1.0"?>
<company>
	<staff>
		<firstname>yong</firstname>
		<lastname>mook kim</lastname>
		<nickname>§</nickname>
		<salary>100000</salary>
	</staff>
</company>

If you used normal SAX’s way to parse it, you will encounter this “Invalid byte 1 of 1-byte UTF-8 sequence” error.

2. Create a Java File

This is normal SAX’s way, does not support UTF-8.

saxParser.parse("c:\\file.xml", handler);

Firstly, you have make sure the file is UTF-8 encoded, and override the SAX’s input source.

File file = new File("c:\\file-utf.xml");
InputStream inputStream= new FileInputStream(file);
Reader reader = new InputStreamReader(inputStream,"UTF-8");
 
InputSource is = new InputSource(reader);
is.setEncoding("UTF-8");
 
saxParser.parse(is, handler);
Full example…
package com.mkyong.test;
 
import java.io.File;
import java.io.FileInputStream;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.Reader;
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
import org.xml.sax.Attributes;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;
 
public class ReadXMLUTF8FileSAX 
{
    public static void main( String[] args )
    {
    	try {
 
    	      SAXParserFactory factory = SAXParserFactory.newInstance();
    	      SAXParser saxParser = factory.newSAXParser();
 
    	      DefaultHandler handler = new DefaultHandler() {
 
    	        boolean bfname = false;
    	        boolean blname = false;
    	        boolean bnname = false;
    	        boolean bsalary = false;
 
    	        public void startElement(String uri, String localName,
    	            String qName, Attributes attributes)
    	            throws SAXException {
 
    	          System.out.println("Start Element :" + qName);
 
    	          if (qName.equalsIgnoreCase("FIRSTNAME")) {
    	        	  bfname = true;
    	          }
 
    	          if (qName.equalsIgnoreCase("LASTNAME")) {
    	        	  blname = true;
    	          }
 
    	          if (qName.equalsIgnoreCase("NICKNAME")) {
    	        	  bnname = true;
    	          }
 
    	          if (qName.equalsIgnoreCase("SALARY")) {
    	        	  bsalary = true;
    	          }
 
    	        }
 
    	        public void endElement(String uri, String localName,
    	                String qName)
    	                throws SAXException {
 
    	              System.out.println("End Element :" + qName);
 
    	        }
 
    	        public void characters(char ch[], int start, int length)
    	            throws SAXException {
 
    	          System.out.println(new String(ch, start, length));
 
 
    	          if (bfname) {
    	            System.out.println("First Name : "
    	                + new String(ch, start, length));
    	            bfname = false;
    	          }
 
    	          if (blname) {
    	              System.out.println("Last Name : "
    	                  + new String(ch, start, length));
    	              blname = false;
    	           }
 
    	          if (bnname) {
    	              System.out.println("Nick Name : "
    	                  + new String(ch, start, length));
    	              bnname = false;
    	           }
 
    	          if (bsalary) {
    	              System.out.println("Salary : "
    	                  + new String(ch, start, length));
    	              bsalary = false;
    	           }
 
    	        }
 
    	      };
 
    	      File file = new File("c:\\file.xml");
    	      InputStream inputStream= new FileInputStream(file);
    	      Reader reader = new InputStreamReader(inputStream,"UTF-8");
 
    	      InputSource is = new InputSource(reader);
    	      is.setEncoding("UTF-8");
 
    	      saxParser.parse(is, handler);
 
 
    	    } catch (Exception e) {
    	      e.printStackTrace();
    	    }
 
    }
}

You have to override the SAX’s input source in order to parse XML file which contains Unicode character.

Oracle Magazine contains technology strategy articles, sample code, tips, Oracle and partner news, how to articles for developers and DBAs, and more. Oracle (NASDAQ: ORCL) is the world\'s largest enterprise software company.
Publisher : Oracle Corporation