How to read UTF-8 XML file in Java – (SAX Parser)

In previous Java SAX XML example, there is no problem if you use SAX to parse a plain text (ANSI) XML file, however, if you parse a XML file which contains some special UTF-8 characters, it will prompts “Invalid byte 1 of 1-byte UTF-8 sequence” exception.


com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: 
Invalid byte 1 of 1-byte UTF-8 sequence.

See following xml file which contain a special UTF-8 characters “§” (press Alt + 789)


<?xml version="1.0"?>
<company>
	<staff>
		<firstname>yong</firstname>
		<lastname>mook kim</lastname>
		<nickname>§</nickname>
		<salary>100000</salary>
	</staff>
</company>

To fix it, just override the SAX input source like this :


File file = new File("c:\\file-utf.xml");
InputStream inputStream= new FileInputStream(file);
Reader reader = new InputStreamReader(inputStream,"UTF-8");
    	      
InputSource is = new InputSource(reader);
is.setEncoding("UTF-8");
    	      
saxParser.parse(is, handler);

See a full example of using SAX parser to parse a Unicode XML file.


package com.mkyong.test;

import java.io.File;
import java.io.FileInputStream;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.Reader;
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
import org.xml.sax.Attributes;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;

public class ReadXMLUTF8FileSAX 
{
    public static void main( String[] args )
    {
    	try {
    		 
    	      SAXParserFactory factory = SAXParserFactory.newInstance();
    	      SAXParser saxParser = factory.newSAXParser();
    	 
    	      DefaultHandler handler = new DefaultHandler() {
    	 
    	        boolean bfname = false;
    	        boolean blname = false;
    	        boolean bnname = false;
    	        boolean bsalary = false;
    	 
    	        public void startElement(String uri, String localName,
    	            String qName, Attributes attributes)
    	            throws SAXException {
    	 
    	          System.out.println("Start Element :" + qName);
    	 
    	          if (qName.equalsIgnoreCase("FIRSTNAME")) {
    	        	  bfname = true;
    	          }
    	 
    	          if (qName.equalsIgnoreCase("LASTNAME")) {
    	        	  blname = true;
    	          }
    	 
    	          if (qName.equalsIgnoreCase("NICKNAME")) {
    	        	  bnname = true;
    	          }
    	 
    	          if (qName.equalsIgnoreCase("SALARY")) {
    	        	  bsalary = true;
    	          }
    	 
    	        }
    	 
    	        public void endElement(String uri, String localName,
    	                String qName)
    	                throws SAXException {
    	 
    	              System.out.println("End Element :" + qName);
    	 
    	        }
    	 
    	        public void characters(char ch[], int start, int length)
    	            throws SAXException {
    	        	
    	          System.out.println(new String(ch, start, length));
    	        	 
    	        	 
    	          if (bfname) {
    	            System.out.println("First Name : "
    	                + new String(ch, start, length));
    	            bfname = false;
    	          }
    	 
    	          if (blname) {
    	              System.out.println("Last Name : "
    	                  + new String(ch, start, length));
    	              blname = false;
    	           }
    	 
    	          if (bnname) {
    	              System.out.println("Nick Name : "
    	                  + new String(ch, start, length));
    	              bnname = false;
    	           }
    	 
    	          if (bsalary) {
    	              System.out.println("Salary : "
    	                  + new String(ch, start, length));
    	              bsalary = false;
    	           }
    	 
    	        }
    	 
    	      };
    	 
    	      File file = new File("c:\\file.xml");
    	      InputStream inputStream= new FileInputStream(file);
    	      Reader reader = new InputStreamReader(inputStream,"UTF-8");
    	      
    	      InputSource is = new InputSource(reader);
    	      is.setEncoding("UTF-8");
    	      
    	      saxParser.parse(is, handler);
    	      
    	 
    	    } catch (Exception e) {
    	      e.printStackTrace();
    	    }
	  
    }
}

About the Author

author image
mkyong
Founder of Mkyong.com, love Java and open source stuff. Follow him on Twitter, or befriend him on Facebook or Google Plus. If you like my tutorials, consider make a donation to these charities.

Comments

Leave a Reply

avatar
newest oldest most voted
sax
Guest
sax
 <![CDATA[http://www.we130_logo.jpg]]>  
 
  public void characters(char ch[], int start, int length) throws SAXException {

             String str_tmp=null;
                str_tmp=new String(ch, start, length).trim();
System.out.println(str_tmp+" "+length);
} 

..result= “1” ->str_tmp=null, lenght=1..
I’d like to print “http://www.we130_logo.jpg”

Manuel G
Guest
Manuel G

Thanks a lot. But this doesn´t work if an attribute name use an special char like “ñ” . However, to fix it just add encoding=”utf-8″? to your XML file header

Amit Thakur
Guest
Amit Thakur
Hi MKyong, Issue –> Actually I am facing issue with xml parsing (SAX Parser) in Unix Machine. Same Jar/Java-Code behave differently on windows and Unix Machine, why ? :( Windows Machine –> works fine , Using SAX Parser to load huge xml file , Read all values correctly and populate same values. Charset.defaultCharset() windows-1252 Unix Machine –> After then created JAR and deployed at Unix –> tomcat and execute the jar. Tried to load same huge xml file But noticed that some values or characters are populated empty or incomplete like Country Name populated as “ysia” instead of “Malaysia” or… Read more »
S
Guest
S

void is an invalid type for the variable startElement and
http://pastebin.com/iHT9R5dL

Nguy?n Hòa
Guest
Nguy?n Hòa

how can I do with url? I use httpConnection.getInputStream() not support any parameter

Fernando
Guest
Fernando

Hi, I´m using the same example but I parse an url

String strUrl = “http://192.4.4.4:54400/wsrest/xxx/A/CA/1203/1281413”
saxParser.parse(strUrl, handler);

How can I read special caracters without errors?

Thanks

Minu
Guest
Minu

Another consequence of not using this approch to specify the encoding method is that the parser fails with a “content not allowed in prolog” exception. I tried everything that many posts hjad suggested (remove extra characters, check encoding withing XML), but that did not help. This approcah did. If possble, please include that as a tag so that this post is picked up in searches.

Jaikrat
Guest
Jaikrat

This doesn’t work for & operator. Any thought?

Thanks
Jai

sandeep
Guest
sandeep

Thanks MKYong, you fixed my issue :)

visceral fat dangers
Guest
visceral fat dangers

whoah this blog is wonderful i really like reading
your posts. Stay up the good work! You already know, many individuals
are looking around for this information, you could aid them greatly.

gary
Guest
gary

Superb, what a website it is! This web site presents helpful facts to us,
keep it up.

Sergey
Guest
Sergey

To make the code more optimal I would advice to use

 else if 

to avoid unnecessary checks.

Raghavendra
Guest
Raghavendra

Hi,

How To Read UTF-8 XML File In Java using DOM??

. I’m getting the same error while reading the xml file using the DocumentBuilder class of DOM…..please provide a solution for this!!!!!

Thanks
Raghavendra

Wei
Guest
Wei

Thank you so much

ELI
Guest
ELI

hi.thank u very much.i have a question.when we extend it from defaulthandler,it is necessary to implement these function? that is not enough?
thank u.

JavaXML
Guest
JavaXML

I have text:
XML_DocName_DateTime.xml at the top and bottom of an XML document I am reading in and the reader is blowing up on the text. How would I tell the reader to skip over the text.

JAVA SE 6
JAVA EE 6
I believe SAX is the library

Thank You Very Much in advance

s
Guest
s

Hey can we use the parsed elements and set these values to an object?What i want is it should seperate out every individual staff if there are hundreds of them and set in an object which can be used furthur.

Thanks

deftsite ,, , , ,
Guest
deftsite ,, , , ,

Nice post, but incomplete i think so. Seems something missing

+Erisoft
Guest
+Erisoft

May be you have a idea this lines of code does not work on j2me:

File file = new File(“c:\\file.xml”);
InputStream inputStream= new FileInputStream(file);

sunny
Guest
sunny

Hi Thanks for the example… my xml is something like this…

yong
mook kim
§
100000

asfd
moasdfa
§
200000

how do i access the second company information.. in this code..

Aravind
Guest
Aravind

do you know how to do this please explain me also

Santhosh
Guest
Santhosh

Hi Yong Sir,

All your tutors, really helpful. i am a new java web developer.

i want assign this xml data into bean then assign to list.

how can i do that.

please help me.

thanks in advance.


Santhosh

yunta_gohan
Guest
yunta_gohan

This really works! Thanks for the sample!

Sid
Guest
Sid

Hi,

Thanks for the code.

One observation. I was not getting the special character correctly while running from Eclipse. But when I did the following modification its started working fine.

Any reason?

From:
Reader reader = new InputStreamReader(inputStream,”UTF-8″);

To:
Reader reader = new InputStreamReader(inputStream);

Thanks & Regards – Sid

Mike Stevens
Guest
Mike Stevens

Many thanks for this tip.

The unicode XML in question was in a StringBuffer so I had to modify your example.
It failed until I realised that I have to specify the character set also when creating the InputStream from the StringBuffer:

InputStream inputStream = new ByteArrayInputStream( _xml.toString().getBytes( "UTF-8" ) );

Once I made this minor adaptation, it worked well.

rman
Guest
rman

If I am thinking correctly you should change you endElement(…) method to set the variables

bfname = false;
blname = false;
bnname = false;
bsalary = false;

otherwise if you have an empty element (for example ) in your xml it the parser will not parse the xml correctly. In this case the end element will be called first, so if you have not set the bsalary variable to false the charachters(…) method will set the value of salary to the charachters which it sees after the tag in this example which is white space.

rman
Guest
rman

sorry the example is supposed to be [salary][/salary]

trackback
Java XML Tutorials

[…] Read UTF-8 XML file – (SAX Parser) Normal SAX can not parse the XML file contains Unicode character, this is the workaround. […]

trackback
SAX Error – MalformedByteSequenceException: Invalid byte 1 of 1-byte UTF-8 sequence. | Java

[…] Full examples can be find here – how do read UTF-8 XML file with SAX parser […]

trackback
How to read XML file in Java – (SAX Parser) | Java

[…] This example may encounter some exceptions for UTF-8 XML file, please read this article about how to readthe XML UTF-8 file in SAX […]