How to read UTF-8 XML file in Java – (SAX Parser)

In previous Java SAX XML example, there is no problem if you use SAX to parse a plain text (ANSI) XML file, however, if you parse a XML file which contains some special UTF-8 characters, it will prompts “Invalid byte 1 of 1-byte UTF-8 sequence” exception.

com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: 
Invalid byte 1 of 1-byte UTF-8 sequence.

See following xml file which contain a special UTF-8 characters “§” (press Alt + 789)

<?xml version="1.0"?>
<company>
	<staff>
		<firstname>yong</firstname>
		<lastname>mook kim</lastname>
		<nickname>§</nickname>
		<salary>100000</salary>
	</staff>
</company>

To fix it, just override the SAX input source like this :

File file = new File("c:\\file-utf.xml");
InputStream inputStream= new FileInputStream(file);
Reader reader = new InputStreamReader(inputStream,"UTF-8");
 
InputSource is = new InputSource(reader);
is.setEncoding("UTF-8");
 
saxParser.parse(is, handler);

See a full example of using SAX parser to parse a Unicode XML file.

package com.mkyong.test;
 
import java.io.File;
import java.io.FileInputStream;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.Reader;
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
import org.xml.sax.Attributes;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;
 
public class ReadXMLUTF8FileSAX 
{
    public static void main( String[] args )
    {
    	try {
 
    	      SAXParserFactory factory = SAXParserFactory.newInstance();
    	      SAXParser saxParser = factory.newSAXParser();
 
    	      DefaultHandler handler = new DefaultHandler() {
 
    	        boolean bfname = false;
    	        boolean blname = false;
    	        boolean bnname = false;
    	        boolean bsalary = false;
 
    	        public void startElement(String uri, String localName,
    	            String qName, Attributes attributes)
    	            throws SAXException {
 
    	          System.out.println("Start Element :" + qName);
 
    	          if (qName.equalsIgnoreCase("FIRSTNAME")) {
    	        	  bfname = true;
    	          }
 
    	          if (qName.equalsIgnoreCase("LASTNAME")) {
    	        	  blname = true;
    	          }
 
    	          if (qName.equalsIgnoreCase("NICKNAME")) {
    	        	  bnname = true;
    	          }
 
    	          if (qName.equalsIgnoreCase("SALARY")) {
    	        	  bsalary = true;
    	          }
 
    	        }
 
    	        public void endElement(String uri, String localName,
    	                String qName)
    	                throws SAXException {
 
    	              System.out.println("End Element :" + qName);
 
    	        }
 
    	        public void characters(char ch[], int start, int length)
    	            throws SAXException {
 
    	          System.out.println(new String(ch, start, length));
 
 
    	          if (bfname) {
    	            System.out.println("First Name : "
    	                + new String(ch, start, length));
    	            bfname = false;
    	          }
 
    	          if (blname) {
    	              System.out.println("Last Name : "
    	                  + new String(ch, start, length));
    	              blname = false;
    	           }
 
    	          if (bnname) {
    	              System.out.println("Nick Name : "
    	                  + new String(ch, start, length));
    	              bnname = false;
    	           }
 
    	          if (bsalary) {
    	              System.out.println("Salary : "
    	                  + new String(ch, start, length));
    	              bsalary = false;
    	           }
 
    	        }
 
    	      };
 
    	      File file = new File("c:\\file.xml");
    	      InputStream inputStream= new FileInputStream(file);
    	      Reader reader = new InputStreamReader(inputStream,"UTF-8");
 
    	      InputSource is = new InputSource(reader);
    	      is.setEncoding("UTF-8");
 
    	      saxParser.parse(is, handler);
 
 
    	    } catch (Exception e) {
    	      e.printStackTrace();
    	    }
 
    }
}
Tags :

About the Author

mkyong
Founder of Mkyong.com and HostingCompass.com, love Java and open source stuff. Follow him on Twitter, or befriend him on Facebook or Google Plus. If you like my tutorials, consider make a donation to these charities.

Comments

  • Pingback: ionizer loans()

  • Pingback: electrician jobs nyc()

  • Pingback: house blue()

  • Pingback: zak?ad szklarski piotrków trybunalski()

  • Pingback: pay day loans()

  • Pingback: plumbers supply st. louis()

  • Pingback: revit 2014 plumbing tutorial()

  • Pingback: pay per day loans plan()

  • Pingback: water ionizer pay plan loans()

  • Pingback: alkaline water()

  • Pingback: car parking()

  • Pingback: parking()

  • Pingback: mobile porn movies()

  • Pingback: best DIRECTV deals()

  • Pingback: Big-chested light-haired romped with belt cock fucktoy()

  • Pingback: tv online, online tv()

  • Pingback: streaming movies()

  • Pingback: Blue Coaster33()

  • Jaikrat

    This doesn’t work for & operator. Any thought?

    Thanks
    Jai

  • sandeep

    Thanks MKYong, you fixed my issue :)

  • http://www.jeonghoon.com/349?TSSESSIONchlrkdgnslcafe24com=70b83a861abd7f4752345e1741591ee6 visceral fat dangers

    whoah this blog is wonderful i really like reading
    your posts. Stay up the good work! You already know, many individuals
    are looking around for this information, you could aid them greatly.

  • http://www.slideshare.net/joanneho37/equal-exchange-coupon-code-coupons-available-now gary

    Superb, what a website it is! This web site presents helpful facts to us,
    keep it up.

  • Sergey

    To make the code more optimal I would advice to use

     else if

    to avoid unnecessary checks.

  • Raghavendra

    Hi,

    How To Read UTF-8 XML File In Java using DOM??

    . I’m getting the same error while reading the xml file using the DocumentBuilder class of DOM…..please provide a solution for this!!!!!

    Thanks
    Raghavendra

  • Wei

    Thank you so much

  • ELI

    hi.thank u very much.i have a question.when we extend it from defaulthandler,it is necessary to implement these function? that is not enough?
    thank u.

  • JavaXML

    I have text:
    XML_DocName_DateTime.xml at the top and bottom of an XML document I am reading in and the reader is blowing up on the text. How would I tell the reader to skip over the text.

    JAVA SE 6
    JAVA EE 6
    I believe SAX is the library

    Thank You Very Much in advance

  • s

    Hey can we use the parsed elements and set these values to an object?What i want is it should seperate out every individual staff if there are hundreds of them and set in an object which can be used furthur.

    Thanks

  • http://www.deftsite.com deftsite ,, , , ,

    Nice post, but incomplete i think so. Seems something missing

  • http://www.teamredaf.com +Erisoft

    May be you have a idea this lines of code does not work on j2me:

    File file = new File(“c:\\file.xml”);
    InputStream inputStream= new FileInputStream(file);

  • sunny

    Hi Thanks for the example… my xml is something like this…

    yong
    mook kim
    §
    100000

    asfd
    moasdfa
    §
    200000

    how do i access the second company information.. in this code..

    • Aravind

      do you know how to do this please explain me also

  • Santhosh

    Hi Yong Sir,

    All your tutors, really helpful. i am a new java web developer.

    i want assign this xml data into bean then assign to list.

    how can i do that.

    please help me.

    thanks in advance.


    Santhosh

  • yunta_gohan

    This really works! Thanks for the sample!

    • http://www.mkyong.com mkyong

      :), of course it’s working

  • Sid

    Hi,

    Thanks for the code.

    One observation. I was not getting the special character correctly while running from Eclipse. But when I did the following modification its started working fine.

    Any reason?

    From:
    Reader reader = new InputStreamReader(inputStream,”UTF-8″);

    To:
    Reader reader = new InputStreamReader(inputStream);

    Thanks & Regards – Sid

  • sax
     &lt;![CDATA[http://www.we130_logo.jpg]]&gt;
      public void characters(char ch[], int start, int length) throws SAXException {
     
                 String str_tmp=null;
                    str_tmp=new String(ch, start, length).trim();
    System.out.println(str_tmp+" "+length);
    }

    ..result= “1” ->str_tmp=null, lenght=1..
    I’d like to print “http://www.we130_logo.jpg”

    • http://www.mkyong.com mkyong

      non-related question, this comment will be delete after 1 day, please post your question nicely on javanullpointer.com

  • Mike Stevens

    Many thanks for this tip.

    The unicode XML in question was in a StringBuffer so I had to modify your example.
    It failed until I realised that I have to specify the character set also when creating the InputStream from the StringBuffer:

    InputStream inputStream = new ByteArrayInputStream( _xml.toString().getBytes( "UTF-8" ) );

    Once I made this minor adaptation, it worked well.

    • http://www.mkyong.com mkyong

      Thanks for your additional inputs.

      • sp

        Hi mkyong,

        I failed to under stand how to handle elements like
        or
        with sax parsers, can you please explain about that.

        Thanks in advance
        sp

        • sp

          Hi mkyong,

          I failed to under stand how to handle empty tags like or
          with sax parsers, can you please explain about that.

          Thanks in advance
          sp

  • rman

    If I am thinking correctly you should change you endElement(…) method to set the variables

    bfname = false;
    blname = false;
    bnname = false;
    bsalary = false;

    otherwise if you have an empty element (for example ) in your xml it the parser will not parse the xml correctly. In this case the end element will be called first, so if you have not set the bsalary variable to false the charachters(…) method will set the value of salary to the charachters which it sees after the tag in this example which is white space.

    • rman

      sorry the example is supposed to be [salary][/salary]

  • Pingback: Java XML Tutorials()

  • Pingback: SAX Error – MalformedByteSequenceException: Invalid byte 1 of 1-byte UTF-8 sequence. | Java()

  • Pingback: How to read XML file in Java – (SAX Parser) | Java()