How to read UTF-8 XML file in Java – (SAX Parser)
In previous Java SAX XML example, there is no problem if you use SAX to parse a plain text (ANSI) XML file, however, if you parse a XML file which contains some special UTF-8 characters, it will prompts “Invalid byte 1 of 1-byte UTF-8 sequence” exception.
com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: Invalid byte 1 of 1-byte UTF-8 sequence.
See following xml file which contain a special UTF-8 characters “§” (press Alt + 789)
<?xml version="1.0"?> <company> <staff> <firstname>yong</firstname> <lastname>mook kim</lastname> <nickname>§</nickname> <salary>100000</salary> </staff> </company>
To fix it, just override the SAX input source like this :
File file = new File("c:\\file-utf.xml"); InputStream inputStream= new FileInputStream(file); Reader reader = new InputStreamReader(inputStream,"UTF-8"); InputSource is = new InputSource(reader); is.setEncoding("UTF-8"); saxParser.parse(is, handler);
See a full example of using SAX parser to parse a Unicode XML file.
package com.mkyong.test; import java.io.File; import java.io.FileInputStream; import java.io.InputStream; import java.io.InputStreamReader; import java.io.Reader; import javax.xml.parsers.SAXParser; import javax.xml.parsers.SAXParserFactory; import org.xml.sax.Attributes; import org.xml.sax.InputSource; import org.xml.sax.SAXException; import org.xml.sax.helpers.DefaultHandler; public class ReadXMLUTF8FileSAX { public static void main( String[] args ) { try { SAXParserFactory factory = SAXParserFactory.newInstance(); SAXParser saxParser = factory.newSAXParser(); DefaultHandler handler = new DefaultHandler() { boolean bfname = false; boolean blname = false; boolean bnname = false; boolean bsalary = false; public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException { System.out.println("Start Element :" + qName); if (qName.equalsIgnoreCase("FIRSTNAME")) { bfname = true; } if (qName.equalsIgnoreCase("LASTNAME")) { blname = true; } if (qName.equalsIgnoreCase("NICKNAME")) { bnname = true; } if (qName.equalsIgnoreCase("SALARY")) { bsalary = true; } } public void endElement(String uri, String localName, String qName) throws SAXException { System.out.println("End Element :" + qName); } public void characters(char ch[], int start, int length) throws SAXException { System.out.println(new String(ch, start, length)); if (bfname) { System.out.println("First Name : " + new String(ch, start, length)); bfname = false; } if (blname) { System.out.println("Last Name : " + new String(ch, start, length)); blname = false; } if (bnname) { System.out.println("Nick Name : " + new String(ch, start, length)); bnname = false; } if (bsalary) { System.out.println("Salary : " + new String(ch, start, length)); bsalary = false; } } }; File file = new File("c:\\file.xml"); InputStream inputStream= new FileInputStream(file); Reader reader = new InputStreamReader(inputStream,"UTF-8"); InputSource is = new InputSource(reader); is.setEncoding("UTF-8"); saxParser.parse(is, handler); } catch (Exception e) { e.printStackTrace(); } } }
May be you have a idea this lines of code does not work on j2me:
File file = new File(“c:\\file.xml”);
InputStream inputStream= new FileInputStream(file);
Hi Thanks for the example… my xml is something like this…
yong
mook kim
§
100000
asfd
moasdfa
§
200000
how do i access the second company information.. in this code..
Hi Yong Sir,
All your tutors, really helpful. i am a new java web developer.
i want assign this xml data into bean then assign to list.
how can i do that.
please help me.
thanks in advance.
–
Santhosh
This really works! Thanks for the sample!
Hi,
Thanks for the code.
One observation. I was not getting the special character correctly while running from Eclipse. But when I did the following modification its started working fine.
Any reason?
From:
Reader reader = new InputStreamReader(inputStream,”UTF-8″);
To:
Reader reader = new InputStreamReader(inputStream);
Thanks & Regards – Sid
..result= “1″ ->str_tmp=null, lenght=1..
I’d like to print “http://www.we130_logo.jpg”
non-related question, this comment will be delete after 1 day, please post your question nicely on javanullpointer.com
Many thanks for this tip.
The unicode XML in question was in a StringBuffer so I had to modify your example.
It failed until I realised that I have to specify the character set also when creating the InputStream from the StringBuffer:
Once I made this minor adaptation, it worked well.
Thanks for your additional inputs.
If I am thinking correctly you should change you endElement(…) method to set the variables
bfname = false;
blname = false;
bnname = false;
bsalary = false;
otherwise if you have an empty element (for example ) in your xml it the parser will not parse the xml correctly. In this case the end element will be called first, so if you have not set the bsalary variable to false the charachters(…) method will set the value of salary to the charachters which it sees after the tag in this example which is white space.
sorry the example is supposed to be [salary][/salary]
[...] Read UTF-8 XML file – (SAX Parser) Normal SAX can not parse the XML file contains Unicode character, this is the workaround. [...]
[...] Full examples can be find here – how do read UTF-8 XML file with SAX parser [...]
[...] This example may encounter some exceptions for UTF-8 XML file, please read this article about how to readthe XML UTF-8 file in SAX [...]