PDFBox – How to read PDF file in Java

By mkyong | Updated: July 24, 2017

Viewed: 9,537 (+56 pv/w)

This article shows you how to use Apache PDFBox to read a PDF file in Java.

1. Get PDFBox

pom.xml


<dependency>
        <groupId>org.apache.pdfbox</groupId>
        <artifactId>pdfbox</artifactId>
        <version>2.0.6</version>
</dependency>

2. Print PDF file

Example to extract all text from a PDF file.

ReadPdf.java


package com.mkyong;

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import org.apache.pdfbox.text.PDFTextStripperByArea;

import java.io.File;
import java.io.IOException;

public class ReadPdf {

    public static void main(String[] args) throws IOException {

        try (PDDocument document = PDDocument.load(new File("/path-to/abc.pdf"))) {

            document.getClass();

            if (!document.isEncrypted()) {
			
                PDFTextStripperByArea stripper = new PDFTextStripperByArea();
                stripper.setSortByPosition(true);

                PDFTextStripper tStripper = new PDFTextStripper();

                String pdfFileInText = tStripper.getText(document);
                //System.out.println("Text:" + st);

				// split by whitespace
                String lines[] = pdfFileInText.split("\\r?\\n");
                for (String line : lines) {
                    System.out.println(line);
                }

            }

        }

    }
}

Note
Please refer to this pdfbox svn for more examples

References

About Author

mkyong

Founder of Mkyong.com, love Java and open source stuff. Follow him on Twitter. If you like my tutorials, consider make a donation to these charities.

Comments

13 Comments

Most Voted

Newest Oldest

Inline Feedbacks

View all comments

Ankita Tapadia

6 years ago

Hi Mykong. I am getting following error on the same problem statement. Could you please guide me on a resolution?

org.codehaus.groovy.control.MultipleCompilationErrorsException: startup failed:
Script1.groovy: 10: unable to resolve class org.apache.pdfbox.text.PDFTextStripper
@ line 10, column 3.
import org.apache.pdfbox.text.PDFTextStripper;
^

1 error

at org.webharvest.runtime.scripting.GroovyScriptEngine.eval(GroovyScriptEngine.java:138) ~[workfusion-webharvest-core.jar:na]
at org.webharvest.runtime.processors.ScriptProcessor.execute(ScriptProcessor.java:74) ~[workfusion-webharvest-core.jar:na]
at org.webharvest.runtime.processors.BaseProcessor.run(BaseProcessor.java:127) ~[workfusion-webharvest-core.jar:na]
at org.webharvest.runtime.Scraper.execute(Scraper.java:169) ~[workfusion-webharvest-core.jar:na]
at org.webharvest.runtime.Scraper.execute(Scraper.java:182) ~[workfusion-webharvest-core.jar:na]
at com.workfusion.studio.rpa.wf.webharvest.launch.StudioWebHarvestTaskExecutor.execute(StudioWebHarvestTaskExecutor.java:108) ~[com.workfusion.studio.wf_8.4.0.jar:na]
at com.workfusion.studio.rpa.wf.webharvest.launch.SingleThreadWebHarvestProcess.processTaskInputs(SingleThreadWebHarvestProcess.java:75) ~[com.workfusion.studio.wf_8.4.0.jar:na]
at com.workfusion.studio.rpa.wf.webharvest.launch.SingleThreadWebHarvestProcess.start(SingleThreadWebHarvestProcess.java:44) ~[com.workfusion.studio.wf_8.4.0.jar:na]
at com.workfusion.studio.rpa.wf.webharvest.launch.WebHarvestMainLauncher.launch(WebHarvestMainLauncher.java:83) ~[com.workfusion.studio.wf_8.4.0.jar:na]
at com.workfusion.studio.rpa.wf.webharvest.launch.WebHarvestMainLauncher.main(WebHarvestMainLauncher.java:141) ~[com.workfusion.studio.wf_8.4.0.jar:na]
Caused by: org.codehaus.groovy.control.MultipleCompilationErrorsException: startup failed:

sankar

2 years ago

i am unable to import the load method plz help me on this

Manmaya

4 years ago

Hi,
Thanks for posting this, is there any way to determine the font name and its size in the particular line of text.

goyal

2 years ago

Getting these exceptions :

at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:89)
at org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:146)
at org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:66)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:933)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:515)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:489)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:156)
at org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:144)
at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:394)
at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:322)
at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:269)
at org.apache.pdfbox.text.PDFTextStripper.getText(PDFTextStripper.java:233)
at PdfToConsole.main(PdfToConsole.java:35)
Caused by: java.lang.ClassNotFoundException: org.apache.fontbox.FontBoxFont
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:355)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)

Rituja

4 years ago

I can’t read colum wise pdf document. It read row wise data. For reading the column wise data what we have to do

SHANKAR

4 years ago

hi how to read non editable image pdf iwithout installing software in java

Mohamad Basuki

4 years ago

Hi Bro, how read multipage pdf, thanks 🙂

Sid

4 years ago

Hi Mykong, I have to covert PDF file to HTML and for this I need a java code to fetch formatting of the PDF as well along with the text. For example tables, images, forms etc. Please guide me.
Thanks.

Anonymous

4 years ago

Reply to Sid

https://www.baeldung.com/pdf-conversions-java

refer this one it might be useful for you.

kumar

4 years ago

Thanks for the help

srinivas

6 years ago

how can get the font style for each line in pdf using pdfbox

Smarty

6 years ago

yes really its worty

Shafqat Shafi

5 years ago

Thanks a lot for such a neat and informative article.

-1

	This comment is spam
	This comment is irrelevant
	This comment is abusive
	Other