PDFBox – How to read PDF file in Java

This article shows you how to use Apache PDFBox to read a PDF file in Java.

1. Get PDFBox

pom.xml

<dependency>
        <groupId>org.apache.pdfbox</groupId>
        <artifactId>pdfbox</artifactId>
        <version>2.0.6</version>
</dependency>

2. Print PDF file

Example to extract all text from a PDF file.

ReadPdf.java

package com.mkyong;

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import org.apache.pdfbox.text.PDFTextStripperByArea;

import java.io.File;
import java.io.IOException;

public class ReadPdf {

    public static void main(String[] args) throws IOException {

        try (PDDocument document = PDDocument.load(new File("/path-to/abc.pdf"))) {

            document.getClass();

            if (!document.isEncrypted()) {
			
                PDFTextStripperByArea stripper = new PDFTextStripperByArea();
                stripper.setSortByPosition(true);

                PDFTextStripper tStripper = new PDFTextStripper();

                String pdfFileInText = tStripper.getText(document);
                //System.out.println("Text:" + st);

				// split by whitespace
                String lines[] = pdfFileInText.split("\\r?\\n");
                for (String line : lines) {
                    System.out.println(line);
                }

            }

        }

    }
}
Note
Please refer to this pdfbox svn for more examples

References

  1. Apache PDFBox
  2. iText – Read and Write PDF in Java

About the Author

author image
mkyong
Founder of Mkyong.com, love Java and open source stuff. Follow him on Twitter. If you like my tutorials, consider make a donation to these charities.

Comments

avatar
9 Comment threads
1 Thread replies
0 Followers
 
Most reacted comment
Hottest comment thread
10 Comment authors
SHANKARAnonymousMohamad BasukiSidManmaya Recent comment authors
newest oldest most voted
Ankita Tapadia
Guest
Ankita Tapadia

Hi Mykong. I am getting following error on the same problem statement. Could you please guide me on a resolution?

org.codehaus.groovy.control.MultipleCompilationErrorsException: startup failed:
Script1.groovy: 10: unable to resolve class org.apache.pdfbox.text.PDFTextStripper
@ line 10, column 3.
import org.apache.pdfbox.text.PDFTextStripper;
^

1 error

at org.webharvest.runtime.scripting.GroovyScriptEngine.eval(GroovyScriptEngine.java:138) ~[workfusion-webharvest-core.jar:na]
at org.webharvest.runtime.processors.ScriptProcessor.execute(ScriptProcessor.java:74) ~[workfusion-webharvest-core.jar:na]
at org.webharvest.runtime.processors.BaseProcessor.run(BaseProcessor.java:127) ~[workfusion-webharvest-core.jar:na]
at org.webharvest.runtime.Scraper.execute(Scraper.java:169) ~[workfusion-webharvest-core.jar:na]
at org.webharvest.runtime.Scraper.execute(Scraper.java:182) ~[workfusion-webharvest-core.jar:na]
at com.workfusion.studio.rpa.wf.webharvest.launch.StudioWebHarvestTaskExecutor.execute(StudioWebHarvestTaskExecutor.java:108) ~[com.workfusion.studio.wf_8.4.0.jar:na]
at com.workfusion.studio.rpa.wf.webharvest.launch.SingleThreadWebHarvestProcess.processTaskInputs(SingleThreadWebHarvestProcess.java:75) ~[com.workfusion.studio.wf_8.4.0.jar:na]
at com.workfusion.studio.rpa.wf.webharvest.launch.SingleThreadWebHarvestProcess.start(SingleThreadWebHarvestProcess.java:44) ~[com.workfusion.studio.wf_8.4.0.jar:na]
at com.workfusion.studio.rpa.wf.webharvest.launch.WebHarvestMainLauncher.launch(WebHarvestMainLauncher.java:83) ~[com.workfusion.studio.wf_8.4.0.jar:na]
at com.workfusion.studio.rpa.wf.webharvest.launch.WebHarvestMainLauncher.main(WebHarvestMainLauncher.java:141) ~[com.workfusion.studio.wf_8.4.0.jar:na]
Caused by: org.codehaus.groovy.control.MultipleCompilationErrorsException: startup failed:

SHANKAR
Guest
SHANKAR

hi how to read non editable image pdf iwithout installing software in java

Mohamad Basuki
Guest
Mohamad Basuki

Hi Bro, how read multipage pdf, thanks :)

Sid
Guest
Sid

Hi Mykong, I have to covert PDF file to HTML and for this I need a java code to fetch formatting of the PDF as well along with the text. For example tables, images, forms etc. Please guide me.
Thanks.

Anonymous
Guest
Anonymous

https://www.baeldung.com/pdf-conversions-java

refer this one it might be useful for you.

Manmaya
Guest
Manmaya

Hi,
Thanks for posting this, is there any way to determine the font name and its size in the particular line of text.

kumar
Guest
kumar

Thanks for the help

Smarty
Guest
Smarty

yes really its worty

Shafqat Shafi
Guest
Shafqat Shafi

Thanks a lot for such a neat and informative article.

srinivas
Guest
srinivas

how can get the font style for each line in pdf using pdfbox