PDFBox – How to read PDF file in Java
This article shows you how to use Apache PDFBox to read a PDF file in Java.
1. Get PDFBox
pom.xml
<dependency>
<groupId>org.apache.pdfbox</groupId>
<artifactId>pdfbox</artifactId>
<version>2.0.6</version>
</dependency>
2. Print PDF file
Example to extract all text from a PDF file.
ReadPdf.java
package com.mkyong;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import org.apache.pdfbox.text.PDFTextStripperByArea;
import java.io.File;
import java.io.IOException;
public class ReadPdf {
public static void main(String[] args) throws IOException {
try (PDDocument document = PDDocument.load(new File("/path-to/abc.pdf"))) {
document.getClass();
if (!document.isEncrypted()) {
PDFTextStripperByArea stripper = new PDFTextStripperByArea();
stripper.setSortByPosition(true);
PDFTextStripper tStripper = new PDFTextStripper();
String pdfFileInText = tStripper.getText(document);
//System.out.println("Text:" + st);
// split by whitespace
String lines[] = pdfFileInText.split("\\r?\\n");
for (String line : lines) {
System.out.println(line);
}
}
}
}
}
Note
Please refer to this pdfbox svn for more examples
Please refer to this pdfbox svn for more examples
Hi Mykong. I am getting following error on the same problem statement. Could you please guide me on a resolution?
org.codehaus.groovy.control.MultipleCompilationErrorsException: startup failed:
Script1.groovy: 10: unable to resolve class org.apache.pdfbox.text.PDFTextStripper
@ line 10, column 3.
import org.apache.pdfbox.text.PDFTextStripper;
^
1 error
at org.webharvest.runtime.scripting.GroovyScriptEngine.eval(GroovyScriptEngine.java:138) ~[workfusion-webharvest-core.jar:na]
at org.webharvest.runtime.processors.ScriptProcessor.execute(ScriptProcessor.java:74) ~[workfusion-webharvest-core.jar:na]
at org.webharvest.runtime.processors.BaseProcessor.run(BaseProcessor.java:127) ~[workfusion-webharvest-core.jar:na]
at org.webharvest.runtime.Scraper.execute(Scraper.java:169) ~[workfusion-webharvest-core.jar:na]
at org.webharvest.runtime.Scraper.execute(Scraper.java:182) ~[workfusion-webharvest-core.jar:na]
at com.workfusion.studio.rpa.wf.webharvest.launch.StudioWebHarvestTaskExecutor.execute(StudioWebHarvestTaskExecutor.java:108) ~[com.workfusion.studio.wf_8.4.0.jar:na]
at com.workfusion.studio.rpa.wf.webharvest.launch.SingleThreadWebHarvestProcess.processTaskInputs(SingleThreadWebHarvestProcess.java:75) ~[com.workfusion.studio.wf_8.4.0.jar:na]
at com.workfusion.studio.rpa.wf.webharvest.launch.SingleThreadWebHarvestProcess.start(SingleThreadWebHarvestProcess.java:44) ~[com.workfusion.studio.wf_8.4.0.jar:na]
at com.workfusion.studio.rpa.wf.webharvest.launch.WebHarvestMainLauncher.launch(WebHarvestMainLauncher.java:83) ~[com.workfusion.studio.wf_8.4.0.jar:na]
at com.workfusion.studio.rpa.wf.webharvest.launch.WebHarvestMainLauncher.main(WebHarvestMainLauncher.java:141) ~[com.workfusion.studio.wf_8.4.0.jar:na]
Caused by: org.codehaus.groovy.control.MultipleCompilationErrorsException: startup failed:
i am unable to import the load method plz help me on this
Hi,
Thanks for posting this, is there any way to determine the font name and its size in the particular line of text.
Getting these exceptions :
at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:89)
at org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:146)
at org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:66)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:933)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:515)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:489)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:156)
at org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:144)
at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:394)
at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:322)
at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:269)
at org.apache.pdfbox.text.PDFTextStripper.getText(PDFTextStripper.java:233)
at PdfToConsole.main(PdfToConsole.java:35)
Caused by: java.lang.ClassNotFoundException: org.apache.fontbox.FontBoxFont
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:355)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
I can’t read colum wise pdf document. It read row wise data. For reading the column wise data what we have to do
hi how to read non editable image pdf iwithout installing software in java
Hi Bro, how read multipage pdf, thanks 🙂
Hi Mykong, I have to covert PDF file to HTML and for this I need a java code to fetch formatting of the PDF as well along with the text. For example tables, images, forms etc. Please guide me.
Thanks.
https://www.baeldung.com/pdf-conversions-java
refer this one it might be useful for you.
Thanks for the help
how can get the font style for each line in pdf using pdfbox
yes really its worty
Thanks a lot for such a neat and informative article.