Selenium WebDriver Read PDF Content
In test automation activities, we may encounter scenario when we have to verify PDF content. In such scenarios, we have to use Java to read PDF files. In this post, we will see how we can use Selenium with Java to verify PDF content. Read on to find out more about Selenium WebDriver read PDF scenario.
We will use PDFBox API to read PDF file using Java code. For our example, we will read content of PDF file at this location and verify that it contains certain text.
Steps:
- Download PDFBox API from here.
- Reference PDFBox JAR file in your Selenium project.
- Open your class file and define the URL of PDF file using this code.
1URL TestURL = new URL("http://www.axmag.com/download/pdfurl-guide.pdf"); - Now, use below code to convert PDF content to text. PDFBox API is used along with Java input stream for this purpose.
1234BufferedInputStream TestFile = new BufferedInputStream(TestURL.openStream());PDFParser TestPDF = new PDFParser(TestFile);TestPDF.parse();String TestText = new PDFTextStripper().getText(TestPDF.getPDDocument()); - Use TestNG assert command to verify that PDF contains ‘Open the setting.xml, you can see it is like this’ text.
1Assert.assertTrue(TestText.contains("Open the setting.xml, you can see it is like this")); - After performing all above steps, your Selenium WebDriver read PDF method should be as below.
1234567891011public void ReadPDF() throws Exception {URL TestURL = new URL("http://www.axmag.com/download/pdfurl-guide.pdf");BufferedInputStream TestFile = new BufferedInputStream(TestURL.openStream());PDFParser TestPDF = new PDFParser(TestFile);TestPDF.parse();String TestText = new PDFTextStripper().getText(TestPDF.getPDDocument());Assert.assertTrue(TestText.contains("Open the setting.xml, you can see it is like this"));}
As you can see from above tutorial, reading PDF content is possible using Selenium WebDriver. Let us know how it goes for you.
Comments
Hi!
Probably you could be interested: we created an open-source library for testing PDF (using your idea):
https://github.com/codeborne/pdf-test
I would glad to get your feedback about it.
Hello Andrei,
I am a heavy selenide user. I currently use it daily for all my automated testing purposes so you can imagine why I am excited to see your post. These new set of API. How can I integrate it with my current selenide. I was in search of this because we have a need to validate some pdf output in our application and assert that its content are as expected. Any guidance into this matter will be appreciated. thanks
thx for sharing it. I will try it.
hi
in a pdf file if there is word hello for multiple time then how can i find all the existence?
Hi,
I want to get some text from PDF. How can I do that using this??? Please Help.
My scenario is, I am downloading a file from application and need to get a text from that pdf which will be used in further step.
Getting java.net.UnknownHostException error with above code. Please help
java.lang.ClassCastException: java.io.BufferedInputStream cannot be cast to org.apache.pdfbox.io.RandomAccessRead
above code does not work.
I received the same error…grrr…
Jan 19, 2018 6:31:26 PM org.apache.pdfbox.util.PDFStreamEngine processEncodedText
WARNING: java.lang.NoSuchMethodError: org.apache.fontbox.cmap.CMap.getSpaceMapping()I
java.lang.NoSuchMethodError: org.apache.fontbox.cmap.CMap.getSpaceMapping()I
at org.apache.pdfbox.pdmodel.font.PDSimpleFont.getSpaceWidth(PDSimpleFont.java:525)
at org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine.java:355)
at org.apache.pdfbox.util.operator.ShowText.process(ShowText.java:45)
at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:554)
at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268)
at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)
at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215)
at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:455)
at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:379)
at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:335)
at org.apache.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:254)
at PDFVerification.main(PDFVerification.java:16)
Gettting NoSuchMethodError while executing. Can you please tell us, what version of pdfbox is added to eclipse to execute
My code is like this,
public void verifyPDFContent(String strURL, String reqTextInPDF)
{
URL url = new URL(driver.getCurrentUrl());
BufferedInputStream fileToParse=new BufferedInputStream(url.openStream()); PDFParser parser = new PDFParser(fileToParse);
parser.parse();
String output=new PDFTextStripper().getText(parser.getPDDocument());
System.out.println(output);
}
I need to pass value to verifyPDFContent method, please let me know what value should be passed for String strURL, String reqTextInPDF ??
Note: the PDF url is dynamic not static.
PDF Parser does not work for BufferedInputStream file for PDFBox 2.0.13
Instead use the below code:
String getURL = driver.getCurrentUrl();
PDDocument doc=null ;
BufferedInputStream file=null;
String output=null;
URL urlOfPdf = new URL(getURL);
BufferedInputStream fileToParse = new BufferedInputStream(urlOfPdf.openStream());
PDDocument document = PDDocument.load(fileToParse);
output = new PDFTextStripper().getText(document);
System.out.println(“output–“+output);
Note- Download and link the jar –> fontbox-2.0.13, if you get font error.
Thanks, it worked
I have to test a pdf which opens in a new browser tab after clicking on a link.After adding the above code its throwing 400 bad request error because the pdf url i am passing after clicking on the pdf link.It is trying to access in new session and its throwing 400 error
For fontbox error, you need to download the fontbox 2.0.13 jar respective to PDFBox 2.0.13.
The newer version PDFBox2.0.13 does not support PDFParser(BufferedInputStream _file)
Use the below code instead,
URL urlOfPdf = new URL(getURL);
BufferedInputStream fileToParse = new BufferedInputStream(urlOfPdf.openStream());
PDDocument document = PDDocument.load(fileToParse);
output = new PDFTextStripper().getText(document);
System.out.println(“PDF Content “+ output);
if(output.contains(requiredText))
{System.out.println(“PDF contains the text “);}
nice solution thank u
Have tried with the solution
Getting below error
java.io.IOException: Error: End-of-File, expected line
at org.apache.pdfbox.pdfparser.BaseParser.readLine(BaseParser.java:1119)
at org.apache.pdfbox.pdfparser.COSParser.parseHeader(COSParser.java:2454)
at org.apache.pdfbox.pdfparser.COSParser.parsePDFHeader(COSParser.java:2425)
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:233)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1145)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1042)
at com.experian.automation.oce.cucumber.originations.steps.CommonSteps.pdf(CommonSteps.java:256)
I’ve same error
Facing …The below error message
java.io.IOException: Authentication failure
how to verify Links form pdf file
Same here for me.got any solution?