Selenium WebDriver Read PDF Content
In test automation activities, we may encounter scenario when we have to verify PDF content. In such scenarios, we have to use Java to read PDF files. In this post, we will see how we can use Selenium with Java to verify PDF content. Read on to find out more about Selenium WebDriver read PDF scenario.
We will use PDFBox API to read PDF file using Java code. For our example, we will read content of PDF file at this location and verify that it contains certain text.
Steps:
- Download PDFBox API from here.
- Reference PDFBox JAR file in your Selenium project.
- Open your class file and define the URL of PDF file using this code.
1URL TestURL = new URL("http://www.axmag.com/download/pdfurl-guide.pdf"); - Now, use below code to convert PDF content to text. PDFBox API is used along with Java input stream for this purpose.
1234BufferedInputStream TestFile = new BufferedInputStream(TestURL.openStream());PDFParser TestPDF = new PDFParser(TestFile);TestPDF.parse();String TestText = new PDFTextStripper().getText(TestPDF.getPDDocument()); - Use TestNG assert command to verify that PDF contains ‘Open the setting.xml, you can see it is like this’ text.
1Assert.assertTrue(TestText.contains("Open the setting.xml, you can see it is like this")); - After performing all above steps, your Selenium WebDriver read PDF method should be as below.
1234567891011public void ReadPDF() throws Exception {URL TestURL = new URL("http://www.axmag.com/download/pdfurl-guide.pdf");BufferedInputStream TestFile = new BufferedInputStream(TestURL.openStream());PDFParser TestPDF = new PDFParser(TestFile);TestPDF.parse();String TestText = new PDFTextStripper().getText(TestPDF.getPDDocument());Assert.assertTrue(TestText.contains("Open the setting.xml, you can see it is like this"));}
As you can see from above tutorial, reading PDF content is possible using Selenium WebDriver. Let us know how it goes for you.
Comments
Hi!
Probably you could be interested: we created an open-source library for testing PDF (using your idea):
https://github.com/codeborne/pdf-test
I would glad to get your feedback about it.
Hello Andrei,
I am a heavy selenide user. I currently use it daily for all my automated testing purposes so you can imagine why I am excited to see your post. These new set of API. How can I integrate it with my current selenide. I was in search of this because we have a need to validate some pdf output in our application and assert that its content are as expected. Any guidance into this matter will be appreciated. thanks
Getting java.net.UnknownHostException error with above code. Please help
java.lang.ClassCastException: java.io.BufferedInputStream cannot be cast to org.apache.pdfbox.io.RandomAccessRead
above code does not work.
Jan 19, 2018 6:31:26 PM org.apache.pdfbox.util.PDFStreamEngine processEncodedText
WARNING: java.lang.NoSuchMethodError: org.apache.fontbox.cmap.CMap.getSpaceMapping()I
java.lang.NoSuchMethodError: org.apache.fontbox.cmap.CMap.getSpaceMapping()I
at org.apache.pdfbox.pdmodel.font.PDSimpleFont.getSpaceWidth(PDSimpleFont.java:525)
at org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine.java:355)
at org.apache.pdfbox.util.operator.ShowText.process(ShowText.java:45)
at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:554)
at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268)
at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)
at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215)
at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:455)
at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:379)
at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:335)
at org.apache.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:254)
at PDFVerification.main(PDFVerification.java:16)
Gettting NoSuchMethodError while executing. Can you please tell us, what version of pdfbox is added to eclipse to execute