Apache PDFBox - Filter Pages

🔧 Operation Name

Apache PDFBox - Filter Pages filterPages


🧾 Description

Filters pages from a PDF document based on two optional criteria:

  • Remove blank pages

  • Retain only selected page ranges

This is useful for preprocessing documents by cleaning up whitespace or extracting specific sections to keep processing times and associated costs to a minimum.


✅ Inputs

Parameter
Type
Required
Description

PDF File [Binary]

InputStream (Binary)

The input PDF document to be filtered.

Remove Blank Pages

Boolean

❌ ([Only one choice allowed])

If true, pages without visible text, images, or annotations will be removed.

Page Range

String

❌ ([Only one choice allowed])

Comma-separated list of page numbers or ranges to retain (e.g., 1,3,5-7). If not provided, all pages are considered.


📤 Output

  • Payload: InputStream (Binary) A new filtered PDF stream containing only the selected (and non-blank) pages.

  • Attributes: PdfBoxFileAttributes Metadata from the original document (e.g., page count, author, title, etc.).


🧪 MuleSoft Flow Example

Here’s how to call this operation in a MuleSoft flow:


🔍 Notes

  • Page Indexing: 1-based (e.g., 1 = first page).

  • If both options are omitted, the PDF is returned unmodified.

  • You can combine both removeBlankPages and pageRange for tighter filtering.

    • For example: remove blank pages after retaining only pages 2–6.

  • Output is a binary PDF, not text.


Underlying Application Interface:

Pseudo Code

Methods used from the Apache PDFBox library
  1. org.apache.pdfbox.Loader.loadPDF(byte[] input): Used to load the original PDF document from a byte array.

  2. org.apache.pdfbox.pdmodel.PDDocument(): Constructor used to create a new, empty PDDocument for the filtered pages.

  3. org.apache.pdfbox.pdmodel.PDDocument.getNumberOfPages(): Used to get the total number of pages from the original PDF document.

  4. org.apache.pdfbox.pdmodel.PDDocument.getPage(int pageIndex): Used to retrieve a specific page from the original document by its zero-based index.

  5. org.apache.pdfbox.pdmodel.PDDocument.addPage(PDPage page): Used to add a page from the original document to the new, filtered document.

  6. org.apache.pdfbox.pdmodel.PDDocument.save(OutputStream output): Used to save the new, filtered PDDocument to an output stream (in this case, a ByteArrayOutputStream).

  7. org.apache.pdfbox.pdmodel.PDDocument.close(): Used to close both the original and the new filtered PDDocuments to release resources.

  8. org.apache.pdfbox.pdmodel.PDDocument.getDocumentInformation(): Used within the extractPdfMetadata helper method (which is called by filterPages) to get the document's metadata.

  9. org.apache.pdfbox.pdmodel.PDPage.getResources(): Used within the isPageBlank helper method to check for resources like images.

  10. org.apache.pdfbox.pdmodel.PDDocument.getDocumentCatalog(): Used within the isPageBlank helper method to access the document catalog.

  11. org.apache.pdfbox.pdmodel.PDDocumentCatalog.getAcroForm(): Used within the isPageBlank helper method to access interactive form fields.

  12. org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.getFields(): Used within the isPageBlank helper method to get the list of form fields.

  13. org.apache.pdfbox.pdmodel.interactive.form.PDField.getWidgets(): Used within the isPageBlank helper method to get the widgets associated with a form field.

  14. org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotationWidget.getPage(): Used within the isPageBlank helper method to check which page a widget is on.

  15. org.apache.pdfbox.pdmodel.PDPage.getAnnotations(): Used within the isPageBlank helper method to check for annotations on the page.

  16. org.apache.pdfbox.text.PDFTextStripper.getText(PDDocument doc): Used within the isPageBlank helper method to extract text from a single page to check if it's blank.

  17. org.apache.pdfbox.text.PDFTextStripper.setStartPage(int startPage): Used within the isPageBlank helper method.

  18. org.apache.pdfbox.text.PDFTextStripper.setEndPage(int endPage): Used within the isPageBlank helper method.

Last updated