Apache PDFBox - Split Pages

🔧 Operation Name

Apache PDFBox - Split Pages splitPages


🧾 Description

Splits a PDF document into multiple smaller PDFs. You can:

  • Split into individual pages, or

  • Split in larger chunks using the pageIncrement parameter.


✅ Inputs

Parameter
Type
Required
Description

PDF File [Binary]

InputStream (Binary)

The PDF document to be split.

Page Increment

Integer

❌ (Optional)

How many pages per chunk. Default is 1 (split into single-page PDFs). Set to e.g. 3 to split every 3 pages into one part.


📤 Output

  • Payload: List<InputStream> (List of binary streams) A list of split PDFs (each with pageIncrement number of pages, except the last chunk which may have fewer).

  • Attributes: PdfBoxFileAttributes Metadata from the original file: number of pages, size, title, etc.


🧪 MuleSoft Flow Example

Here’s how to call this operation in a MuleSoft flow:


🔍 Notes

  • If pageIncrement is not specified, it defaults to 1 — i.e., one page per output PDF.

  • Set pageIncrement = 2 to split into documents containing two pages each, and so on.

  • The last part may contain fewer pages if the total isn’t divisible by the increment.


Underlying Application Interface:

Pseudo Code
Methods used from the Apache PDFBox library
  • org.apache.pdfbox.Loader.loadPDF(byte[] input): Used in Step 6 to load the original PDF document from the input byte array.

  • org.apache.pdfbox.pdmodel.PDDocument.getDocumentInformation(): Used within the extractPdfMetadata helper (called in Step 8) to retrieve the metadata from the original document. Methods from the returned PDDocumentInformation object (like getTitle(), getAuthor(), etc.) are then used to populate the attributes.

  • org.apache.pdfbox.pdmodel.PDDocument.getNumberOfPages(): Used in Step 9 to get the total number of pages in the original document.

  • org.apache.pdfbox.multipdf.Splitter(): Used in Step 11 to create a new instance of the PDF splitter.

  • org.apache.pdfbox.multipdf.Splitter.setSplitAtPage(int pageIncrement): Used in Step 12 to configure the splitter to split the document every pageIncrement pages.

  • org.apache.pdfbox.multipdf.Splitter.split(PDDocument document): Used in Step 13 to perform the actual splitting of the original PDDocument into a list of new PDDocument objects.

  • org.apache.pdfbox.pdmodel.PDDocument.save(OutputStream output): Used in Step 15d within the loop to save each individual split PDDocument to a ByteArrayOutputStream.

  • org.apache.pdfbox.pdmodel.PDDocument.close(): Used in Step 16 to close each of the split PDDocuments after they have been saved, and also in Step 20 to ensure the original loaded PDDocument is closed.

Last updated