Apache PDFBox - Merge PDFs

πŸ”§ Operation Name

Apache PDFBox - Merge PDFs mergePdfs


🧾 Description

Combines two or more PDF documents into a single unified PDF. Each input file is processed in-memory using PDFBox's random-access buffering to ensure full compatibility with PDFBox 3.0.x.

Ideal for combining related documents before delivery, archiving, or downstream transformation.


βœ… Inputs

  • PDF Files [List of Binary] (List<InputStream>) A list of PDF streams to merge. Must contain at least two. Provided via a DataWeave expression or flow variable (e.g., #[payload], #[vars.myList]).


πŸ“€ Output

  • Payload: InputStream (binary stream) A single merged PDF containing all input documents, in the order provided.

  • Attributes: PdfBoxFileAttributes Metadata from the merged output which will be from the FIRST pdf except total page count will be the combined page total of merged pdf: total page count, file size, title, author, etc.


πŸ§ͺ MuleSoft Flow Example

Here’s how to call this operation in a MuleSoft flow:

Example Dataweave for Input Array of Binaries:

Advised: Add MuleSoftForge Apache PDFBox - Merge pdfs component and make no change

πŸ” Notes

  • Input must contain at least two PDF files, or the operation will throw an error.

  • The merge order follows the order of the List<InputStream> provided β€” be careful with how your list is constructed.

  • All documents are merged in memory using RandomAccessReadBuffer, compatible with PDFBox 3.0.4.

  • If input streams are empty (0 bytes), they will still be processed unless you add a pre-filter.

  • Ideal for combining invoices, attachments, or generating consolidated output PDFs


Underlying Application Interface:

Pseudo Code
Methods used from the Apache PDFBox library
  • org.apache.pdfbox.multipdf.PDFMergerUtility(): The constructor is used in Step 4 to create an instance of the utility class responsible for merging.

  • org.apache.pdfbox.multipdf.PDFMergerUtility.setDestinationStream(OutputStream outputStream): Used in Step 5 to tell the merger where to write the resulting merged PDF.

  • org.apache.pdfbox.io.RandomAccessReadBuffer(byte[] bytes): The constructor is used in Step 7a ii to create a buffer from the byte array of each input PDF. This buffer is required by the merger utility.

  • org.apache.pdfbox.multipdf.PDFMergerUtility.addSource(RandomAccessRead source): Used in Step 7a iv within the loop to add each input PDF (represented by a RandomAccessReadBuffer) to the list of documents to be merged.

  • org.apache.pdfbox.multipdf.PDFMergerUtility.mergeDocuments(MemoryUsageSetting memoryUsageSetting): Used in Step 7b to perform the actual merging process. The null argument indicates default memory usage settings.

  • org.apache.pdfbox.Loader.loadPDF(byte[] input): Used in Step 7d i within a try-with-resources block to load the newly created merged PDF byte array into a PDDocument object, specifically for the purpose of extracting its metadata.

  • org.apache.pdfbox.pdmodel.PDDocument.getDocumentInformation(): Used within the extractPdfMetadata helper method (called in Step 7d iii) to get the metadata dictionary of the merged document.

  • org.apache.pdfbox.pdmodel.PDDocument.getNumberOfPages(): Used within the extractPdfMetadata helper method (called in Step 7d iii) to get the page count of the merged document.

  • org.apache.pdfbox.io.RandomAccessRead.close(): Used in Step 9b within the finally block to close the RandomAccessReadBuffer resources that were created for each input PDF.

  • org.apache.pdfbox.pdmodel.PDDocument.close(): Used implicitly by the try-with-resources block in Step 7d to close the PDDocument created from the merged bytes after metadata extraction.

Last updated