Apache PDFBox - Extract Text

🔧 Operation Name

Apache PDFBox - Extract Text extractTextWithPageRange


🧾 Description

Extracts text content from one or more selected pages in a PDF. You can optionally define specific pages or ranges using a string like "1,3,5-7".

Utilize Apache PDFBox® to extract the text of PDF document to:

  • Classify a pdf before choosing which MuleSoft IDP Document Action to use

  • Feed an LLM prompt


✅ Inputs

Parameter
Type
Required
Description

PDF File [Binary]

InputStream (Binary)

The PDF file whose text content you want to extract.

Page Range

String

❌ (Optional)

Comma-separated list of individual pages and ranges (e.g., 2,4,9-12). If omitted, all pages are used.


📤 Output

  • Payload: String Contains the extracted text from the specified pages.

  • Attributes: PdfBoxFileAttributes Includes metadata such as:

    • numberOfPages

    • pdfSize

    • title, author, subject, keywords

    • creationDate, modificationDate


🧪 MuleSoft Flow Example

Here’s how to call this operation in a MuleSoft flow:


🔍 Notes

  • Page Indexing: Page numbers are 1-based (i.e., 1 = first page).

  • If pageRange is omitted, the connector will extract text from all pages.

  • Text is returned as plain text (text/plain), suitable for logging, displaying in UIs, or further transformation.


Underlying Application Interface:

Pseudo Code

Methods used from the Apache PDFBox library
  • org.apache.pdfbox.Loader.loadPDF(byte[] input): Used to load the PDF document from a byte array.

  • org.apache.pdfbox.pdmodel.PDDocument.getNumberOfPages(): Used to get the total number of pages in the loaded PDF document.

  • org.apache.pdfbox.text.PDFTextStripper(): Constructor for creating a new text stripper object.

  • org.apache.pdfbox.text.PDFTextStripper.setStartPage(int startPage): Used to set the starting page for text extraction.

  • org.apache.pdfbox.text.PDFTextStripper.setEndPage(int endPage): Used to set the ending page for text extraction.

  • org.apache.pdfbox.text.PDFTextStripper.getText(PDDocument doc): Used to extract text from the specified pages of the document.

  • org.apache.pdfbox.pdmodel.PDDocument.close(): Used to close the loaded PDF document and release resources.

Last updated