Apache PDFBox - Extract Text
🔧 Operation Name
Apache PDFBox - Extract Text
extractTextWithPageRange
🧾 Description
Extracts text content from one or more selected pages in a PDF. You can optionally define specific pages or ranges using a string like "1,3,5-7".
Utilize Apache PDFBox® to extract the text of PDF document to:
Classify a pdf before choosing which MuleSoft IDP Document Action to use
Feed an LLM prompt
✅ Inputs
PDF File [Binary]
InputStream (Binary)
✅
The PDF file whose text content you want to extract.
Page Range
String
❌ (Optional)
Comma-separated list of individual pages and ranges (e.g., 2,4,9-12). If omitted, all pages are used.
📤 Output
Payload:
StringContains the extracted text from the specified pages.Attributes:
PdfBoxFileAttributesIncludes metadata such as:numberOfPagespdfSizetitle,author,subject,keywordscreationDate,modificationDate
🧪 MuleSoft Flow Example
Here’s how to call this operation in a MuleSoft flow:

🔍 Notes
Page Indexing: Page numbers are 1-based (i.e.,
1= first page).If
pageRangeis omitted, the connector will extract text from all pages.Text is returned as plain text (
text/plain), suitable for logging, displaying in UIs, or further transformation.
Underlying Application Interface:
Last updated