Apache PDFBox - Extract Text
Last updated
Last updated
Apache PDFBox - Extract Text
extractTextWithPageRange
Extracts text content from one or more selected pages in a PDF. You can optionally define specific pages or ranges using a string like "1,3,5-7"
.
Utilize to extract the text of PDF document to:
Classify a pdf before choosing which MuleSoft IDP Document Action to use
Feed an LLM prompt
PDF File [Binary]
InputStream
(Binary)
โ
The PDF file whose text content you want to extract.
Page Range
String
โ (Optional)
Comma-separated list of individual pages and ranges (e.g., 2,4,9-12
). If omitted, all pages are used.
Payload: String
Contains the extracted text from the specified pages.
Attributes: PdfBoxFileAttributes
Includes metadata such as:
numberOfPages
pdfSize
title
, author
, subject
, keywords
creationDate
, modificationDate
Hereโs how to call this operation in a MuleSoft flow:
Page Indexing: Page numbers are 1-based (i.e., 1
= first page).
If pageRange
is omitted, the connector will extract text from all pages.
Text is returned as plain text (text/plain
), suitable for logging, displaying in UIs, or further transformation.