Apache PDFBox - Extract Text
🔧 Operation Name
Apache PDFBox - Extract Text
extractTextWithPageRange
🧾 Description
Extracts text content from one or more selected pages in a PDF. You can optionally define specific pages or ranges using a string like "1,3,5-7"
.
Utilize Apache PDFBox® to extract the text of PDF document to:
Classify a pdf before choosing which MuleSoft IDP Document Action to use
Feed an LLM prompt
✅ Inputs
PDF File [Binary]
InputStream
(Binary)
✅
The PDF file whose text content you want to extract.
Page Range
String
❌ (Optional)
Comma-separated list of individual pages and ranges (e.g., 2,4,9-12
). If omitted, all pages are used.
📤 Output
Payload:
String
Contains the extracted text from the specified pages.Attributes:
PdfBoxFileAttributes
Includes metadata such as:numberOfPages
pdfSize
title
,author
,subject
,keywords
creationDate
,modificationDate
🧪 MuleSoft Flow Example
Here’s how to call this operation in a MuleSoft flow:

<mule
xmlns="http://www.mulesoft.org/schema/mule/core"
xmlns:doc="http://www.mulesoft.org/schema/mule/documentation"
xmlns:pdfbox="http://www.mulesoft.org/schema/mule/pdfbox"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.mulesoft.org/schema/mule/core
http://www.mulesoft.org/schema/mule/core/current/mule.xsd
http://www.mulesoft.org/schema/mule/pdfbox
http://www.mulesoft.org/schema/mule/pdfbox/current/mule-pdfbox.xsd
http://www.mulesoft.org/schema/mule/ee/core
http://www.mulesoft.org/schema/mule/ee/core/current/mule-ee.xsd">
<flow name="main">
<scheduler doc:name="Scheduler" doc:id="dsgkfy" >
<scheduling-strategy>
<fixed-frequency timeUnit="HOURS"/>
</scheduling-strategy>
</scheduler>
<flow-ref name="Apache PDFBox - Extract Text" />
</flow>
<sub-flow name="Apache PDFBox - Extract Text">
<set-payload doc:id="vxsfk1" doc:name="Set payload" mimeType="application/octet-stream" value='#[%dw 2.0
output application/java
---readUrl("https://www.adobe.com/support/products/enterprise/knowledgecenter/media/c4611_sample_explain.pdf", "application/octet-stream") as Binary]'></set-payload>
<pdfbox:extract-text-with-page-range doc:id="vicbr1" doc:name="Apache PDFBox - Extract Text" pageRange="1,3-4"></pdfbox:extract-text-with-page-range>
<logger doc:name="Logger" doc:id="ecdqss" message='#[%dw 2.0
output text
---
"\n\n Apache PDFBox - Extract Text "
++ "\n\n⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄"
++ "\n\nExtracted Text Attributes: " ++ (write(attributes, "application/json")) as String
++ "\n\nExtracted Text Response: " ++ payload as String
++ "\n\n^^^^^^^^^^^^^^^^^^^^"
++ "\n\n Apache PDFBox - Extract Text"
++ "\n\n"]'/>
</sub-flow>
</mule>
🔍 Notes
Page Indexing: Page numbers are 1-based (i.e.,
1
= first page).If
pageRange
is omitted, the connector will extract text from all pages.Text is returned as plain text (
text/plain
), suitable for logging, displaying in UIs, or further transformation.
Underlying Application Interface:
Last updated