MuleSoft Forge
GitHub
  • MuleSoft Forge Initiative
    • Overview
    • How to Contribute
  • Connectors
    • mule-idp-connector
      • Set Up
      • Operations
        • Service IDP - Execution - Submit
        • Service IDP - Execution Result - Retrieve
        • Service IDP - Review Tasks - List
        • Service IDP - Review Task - Delete
        • Service IDP - Review Task - Update
        • Platform IDP - Actions - List
        • Platform IDP - Action Versions - List
        • Deprecated 1.0.1 - Utils IDP - PDF - ExtractText
        • Deprecated 1.0.1 - Utils IDP - PDF - RemovePages
      • docs.mulesoft.com
      • MuleSoft IDP Universal 🌐 REST Smart Connector 🔌
  • Modules
    • mule-pdfbox-module
      • Set Up
      • Operations
        • Apache PDFBox - Extract Text
        • Apache PDFBox - Filter Pages
        • Apache PDFBox - Get Info
        • Apache PDFBox - Merge PDFs
        • Apache PDFBox - Rotate Pages
        • Apache PDFBox - Split Pages
Powered by GitBook
On this page
  • DEPRECATED 1.0.1 - Moved to mule-pdfbox-module
  • Usage:
  • Configuration:
  • Underlying Application Interface:
  1. Connectors
  2. mule-idp-connector
  3. Operations

Deprecated 1.0.1 - Utils IDP - PDF - ExtractText

Extract Text from PDF

PreviousPlatform IDP - Action Versions - ListNextDeprecated 1.0.1 - Utils IDP - PDF - RemovePages

Last updated 12 days ago

DEPRECATED 1.0.1 - Moved to

Usage:

Utilize to manipulate a PDF document

  • Extract Text from a PDF so that for example you can classify a doc before choosing which Document Action to use to process the doc


Configuration:


Underlying Application Interface:

Pseudo Code
FUNCTION utilsPdfRemovePages(Options, InputPdfStream)
  OUTPUT: ModifiedPdfStream, Attributes {originalPageCount, pagesRemoved}

  // 1. Load PDF from InputPdfStream into an editable pdfDocument
  Read InputPdfStream into memory
  Load PDF data into pdfDocument object
  originalCount = pdfDocument.getPageCount()

  // 2. Decide removal strategy based on Options
  IF Options say "Remove Blank Pages":
    // Find and remove pages with no text, images, or annotations
    Identify blank pages within pdfDocument
    Remove all identified blank pages
  ELSE:
    // Keep only specified pages, remove others
    Get list of PageNumbersToKeep from Options
    Identify all pages NOT in PageNumbersToKeep
    Remove identified pages from pdfDocument
  END IF

  // 3. Finalize and return
  removedCount = originalCount - pdfDocument.getPageCount()
  Save modified pdfDocument into outputBytes
  Create ModifiedPdfStream from outputBytes
  Create Attributes map {originalPageCount=originalCount, pagesRemoved=removedCount}

  RETURN ModifiedPdfStream, Attributes
  // Note: Includes error handling for invalid input and processing errors.

END FUNCTION

Methods used from the Apache PDFBox library

From org.apache.pdfbox.Loader:

  • loadPDF(byte[] pdfBytes): Loads a PDF document from a byte array.

From org.apache.pdfbox.pdmodel.PDDocument:

  • getNumberOfPages(): Gets the total number of pages in the document.

  • getPages(): Gets the page tree (PDPageTree) containing all pages.

  • getPage(int pageIndex): Gets a specific page by its 0-based index.

  • removePage(PDPage page): Removes the specified page object from the document.

  • save(OutputStream outputStream): Saves the document to an output stream.

  • close(): Closes the document and releases resources (implicitly called by the try-with-resources statement).

From org.apache.pdfbox.pdmodel.PDPage:

  • getResources(): Gets the resources dictionary (PDResources) associated with the page.

  • getAnnotations(): Gets a list (List<PDAnnotation>) of annotations on the page.

From org.apache.pdfbox.pdmodel.PDResources:

  • getXObjectNames(): Gets the names of external objects (like images) referenced in the resources.

From org.apache.pdfbox.pdmodel.common.PDPageable (Interface implemented by PDPageTree):

  • indexOf(PDPage page): Finds the 0-based index of a given page within the page tree.

From org.apache.pdfbox.text.PDFTextStripper:

  • PDFTextStripper(): Constructor to create a text stripper object.

  • setStartPage(int pageNum): Sets the 1-based page number where text extraction should start.

  • setEndPage(int pageNum): Sets the 1-based page number where text extraction should end.

  • getText(PDDocument document): Extracts text from the specified page range within the document.

mule-pdfbox-module
Apache PDFBox®
pdfbox 3.0.4 javadoc (org.apache.pdfbox)
Logo