Deprecated 1.0.1 - Utils IDP - PDF - ExtractText

Extract Text from PDF

DEPRECATED 1.0.1 - Moved to mule-pdfbox-module

Usage:

Utilize Apache PDFBox® to manipulate a PDF document

Extract Text from a PDF so that for example you can classify a doc before choosing which Document Action to use to process the doc

Configuration:

Underlying Application Interface:

pdfbox 3.0.4 javadoc (org.apache.pdfbox)

Pseudo Code

FUNCTION utilsPdfRemovePages(Options, InputPdfStream)
  OUTPUT: ModifiedPdfStream, Attributes {originalPageCount, pagesRemoved}

  // 1. Load PDF from InputPdfStream into an editable pdfDocument
  Read InputPdfStream into memory
  Load PDF data into pdfDocument object
  originalCount = pdfDocument.getPageCount()

  // 2. Decide removal strategy based on Options
  IF Options say "Remove Blank Pages":
    // Find and remove pages with no text, images, or annotations
    Identify blank pages within pdfDocument
    Remove all identified blank pages
  ELSE:
    // Keep only specified pages, remove others
    Get list of PageNumbersToKeep from Options
    Identify all pages NOT in PageNumbersToKeep
    Remove identified pages from pdfDocument
  END IF

  // 3. Finalize and return
  removedCount = originalCount - pdfDocument.getPageCount()
  Save modified pdfDocument into outputBytes
  Create ModifiedPdfStream from outputBytes
  Create Attributes map {originalPageCount=originalCount, pagesRemoved=removedCount}

  RETURN ModifiedPdfStream, Attributes
  // Note: Includes error handling for invalid input and processing errors.

END FUNCTION

Methods used from the Apache PDFBox library

From org.apache.pdfbox.Loader:

loadPDF(byte[] pdfBytes): Loads a PDF document from a byte array.

From org.apache.pdfbox.pdmodel.PDDocument:

getNumberOfPages(): Gets the total number of pages in the document.
getPages(): Gets the page tree (PDPageTree) containing all pages.
getPage(int pageIndex): Gets a specific page by its 0-based index.
removePage(PDPage page): Removes the specified page object from the document.
save(OutputStream outputStream): Saves the document to an output stream.
close(): Closes the document and releases resources (implicitly called by the try-with-resources statement).

From org.apache.pdfbox.pdmodel.PDPage:

getResources(): Gets the resources dictionary (PDResources) associated with the page.
getAnnotations(): Gets a list (List<PDAnnotation>) of annotations on the page.

From org.apache.pdfbox.pdmodel.PDResources:

getXObjectNames(): Gets the names of external objects (like images) referenced in the resources.

From org.apache.pdfbox.pdmodel.common.PDPageable (Interface implemented by PDPageTree):

indexOf(PDPage page): Finds the 0-based index of a given page within the page tree.

From org.apache.pdfbox.text.PDFTextStripper:

PDFTextStripper(): Constructor to create a text stripper object.
setStartPage(int pageNum): Sets the 1-based page number where text extraction should start.
setEndPage(int pageNum): Sets the 1-based page number where text extraction should end.
getText(PDDocument document): Extracts text from the specified page range within the document.

PreviousPlatform IDP - Action Versions - List NextDeprecated 1.0.1 - Utils IDP - PDF - RemovePages

Last updated 1 month ago