Deprecated 1.0.1 - Utils IDP - PDF - ExtractText

Extract Text from PDF

DEPRECATED 1.0.1 - Moved to mule-pdfbox-module

Usage:

Utilize Apache PDFBox® to manipulate a PDF document

  • Extract Text from a PDF so that for example you can classify a doc before choosing which Document Action to use to process the doc


Configuration:


Underlying Application Interface:

Pseudo Code

Methods used from the Apache PDFBox library

From org.apache.pdfbox.Loader:

  • loadPDF(byte[] pdfBytes): Loads a PDF document from a byte array.

From org.apache.pdfbox.pdmodel.PDDocument:

  • getNumberOfPages(): Gets the total number of pages in the document.

  • getPages(): Gets the page tree (PDPageTree) containing all pages.

  • getPage(int pageIndex): Gets a specific page by its 0-based index.

  • removePage(PDPage page): Removes the specified page object from the document.

  • save(OutputStream outputStream): Saves the document to an output stream.

  • close(): Closes the document and releases resources (implicitly called by the try-with-resources statement).

From org.apache.pdfbox.pdmodel.PDPage:

  • getResources(): Gets the resources dictionary (PDResources) associated with the page.

  • getAnnotations(): Gets a list (List<PDAnnotation>) of annotations on the page.

From org.apache.pdfbox.pdmodel.PDResources:

  • getXObjectNames(): Gets the names of external objects (like images) referenced in the resources.

From org.apache.pdfbox.pdmodel.common.PDPageable (Interface implemented by PDPageTree):

  • indexOf(PDPage page): Finds the 0-based index of a given page within the page tree.

From org.apache.pdfbox.text.PDFTextStripper:

  • PDFTextStripper(): Constructor to create a text stripper object.

  • setStartPage(int pageNum): Sets the 1-based page number where text extraction should start.

  • setEndPage(int pageNum): Sets the 1-based page number where text extraction should end.

  • getText(PDDocument document): Extracts text from the specified page range within the document.

Last updated