Utils IDP - PDF - ExtractText
Extract Text from PDF
Last updated
Extract Text from PDF
Last updated
Utilize to manipulate a PDF document
Extract Text from a PDF so that for example you can classify a doc before choosing which Document Action to use to process the doc
FUNCTION utilsPdfRemovePages(Options, InputPdfStream)
OUTPUT: ModifiedPdfStream, Attributes {originalPageCount, pagesRemoved}
// 1. Load PDF from InputPdfStream into an editable pdfDocument
Read InputPdfStream into memory
Load PDF data into pdfDocument object
originalCount = pdfDocument.getPageCount()
// 2. Decide removal strategy based on Options
IF Options say "Remove Blank Pages":
// Find and remove pages with no text, images, or annotations
Identify blank pages within pdfDocument
Remove all identified blank pages
ELSE:
// Keep only specified pages, remove others
Get list of PageNumbersToKeep from Options
Identify all pages NOT in PageNumbersToKeep
Remove identified pages from pdfDocument
END IF
// 3. Finalize and return
removedCount = originalCount - pdfDocument.getPageCount()
Save modified pdfDocument into outputBytes
Create ModifiedPdfStream from outputBytes
Create Attributes map {originalPageCount=originalCount, pagesRemoved=removedCount}
RETURN ModifiedPdfStream, Attributes
// Note: Includes error handling for invalid input and processing errors.
END FUNCTION
From org.apache.pdfbox.Loader
:
loadPDF(byte[] pdfBytes)
: Loads a PDF document from a byte array.
From org.apache.pdfbox.pdmodel.PDDocument
:
getNumberOfPages()
: Gets the total number of pages in the document.
getPages()
: Gets the page tree (PDPageTree
) containing all pages.
getPage(int pageIndex)
: Gets a specific page by its 0-based index.
removePage(PDPage page)
: Removes the specified page object from the document.
save(OutputStream outputStream)
: Saves the document to an output stream.
close()
: Closes the document and releases resources (implicitly called by the try-with-resources statement).
From org.apache.pdfbox.pdmodel.PDPage
:
getResources()
: Gets the resources dictionary (PDResources
) associated with the page.
getAnnotations()
: Gets a list (List<PDAnnotation>
) of annotations on the page.
From org.apache.pdfbox.pdmodel.PDResources
:
getXObjectNames()
: Gets the names of external objects (like images) referenced in the resources.
From org.apache.pdfbox.pdmodel.common.PDPageable
(Interface implemented by PDPageTree
):
indexOf(PDPage page)
: Finds the 0-based index of a given page within the page tree.
From org.apache.pdfbox.text.PDFTextStripper
:
PDFTextStripper()
: Constructor to create a text stripper object.
setStartPage(int pageNum)
: Sets the 1-based page number where text extraction should start.
setEndPage(int pageNum)
: Sets the 1-based page number where text extraction should end.
getText(PDDocument document)
: Extracts text from the specified page range within the document.