Apache PDFBox - Get Info

πŸ”§ Operation Name

Apache PDFBox - Get Info extractInfo


🧾 Description

Extracts metadata and structural details from a PDF document. This includes properties like author, title, number of pages, creation/modification dates, and file size.


βœ… Inputs

Parameter
Type
Required
Description

PDF File [Binary]

InputStream (Binary)

βœ…

The PDF document for which to extract information.


πŸ“€ Output

  • Attributes: PdfBoxFileAttributes A custom object containing metadata and structural details:

    Field
    Type
    Description

    numberOfPages

    int

    Total number of pages in the PDF

    pdfSize

    long

    Size in bytes

    title

    String

    Document title

    author

    String

    Author metadata

    subject

    String

    Subject metadata

    keywords

    String

    Keywords metadata

    creator

    String

    Tool or system used to create the PDF

    producer

    String

    PDF producer metadata

    creationDate

    String

    Date created (ISO-8601)

    modificationDate

    String

    Date modified (ISO-8601)


πŸ§ͺ MuleSoft Flow Example

Here’s how to call this operation in a MuleSoft flow:


πŸ” Notes

  • The operation does not modify the PDFβ€”only reads metadata.

  • Ideal for auditing, indexing, or validating PDFs before further processing.


Underlying Application Interface:

Pseudo Code
Methods used from the Apache PDFBox library
  • org.apache.pdfbox.Loader.loadPDF(byte[] input): Used to load the PDF document.

  • org.apache.pdfbox.pdmodel.PDDocument.getNumberOfPages(): Used to get the total number of pages.

  • org.apache.pdfbox.pdmodel.PDDocument.getDocumentInformation(): Used to get the document's metadata.

  • org.apache.pdfbox.pdmodel.PDDocumentInformation (methods like getTitle(), getAuthor(), getSubject(), getKeywords(), getCreationDate(), getModificationDate()): Used to retrieve specific metadata fields.

  • org.apache.pdfbox.pdmodel.PDDocument.close(): Used to close the loaded document.

Last updated