-->
This post of the example tutorial series describes how to read Metadata from a PDF document using Java iText library. For those, who are beginners to the concept of Metadata, a small definition is provided below to get started. PDF metadata. When creating a PDF document, you might want to make sure that people could find out information about the PDF document. You can accomplish this task by adding metadata to the PDF document. The code shown below adds the title, the subject, the author, and its.
Some image files contain metadata that you can read to determine features of the image. For example, a digital photograph might contain metadata that you can read to determine the make and model of the camera used to capture the image. With GDI+, you can read existing metadata, and you can also write new metadata to image files.
GDI+ stores an individual piece of metadata in a PropertyItem object. You can read the PropertyItems property of an Image object to retrieve all the metadata from a file. The PropertyItems property returns an array of PropertyItem objects.
A PropertyItem object has the following four properties: Id
, Value
, Len
, and Type
.
Id
A tag that identifies the metadata item. Some values that can be assigned to Id are shown in the following table.
Hexadecimal value | Description |
---|---|
0x0320 0x010F 0x0110 0x9003 0x829A 0x5090 0x5091 | Image title Equipment manufacturer Equipment model ExifDTOriginal Exif exposure time Luminance table Chrominance table |
Value
An array of values. The format of the values is determined by the Type property.
Len
The length (in bytes) of the array of values pointed to by the Value property.
Type
The data type of the values in the array pointed to by the Value
property. The formats indicated by the Type
property values are shown in the following table
Numeric value | Description |
---|---|
1 | A Byte |
2 | An array of Byte objects encoded as ASCII |
3 | A 16-bit integer |
4 | A 32-bit integer |
5 | An array of two Byte objects that represent a rational number |
6 | Not used |
7 | Undefined |
8 | Not used |
9 | SLong |
10 | SRational |
Example
Description
The following code example reads and displays the seven pieces of metadata in the file FakePhoto.jpg
. The second (index 1) property item in the list has Id 0x010F (equipment manufacturer) and Type 2 (ASCII-encoded byte array). The code example displays the value of that property item.
The code produces output similar to the following:
Code
Compiling the Code
The preceding example is designed for use with Windows Forms, and it requires PaintEventArgse
, which is a parameter of the Paint event handler. Handle the form's Paint event and paste this code into the paint event handler. You must replace FakePhoto.jpg
with an image name and path valid on your system and import the System.Drawing.Imaging
namespace.
See also
I'm trying to read metadata attached to arbitrary PDFs: title, author, subject, and keywords.
Is there a PHP library, preferably open-source, that can read PDF metadata? If so, or if there isn't, how would one use the library (or lack thereof) to extract the metadata?
To be clear, I'm not interested in creating or modifying PDFs or their metadata, and I don't care about the PDF bodies. I've looked at a number of libraries, including FPDF (which everyone seems to recommend), but it appears only to be for PDF creation, not metadata extraction.
6 Answers
The Zend framework includes Zend_Pdf, which makes this really easy:
Limitations: Works only on files without encryption smaller then 16MB.
Don't know about libraries, but a simple way to achieve the same result might be fopening the file and parsing everything that comes after the last 'endstream'.
Try to open a pdf on a text editor, a parser shouldn't take more than five lines.
PDF Parser does exactly what you want and it's pretty straightforward to use:
You can try it in the demo page.
I was looking for the same thing today. And I came across a small PHP class over at http://de77.com/ that offers a quick and dirty solution. You can download the class directly. Output is UTF-8 encoded.
The creator says:
Here’s a PHP class I wrote which can be used to get title & author and a number of pages of any PDF file. It does not use any external application - just pure PHP.
For me, it work's! All thanks goes solely to the creator of the class ... well, maybe just a little bit thanks to me too for finding the class ;)
You may use PDFtk to extract the page count:
If ImageMagick is available you may also use:
Bash Read Pdf Metadata
Execute in PHP via shell_exec():