понедельник, 29 марта 2010 г.

PDF and Java

I discovered a Java library for PDF from Etymon Consulting. Although it does not cover the full specification, it does provide a convenient approach for reading, changing and writing PDF files from within Java programs. As with any Java library, the API is organized into packages. The main package is
com.etymon.pj.object
 
. Here, you'll find an object representation of all PDF core objects, which are arrays, boolean, dictionary, name, null, number, reference, stream, and string. Where the Java language provides an equivalent object, it is used but with a wrapper around it for consistency purposes. So, for example, the string object is represented by PjString.

When you read a PDF file, the Java equivalents of the PDF objects are created. You can then manipulate the objects using their methods and write the result back to the PDF file. You do need knowledge of PDF language to effectively do some of the manipulations. The following lines, for example, create a Font object:
 
PjFontType1 font = new PjFontType1(); 
font.setBaseFont(new PjName("Helvetica-Bold")); 
font.setEncoding(new PjName("PDFDocEncoding")); 
int fontId = pdf.registerObject(font);


where
pdf
is the object pointer to a PDF file.

One thing, I wanted to do was to change parts of the text in the PDF file to create "customized" PDF. While I have access to the PjStream object, the bytearray containing the text is compressed and the current library does not support decompression of LZW. It does support decompression of Flate algorithm.
Despite some limitations, you can still do many useful things. If you need to append a number of PDF documents programmatically, you can create a page and then append the page to the existing PDF documents, all from Java. The API also provide you with information about the document like number of pages, author, keyword, and title. This would allow for a Java servlet to dynamically create a page containing the document information with a link to the actual PDF files. As new PDF files are added and old ones deleted, the servlet would update the page to reflect the latest collection.
Listing 1 shows a simple program that uses the pj library to extract information from a PDF file and print that information to the console.
 
Listing 1.
import com.etymon.pj.*;
import com.etymon.pj.object.*;

public class GetPDFInfo {
  public static void main (String args[]) {
   try {
           Pdf pdf = new Pdf(args[0]);
            System.out.println("# of pages is " + pdf.getPageCount());
       int y = pdf.getMaxObjectNumber();
       for (int x=1; x <= y; x++) {
     PjObject obj = pdf.getObject(x);
         if (obj instanceof PjInfo) {
        System.out.println("Author: " + ((PjInfo)
                                                        obj).getAuthor());
        System.out.println("Creator: " + ((PjInfo)
                                                        obj).getCreator());
        System.out.println("Subject: " + ((PjInfo)
                                                        obj).getSubject());
        System.out.println("Keywords: " + ((PjInfo)
                                                         obj).getKeywords());

         }
       }
   }
   catch (java.io.IOException ex) {
        System.out.println(ex);
   }
   catch (com.etymon.pj.exception.PjException  ex) {
        System.out.println(ex);
   }   
  }
}

Before you compile the above program, you need to download the pj library, which includes the pj.jar file. Make sure your CLASSPATH includes the pj.jar file.
The program reads the PDF file specified at the command-line and parses it using the following line:

Pdf pdf = new Pdf(args[0]);
It then goes through all the objects that were created as a result of parsing the PDF file and searches for a
PjInfo
object. That object encapsulates information such as the author, subject, and keywords, which are extracted using the appropriate methods. You can also "set" those values, which saves them permanently in the PDF file.
There are a number of sample programs that ship with the pj library, along with the standard javadoc-style documentation. The library is distributed under GNU General Public License.

Conclusion

Despite additions and advancements of HTML, PDF continues to be the most popular mean for sharing rich documents. As a programming language, Java needs to be able to interact with data. The pj library shown here, is a preview of how PDF objects can be modeled in Java and then use Java's familiar constructs to manipulate the seemingly complex PDF documents. With this type of interaction, applications that need to serve rich documents can actually "personalize" the content before sending out the document. This scenario can be applied, for example, to many legal forms where a hand signature is still required and the form is too complex to be drawn entirely in HTML. Java and PDF provide a nice solution for these types of applications.

Комментариев нет:

Отправить комментарий

Примечание. Отправлять комментарии могут только участники этого блога.