Flying saucer pdf page break

Flying Saucer PDF Generator and Unicode

Written by Geoff Mottram (geoff at minaret dot biz).

Placed in the public domain on January 30, 2012 by the author.

Last updated: February 16, 2012.

This document and all associated software is provided "as is", without warranty of any kind, express or implied, including but not limited to the warranties of merchantability, fitness for a particular purpose and noninfringement. In no event shall the author be liable for any claim, damages or other liability, whether in an action of contract, tort or otherwise, arising from, out of or in connection with this document and associated software or the use or other dealings in same.

Contents
Introduction
Repositories
Arial Version of Flying Saucer
Features of this Release
Software Bundles
Quick Start Guide to Generating a PDF File with Flying Saucer
Tips on using Flying Saucer

Introduction
Flying Saucer is a remarkable open-source Java project for converting XTHML files that contain CSS style-sheet information into PDF files. Flying Saucer relies on an equally impressive Java project called iText, which does the actual PDF generation via a programming interface. The beauty of Flying Saucer is how easy it is to generate PDF files from a format that most people are familiar with: HTML.

This document has two purposes: to provide some tips and helpful hints that will get you up and running faster with Flying Saucer, particularly if you are interested in Unicode and non-Latin characters; and to make available both source and binary bundles of a variation of Flying Saucer that contains some new features and fixes.

Repositories
Other than the material provided on this site, the current versions of Flying Saucer and iText can be found here: Flying Saucer Home Page
iText Home Page

A note about iText: Since version 5, the license has changed and you must pay to use this software in commercial projects unless you make your source code available to your work as well. The license even applies to the PDF files that are generated by iText so that if you use iText to generate PDF files in a commercial setting, you must pay or release your source code.

The last version of iText that does not have this restriction is 2.1.7 which uses Mozilla Public License Version 1.1. You can still find source and binary releases of 2.1.7 out on the Internet. An iText 2.1.7 binary jar file is included in the binary Flying Saucer files on this site. An iText source code bundle and a binary bundle are provided here as well.

Arial Version of Flying Saucer
This variant of Flying Saucer is named after the Arial Unicode MS font which was the reason for many of the enhancements to Flying Saucer. This font is incredibly useful if you are doing any work with multiple character sets as it reportedly implements the entire Unicode 2 character repertoire. What it does not have is premade bold and italic variations of the font. The Arial version of Flying Saucer will generate the appropriate PDF commands to emulate bold and italics when a style calls for it but no matching font can be found.

This release (R8-Arial) is based on the Flying Saucer master branch from January 6, 2012. The changes documented here have been uploaded to the Flying Saucer project on GitHub.

WARNING: Be advised that Arial Unicode MS is the copyrighted intellectual property of Microsoft Corporation. It comes bundled with Microsoft Office and portions of it must be embedded in any PDF files you generate with this font for it to display correctly if the computer viewing the PDF file does not have Arial Unicode MS. If you do embed the font in your PDF files, please seek the appropriate legal advice on your use of and, more importantly, your redistribution of any PDF documents generated with any font that has restrictions on its use. If you don't embed the font, you can avoid this problem altogether but the document will not display properly if the font is not installed on the target computer.

    Added font emulation for bold and italics variations when there is no direct support in the font files themselves. Fonts like Microsoft's Arial Unicode MS only come in one version: plain text. In order to have bold, italics and bold+italics the font must be modified on-the-fly by the PDF display software.

When a XHTML style calls for a bold or italics variant of a font and if the currently selected font does not have built-in support for these variations, Flying Saucer will now output the necessary PDF instructions to emulate these effects.

In the case of the PDF document title property, it will be automatically set using the contents of the element in the section of the XHTML source document. However, a metadata title name/content pair will override any element found in the head.

A title in the head section of the document 
public void addMetadata(String name, String value) public String getMetadataByName(String name) public ArrayList getMetadataListByName(String name)
void preWrite(ITextRenderer iTextRenderer, int pageCount);

When you used a line height that was smaller than a font's maximum glyph height, Flying Saucer was generating phantom text that bled from the first line of a page onto the bottom of the page immediately preceeding it. While the phantom text was not visible in the PDF document, your cursor could land on it and you could find it when running a search in Adobe Reader (it displayed as a blue rectangle where the text was hiding).

The solution used here is to disable the code in InlineBoxing.java that centers lines of text vertically within the currently defined line height (search for any references to halfLeading in InlineBoxing.java ). There may be a more elegent solution but this works for the time being. If a line of text has extra vertical space, that space will always follow below the text.

Software Bundles
The Flying Saucer code available here is the Arial branch of Flying Saucer Release 8, compiled with Java 6. If you just want to run the binary version, all you need to download is the first bundle. It contains iText version 2.1.7 for Java 6.

The iText source and binary code available here is version 2.1.7 with no modifications and has been compiled with Java 6 (a.k.a. 1.6). The binary supports PDF file encryption and includes the following jar files from Bouncy Castle:

bcmail-jdk16-146.jar
bcprov-jdk16-146.jar
bctsp-jdk16-146.jar

Note: If you live in a country that does not permit you to possess cryptography tools, don't download the binary version of iText available here. The Flying Saucer binary does not include these encryption libraries.

    Create a directory for Flying Saucer (i.e. /usr/local/flyingsaucer ).

#!/bin/sh CLASSPATH="/path/to/jar/files/core-renderer.jar:/path/to/jar/files/iText-2.1.7.jar" java org.xhtmlrenderer.simple.PDFRenderer $*
fs url pdf [version] where: url is the file name or URL of a XHTML source document. pdf is the file name of the PDF to create. version is an optional PDF version number between 1.2 and 1.7 (default is 1.2)

    When producing your XHTML file, use UTF-8 encoding.

@font-face < src: url(file:///Absolute/path/to/font/directory/ARIALUNI.TTF); -fs-pdf-font-embed: embed; -fs-pdf-font-encoding: Identity-H; >

Note the two Flying Saucer extensions to the @font-face style directive. The -fs-pdf-font-embed: embed; line directs the PDF generator to embed any portions of the font that your document needs to display properly. Without this line, your document will display properly only if the recipient has this font. By not embedding the font, you will avoid any potential copyright and licensing issues with the owner of this font but the document might not look as good at the other end. By embedding this font, you are insuring that the document is self-contained.

The -fs-pdf-font-encoding: Identity-H; line instructs the PDF display software that this is a Unicode font and not a font that is restricted to certain code pages. This line is very important.

body

The line-height property is a value that is multiplied by the font-size to set the vertical line height. In this example, the line height will be 7.956 points (6.8 x 1.17). It may seem counter-intuitive that a line height that is 17% larger than the font size should be smaller than the default line height but that is because, at least for this font, a 6.8 point size has a maximum character size of 11.1 points (63% larger than the nominal font size). FYI, Arial Unicode MS has the following font metrics:

unitsPerEm = 2048 xMax = 4629 xMin = -2071 yMax = 2200 yMin = -572
table < -fs-border-spacing-horizontal: 0; -fs-border-spacing-vertical: 0; border-spacing: 0; border-style: none; border-width: 0; border: 0; padding: 0; margin: 0; >td

The Flying Saucer style extensions ( -fs-border-spacing-horizontal: 0; and -fs-border-spacing-vertical: 0; ) are really critical to making this work.

Here is an example of how you can define two different page layouts for printing a book in which the side of the page closest to the binding (the inside of the page) has a larger margin (72 points) than the outside (60 points). There is also a footer centered at the bottom of every page with the letter A, a dash and the current page number.

/* Odd page numbers */ @page:right < size: 9.25in 11.25in; margin: 32pt 0 40pt 72pt; padding: 0; @bottom-center < content: "A-" counter(page); font-family: Arial Unicode MS, Lucida Sans Unicode, Arial, sans-serif; font-size: 6.8pt; >> /* Even page numbers */ @page:left < size: 9.25in 11.25in; margin: 32pt 0 40pt 60pt; padding: 0; @bottom-center < content: "A-" counter(page); font-family: Arial Unicode MS, Lucida Sans Unicode, Arial, sans-serif; font-size: 6.8pt; >>

This particular example was used in conjunction with a XHTML generating application that performed its own line wrapping and page breaks. However, since Flying Saucer will wrap lines that are too long and and insert page breaks when a page is too long, it was important to know whether the number of pages generated by the source application matched the final PDF page count. The two applications (XHTML generator and Min2pdf converter) communicated by adding a page count in an XML comment at the end of the XHTML file. For a 10 page document, the last line would look like this:

The preWrite() method in Min2PDF.java accesses the parsed XHTML document tree to locate the last child of the document. If this node is a comment, the text within the comment is parsed for the page count. Based on whether this page count matches the number of pages in the final PDF document (which is passed as an argument to preWrite() ), a text line is generated and saved as a subject metadata item in this XHTML document, which will be used as the subject property of the generated PDF file. The Min2PDF class also sets the exit value for this process to indicate to the caller whether the conversion was a success (0), a rendering error occurred (1) or if there was a page count error (2). Note the commented-out line at the start of preWrite() :

//String s = od.getMetadataByName("subject"); // existing subject line