Tuesday, May 11, 2010

Merging PDF files

The python code at the bottom of this posting can be used to merge PDF files (via GhostScript (gs)). In theory gs can do that by itself, in practice I found merging about 160 single page files into one resulted in strange characters appearing in some of the text. The python code merges files two at a time, repeatedly, until all are merged. Merging file 1 and 2, then that with 3, then that with 4, etc. may also work, but it becomes very slow for a large set of files. The binary approach here is much faster.

This code is just a quick hack. If you have a large pile of PDFs to merge and GhostScript is failing as described above, this could save your day. It's invoked from the command line by:

python pdfmerge.py *.pdf

and merges the files in the order listed, creating a lot of files called __XXXX.pdf in the process. The last __XXXX.pdf file produced is your output, you should rename that one and delete the rest. I did say it was just a quick hack :-)

"""Merge pdfs using GhostScript (gs)

Work around for a bug in gs such that::
    gs -sDEVICE=pdfwrite -dNOPAUSE -dBATCH -dSAFER 
    -sOutputFile=foo.pdf *.pdf
produces odd text corruptions if *.pdf expands to a large number of files.

This program uses a binary merging approach which seems to avoid the bug.
import sys
import os
import subprocess
from collections import defaultdict

pages = defaultdict(lambda:1)  # number of pages in each file

pdfs = sys.argv[1:]            # pdfs to merge, already ordered

idx = 0                        # sequence number for temporary pdfs

newpdfs = []                   # list of new pdfs to process

while pdfs or newpdfs:

    if not pdfs:  # pdfs list ends up empty whenever it starts of even length
        pdfs = newpdfs
        newpdfs = []

    if len(pdfs) == 1:
        # only one left, just add it to the end of the list for next iteration
        pdfs = newpdfs
        newpdfs = []
        if len(pdfs) == 1:  # we're done

    pdf0 = pdfs.pop(0)  # pair of pdfs to merge
    pdf1 = pdfs.pop(0)

    assert os.path.isfile(pdf0)  # should both exist
    assert os.path.isfile(pdf1)

    idx += 1
    newpdf = "__pdf%04d.pdf" % idx

    pages[newpdf] = pages[pdf0] + pages[pdf1]

    newpdfs.append(newpdf)  # add new pdf to list for next iteration

    cmd = ("gs -sDEVICE=pdfwrite -dNOPAUSE -dBATCH -dSAFER "
    "-sOutputFile=%s %s %s" % (newpdf, pdf0, pdf1))

    # here's a lot of uneeded paranoia that arose when I was feeding in some
    # bad (0 byte) pdf files, doesn't hurt to leave it in

    print pdf0, pdf1, newpdf, pages[pdf0], pages[pdf1], \
        pages[newpdf], len(pdfs), len(newpdfs)
    print cmd

    proc = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE, 

    out,dummy = proc.communicate()

    print out

    # the term 'Processing pages ' should occur twice, if both files were read
    procs = [i for i in out.split('\n') if i.startswith('Processing pages ')]

    assert len(procs) == 2

No comments:

Post a Comment