The python code at the bottom of this posting can be used to merge PDF files (via GhostScript (gs)). In theory gs can do that by itself, in practice I found merging about 160 single page files into one resulted in strange characters appearing in some of the text. The python code merges files two at a time, repeatedly, until all are merged. Merging file 1 and 2, then that with 3, then that with 4, etc. may also work, but it becomes very slow for a large set of files. The binary approach here is much faster.
This code is just a quick hack. If you have a large pile of PDFs to merge and GhostScript is failing as described above, this could save your day. It's invoked from the command line by:
python pdfmerge.py *.pdf
and merges the files in the order listed, creating a lot of files called __XXXX.pdf in the process. The last __XXXX.pdf file produced is your output, you should rename that one and delete the rest. I did say it was just a quick hack :-)
"""Merge pdfs using GhostScript (gs)
Work around for a bug in gs such that::
gs -sDEVICE=pdfwrite -dNOPAUSE -dBATCH -dSAFER
-sOutputFile=foo.pdf *.pdf
produces odd text corruptions if *.pdf expands to a large number of files.
This program uses a binary merging approach which seems to avoid the bug.
"""
import sys
import os
import subprocess
from collections import defaultdict
pages = defaultdict(lambda:1) # number of pages in each file
pdfs = sys.argv[1:] # pdfs to merge, already ordered
idx = 0 # sequence number for temporary pdfs
newpdfs = [] # list of new pdfs to process
while pdfs or newpdfs:
if not pdfs: # pdfs list ends up empty whenever it starts of even length
pdfs = newpdfs
newpdfs = []
if len(pdfs) == 1:
# only one left, just add it to the end of the list for next iteration
newpdfs.append(pdfs.pop(0))
pdfs = newpdfs
newpdfs = []
if len(pdfs) == 1: # we're done
break
pdf0 = pdfs.pop(0) # pair of pdfs to merge
pdf1 = pdfs.pop(0)
assert os.path.isfile(pdf0) # should both exist
assert os.path.isfile(pdf1)
idx += 1
newpdf = "__pdf%04d.pdf" % idx
pages[newpdf] = pages[pdf0] + pages[pdf1]
newpdfs.append(newpdf) # add new pdf to list for next iteration
cmd = ("gs -sDEVICE=pdfwrite -dNOPAUSE -dBATCH -dSAFER "
"-sOutputFile=%s %s %s" % (newpdf, pdf0, pdf1))
# here's a lot of uneeded paranoia that arose when I was feeding in some
# bad (0 byte) pdf files, doesn't hurt to leave it in
print pdf0, pdf1, newpdf, pages[pdf0], pages[pdf1], \
pages[newpdf], len(pdfs), len(newpdfs)
print cmd
proc = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE,
stderr=subprocess.STDOUT)
out,dummy = proc.communicate()
print out
print
# the term 'Processing pages ' should occur twice, if both files were read
procs = [i for i in out.split('\n') if i.startswith('Processing pages ')]
assert len(procs) == 2
No comments:
Post a Comment