The python code at the bottom of this posting can be used to merge PDF files (via GhostScript (gs)). In theory gs can do that by itself, in practice I found merging about 160 single page files into one resulted in strange characters appearing in some of the text. The python code merges files two at a time, repeatedly, until all are merged. Merging file 1 and 2, then that with 3, then that with 4, etc. may also work, but it becomes very slow for a large set of files. The binary approach here is much faster.
This code is just a quick hack. If you have a large pile of PDFs to merge and GhostScript is failing as described above, this could save your day. It's invoked from the command line by:
python pdfmerge.py *.pdf
and merges the files in the order listed, creating a lot of files called __XXXX.pdf in the process. The last __XXXX.pdf file produced is your output, you should rename that one and delete the rest. I did say it was just a quick hack :-)
"""Merge pdfs using GhostScript (gs) Work around for a bug in gs such that:: gs -sDEVICE=pdfwrite -dNOPAUSE -dBATCH -dSAFER -sOutputFile=foo.pdf *.pdf produces odd text corruptions if *.pdf expands to a large number of files. This program uses a binary merging approach which seems to avoid the bug. """ import sys import os import subprocess from collections import defaultdict pages = defaultdict(lambda:1) # number of pages in each file pdfs = sys.argv[1:] # pdfs to merge, already ordered idx = 0 # sequence number for temporary pdfs newpdfs = [] # list of new pdfs to process while pdfs or newpdfs: if not pdfs: # pdfs list ends up empty whenever it starts of even length pdfs = newpdfs newpdfs = [] if len(pdfs) == 1: # only one left, just add it to the end of the list for next iteration newpdfs.append(pdfs.pop(0)) pdfs = newpdfs newpdfs = [] if len(pdfs) == 1: # we're done break pdf0 = pdfs.pop(0) # pair of pdfs to merge pdf1 = pdfs.pop(0) assert os.path.isfile(pdf0) # should both exist assert os.path.isfile(pdf1) idx += 1 newpdf = "__pdf%04d.pdf" % idx pages[newpdf] = pages[pdf0] + pages[pdf1] newpdfs.append(newpdf) # add new pdf to list for next iteration cmd = ("gs -sDEVICE=pdfwrite -dNOPAUSE -dBATCH -dSAFER " "-sOutputFile=%s %s %s" % (newpdf, pdf0, pdf1)) # here's a lot of uneeded paranoia that arose when I was feeding in some # bad (0 byte) pdf files, doesn't hurt to leave it in print pdf0, pdf1, newpdf, pages[pdf0], pages[pdf1], \ pages[newpdf], len(pdfs), len(newpdfs) print cmd proc = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE, stderr=subprocess.STDOUT) out,dummy = proc.communicate() print out print # the term 'Processing pages ' should occur twice, if both files were read procs = [i for i in out.split('\n') if i.startswith('Processing pages ')] assert len(procs) == 2
No comments:
Post a Comment