Wednesday, May 26, 2010

Loading SQL tables column by column

Goal: Load data copied from an PDF table into a RDMS table column by column, using SQL.

Selecting and copy/pasting the whole PDF table at once didn't extract the data in clean or usable way, things got jumbled. But selecting one column at a time (using xpdf) cleanly extracted the data in that column. But how can you insert it into the table without messing up the ordering of each columns content? OMG! The Excel "reordering destroys data integrity" problem has come to SQL! :-) Anyway, given a table like this:

21AntOne
31BatTwo
76CatThree
89DogFour

The following approach will work (from a postgres / psql session):

create table rescued_data (
  col1 int,
  col2 text,
  col3 text,
  ordering int
);

create temp sequence s;
create temp table col (val text);

\copy col from stdin
21
31
76
89
\.

insert into rescued_data (col1, ordering)
  select val::int, nextval('s') from col;

-- note need to match type with ::int in the above

select setval('s', 1, false);  -- reset the sequence
truncate col;

\copy col from stdin
Ant
Bat
Cat
Dog
\.

update rescued_data set col2 = val
  from (select val, nextval('s') as seq from col) as x
  where seq = ordering;

-- repeating above for next column

select setval('s', 1, false);  -- reset the sequence
truncate col;

\copy col from stdin
One
Two
Three
Four
\.

update rescued_data set col3 = val
  from (select val, nextval('s') as seq from col) as x
  where seq = ordering;

select * from rescued_data;

-- if necessary, you can
alter table rescued_data drop column ordering;

Sunday, May 16, 2010

Python/PyQt upgrade triggers strange bug

Percy: Look, look, I just can't take the pressure of all these omens anymore!
Edmund: Percy...
Percy: No, no, really, I'm serious! Only this morning in the courtyard I saw a horse with two heads and two bodies!
Edmund: Two horses standing next to each other?
Percy: Yes, I suppose it could have been.
Blackadder, "Witchsmeller Pursuivant"

Today I saw a bug with one head and two bodies. Upgrading from Ubuntu to 9.10 to 10.4 broke a tool bar button in Leo, the world's best code editor / project manager / note sorter. The upgrade involved transitions from Python 2.6.4 -> 2.6.5 and PyQt 4.6 -> 4.7.2. The forward and back browsing buttons supplied by Leo's nav_qt plugin stopped working.

After Brain had been debugging, testing, googling, comparing etc. for over two hours, Intuition wanders past and says, "oh, ha, why not try

def __init__ (self,c):
         self.c = c
+        c._prev_next = self
         self.makeButtons()
Sometimes, Brain doesn't like Intuition very much.

Fortunately Brain was able to save some face, as

-        act_l = QtGui.QAction(icon_l, 'prev', ib_w)           
-        act_r = QtGui.QAction(icon_r, 'next', ib_w)           
+        act_l = QtGui.QAction(icon_l, 'prev', ib_w, triggered=self.clickPrev)   
+        act_r = QtGui.QAction(icon_r, 'next', ib_w, triggered=self.clickNext)  
was also required.

So it seems like the upgrade caused two changes which both had the same symptom, making debugging a challenge. It seems like the plugin class instance or the actions it was creating are now being garbage collected where they weren't before. The c._prev_next = self would prevent the instance being collected, although it's unclear that it should also prevent the actions being collected. You would think the GUI's links to the actions would be enough to protect them, so perhaps that bug body wasn't an old glitch going away, but a new one being introduced. OTOH the gui must have a link to the actions, as it's able to trigger them.

The triggered=self.clickPrev addition presumably covers a change in the emission of 'clicked()' by QToolButton, or a change in default actions, or something. Passing the parameter that way is a PyQt alternative to act_r.connect(act_r, QtCore.SIGNAL("triggered()"),self.clickNext), which would probably also have worked.

Wednesday, May 12, 2010

zipas - ensure .zip file contains an outer wrapper folder

A trivial bash script to create a .zip file where everything's in
a top-level folder. Saves creating a folder temporarily, and copying
/ linking things into it:

#!/bin/sh

if [ $# -lt 2 ]; then cat <<EOF

usage: $0 <wrapper-folder> <target files>

e.g. zipas jul0907 *.c
EOF
else

  PATHNAME="$1"

  HEADNAME=${PATHNAME%%/*}  # i.e 'foo' from 'foo/bar/inner'

  if [ -a "$HEADNAME" ]; then

    echo "ERROR: '$HEADNAME' exists"
  else
    shift
    mkdir -p $PATHNAME

    for i in $@; do
      ln -s "$PWD/$i" "$PATHNAME/$i"
    done
    zip -r $HEADNAME.zip $HEADNAME

    rm -rf "$HEADNAME"
  fi
fi

posted at: 15:01 |
path: /code/bash |
permanent link to this entry

Tuesday, May 11, 2010

Merging PDF files

The python code at the bottom of this posting can be used to merge PDF files (via GhostScript (gs)). In theory gs can do that by itself, in practice I found merging about 160 single page files into one resulted in strange characters appearing in some of the text. The python code merges files two at a time, repeatedly, until all are merged. Merging file 1 and 2, then that with 3, then that with 4, etc. may also work, but it becomes very slow for a large set of files. The binary approach here is much faster.

This code is just a quick hack. If you have a large pile of PDFs to merge and GhostScript is failing as described above, this could save your day. It's invoked from the command line by:

python pdfmerge.py *.pdf

and merges the files in the order listed, creating a lot of files called __XXXX.pdf in the process. The last __XXXX.pdf file produced is your output, you should rename that one and delete the rest. I did say it was just a quick hack :-)

"""Merge pdfs using GhostScript (gs)

Work around for a bug in gs such that::
    
    gs -sDEVICE=pdfwrite -dNOPAUSE -dBATCH -dSAFER 
    -sOutputFile=foo.pdf *.pdf
    
produces odd text corruptions if *.pdf expands to a large number of files.

This program uses a binary merging approach which seems to avoid the bug.
"""
    
import sys
import os
import subprocess
from collections import defaultdict

pages = defaultdict(lambda:1)  # number of pages in each file

pdfs = sys.argv[1:]            # pdfs to merge, already ordered

idx = 0                        # sequence number for temporary pdfs

newpdfs = []                   # list of new pdfs to process

while pdfs or newpdfs:

    if not pdfs:  # pdfs list ends up empty whenever it starts of even length
        pdfs = newpdfs
        newpdfs = []

    if len(pdfs) == 1:
        # only one left, just add it to the end of the list for next iteration
        newpdfs.append(pdfs.pop(0))
        pdfs = newpdfs
        newpdfs = []
        if len(pdfs) == 1:  # we're done
            break

    pdf0 = pdfs.pop(0)  # pair of pdfs to merge
    pdf1 = pdfs.pop(0)

    assert os.path.isfile(pdf0)  # should both exist
    assert os.path.isfile(pdf1)

    idx += 1
    newpdf = "__pdf%04d.pdf" % idx

    pages[newpdf] = pages[pdf0] + pages[pdf1]

    newpdfs.append(newpdf)  # add new pdf to list for next iteration

    cmd = ("gs -sDEVICE=pdfwrite -dNOPAUSE -dBATCH -dSAFER "
    "-sOutputFile=%s %s %s" % (newpdf, pdf0, pdf1))

    # here's a lot of uneeded paranoia that arose when I was feeding in some
    # bad (0 byte) pdf files, doesn't hurt to leave it in

    print pdf0, pdf1, newpdf, pages[pdf0], pages[pdf1], \
        pages[newpdf], len(pdfs), len(newpdfs)
    print cmd

    proc = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE, 
        stderr=subprocess.STDOUT)

    out,dummy = proc.communicate()

    print out
    print

    # the term 'Processing pages ' should occur twice, if both files were read
    procs = [i for i in out.split('\n') if i.startswith('Processing pages ')]

    assert len(procs) == 2