Code, mostly

Wednesday, May 8, 2019

A simple Python web app, using Flexx and Docker, behind a proxy.

A really simple problem. An Arduino based data logger is producing a text file that looks like this:

RUN: 201904191310 25 8ef45 200
24 Jan 2018 12:23:34
342 522
542 124
123 452
RUN: 201904191310 25 8ef45 300
24 Jan 2018 12:24:54
423 252
452 241
231 542

It needs to be converted to CSV like this:

calib_run,temp_set,unit_id,targconc,time,adc_cond,adc_temp,count
201904191310,25,8ef45,200,24 Jan 2018 12:23:34,335.67,366.0,3
201904191310,25,8ef45,300,24 Jan 2018 12:24:54,368.67,345.0,3

I.e. group by the ID fields and take the mean and count of the observation fields. It could be done easily enough in vanilla Python, but it seemed like a nice simple case to experiment with one of those automatic parser generators. Without much research I picked Lark. From its docs. I came up with the following definition:

start: block+
block: "RUN:" calib_run temp_set unit_id targconc time obs+

calib_run: NUMBER
temp_set: NUMBER
targconc: NUMBER
unit_id: CHARS
time: DATE
obs: adc_cond adc_temp
adc_cond: NUMBER
adc_temp: NUMBER

%import common.NUMBER
%import common.WS
%ignore WS
CHARS: /\\S+/
DATE: NUMBER WS+ MNTH WS+ NUMBER WS+ CHARS
MNTH: ("Jan"|"Feb"|"Mar"|"Apr"|"May"|"Jun"|"Jul"|"Aug"|"Sep"|"Oct"|"Nov"|"Dec")

My "Domain Specific Language" (DSL) starts with the constant sentinel string "RUN:", and then a set of ID fields and then some number of observations that are pairs of numbers. This is Extended Backus-Naur form (EBNF), a way of formally describing language structure, similar to what you see in the Python docs. and other places.

When you run it, you get a tree of nodes, like this:

start
  block
    calib_run   201904191310
    temp_set    25
    unit_id     8ef45
    targconc    200
    time        24 Jan 2018 12:23:34
    obs
      adc_cond  342
      adc_temp  522
    obs
      adc_cond  542
      adc_temp  124
  block
    calib_run   201904191310
    temp_set    25
    unit_id     8ef45

Code looks something like this:

from lark import Lark, Token
parser = Lark(grammar)
pt = parser.parse(text)
print(pt.pretty())

where grammar and text are the EBNF and raw Arduino text shown above. Here's some code (parse_text is the entry point) which uses EBNF, Lark, and Pandas, to convert the text to a list of lists, essentially the parsed form of the CSV output we want:

  import pandas as pd
  from lark import Lark, Token

  flds = ['calib_run', 'temp_set', 'unit_id', 'targconc', 'time',
          'adc_cond', 'adc_temp']
 
  def proc_block(node, callback, state=None):
      if state is None:
          state = dict(__res=[])
      if isinstance(node.children[0], Token):
          state[node.data] = str(node.children[0])
          callback(state, node)
      else:
          for child in node.children:
              proc_block(child, callback, state)

      return state

  def callback(state, node):
      if node.data == 'adc_temp':
          state['__res'].append([state[i] for i in flds])

  def parse_text(text, grammar):
      parser = Lark(grammar)
      pt = parser.parse(text)
      res = None
      num = [i for i in flds if 'adc_' in i]
      grp = list(set(flds) - set(num))
      for block in pt.children:
          df = proc_block(block, callback)['__res']
          df = pd.DataFrame(df, columns=flds)
          for fld in num:
              df[fld] = df[fld].astype(float)
          counts = df.groupby(grp).count()
          means = df.groupby(grp).mean().round(2).reset_index()
          means['count'] = counts['adc_cond'].tolist()
          res = means if res is None else res.append(means)

      res = res[flds+['count']]  # reorder to put ID fields first again
      return [res.columns.tolist()] + res.values.tolist()

So that's pretty much mission accomplished as far as converting the input to CSV goes. To make it easy for the target audience to use, I decided to wrap it in a web app. using Flexx. Nothing complicated there - just a text area, a label, and a button. The label tells you to paste you text into the text area, the button converts from the raw form to CSV. These datasets are small enough to handle by copy / pasting. To make it work as a docker container I have this:

  
if __name__ == '__main__':
    a = flx.App(Main)
    a.serve()
    flx.create_server(host="0.0.0.0", port=8000)
    flx.start()

in my Flexx code, so it listens on all interfaces, not just 127.0.0.1, and on a predictable port, 8000. It will be the only thing in the Docker container, so we know 8000 is available. The Dockerfile looks like this:

  
FROM continuumio/miniconda3

RUN conda install -c conda-forge lark-parser pandas flexx

RUN mkdir /.webruntime \
 && chmod a+rwx /.webruntime

COPY log2csv.py log2csv_ui.py /

CMD ["python", "log2csv_ui.py"]

and works as expected. A quick test with an SSH tunnel to the docker container on the remote host, everything looks good, ready to deploy. I deploy these containers with Apache proxies to an address like http://example.com/log2csv/. I put the proxy in place, but argh, it only proxies the http:// requests, not the ws:// requests. Here's the Apache config. that fixed that:

  
    LoadModule proxy_module modules/mod_proxy.so
    LoadModule proxy_http_module modules/mod_proxy_http.so
    LoadModule proxy_html_module modules/mod_proxy_html.so
    LoadModule proxy_wstunnel_module modules/mod_proxy_wstunnel.so
    LoadModule rewrite_module modules/mod_rewrite.so
 
    ProxyPass /log2csv/ http://log2csv:8000/
    ProxyHTMLURLMap http://log2csv:8000/ /log2csv/

    RewriteEngine on
    RewriteCond %{HTTP:Upgrade} websocket [NC]
    RewriteCond %{HTTP:Connection} upgrade [NC]
    RewriteRule .* "ws://log2csv:8000%{REQUEST_URI}" [P]

    <location log2csv="">
        ProxyPassReverse /
        ProxyHTMLURLMap  /      /log2csv/
    </location>

Note that the Apache server's also running in a (different) docker container linked to the Flexx web app. container with --link log2csv, so the IP address for the Flexx web app. container is just log2csv from within the Apache server's docker container. In a different context you might use 127.0.0.1.

There are probably some hidden rough edges in the above Apache config. Using proxy_wstunnel_module is supposed to be sufficient by itself, as it allows you to write ProxyPass /log2csv/ ws://log2csv:8000/. But that didn't work, I think because the ws:// request was missing the /log2csv/ subpath. I think it was missing because the ProxyHTMLURLMap didn't fix the ws:// request in the JavaScript(?) that generated it. So instead it's correctly routed by the RewriteRule. But that targets all ws:// requests, so something else would be needed if there was another websocket app. on the server.

Tuesday, September 30, 2014

Review entire git commit history of a file, for security issues, passwords, sensitive information, etc.

Say you want to push a git repo. that you've been working on privately to a publicly viewable place, how do you review not just the current working copy / HEAD for sensitive information, but the whole history of files that might contain details you don't want to make public? Something like this seems reasonably efficient:

git log --patch --reverse path/to/file.py \
  | grep '\(^+\|^commit\)' \
  | sed 's/^+//'

Including only lines starting with "+" or "commit" shows you only what's added to the file. As long as you start at the beginning, the deletions (lines starting with "-") and context (everything else) don't matter.

Deleting the '+' at the beginning of each line means you can dump the output from the above into your favorite editor to get syntax highlighting, which perhaps makes it easier to read. You might need to use a temporary file with an appropriate extension, e.g.:

git log --patch --reverse path/to/file.py \
  | grep '\(^+\|^commit\)' \
  | sed 's/^+//; s/^commit/#commit/' \
  >delme.py

(adding the # in front of commit lines is nice with syntax highlighting in Python)

Thursday, April 19, 2012

Claws-mail - search for messages in date range

Claws-mail message search uses age in days to select messages by time, which is fine for "something more than two weeks ago but less than a month ago" (ag 14 & al 31) but not helpful for "a message in October 2010". This python script prints the Claws-mail extended search string for a given date range:

(replace raw_input with input for Python 3.x)

#!/usr/bin/python
import datetime
FROM = raw_input("From date YYYY-MM-DD: ")
TO = raw_input("To date YYYY-MM-DD: ")
FROM = datetime.datetime.strptime(FROM, "%Y-%m-%d")
TO = datetime.datetime.strptime(TO, "%Y-%m-%d")
NOW = datetime.date.today()
START = (NOW-FROM.date()).days
STOP = (NOW-TO.date()).days
print("al %d & ag %d" % (START, STOP-1))

Wednesday, November 30, 2011

Compiling QR reader for python / Ubuntu 11.10

Steps that apparently worked to get a QR code reader compiled for python (2.7) un Ubuntu 11.10 64 bit as of 2011-11-30.

(an alternative to the steps here might be this: http://pyqrcode.sourceforge.net/ but that didn't work easily for me in recent Ubuntus)

These instructions:
http://hi.baidu.com/paulau/blog/item/915e860ffbf7032c6059f34c.html

cover most of the steps, but lots of details need changing now vs. 2008.

The PyQrcodec_Linux.tar.gz mentioned in various places on the web has disappeared from www.pedemonte.eu - in fact that whole domain is gone. You can get it from http://gentoo.mirrors.pair.com/distfiles/PyQrcodec_Linux.tar.gz

The instructions above say:

sudo apt-get install g++
sudo apt-get install python-dev
sudo apt-get install libcv-dev libcvaux-dev

in addition you now need

sudo apt-get install libhighgui-dev

Untar and cd into the PyQrCodec folder.

then edit setup.py and add

extra_compile_args = ['-fpermissive'],

before each of the "sources = ..." lines.

then in the shell run (once and once only)

for file in $(grep -lre BEGIN ./*|grep -v svn); do sed -e "s/_BEGIN_/_CV_BEGIN_/;s/_END_/_CV_END_/" $file -i; done

now python setup.py build and sudo python setup.py install
should work (with all kinds of warnings).

The how to use example should be:

import PyQrcodec
size, image = PyQrcodec.encode('www.example.com')
image.save('example.png')
status, text = PyQrcodec.decode('example.png')
print(text)

.encode() params are:

Definition:     PyQrcodec.encode(string, image_width=400,
  case_sensitive=True, version=5,
  error_correction_level='QR_ECLEVEL_L',
  encoding_hint='QR_MODE_AN')
Docstring:
  Returns a PIL image of the QR code generated
  from the given string

Note: for http://upload.wikimedia.org/wikipedia/commons/thumb/9/9b/Wikipedia_mobile_en.svg/220px-Wikipedia_mobile_en.svg.png I had to flatten the image (replace the transparent background) for decode() to work.

Tuesday, June 21, 2011

Can I delete that branch? Check bzr branch relationships.

EDIT: Turns out bzr missing performs this function, bzr missing path/to/other/branch | head 2 to avoid having the important info. scrolled away. I'll assume bzr missing was added after I wrote this :-)

If you use bazaar (bzr) and end up with several branches for the same project, you can end up wondering if one branch contains all the commits in another, e.g. you need to check that all the work done in a successful experimental branch has been moved into the trunk.

This small python program does that:

leo.repo> bzrin free_layout trunk
Checking for commits of revs in 'free_layout' in 'trunk'
Status of 'free_layout':
unknown:
  .thumbnails/
  demo.jpg
  nohup.out
Status of 'trunk':
unknown:
  *.g1.dml
Counting revs in free_layout
6026 revs in free_layout
Counting revs in trunk
6683 revs in trunk
All revs in free_layout exist in trunk : OK

`bzrin` checks that all the commits in the "free_layout" branch have been merged into the trunk - in this case they have, and you can safely delete "free_layout".

The code uses `subprocess` rather than the python bzr bindings to do its work, but it gets the job done and has proved very useful for tidying up a directory full of branches for various subprojects.

#!/usr/bin/python
"""Check that the latest commit in bzr branch A exists in bzr branch B
"""

# bzrin2
# Author: Terry Brown
# Created: Mon Sep  8 12:18:21 CDT 2008

import subprocess, sys, os
import tempfile  # ? because subprocess.PIPE hangs in .wait() ?

def emit(s):
    sys.stdout.write(s)

def main():
    branch = tuple(sys.argv[1:3])
    emit("Checking for commits of revs in '%s' in '%s'\n" % branch)

    # show status
    for i in branch:
        emit("Status of '%s':\n" % i)
        cmd = subprocess.Popen(('bzr status '+i).split())
        cmd.wait()

    revs = []

    for i in branch:
        emit("Counting revs in %s\n" % i)
        revs.append(set())
        tmpFile, tmpName = tempfile.mkstemp()
        cmd = subprocess.Popen(('bzr log --show-ids --levels=0 '+i).split(),
            stdout = tmpFile)
        os.close(tmpFile)
        cmd.wait()
        source = file(tmpName)
        for line in source:
            content = line.strip()
            if content.startswith('revision-id:'):
                id_ = content.split(None,1)[1]
                while not line.strip() == 'message:':
                    line = source.next()
                line = source.next()
                msg = []
                while not line.strip().startswith('-'*10):
                    msg.append(line.strip())
                    try:
                        line = source.next()
                    except StopIteration:  # end of file
                        break
                revs[-1].add((id_, tuple(msg)))
        os.remove(tmpName)
        emit("%d revs in %s\n" % (len(revs[-1]), i))

    diff = revs[0].difference(revs[1])

    if not diff:
        emit ("All revs in %s exist in %s : OK\n" % branch)
    else:
        emit ("WARNING: %s contains revs NOT in %s\n" % branch)
        for i in diff:
            emit("%s\n%s\n" % (i[0], ''.join(['  '+m for m in i[1]])))
        emit ("WARNING: %s contains revs NOT in %s\n" % branch)

if __name__ == '__main__':
    main()

Thursday, October 28, 2010

Use multiple cores for shell scripts in Ubuntu

So you want to use all your CPU's cores for some shell based batch processing task. Seems there should already be an app for that, and there is, parallel, in the more-utils package in Ubuntu. But it's not that easy to use the target file argument in a shell script. (note: I think there may be more than one version of this utility, I'm referring to the one that ships with Ubuntu).

For example, I wanted to use all cores for this operation:

for i in svg/*.svg; do f=$(basename $i); \
  inkscape-devel --without-gui \
  --export-background=white \
  --export-ps ps/${f%%svg}ps $i; \
  echo $f; done

So parallel has an -i flag which enables replacement of {} with the target argument ($i in the above) but only if it's surrounded by spaces and not quoted, hardly convenient for scripting. This simple wrapper (saved in a file called task, made executable and placed somewhere on your $PATH) gets around that problem:

# helper for parallel
#
# usage: task 'shell-pattern' 'shell commands'

GLOB="$1"
shift
SCRIPT=$(mktemp)
echo "$@" >"$SCRIPT"
chmod +x "$SCRIPT"
parallel "$SCRIPT" -- $GLOB
rm "$SCRIPT"

So now you can use $1 (not $i) in your shell code without any complications. The above example becomes:

task 'svg/*.svg' 'f=$(basename $1); inkscape-devel \
--without-gui --export-background=white --export-ps \
ps/${f%%svg}ps $1; echo $f'

...and running on four cores it's much quicker :-)

Wednesday, May 26, 2010

Loading SQL tables column by column

Goal: Load data copied from an PDF table into a RDMS table column by column, using SQL.

Selecting and copy/pasting the whole PDF table at once didn't extract the data in clean or usable way, things got jumbled. But selecting one column at a time (using xpdf) cleanly extracted the data in that column. But how can you insert it into the table without messing up the ordering of each columns content? OMG! The Excel "reordering destroys data integrity" problem has come to SQL! :-) Anyway, given a table like this:

21	Ant	One
31	Bat	Two
76	Cat	Three
89	Dog	Four

The following approach will work (from a postgres / psql session):

create table rescued_data (
  col1 int,
  col2 text,
  col3 text,
  ordering int
);

create temp sequence s;
create temp table col (val text);

\copy col from stdin
21
31
76
89
\.

insert into rescued_data (col1, ordering)
  select val::int, nextval('s') from col;

-- note need to match type with ::int in the above

select setval('s', 1, false);  -- reset the sequence
truncate col;

\copy col from stdin
Ant
Bat
Cat
Dog
\.

update rescued_data set col2 = val
  from (select val, nextval('s') as seq from col) as x
  where seq = ordering;

-- repeating above for next column

select setval('s', 1, false);  -- reset the sequence
truncate col;

\copy col from stdin
One
Two
Three
Four
\.

update rescued_data set col3 = val
  from (select val, nextval('s') as seq from col) as x
  where seq = ordering;

select * from rescued_data;

-- if necessary, you can
alter table rescued_data drop column ordering;