Wednesday, May 8, 2019

A simple Python web app, using Flexx and Docker, behind a proxy.

A really simple problem. An Arduino based data logger is producing a text file that looks like this:

RUN: 201904191310 25 8ef45 200
24 Jan 2018 12:23:34
342 522
542 124
123 452
RUN: 201904191310 25 8ef45 300
24 Jan 2018 12:24:54
423 252
452 241
231 542

It needs to be converted to CSV like this:

calib_run,temp_set,unit_id,targconc,time,adc_cond,adc_temp,count
201904191310,25,8ef45,200,24 Jan 2018 12:23:34,335.67,366.0,3
201904191310,25,8ef45,300,24 Jan 2018 12:24:54,368.67,345.0,3

I.e. group by the ID fields and take the mean and count of the observation fields.  It could be done easily enough in vanilla Python, but it seemed like a nice simple case to experiment with one of those automatic parser generators.  Without much research I picked Lark.  From its docs. I came up with the following definition:

start: block+
block: "RUN:" calib_run temp_set unit_id targconc time obs+

calib_run: NUMBER
temp_set: NUMBER
targconc: NUMBER
unit_id: CHARS
time: DATE
obs: adc_cond adc_temp
adc_cond: NUMBER
adc_temp: NUMBER

%import common.NUMBER
%import common.WS
%ignore WS
CHARS: /\\S+/
DATE: NUMBER WS+ MNTH WS+ NUMBER WS+ CHARS
MNTH: ("Jan"|"Feb"|"Mar"|"Apr"|"May"|"Jun"|"Jul"|"Aug"|"Sep"|"Oct"|"Nov"|"Dec")

My "Domain Specific Language" (DSL) starts with the constant sentinel string "RUN:", and then a set of ID fields and then some number of observations that are pairs of numbers.  This is Extended Backus-Naur form (EBNF), a way of formally describing language structure, similar to what you see in the Python docs. and other places.

When you run it, you get a tree of nodes, like this:

start
  block
    calib_run   201904191310
    temp_set    25
    unit_id     8ef45
    targconc    200
    time        24 Jan 2018 12:23:34
    obs
      adc_cond  342
      adc_temp  522
    obs
      adc_cond  542
      adc_temp  124
  block
    calib_run   201904191310
    temp_set    25
    unit_id     8ef45
Code looks something like this:
from lark import Lark, Token
parser = Lark(grammar)
pt = parser.parse(text)
print(pt.pretty())
where grammar and text are the EBNF and raw Arduino text shown above. Here's some code (parse_text is the entry point) which uses EBNF, Lark, and Pandas, to convert the text to a list of lists, essentially the parsed form of the CSV output we want:

  import pandas as pd
  from lark import Lark, Token

  flds = ['calib_run', 'temp_set', 'unit_id', 'targconc', 'time',
          'adc_cond', 'adc_temp']
 
  def proc_block(node, callback, state=None):
      if state is None:
          state = dict(__res=[])
      if isinstance(node.children[0], Token):
          state[node.data] = str(node.children[0])
          callback(state, node)
      else:
          for child in node.children:
              proc_block(child, callback, state)

      return state

  def callback(state, node):
      if node.data == 'adc_temp':
          state['__res'].append([state[i] for i in flds])

  def parse_text(text, grammar):
      parser = Lark(grammar)
      pt = parser.parse(text)
      res = None
      num = [i for i in flds if 'adc_' in i]
      grp = list(set(flds) - set(num))
      for block in pt.children:
          df = proc_block(block, callback)['__res']
          df = pd.DataFrame(df, columns=flds)
          for fld in num:
              df[fld] = df[fld].astype(float)
          counts = df.groupby(grp).count()
          means = df.groupby(grp).mean().round(2).reset_index()
          means['count'] = counts['adc_cond'].tolist()
          res = means if res is None else res.append(means)

      res = res[flds+['count']]  # reorder to put ID fields first again
      return [res.columns.tolist()] + res.values.tolist()
So that's pretty much mission accomplished as far as converting the input to CSV goes. To make it easy for the target audience to use, I decided to wrap it in a web app. using Flexx.  Nothing complicated there - just a text area, a label, and a button.  The label tells you to paste you text into the text area, the button converts from the raw form to CSV.  These datasets are small enough to handle by copy / pasting.  To make it work as a docker container I have this:
  
if __name__ == '__main__':
    a = flx.App(Main)
    a.serve()
    flx.create_server(host="0.0.0.0", port=8000)
    flx.start()
in my Flexx code, so it listens on all interfaces, not just 127.0.0.1, and on a predictable port, 8000. It will be the only thing in the Docker container, so we know 8000 is available. The Dockerfile looks like this:
  
FROM continuumio/miniconda3

RUN conda install -c conda-forge lark-parser pandas flexx

RUN mkdir /.webruntime \
 && chmod a+rwx /.webruntime

COPY log2csv.py log2csv_ui.py /

CMD ["python", "log2csv_ui.py"]
and works as expected. A quick test with an SSH tunnel to the docker container on the remote host, everything looks good, ready to deploy. I deploy these containers with Apache proxies to an address like http://example.com/log2csv/. I put the proxy in place, but argh, it only proxies the http:// requests, not the ws:// requests. Here's the Apache config. that fixed that:
  
    LoadModule proxy_module modules/mod_proxy.so
    LoadModule proxy_http_module modules/mod_proxy_http.so
    LoadModule proxy_html_module modules/mod_proxy_html.so
    LoadModule proxy_wstunnel_module modules/mod_proxy_wstunnel.so
    LoadModule rewrite_module modules/mod_rewrite.so
 
    ProxyPass /log2csv/ http://log2csv:8000/
    ProxyHTMLURLMap http://log2csv:8000/ /log2csv/

    RewriteEngine on
    RewriteCond %{HTTP:Upgrade} websocket [NC]
    RewriteCond %{HTTP:Connection} upgrade [NC]
    RewriteRule .* "ws://log2csv:8000%{REQUEST_URI}" [P]

    <location log2csv="">
        ProxyPassReverse /
        ProxyHTMLURLMap  /      /log2csv/
    </location>
Note that the Apache server's also running in a (different) docker container linked to the Flexx web app. container with --link log2csv, so the IP address for the Flexx web app. container is just log2csv from within the Apache server's docker container. In a different context you might use 127.0.0.1.

There are probably some hidden rough edges in the above Apache config.  Using proxy_wstunnel_module is supposed to be sufficient by itself, as it allows you to write ProxyPass /log2csv/ ws://log2csv:8000/. But that didn't work, I think because the ws:// request was missing the /log2csv/ subpath.  I think it was missing because the ProxyHTMLURLMap didn't fix the ws:// request in the JavaScript(?) that generated it. So instead it's correctly routed by the RewriteRule.  But that targets all ws:// requests, so something else would be needed if there was another websocket app. on the server.