The birth of a computer language

Introduction

Eventually, during the course of one’s programming career, we encounter a task which requires a high degree of complexity and configurability.  This task is cannot be easily expressed in the available languages.  XML and property configuration files will fall short of being an ideal solution.

Still we try to get off cheap and hammer out solutions using these blunt tools anyway.  The solution invariably trades deferred programming hours for results in increased maintenance and fragility down the road.

These require recompilation, complex configuration which require complex IT handoffs when we might have instead created a DSL or Domain Specific Language which would allow our user base to manage this task for themselves.

Such systems, when based upon properly designed DSL, delight the user with the amount of power it delivers into their hands.

How does this apply to DEX you ask?  Good question.  I have been using DEX extensively for my day job which involves a lot of analysis.  Most of my sources of data were not very well behaved.  I found myself writing lots of trivial yet task specific Groovy code to massage incoming data.  I’m growing sick of it.

I found myself stashing things in temporary SQL tables to perform trivial task such as sorting and selecting slices of the data which are interesting.  SQL falls short for many task, and I grew quickly tired of trying to find the right way to express three-way joins.  I’m a programmer not a DBA damnit, and I find myself despising SQL for these sort of task.  I know what I meant, but Oracle, HSQL nor MySQL seem to track with me.

So, I thought to myself, wouldn’t it be great if I could have a table transformation language that would allow me to easily express what I want?  I’d like to be able to use it within Dex or on the command line to slice up a CSV.

So, I explored packages which I could integrate with which would provide the capability I craved.  Many things were close, however, I was not willing to make any concessions.  This language will become a cornerstone capability within Dex.  I want it exactly the way I want it.

TMI = Table Manipulation Interpreter or Too Much Information

I am calling the language TMI.  I like the irony of the acronym and I think that too many of us take things too seriously.  Humor is a good thing.

In TMI, I would like to capture the things which SQL does well, avoiding the things it does not.  I also want to provide more powerful and intuitive operations with the power you would typically find in a 4GL language.

One of the things SQL does will is selecting data.  Let’s build from that and simplify.

SELECT

The default table will come into the flow as it does with all DEX components within a TMITable structure called “dex”.  Suppose that we have a table which looks like:

FIRST LAST AGE
Bob Marley 55
Pat Martin 44
Jimmy Hoffa 77

I want to be able to isolate columns like:

select FIRST, LAST from dex;

Since dex is the default table, I can also say:

select FIRST, LAST

Both of these operations would remove AGE from the data.

So what?  SQL does that well.  You’re right, but this is simply a starting point, a starting point for simplifying complexity.  I’ll go more into that in a later post.

I also want to handle column names which SQL does not like.  I see stuff like this a lot:

Server Name Server Address Server Port
server1 192.168.1.1 80
server2 192.168.1.2 80
server3 192.168.1.3 80

If you want to name your columns as such, you should be able to.

dex = select “Server Name”, “Server Address” from dex;

ASSIGNMENT

You probably noticed the assignment in the previous example.  That’s critical.  Complex things need to be divided into steps.  These steps often require placeholders to put the intermediary results.

You can create, store, reassign and output variables on the fly.

println “Dex Data: “ + dex;
newdata = select A, B from dex;
println “My New Data Looks Like: “ + newdata;
dex = newdata;
println “Dex Data Now Looks Like: “ + newdata;

Enough Complexity For Now

That’s enough capability for now, though I will quickly evolve it to handle other complex operations such as merging and splitting rows and columns.  For now let’s stay on task and go into the creation of this DSL.  Having experience with JavaCC, I considered using it, however, it’s user base and support have dwindled over the years and all of the exciting developments are happening in the ANTLR world.

Terence Parr is working on a new version of ANTLR that I find absolutely amazing.  Terrence has spent 20 years on the same core task, and it shows.  His ideas and implementations get more and more polished.

The TMI Grammar

Compared to other language frameworks, ANTLR4 stays out of your business and provides a clear separation between Lexer, Parser and Interpreter/Translator.  I no longer have to create so many rules and productions to massage the tree and enforce precedence.

The grammar currently looks like this:

grammar TMI;

tokens
{
  IDENTIFIER,
  SELECT,
  FROM,
  PRINT,
  PRINTLN,
  TRUE,
  FALSE,
  NL
}

@header
{
package com.javainc.tmi;
}

program
	: (expressionOrAssignment (';'|NL)|NL)* EOF;

expressionOrAssignment
	: assignment
	| expression;

assignment
	: expression assignmentOperator expression;

expression
	: '(' expression ')'						# parenExpression
	| expression '+' expression					# add
	| (PRINT|PRINTLN) expression				# print
	| SELECT columnList (FROM tableName)?		# select
	| IDENTIFIER								# identifier
	| STRING									# string
	| BOOLEAN									# boolean;

assignmentOperator
	: '=' | '*='	| '/=' | '%=' | '+=' | '-=' | '<<=' | '>>=' | '&=' | '^=' | '|=';

////////////////////////////////////////

columnList : columnName (',' columnName)*;

columnName : IDENTIFIER | STRING;

tableName : IDENTIFIER;

// Keywords:
SELECT : 'select';
FROM : 'from';
PRINT : 'print';
PRINTLN : 'println';
TRUE : 'true';
FALSE : 'false';

// Last
STRING
	:   '"' ( ESCAPE | ~('\\'|'"') )* '"'
	|   '\'' ( ESCAPE | ~('\\'|'\'') )* '\''
	;

BOOLEAN : TRUE | FALSE;

fragment HEXDIGIT
	: ('0'..'9'|'a'..'f'|'A'..'F') ;

fragment ESCAPE
	:   '\\' ([abtnfrv]|'"'|'\'')
	|   UNICODE_ESCAPE
	|   HEX_ESCAPE
	|   OCTAL_ESCAPE
	;

fragment
UNICODE_ESCAPE
    :   '\\' 'u' HEXDIGIT HEXDIGIT HEXDIGIT HEXDIGIT
    |   '\\' 'u' '{' HEXDIGIT HEXDIGIT HEXDIGIT HEXDIGIT '}'
    ;

fragment
OCTAL_ESCAPE
    :   '\\' ('0'..'3') ('0'..'7') ('0'..'7')
    |   '\\' ('0'..'7') ('0'..'7')
    |   '\\' ('0'..'7')
    ;

fragment
HEX_ESCAPE
    :   '\\' HEXDIGIT HEXDIGIT?
    ;

IDENTIFIER : LETTER (LETTER|'0'..'9')*;

fragment LETTER
	:	'$'
	|	'A'..'Z'
	|	'a'..'z'
	|	'_'
	;

NL
	: '\r'? '\n';

WS : (' '|'\t'|'\f'|'\n'|'\r')+ {skip();};

Not too complex, but probably fairly daunting if you haven’t been exposed to grammars before.  Trust me, it’s an order of magnitude easier than writing a recursive descent parser from scratch.

One of the really nice things that simplify life in ANTLR4 is:

expression
  : '(' expression ')'                        # parenExpression
  | expression '+' expression                 # add
  | (PRINT|PRINTLN) expression                # print
  | SELECT columnList (FROM tableName)?       # select
  | IDENTIFIER                                # identifier
  | STRING                                    # string
  | BOOLEAN                                   # boolean;

This section captures precedence and the ‘#’ entries are labels.  They will create nodes in listeners or visitors created by ANTLR.  Previously, I had to create separate rules where I wanted to write handlers.  This will create handlers for parenExpression, add, print, select and so on.

Very nice indeed.

TMI Interpreter Code

The main code for the test interpreter is as follows:

package com.javainc.tmi;

import java.util.ArrayList;
import java.util.List;
import java.util.Stack;

import org.antlr.v4.runtime.ANTLRFileStream;
import org.antlr.v4.runtime.ANTLRInputStream;
import org.antlr.v4.runtime.CommonTokenStream;
import org.antlr.v4.runtime.tree.ParseTree;

public class TMI
{
  public static void main(String[] args) throws Exception
  {
    // create a CharStream that reads from standard input
    ANTLRInputStream input = new ANTLRFileStream(args[0]);
    // create a lexer that feeds off of input CharStream
    TMILexer lexer = new TMILexer(input);

    // create a buffer of tokens pulled from the lexer
    CommonTokenStream tokens = new CommonTokenStream(lexer);

    // System.out.println("TOKENS: '" + tokens.getText() + "'");
    // create a parser that feeds off the tokens buffer
    TMIParser parser = new TMIParser(tokens);

    ParseTree tree = parser.program();
    // System.out.println("TREE: '" + programCtx.toStringTree(parser) + "'");

    TMISymbolTable sym = new TMISymbolTable();

    TMITable table = new TMITable();
    table.setHeader("A", "B", "C");

    sym.put("name", "Patrick Martin");
    Stack stack = new Stack();

    for (int i = 0; i < 10; i++)
    {
      List row = new ArrayList();
      for (int col = 0; col < table.getHeader().size(); col++)
      {
        row.add("" + (i * table.getHeader().size() + col));
      }
      table.addData(row);
    }

    sym.put("dex", table);

    long startTime, endTime;

    TMIVisitorImpl tmi = new TMIVisitorImpl(sym, stack);
    startTime = System.currentTimeMillis();

    tmi.visit(tree);
    endTime = System.currentTimeMillis();

    System.out.println("Execution completed in: " + (endTime - startTime)
        + " ms.");
  }
}

It shares a symbol table and stack with the parse tree.  I’m not using the stack yet, but I might.  Not sure.

TMI Visitor Implementation

I won’t post all of the code for this here, however, here is the class definition and constructor:

public class TMIVisitorImpl extends ParseTreeVisitor<Object> implements
    TMIVisitor<Object>
{
  private TMISymbolTable sym   = null;
  private Stack          stack = null;

  public TMIVisitorImpl(TMISymbolTable sym, Stack stack)
  {
    this.sym = sym;
    this.stack = stack;
  }
  ...
}

And the implementation of the select command:

  public TMITable visitSelect(SelectContext ctx)
  {
    List<Object> results = eval(ctx);

    List<String> colList = (List<String>) results.get(1);
    TMIIdentifier id = (TMIIdentifier) results.get(3);

    if (sym.defines(id, TMITable.class))
    {
      return ((TMITable) sym.get(id)).select(colList);
    }

    return null;
  }

</pre>
It's still rough, basically it evaluates it's children, gets a list of column names and the table and delegates the selection operation to another class called TMITable.  But this shows the basic handoff between ANTLR and our Java code.
<h2>Running Some Programs</h2>
TMI can't do much yet, however here is a test program and its output.


greeting = "Hello";
name = "Patrick"
println greeting + " " + name;
println dex;
tmp = select A from dex;
println "TMP RESULT=" + tmp;
dex = tmp;
println "NEW DEX: " + dex;</pre>

Output:

Hello Patrick
[A, B, C][[0, 1, 2], [3, 4, 5], [6, 7, 8], [9, 10, 11], [12, 13, 14], [15, 16, 17], [18, 19, 20], [21, 22, 23], [24, 25, 26], [27, 28, 29]]
TMP RESULT=[A][[0], [3], [6], [9], [12], [15], [18], [21], [24], [27]]
NEW DEX: [A][[0], [3], [6], [9], [12], [15], [18], [21], [24], [27]]
Execution completed in: 3 ms.

Conclusion

It doesn’t look like much yet, but TMI will offer a much more powerful alternative for slicing and dicing your data.  More importantly, it will be intuitive and a separate product from DEX itself so that others may embed TMI within their own stuff.  I’ll keep updating this blog with developments and exciting new capabilities of this facet of DEX development.

About patmartin

I am a coder and Data Visualization/Machine Learning enthusiast.
This entry was posted in General and tagged , , , . Bookmark the permalink.

Leave a comment