Excel Formulas and the DLR – Part 1

Update: Part 2 now available here!

For a recent project I needed an expression parser / compiler / evaluator that could handle Excel-like formulas, but be able to run cross-platform and achieve [near] native code performance.  The Dynamic Language Runtime – which IronPython and IronRuby are built on – looked a pretty neat fit.

First, a little background.  I already had something based on .NET 3.5’s statically-typed expression trees and a modified version of the Dynamic Expression API.  Most of my changes to “Dynamic.cs” were related to handling differences between value types, and some Excel quirks.  To keep things simple, it took a pretty heavy-handed approach to type conversion (largely run-time checks) and imposed some type-related restrictions on the expressions.

The DLR introduces “Expression Trees v2.0”, which are a superset of those from .NET 3.5 and Silverlight 2.  Some interesting things it adds are:

  • DynamicExpression – Describes an operation to resolve at run time, but uses the DLR’s fast dynamic call site caching.
  • Control flow – In addition to simple expressions, many more language concepts can be represented.

IronPython and IronRuby are already a great fit for many use cases.  For an excellent example, check out Pavan Podila’s ScriptConverter – it allows you to use Python expressions and statements in your XAML markup.  The potential uses are huge.

However, for this project I wanted to support as much Excel syntax / oddities as possible and keep a minimal download footprint.

Getting Started – Parsing an expression

A little research revealed I’d probably need a “compiler compiler” or “parser generator”.  I found an awesome utility called TinyPG (by Herre Kuijpers) on CodeProject.  In Herre’s words:

This particular generator is an LL(1) recursive descent parser generator. This means that instead of generating a state machine out of a grammar like most compiler compilers, instead it will generate source code directly; basically generating a method for each non-terminal in the grammar.

So basically you can throw your grammar at it and get source code out.  Nice :)  Here’s some grammar I came up with:

//Tiny Parser Generator v1.2
//Copyright © Herre Kuijpers 2008-2010
<% @TinyPG Namespace="Cjc.ExpressionEngine.Excel.Compiler" %> 

PLUSMINUS    -> @"(\+|-|&)";
MULTDIV    -> @"\*|/";
BROPEN        -> @"\(";
BRCLOSE    -> @"\)";
SBROPEN    -> @"\{";
SBRCLOSE    -> @"\}";
COMMA        -> @",";
PERIOD        -> @"\.";
COLON        -> @"\:";
SHEETBEGIN    -> @"'(?=(''|[^'])*'!)";
SHEETEND    -> @"'!";
QUOTEBEGIN    -> @"@?\""(?=(\""\""|[^\""])*\"")";
QUOTEEND    -> @"\""";
QUOTED        -> @"(\""\""|[^\""])*";
SNGQUOTEBEGIN    -> @"'(?=[^']*'[^!]?)";
SNGQUOTEEND    -> @"'(?=[^!]?)";
SNGQUOTED    -> @"(''|[^'])*";
TRUE        -> @"[Tt][Rr][Uu][Ee]";
FALSE        -> @"[Ff][Aa][Ll][Ss][Ee]";
NULL        -> @"[Nn][Uu][Ll][Ll](\(\))?";
CELLID        -> @"\$?[a-zA-Z]+\$?[0-9]+";
CONSTRUCTOR    -> @"new\s[a-zA-Z_][a-zA-Z0-9_\.]*";
METHODNAME    -> @"[a-zA-Z_][a-zA-Z0-9_\.]*[a-zA-Z0-9_](?=\()";
IDENTIFIER    -> @"[a-zA-Z_][a-zA-Z0-9_\.]*[a-zA-Z0-9_]";
EVAL        -> @"\{Eval\s[a-zA-Z0-9_\-\.\s]*?\}";
DATA        -> @"\{Data\s[a-zA-Z0-9_\-\.\s]*?\}";
ASSIGNID    -> @"[a-zA-Z_][a-zA-Z0-9_\.]*(?=\s?=\s?)";
INTEGER    -> @"(\+|-)?[0-9]+";
NUMBER        -> @"(\+|-)?[0-9]*\.[0-9]+";
COMPARE    -> @"==|<=|>=|>|<|=|!=|<>";
EQUALS        -> @"=";
SIGN        -> @"\+|-";
INVALID    -> @"#VALUE!";
EOF        -> @"^$";

[Skip] WHITESPACE -> @"\s+";

Start        -> (Assignment|CompareExpr)? EOF;
Assignment    -> ASSIGNID EQUALS CompareExpr;
CompareExpr    -> AddExpr (COMPARE AddExpr)?;
AddExpr    -> MultExpr (PLUSMINUS MultExpr)*;
MultExpr    -> Sign (MULTDIV Sign)*;
Params        -> CompareExpr? (COMMA CompareExpr)*;
Constructor    -> CONSTRUCTOR (BROPEN Params? BRCLOSE | SBROPEN Params? SBRCLOSE);
Method        -> METHODNAME BROPEN Params? BRCLOSE;
String        -> QUOTEBEGIN QUOTED QUOTEEND | SNGQUOTEBEGIN SNGQUOTED SNGQUOTEEND;
Sheet        -> SHEETBEGIN SNGQUOTED SHEETEND;
Range        -> Sheet? ( CELLID (COLON CELLID)? );
Identifier    -> IDENTIFIER;
Eval        -> EVAL;
Data        -> DATA;
Sign        -> SIGN? (Method | Constructor | Range | Identifier | Eval | Data | Atom);
Atom        -> TRUE | FALSE | NULL | INTEGER | NUMBER | String | INVALID | BROPEN CompareExpr BRCLOSE;

Building this in TinyPG gives you three source files: Scanner.cs, Parser.cs and ParseTree.cs.  Most of these have partial classes so it’s easy to extend them (TinyPG generates code from a template; you could easily change that if you want more control).

Building an Expression / Abstract Syntax Tree

To build a tree from an expression, simply create a Scanner, wrap it in a Parser and feed it your expression.  We’ll extend the ParseTree [partial] class with code similar to this:

private Expression GetNodeExpression( ParseNode node )
{
    switch ( node.Token.Type )
    {
        case TokenType.Start:
            return GetStartExpression( node );

        case TokenType.Assignment:
            return GetAssignmentExpression( node );

        case TokenType.CompareExpr:
            return GetComparisonExpression( node );

        case TokenType.AddExpr:
            return GetAddExpression( node );

        case TokenType.MultExpr:
            return GetMultiplyExpression( node );

        case TokenType.Atom:
            return GetAtomExpression( node );

        case TokenType.Range:
            return GetRangeExpression( node );

        case TokenType.TRUE:
            return Expression.Constant( true );

        case TokenType.FALSE:
            return Expression.Constant( false );

        case TokenType.NULL:
            return Expression.Constant( null );

        case TokenType.STRING:
            return GetString( node );

        case TokenType.INTEGER:
            {
                int iVal;
                var value = node.Token.Text;

                return int.TryParse( value, out iVal )
                    ? Expression.Constant( iVal )
                    : Expression.Constant( long.Parse( value ) );
            }

        case TokenType.NUMBER:
            return Expression.Constant( decimal.Parse( node.Token.Text ) );

        case TokenType.Identifier:
            return GetVariableExpression( node );

        case TokenType.Method:
            return GetMethodExpression( node );
    }

    return Expression.Constant( node.Token.Text );
}

The helper methods (GetMultiplyExpression etc) are pretty simple; you can see the full source on CodePlex.

Not Using the DLR

So far we haven’t done anything specific to the DLR.  You could go ahead and build a regular .NET 3.5 [statically-typed] expression tree from this and it’d work great.  In fact, if you only needed to handle numeric values the implementation could be pretty trivial – just Expression.Convert all values to System.Decimal, wrap the tree in Expression.Lambda, call Compile and invoke the delegate.  All too easy – but where’s the fun in that? 🙂

A Small Teaser

To get an idea how something like this might be used, you can check out a small teaser right here (or click the image below).

image

Continue to Part 2

9 Comments

  1. Chris — very interesting. I might have a potential customer for this capability. Did you ever finish the development, so that it would actually update and recalc?

    Reply

  2. I would love to have the last version of your excel grammar. It seems like it’s not the own generating the parser found in the source published. Do you still have it?

    Reply

    1. gas – I’ve updated the Excel grammar in the post… It’s the latest version I have (and is being actively used), but you may still need to make some tweaks. Hope this helps! 🙂

      Reply

      1. Uh great! Thanks. I was since then trying to reconstruct it from the Parser without success until yet. You saved my day (and more)

Leave a comment