Delta Engine Blog

The last few days I was working a little bit with the DLR and got to a point today where I was wondering about the performance of a very simple while loop in the DLR, specifically in the ToyScript DLR sample project and of course IronPython. I am totally aware that the following comparison DOES NOT really provide accurate information about the actual performance of the languages and techniques used. Please keep this in mind when reading this article, it only covers a very certain aspect of all used languages. More specially how they perform when adding numbers in a while loop a lot of times! Some of it reflects that compiled languages are faster than dynamic languages, but you should not compare the absolute values, Python or Lua is not really 100 times slower, in many cases you can reach almost similar performance as long as you implement or call performance-critical code in a clever way (e.g. calling C++ code, using frameworks, etc.).

I was only interested about very ruff comparisons to figure out which ways are fast and which languages or techniques are kinda slow (click image on the right for a quick overview). Some code written in Assembler or direct IL code execution was really fast (duh, no wonder), but all execution times of C++, C#, Assembler or IL are way faster than all other dynamic languages or DLR approaches. The important thing for me was to figure out why dynamic languages like IronPython or ToyScript on top of the DLR were like 100 times slower for this specific code. At the end of writing this article I was able to provide pretty good performance with the DLR, which is just 2-3 times slower than the fastest execution time. That's pretty good and could probably be even improved more.

Basically I used the following code sample, which adds 1.0 to counter 10 million times while decreasing the loops variable until it reaches 0. The following code block is not valid syntax in any of the presented languages, but it is very similar to Python. In languages that use compile time types, counter would be a declared as a double (64bit floating point number), most dynamic languages use double as the run-time type for floating point numbers also.

loops = 10000000
counter = 0.0
while loops > 0
  loops = loops - 1
  counter = counter + 1.0

print counter

Languages and techniques used and tested (all with detailed code examples and explanations):

C# 3.0
Native C++
Assembler
IronPython 2.6
ToyScript
Python 2.6
Python 3.1
Lua 5.1.4
Script.NET
IL Emit
Irony + DLR (dummy)
DLR AST with objects
DLR AST with doubles

Before boring you with all the different implementation details, here is the fancy Code Generation Performance comparison. This article became very long; I just wanted to post a few sentences about the DLR and this is what happened. The following graphs were created with Excel with some help from my good old friend ViperDK. The second graph (the first one was already shown above) shows just the execution time of the while-loop code block in each of the used language or technique.

And this graph also shows the other times like compile & build time where it makes sense or script or DLR start-up times to see how much this could affect the overall time from coding to seeing the result on the screen. This is important for my approach to my own language since I want a extremely low time for the coding to see result loop. Start-up time can be ignored in those cases, but it is still important since it is still kinda slow with the DLR (but will hopefully get better in the future).

Here are the values used to generate those graphs. Tested on my almost 3 years old 2.4 Ghz 6600 CPU; this is probably way faster on my i7@3.6 Ghz at work:

Language or Technique	Code lines	Compile&Build	Script Startup	Parsing&AST	Execution
C# 3.0	10	~500ms	-	-	~37ms
Native C++	8	1500ms	-	-	~31ms
Assembler	15+	~1100ms	-	-	~15ms
IronPython 2.6	6	-	~300ms DLR	160ms	~9000ms (~1250ms for files)
ToyScript	8	-	~430ms DLR	103ms	~8400ms
Python 2.6	6	-	~50ms	~100ms	~3500ms
Python 3.1	6	-	~50ms	~100ms	~4500ms
Lua 5.1.4	7	-	~25ms	~150ms	~1400ms
Script.NET	8	-	~100ms	-	~39752ms
IL Emit	30+	-	~100ms	~1ms	~16ms
Irony + DLR (dummy)	30 + 6	-	265ms Irony	87ms	~138ms (dummy)
DLR AST with objects	68+	-	433ms DLR	4ms	~459ms
DLR AST with doubles	60+	-	434ms DLR	5ms	~98ms

Implementing and testing the following code was not too bad for me since I was learning how to generate AST statements in the DLR, generating System.Reflection.Emit IL code, doing stuff with the ToyScript DLR sample, etc. anyway. I was of course interested in how to get the best performance for my code generation, but it is useful to understand why each approach is slower or faster. Sorry if the comparison presents your project or technology you might use in a bad light, but for example Script.NET is slow in comparison (this is explained below in more detail), but still fast enough for probably 99% of its uses. It still took much longer than I expected, writing this article did take some time too, but most of the time was spend writing and testing code in each language and technique and then gathering the execution time results. But performance evaluation is always fun :) Again: This is not a representative performance comparison of the languages and techniques involved, it only applies to my very short and stupid test code and should only give you a general idea.

C# 3.0

First of all the very simple implementation of the while loop:


public static double DoLoop(int loops)
{
    double counter = 0.0;
    while (loops > 0)
    {
        loops = loops - 1;
        counter = counter + 1.0;
    }
    return counter;
} // DoLoop(loops)

And now some code to measure the execution time and to print out the result to see if the loop was executed as many times as we expected:


    long startTime = WindowsHelper.GetPerformanceCounter();
    
    // Do simple performance test in C#
    const int NumberOfLoops = 10000000;
    double result = DoLoop(NumberOfLoops);
    // Note: We did printing in other tests too (does not matter anyway)!
    Console.WriteLine("DoLoop result=" + result);
    
    long endTime = WindowsHelper.GetPerformanceCounter();
    Console.WriteLine("Executing the while loop " + NumberOfLoops +
        " times took: " + WindowsHelper.ConvertToTime(endTime - startTime));

Note that the WindowsHelper class is just providing a little bit more accurate information than just using a StopWatch (which is used in most other samples here), which is fine too. With System.Diagnostics.StopWatch this code looks like the following (actually results in the same results on Windows 7 and Vista):


    Stopwatch timer = new Stopwatch();
    timer.Start();

    // Do simple performance test in C#
    const int NumberOfLoops = 10000000;
    double result = DoLoop(NumberOfLoops);
    //Note: We did printing in other tests too (does not matter much anyway)!
    Console.WriteLine("DoLoop result=" + result);

    timer.Stop();
    Console.WriteLine("Executing the while loop " + NumberOfLoops +
        " times took: " + timer.ElapsedMilliseconds + "ms");

The following output is generated:

DoLoop result=10000000
Executing the while loop 10000000 times took: 37ms

Native C++

Here is the whole program used for testing. This has nothing to do with the DLR or .NET anymore, I just wanted to know how my little while loop would perform in native C++ and Assembler. Sometimes when using for loops the C++ compiler will optimize the code away and just replace it with something static that results in the same value as if the for loop would have been executed. Please also note that it did not make any difference to use a 64bit platform and doubles or just 32bit floats, but since most script languages and especially .NET compile to x64 too, we should not use 64bit numbers when compiling a 32bit application here in C++. Switching to x64 is also not too hard, you just need the correct 64bit libraries and set the linker target platform to X64.


#include <stdio.h>
#include <windows.h>

int main(int argc, char* argv[])
{
    long int before = GetTickCount();

    // C++ version
    int loops = 10000000;
    float counter = 0.0f;
    //64bit code: double counter = 0.0;
    while (loops > 0)
    {
        loops = loops - 1;
        counter = counter + 1.0f;
    }
    
    printf("counter=%f\n", counter);

    long int after = GetTickCount();
    printf("This took %dms\n", (after-before));

    // Wait for user input (not required for Debug.StartWithoutDebugging)
    //char input[128];
    //scanf(input);

    return 0;
}

And after compiling (usually a lot faster than 1s when just incrementally compiling) and running this code, it generates this output:

counter=10000000.000000
This took 31ms

The result is not very accurate, I either get 31ms or 47ms, but not values in between, the accuracy of GetTickCount pretty much sucks and I should use similar code as the GetPerformanceCounter() stuff above from C#, but I do not really care about more accurate values here, C++ is more than fast enough and we do not need to discuss its runtime performance!

Assembler

Based on the C++ solution I was able to write some Assembler instructions. It has been a very long time since I used assembler code at all, probably in 2002 when optimizing some path finding algorithms for ArenaWars, which were called from C++ code, which was called from C# code. Before that in the C++ days I used it now and then, but never for more than some performance critical inner loops. Please also note that sometimes you make non-optimal assembler code (as I probably did in this example) and just having your C++ compiler generate assembler instructions can be better, so always profile your code and only optimize to Assembler if you have no other option left and if you are too lazy to rethink the problem :D


#include <stdio.h>
#include <windows.h>

int main(int argc, char* argv[])
{
    long int before = GetTickCount();
    
    int loops = 10000000;
    float counter = 0;
    float counterInc = 1.0f;
    __asm
    {
startLoop:
        // counter += 1.0f
        fld counter
        fadd counterInc
        fstp counter

        // loops--
        dec loops

        // loops > 0? 
        cmp loops, 0
        // Then goto startLoop (else we are done)
        jg startLoop
    }

    printf("counter=%f\n", counter);

    long int after = GetTickCount();
    printf("This took %dms\n", (after-before));

    return 0;
}

I don't have much to say about this ugly assembler code, as you can see it is a lot more lines of code and this was the first idea not even using registers as you should in assembler. After running this code and getting 31-47ms too (like the C++ version), I added some minor optimization by using registers more explicitly and this is the resulting code, it ran about 2-3 faster than the above code:


#include <stdio.h>
#include <windows.h>

int main(int argc, char* argv[])
{
    long int before = GetTickCount();

    int loops = 10000000;
    float counter = 0;
    float counterInc = 1.0f;
    __asm
    {
        // Copy loop counter to eax register (32bit)
        mov eax, loops
        // Load both float values
        fld counterInc
        fld counter
        
startLoop:
        // counter += 1.0f
        fadd st(0), st(1)
        //st(0) has still the counter value! no need to store: fst st(0)

        // loops--
        dec eax

        // loops > 0? 
        cmp eax, 0
        // Then goto startLoop (else we are done)
        jg startLoop

        // We are done, copy counter value back!
        fstp counter
    }

    printf("counter=%f\n", counter);

    long int after = GetTickCount();
    printf("This took %dms\n", (after-before));

    return 0;
}

This program does produce the following output:

counter=10000000.000000
This took 15ms

This is pretty good and I did not want to optimize it further, especially not when the timer is so inaccurate. Running the code 10 times resulted in 140ms. Again, fast enough, no reason to argue the superior runtime performance. But this comes at a high price since the code is in no way easy to understand, its hard to write and read.

IronPython 2.6

Speaking of code, I would say the implementation in Python is the shortest and very beautiful since it is so easy to read and write:


loops = 10000000
counter = 0
while loops > 0:
    loops = loops - 1
    counter = counter + 1.0
print counter

I just used my ScriptManager from my .NET libraries, which can handle IronPython (and other DLR languages) as well as Lua and wrote this short unit test to see how IronPython performs:


    public void TestPythonDoLoop()
    {
        // Create script manager without any output redirection or special directories
        ScriptManager manager = new ScriptManager(null, null);
        
        Stopwatch timer = new Stopwatch();
        timer.Start();

        // Simply execute line by line as we would in a console window
        manager.ExecuteCode("loops = 10000000", ScriptType.Python);
        manager.ExecuteCode("counter = 0", ScriptType.Python);
        manager.ExecuteCode(@"while loops > 0:
loops = loops - 1
counter = counter + 1", ScriptType.Python);
        manager.ExecuteCode("print counter", ScriptType.Python);

        timer.Stop();
        Console.WriteLine(
            "Executing the while loop 10000000 times in python took: " +
            timer.ElapsedMilliseconds + "ms");
    } // TestPythonDoLoop()

And this results in the following output:

Executing the while loop 10000000 times in python took: 9151ms

Update 2009-04-21: As Simon Davy correctly states in the comments of this post, ipy.exe (IronPython) is actually a lot faster when executing a file with the code above (e.g. use import time). This results in an execution time of just 1250ms, so a lot faster! I have tested it too with ipy, but I entered the code line by line (same way as I called my ScriptManager code in the unit test above) and it still takes 9 seconds. But Simon is right, once I execute everything at once (with ipy test.py) it only takes 1.25s (can also be done with my ScriptManager, but I have not thought of that). So it is probably just an issue with REPL (entering statements one by one .. less optimizations that way). While this certainly does make IronPython a much faster dynamic language than all the other ones, it still is slower than the compiled languages or the DLR code at the end of this article.

Thanks for noticing!This is certainly not good. The same thing happend after I tried using the IronPython 2.6 console or IronPython 2.0, it always took about 9 seconds to complete the big while loop on my PC. I wanted to figure out why this is so slow compared to C#, C++ or Assembler runtime, but then again we have to remember that Python is a dynamic language. The most important reason for the long runtime here is probably just all the boxing and unboxing of values, if you can say that for dynamic languages. What I mean is figuring out which type we got, and then performing the add/sub operation, which can be slow if you do it 10 million times. Please note again that this is probably not affecting any real Python program and performance for more complex programs will be way better. More on the type-performance and this issue in general at the end of this article!

ToyScript

To learn more about the DLR I played around with the DLR ToyScript sample quite a bit the last days. Since a lot of its code is based on the same ideas as IronPython and IronRuby, you can't expect much different performance results. This is the ToyScript code for the while loop:


loops = 10000000
counter = 0
while (loops > 0)
{
    loops = loops - 1
    counter = counter + 1
}
print counter

Similar to the ScriptManager above I wrote some helper classes for the ToyScript language allowing me just to execute code, viewing the AST (abstract syntax tree) and write a lot of unit tests. The following code is such a unit test for the ToyScript class to check the performance for these 10 million while loop iterations:


    public void TestBigLoop()
    {
        const int NumberOfLoops = 10000000;

        Stopwatch timer = new Stopwatch();
        timer.Start();

        // Initialize the language first (takes its own time, around 400ms)
        ToyLanguage language = new ToyLanguage();
        // Execute something else (takes maybe 100ms)
        Assert.Equal(3.0, language.Execute("print 1+2"));

        timer.Stop();
        Console.WriteLine("Startup time: " + timer.ElapsedMilliseconds + "ms");
        timer.Reset();
        timer.Start();

        Assert.Equal((double)NumberOfLoops,
            language.Execute(
@"loops = " + NumberOfLoops + @"
counter = 0
while (loops > 0)
{
loops = loops - 1
counter = counter + 1
}
print counter
"));

        timer.Stop();
        // Show total execution time
        Console.WriteLine("Executing the while loop " + NumberOfLoops +
            " times took: " + timer.ElapsedMilliseconds+"ms");
    } // TestBigLoop()

This code runs for about the same time as IronPython, maybe a little faster since ToyScript has a lot less features than IronPython, but in general it uses the same approach. This is the output from the unit test:

Startup time: 650ms
Executing the while loop 10000000 times took: 8914ms

Investigating the Lambda Expression AST for the code block shows that quite a lot of statements were created in order to make this ToyScript code run on the DLR:


 // Expression: Expression`1
 
 .lambda (<toyblock>$2 Microsoft.Func`1[System.Object] #1)
 
 // LambdaExpression: <toyblock>$2(1)
 .lambda System.Object <toyblock>$2 ()(
 ) {
   .label 0x03d11d12:
   .comma  {
     .block  {
       .extension AssignmentExtensionExpression ( .extension GlobalVariableExpression ( )
       (Object)10000000)
       ,
       .extension AssignmentExtensionExpression ( .extension GlobalVariableExpression ( )
       (Object)0)
       ,
       .loop break:.label 0x03fe2d21  {
         .block  {
           .if (
             .site (Boolean) CallSiteBinder(ConvertTo to System.Boolean) (
               .extension CodeContextExpression ( )
               ,
               .site (Object) CallSiteBinder(DoOperation GreaterThan) (
                 .extension CodeContextExpression ( )
                 ,
                 .extension GlobalVariableExpression ( )
                 ,
                 0
               )
             )
           ) {
             /*empty*/
           } .else {
             .block  {
               /*empty*/,
               .break .label 0x03fe2d21
             }
           }
           ,
           .block  {
             .extension AssignmentExtensionExpression ( .extension GlobalVariableExpression ( )
             .site (Object) CallSiteBinder(DoOperation Subtract) (
               .extension CodeContextExpression ( )
               ,
               .extension GlobalVariableExpression ( )
               ,
               1
             ))
             ,
             .extension AssignmentExtensionExpression ( .extension GlobalVariableExpression ( )
             'ToyHelpers.Add'(
               (Object).extension GlobalVariableExpression ( )
               ,
               (Object)1
             ))
             ,
             /*empty*/
           },
           /*empty*/,
           /*empty*/
         }
       },
       'ToyHelpers.Print'(
         .extension CodeContextExpression ( )
         ,
         .extension GlobalVariableExpression ( )
       ),
       /*empty*/
     },
     .null
   }
 }

The IL code was even more confusing so I only tried to figure out the simple "print 1+2" expression IL code that can be produced by calling ScriptCode.SaveToAssembly(..), more DLR IL code analysis will happen at the end of this article:


  .method public specialname static object <toyblock>$1$1(
class [Microsoft.Scripting]Microsoft.Scripting.Runtime.Scope $scope,
 class [Microsoft.Scripting]Microsoft.Scripting.Runtime.LanguageContext $language) cil managed
  {
    .maxstack 3
    .locals init (
      [0] class [Microsoft.Scripting]Microsoft.Scripting.Runtime.CodeContext context)
    L_0000: ldarg.0 
    L_0001: ldarg.1 
    L_0002: call class [Microsoft.Scripting]Microsoft.Scripting.Runtime.CodeContext
[Microsoft.Scripting]Microsoft.Scripting.Runtime.ScriptingRuntimeHelpers::CreateTopLevelCodeContext(
class [Microsoft.Scripting]Microsoft.Scripting.Runtime.Scope,
class [Microsoft.Scripting]Microsoft.Scripting.Runtime.LanguageContext)
    L_0007: stloc.0 
    L_0008: ldloc.0 
    L_0009: ldc.r8 1
    L_0012: box float64
    L_0017: ldc.r8 2
    L_0020: box float64
    L_0025: call object ToyScript.ToyHelpers::Add(object, object)
    L_002a: call void ToyScript.ToyHelpers::Print(class [Microsoft.Scripting]Microsoft.Scripting.Runtime.CodeContext, object)
    L_002f: ldnull 
    L_0030: ret 
  }

There is a lot of extra stuff in there, but you can clearly see the 1+2 in L_0009 to L_00020 with all its boxing fun. Then those two are added with help of a method Add that is defined in the ToyHelpers class (unboxing the doubles again, adding and boxing them to an object again, which is returned). Then the result is passed along to the ToyHelpers.Print method, which just calls Console.WriteLine in my case (usually it is used by the ConsoleHost class for the normal REPL console interaction).

I am not a DLR expert and there is probably a lot of other things going on, but to me all this boxing, unboxing and extra code just to add some numbers is clearly too much for the big loop. At the end of this article I will try to write some DLR code that does less boxing/unboxing.

Python 2.6

After testing the DLR it was time to check out the native Python implementations to see if this is releated to the dynamic nature of dynamic languages like Python or Lua. Well it obviously is related, but more specifically I wanted to see if certain implementaions are maybe more clever or optimized.

The Python code is the same as above for IronPython:


loops = 10000000
counter = 0.0
while loops > 0:
    loops = loops - 1
    counter = counter + 1.0
print counter

Executing this takes about 3.5 seconds, we could cheat a little and use the following code to execute this loop a little faster:


counter = 0.0
for loops in range(0, 10000000):
    counter = counter + 1.0
print counter

This runs about 2 times faster (less than 2 seconds) because Python is clever and optimizes for loops. The same actually applies to IronPython, a for loop is much faster here too (5.5s instead of 9.1s). But this defeats the whole point of this exercise, I wanted to know what is slow about the DLR and implementations like IronPython or ToyScript and why. It is nice that Python executes for loops quicker, but the whole counter = counter + 1 thing is still very slow compared to C#, C++, etc.

Python 3.1

Executing Python 3.1 is pretty much the same as Python 2.6. It is slighly slower, but that can also be because I use a alpha build (3.1 alpha 2 from 2009-04-04). Again, executing the for-loop version with range was twice as fast (2.1s instead of 4.3s). In general is is quite known that Python is not one of the fastest languages, there are many dynamic languages that can be faster (for example Ruby or Lua), but there are also much slower ones like PHP. If you are more interested about comparing IronPython with Python read the performance comparisons at the IronPython project at codeplex: http://ironpython.codeplex.com/Wiki/View.aspx?title=IronPython%20Performance

Lua 5.1.4

The lua code is pretty similar to Python, but it executes much faster:


loops = 10000000
counter = 0.0
while (loops > 0) do
    loops = loops-1
    counter = counter+1
end
print(counter)

This executes in less than 1.5 seconds and returns the expected 10000000 (not 10000000.0 as Python however since Lua treats everything as a 64bit double anyway). The for-loop trick works here too and reduces the time in half (0.7s instead of 1.4s):


counter = 0.0
for i=1,10000000 do
    counter = counter+1.0
end
print(counter)

The reason why Lua is much faster than all the other dynamic languages I have tested is probably because Lua has much simpler data types and is optimized a lot for math, for loops and stuff like that. I use Lua more than Python these days and the performance is quite good I have to say (using it for AI, some physics and some game logic). As you can see here in many benchmarks Lua is several times faster than Python (but Python is still a lot more powerful with the huge amount of libraries available for it).

Script.NET

Script.NET is a great project that uses the Irony compiler toolkit to parse and evaluate simple C# code. It does not generate any IL or immediate code, it just evaluates the AST directly. This is a very nice approach and easy to understand because all the code is C#, but there is a slight problem with it: This approach is it VERY slow. It was optimized many times already and runs much faster than a year ago, but I would not recommend it for anything other than some scripting in C# (guess what, that's the whole name of the project). If you do not need high performance Script.NET is just fine for executing a couple lines of code, even for games as long as you do not do any physics or rendering code with it ^^


loops = 10000000;
counter = 0;
while (loops > 0)
{
    loops = loops-1;
    counter = counter+1;
}
Console.WriteLine(counter);

The output appears after almost 40 seconds and displays correctly:

10000000

At least I now know why generating some IL code is very important, with or without the DLR. Next I will investigate IL code generation with Reflection.Emit, but I can still remember the pain from the last time I tried to do something with it.

IL Emit

So let's get right to it, the following code will generate the required IL for us:


    // Create a dynamic method called DoLoop where we put all our IL code
    DynamicMethod DoLoop = new DynamicMethod("DoLoop",
        typeof(double), null, typeof(string).Module);

    // Start a timer once again to measure creation time and execution time!
    Stopwatch timer = new Stopwatch();
    timer.Start();

    // Get an ILGenerator and emit a body for our dynamic method
    ILGenerator il = DoLoop.GetILGenerator();
    
    // Create counter (double)
    LocalBuilder counter = il.DeclareLocal(typeof(double));
    il.Emit(OpCodes.Ldc_R8, 0.0);
    il.Emit(OpCodes.Stloc, counter);
    // Assign loops variable to 10000000 (10 million) for the loop!
    LocalBuilder loops = il.DeclareLocal(typeof(int));
    il.Emit(OpCodes.Ldc_I4, 10000000);
    il.Emit(OpCodes.Stloc, loops);
    
    // Start loop label
    Label startLabel = il.DefineLabel();
    il.MarkLabel(startLabel);
    
    // Do this 10000000 times:    
    // Increase counter by 1.0
    il.Emit(OpCodes.Ldloc, counter);
    il.Emit(OpCodes.Ldc_R8, 1.0);
    il.Emit(OpCodes.Add);
    il.Emit(OpCodes.Stloc, counter);

    // Decrease loops by 1
    il.Emit(OpCodes.Ldloc, loops);
    il.Emit(OpCodes.Ldc_I4, 1);
    il.Emit(OpCodes.Sub);
    il.Emit(OpCodes.Stloc, loops);

    // Check if we still have to loop (as long as loops > 0)
    il.Emit(OpCodes.Ldloc, loops);
    il.Emit(OpCodes.Ldc_I4, 0);
    il.Emit(OpCodes.Bgt, startLabel);

    // End loop and return counter!
    il.Emit(OpCodes.Ldloc, counter);
    il.Emit(OpCodes.Ret);

    // Create a delegate that represents the dynamic method. This
    // action completes the method, and any further attempts to
    // change the method will cause an exception.
    DoLoopDelegate loop = (DoLoopDelegate)
        DoLoop.CreateDelegate(typeof(DoLoopDelegate));

    timer.Stop();
    Console.WriteLine("DoLoop IL creation time: " + timer.ElapsedMilliseconds + "ms");
    timer.Reset();

This creates the dynamic DoLoop method, which we can call with loop() from here on, which will also return the result (counter) once we call it. This all takes almost no time at all:

DoLoop IL creation time: 0ms

Next, we need to invoke this dynamic method, which is rather easy:


    timer.Start();
    Console.WriteLine("Invoking DoLoop() returned: " + loop());
    timer.Stop();
    Console.WriteLine("DoLoop IL execution time: " + timer.ElapsedMilliseconds + "ms");

And this only takes about 16ms and also returns the correct value:

Invoking DoLoop() returned: 10000000
DoLoop IL execution time: 16ms

So IL code is rather fast (faster than everything else here except the fine-tuned assembler, which is similar to this), but don't ask in how many problems I ran into just writing these few lines of code. Let's just say a lot of times I used wrong opcodes, did not know how to implement stuff and finally when stuff compiled I got syntax errors when executing, not really telling me what I did wrong, so I had to comment out code until it worked again and then slowly added code until I could figure out what had to be changed. Let's just say not a pleasant thing to write a compiler this way.

Irony + DLR (dummy)

As mentioned before Script.NET actually uses the Irony compiler toolkit for parsing and building the AST (abstract syntax tree). Irony itself is rather fast (can parse 15000 lines of code or more per second), only the execution part of Script.NET wasn't so good because no code is generated, everything is evaluated on the fly.

While I have started using Irony just a week ago after playing around with ANTLR first (see earlier blog post here), but I still like it very much and I hope Roman Ivantsov (the guy behind Irony) will soon release a long promised update with some cool new features I can sink my teeth into. As you will see in my sample below I have added some own functionality to Irony already to make it easier to test code and I will probably extend it even more when I write more of my language grammar.

If you don't know how Irony works and you are interested in building compiler grammars directly in C# you should check out the project and the Lang.NET talks Roman did this and last year about it, plus the great Irony article on CodeProject of course. Basically you define some non-terminals (variables, numbers) and terminals (logical blocks of your code) in the constructor of a Irony.Grammar derived class and then define some rules for it (in BNF = Backus-Naur Form). Then you set the root node and you are ready to parse some code. Please note that this grammar is very simple and will only handle this sample code and very similar code blocks. I'm still just playing around with grammars, there is no concrete implementation of my language yet, just a lot of ideas .. Writing some grammars is a good training, I have written one in ANTLR and 4 other more complex ones with Irony yet.


    public TestDoLoopGrammar()
    {
        // 1. Terminals
        var variable = new IdentifierTerminal("variable");
        var number = new NumberLiteral("number");

        // 2. Non-terminals
        var program = new NonTerminal("program");
        //var commandList = new NonTerminal("commandList");
        var command = new NonTerminal("command");
        var assignmentOperator = new NonTerminal("assignmentOperator");
        var assignment = new NonTerminal("assignment");
        var whileLoop = new NonTerminal("whileLoop");
        var compareOperator = new NonTerminal("compareOperator");
        var condition = new NonTerminal("condition");
        var binaryOperator = new NonTerminal("binaryOperator");
        var operation = new NonTerminal("operation");

        // 3. BNF rules
        // Please note that this program grammar is rather limited and not really
        // useful, it is only used for testing and learning Irony ..
        // We can have many commands in case we want to go crazy
        program.Rule = MakePlusRule(program, null, command);
        assignmentOperator.Rule = Symbol("=");
        assignment.Rule = variable + assignmentOperator + number;
        whileLoop.Rule = Symbol("while") + condition + ":" +
            program + "end";
        // We only allow some simple compare conditions
        compareOperator.Rule = Symbol("<") | "==" | ">";
        condition.Rule = variable + compareOperator + number;
        // We can either use + or -
        binaryOperator.Rule = Symbol("+") | "-";
        operation.Rule =
            variable + assignmentOperator + variable + binaryOperator + number;
        // Commands can be assignments, while loops or operations
        command.Rule = assignment | whileLoop | operation;

        // 4. Operators precedence and punctuation
        RegisterPunctuation(":", "end");

        // 5. Global settings
        this.CaseSensitive = false;
        this.ThrowGrammarExceptionsOnError = true;

        // 6. Set grammar root
        this.Root = program;
    } // TestDoLoopGrammar()

Note that ThrowGrammarExceptionsOnError is not a property of Irony, I added it to make sure all Grammar errors immediatly cause an exception to be thrown, which makes it much easier for unit tests and figuring out what when wrong. Not really sure why Irony does not do that, I can understand not throwing exceptions for syntax errors when parsing some source code (because that might be slower than just reporting errors in a list form and more helpful because its easier to continue after the error). But grammar errors mean usually that you cannot use the grammar for anything anyway. Some things also just produce a warning, but I throw Exceptions then too because I don't want hidden fixes of my grammar, which would make it hard to fix real problems later.

Anyway, with the TestDoLoopGrammar class we can now test and parse some code. Irony usually expects you to use its Grammar Explorer program to do this job, which is a pretty nice application, but it is not unit test and we all know I'm all about unit testing (don't we?). I still use the Irony Grammar Explorer from time to time and I recommend it very much if you want to test out grammars and visualize the AST a little better than just throwing some text into a console window. The following simple unit test will parse our while-loop program source code and produce a nice AST for us, which it kindly prints out:


    public void GenerateDoLoopAST()
    {
        AstNode program = IronyHelpers.ParseWithErrorHandling(
            new TestDoLoopGrammar(),
@"loops = 10000000
counter = 0.0
while loops > 0:
    loops = loops - 1
    counter = counter + 1.0
end");

        IronyHelpers.PrintAst(program);
    }

This will quickly print out the AST in the following way, which might seem a little complex and verbose at first, but you have to remember that we do not have to generate code for every single node. Often several nodes together will just be one single instruction. Other nodes like program are just used to hold code blocks together and allow several child nodes.

 program
   assignment
     loops [variable]
     assignmentOperator
       = [Symbol]
     10000000 [number]
   assignment
     counter [variable]
     assignmentOperator
       = [Symbol]
     0 [number]
   whileLoop
     while [variable]
     condition
       loops [variable]
       compareOperator
         > [Symbol]
       0 [number]
     program
       operation
         loops [variable]
         assignmentOperator
           = [Symbol]
         loops [variable]
         binaryOperator
           - [Symbol]
         1 [number]
       operation
         counter [variable]
         assignmentOperator
           = [Symbol]
         counter [variable]
         binaryOperator
           + [Symbol]
         1 [number]
     end [variable]

Now this tree has to be translated to some code. We could use Reflection.Emit for this, but we just saw how much work it was for this very simple sample. Thats the reason I'm using the DLR instead, which also gives my language some cool dynamic capabilities for free (plus the other benefits like interacting with other DLR languages). But since the DLR is not a piece of cake and I'm not done exploring it yet (just started last two week with it, gimme more time), I will not present a full project with this language here yet (totally out of the scope of this article). I used a modified version of ToyScript and some other sample projects (following next) to generate some DLR code here, but it is still hacked up very much. I will probably write another long article when I'm done with that and have some cool results to show.

DLR AST with objects

Okay, so let's dig deeper into the DLR. While I got some parts of the above Irony AST working in the DLR with help of the ToyScript sample, the whole DLR framework, the ToyScript sample and especially IronPython is still very hard to understand and its also kinda annoying that recent code drops do not work at all with ToyScript. They compile, but running ToyScript will just produce funny errors like "node not reducible", which is not very helpful if you just want to learn about the DLR. A lot of stuff in the DLR still constantly changes (just see the huge amount of code drops at dlr.codeplex.com). I totally like that the DLR team releases all this wonderful code and does it so often, but without any working documentation (almost everything you find on the internet is based on DLR versions from a year ago, when stuff worked in a totally different way) and not even the included samples working all the time is kinda painful. So what does a programmer do in such a case? Yes, he writes tests and figures it out on his own. If there were not this many classes that all do almost nothing in functionality (what the hell, in my projects each class has many hundred lines of code, but in the DLR it seems like 50% of all classes are just 10-30 lines long) it would a much easier job. Debugging is also much harder because stack-traces become 2 miles long after just calling one of two actual functions, that do not just call other functions. Okay, enough ranting, let's just hope these things get better in the final 1.0 release since Microsoft is not allowing any contributions from the community to the DLR and I certainly do not want to write my own. It is pretty amazing what the DLR can do and how well it performs, it just has to become easier to use. Someone had to say it :)

So the bare minimum you need for the DLR to compile some AST for you is a LanguageContext and a DefaultBinder, define one class of each and derive it accordingly. Then define a constructor for your language context, set the binder and while you are at it add the default System (mscorlib) assembly to the script domain manager. Note this code uses DLR change set 21259 (from Apr 5 2009), but I could also use the more recent 22459 version (from today), but I reverted to make the ToyScript sample work again.


    public YourLanguageContext(ScriptDomainManager manager,
        IDictionary<string, object> options)
        : base(manager)
    {
        Binder = new YourBinder(manager);
        manager.LoadAssembly(typeof(string).Assembly);
    } // YourLanguageContext(manager, options)

Next you need to provide an implementation of ParseSourceCode to make the compiler happy and to actually allow your program to parse some source code in your language:


    protected override ScriptCode CompileSourceCode(SourceUnit sourceUnit,
        CompilerOptions options, ErrorSink errorSink)
    {
        // Create lambda expression
        LambdaBuilder program = AstUtils.Lambda(typeof(object), "YourLanguage");
        program.Body = Expression.Block(GenerateStatements(program, sourceUnit.GetCode()));
        LambdaExpression ast = program.MakeLambda();
    
        Expression<DlrMainCallTarget> globalAst =
            new GlobalLookupRewriter().RewriteLambda(ast);
    
        return new LegacyScriptCode(globalAst, sourceUnit);
} // ParseSourceCode(context)

The interesting thing here is how we build an DLR AST from statements (Expression.Block) with help of the LambdaBuilder. The statements do not exist yet, that's what the GenerateStatements method will be used for. We pass the source code to this method and let its do its magic (this is a reduced version, the actual unit test uses a more complex version with some extra code paths and statements to make sure we can initialize the DLR first and to generate code for the next technique too):


    private List<Expression> GenerateStatements(LambdaBuilder program, string code)
    {
        List<Expression> statements = new List<Expression>();

        // Try simple while loop for performance testing!
        // counter = 0.0
        Expression counter = program.Variable(typeof(object), "counter");
        statements.Add(
            Expression.Assign(
                counter,
                Expression.Convert(
                    Expression.Constant(0.0), typeof(object))
            )
        );
        Expression loops = program.Variable(typeof(object), "loops");
        statements.Add(
            Expression.Assign(
                loops,
                Expression.Convert(
                    Expression.Constant((int)10000000), typeof(object))
            )
        );
        
        // while (loops > 0) { counter = counter + 1; loops = loops - 1; }
        statements.Add(
            AstUtils.While(
                Expression.GreaterThan(
                    Expression.Convert(loops, typeof(int)),
                    Expression.Constant(0)
                ),
                Expression.Block(
                    Expression.Assign(counter,
                        Expression.Convert(
                            Expression.Add(
                                Expression.Convert(counter, typeof(double)),
                                Expression.Constant(1.0)
                            ), typeof(object))),
                    Expression.Assign(loops,
                        Expression.Convert(
                            Expression.Subtract(
                                Expression.Convert(loops, typeof(int)),
                                AstUtils.Constant(1)
                            ), typeof(object)))
                ),
                AstUtils.Empty()
            )
        );
        
        // Print out results of counter to the console!
        MethodInfo consoleWriteLine = typeof(Console).GetMethod("WriteLine",
            new Type[] { typeof(object) });
        statements.Add(
            Expression.Call(
                consoleWriteLine,
                counter
            )
        );

        return statements;
    } // GenerateStatements(program, code)

This is obviously not dynamic code, it just adds a bunch of expression statements to a list and returns it. Please also note that I tried to use object for counter and loops to demonstrate this in a similar way ToyScript or IronPython will generate code (thats how I figured out some of these expressions). But since .NET or the DLR does not know how to add two objects for example and does not even allow to assign a number to an object, we have to do a lot of casting, which is in essense just boxing and unboxing.

This code is then executed by a unit test, which does basically create a ScriptRuntime for us, makes sure it uses this language implementation and then calls engine.CreateScriptSourceFromString with the code to be executed. Please note that GenerateStatements completely ignores the source code, it will always generate the same statements right now.


    Stopwatch timer = new Stopwatch();
    timer.Start();

    // Create runtime
    ScriptRuntimeSetup setup = new ScriptRuntimeSetup();
    Type provider = typeof(YourNamespace.YourLanguageContext);
    setup.LanguageSetups.Add(
        new LanguageSetup(provider.AssemblyQualifiedName, provider.Name));
    ScriptRuntime runtime = new ScriptRuntime(setup);

    // Load Engine
    ScriptEngine engine =
        runtime.GetEngineByTypeName(provider.AssemblyQualifiedName);
    timer.Stop();
    Console.WriteLine("Create DLR runtime: " + timer.ElapsedMilliseconds + "ms");
    timer.Reset();
    timer.Start();

    // Execute command
    ScriptSource code = engine.CreateScriptSourceFromString("Objects");
    code.Execute();

    timer.Stop();
    Console.WriteLine("Execute code: " + timer.ElapsedMilliseconds + "ms");

This test will spend about 430ms in the DLR runtime creation (which is kinda slow and will hopefully be faster in the future) and that is the reason it is checked separately. Then it will spend a few ms for the statement creation (4ms as mentioned in the table above) and finally it will execute the code in about 459ms, which is a lot faster than the other DLR languages. But then again, we cheated quite a bit by providing the exact statements we need. The only thing we can optimize is to get rid of all those boxing and unboxing actions. While it is nice to be under half a second for these 10 million while loop iterations with one add to counter and one subtract from loops statement each, it is still a far cry from the performance we have seen in the compiled languages. Let's check out the IL code generated from the DLR AST to see whats going on here:

{
  .maxstack 5
  .locals init (
    [0] class [Microsoft.Scripting]Microsoft.Scripting.Runtime.CodeContext context,
    [1] object obj2,
    [2] object obj3)
  L_0000: ldarg.0 
  L_0001: ldarg.1 
  L_0002: call class [Microsoft.Scripting]Microsoft.Scripting.Runtime.CodeContext
[Microsoft.Scripting]Microsoft.Scripting.Runtime.ScriptingRuntimeHelpers::CreateTopLevelCodeContext(
class [Microsoft.Scripting]Microsoft.Scripting.Runtime.Scope,
class [Microsoft.Scripting]Microsoft.Scripting.Runtime.LanguageContext)
  L_0007: stloc.0 
  L_0008: ldc.r8 0
  L_0011: box float64
  L_0016: stloc.1 
  L_0017: ldc.i4 0x989680
  L_001c: box int32
  L_0021: stloc.2 
  L_0022: ldloc.2 
  L_0023: unbox.any int32
  L_0028: ldc.i4.0 
  L_0029: cgt 
  L_002b: brfalse L_0035
  L_0030: br L_003a
  L_0035: br L_0063
  L_003a: ldloc.1 
  L_003b: unbox.any float64
  L_0040: ldc.r8 1
  L_0049: add 
  L_004a: box float64
  L_004f: stloc.1 
  L_0050: ldloc.2 
  L_0051: unbox.any int32
  L_0056: ldc.i4.1 
  L_0057: sub 
  L_0058: box int32
  L_005d: stloc.2 
  L_005e: br L_0022
  L_0063: ldloc.1 
  L_0064: call void [mscorlib]System.Console::WriteLine(object)
  L_0069: ldnull 
  L_006a: ret 
}

It might be a little hard to figure out which line does what, but once you found the constants like 0x989680 (which is just 10000000) and the add and sub operations, you can get an idea how the statements were translated. The code between L_0035 and L_0063 is the inner loop and the performance critical code. There are 2 unboxing and 2 boxing operations in that code and probably some other unnecessary assignments.

DLR AST with doubles

While playing around with the code even more and getting to a version like the one that will be shown in this section, I tried to understand why the DLR did certain things in the way that it does. A lot of dynamic behavior is only possible when you use very generic types, but I still think for my language implementation it should be possible to use a much more strict rule set and only allow dynamic code where necessary or wanted, not in thinks like loops or math, where we deal with numbers anyway. Lua is a great example of that, numbers are just numbers. They are always doubles, so there is no need to box or unbox them all the time, which is probably one of the reasons it can handle math better than other dynamic languages.

I also found this useful explanation of what IL the DLR emits for dynamic code, it could explain why stuff is slower than IL code, but still a lot faster than evaluating everything ever time. The DLR does a great job of caching and optimizing dynamic code, but as you can see the IL generated is quite long and verbose too, which is ok for dynamic code I guess. I'm just interested in ways to optimize this further to at least provide proof that the DLR is in fact just as good as a compiler for more static languages.

After the last section it is quite obvious what we have to do do generate statements that can be executed quicker: Get rid of all that boxing and unboxing, use specific data types and make sure all the assignments use the correct format too (else convert constants or other variables). This should not be too hard to do in an actual language once we figure out that we are only dealing with numbers (similar to Lua we could only allow one kind of number, like doubles). This is the updated code for GenerateStatements:


    private List<Expression> GenerateStatements(LambdaBuilder program, string code)
    {
        List<Expression> statements = new List<Expression>();

        // Try simple while loop for performance testing!
        // counter = 0.0
        Expression counter = program.Variable(typeof(double), "counter");
        statements.Add(
            Expression.Assign(
                counter,
                Expression.Constant(0.0)
            )
        );
        Expression loops = program.Variable(typeof(double), "loops");
        statements.Add(
            Expression.Assign(
                loops,
                Expression.Constant(10000000.0)
            )
        );

        // while (loops > 0) { counter = counter + 1; loops = loops - 1; }
        statements.Add(
            AstUtils.While(
                Expression.GreaterThan(
                    loops,
                    Expression.Constant(0.0)
                ),
                Expression.Block(
                    Expression.Assign(counter,
                        Expression.Add(counter,
                            Expression.Constant(1.0))),
                    Expression.Assign(loops,
                        Expression.Subtract(loops,
                            AstUtils.Constant(1.0)))
                ),
                AstUtils.Empty()
            )
        );

        // Print out results of counter to the console!
        MethodInfo consoleWriteLine = typeof(Console).GetMethod("WriteLine",
            new Type[] { typeof(double) });
        statements.Add(
            Expression.Call(
                consoleWriteLine,
                counter
            )
        );

        return statements;
    } // GenerateStatements(program, code)

BTW: To analyze the IL code I use the ScriptCode.SaveToAssembly method and use Reflector to show the IL code. As you can see the code has become shorter and we can't see any signs of boxing and unboxing anymore. The code is not even optimal because we use doubles for everything (remember that the C#, C++ and Assembler samples all used a simple integer for the while loop), but as the execution time with less than 100ms shows us, the performance is quite good with this approach.

{
  .maxstack 5
  .locals init (
    [0] class [Microsoft.Scripting]Microsoft.Scripting.Runtime.CodeContext context,
    [1] float64 num,
    [2] float64 num2)
  L_0000: ldarg.0 
  L_0001: ldarg.1 
  L_0002: call class [Microsoft.Scripting]Microsoft.Scripting.Runtime.CodeContext
[Microsoft.Scripting]Microsoft.Scripting.Runtime.ScriptingRuntimeHelpers::CreateTopLevelCodeContext(
class [Microsoft.Scripting]Microsoft.Scripting.Runtime.Scope,
class [Microsoft.Scripting]Microsoft.Scripting.Runtime.LanguageContext)
  L_0007: stloc.0 
  L_0008: ldc.r8 0
  L_0011: stloc.1 
  L_0012: ldc.r8 10000000
  L_001b: stloc.2 
  L_001c: ldloc.2 
  L_001d: ldc.r8 0
  L_0026: cgt 
  L_0028: brfalse L_0032
  L_002d: br L_0037
  L_0032: br L_0054
  L_0037: ldloc.1 
  L_0038: ldc.r8 1
  L_0041: add 
  L_0042: stloc.1 
  L_0043: ldloc.2 
  L_0044: ldc.r8 1
  L_004d: sub 
  L_004e: stloc.2 
  L_004f: br L_001c
  L_0054: ldloc.1 
  L_0055: call void [mscorlib]System.Console::WriteLine(float64)
  L_005a: ldnull 
  L_005b: ret 
}

This is a more complex version of the above unit test from last section, the 10000000 is printed by the DoLoop DLR code:

runtime creation: 336ms
get engine: 16ms
create script: 4ms
Init code and first script startup time: 232ms
10000000
execute again (DoLoop code): 98ms
and shutdown: 0ms

Okay, I would say this blog post is long enough now and I wrote all day on it and the code samples presented here. Could even be the longest blog post on my blog (html is 150kb). Who knows .. it is probably a good idea to go to bed now.

Good fight, good night, till next time :)