Delta Engine Blog

AI, Robotics, multiplatform game development and Strict programming language

For vs Foreach Performance

Today I had a discussion with a few other programmers (Hi ViperDK and Timo ^^) about foreach and for loop performance. Everyone had his own ideas and experiences, but we talked mostly about XNA on the Xbox 360 (using the compact framework, that does not like foreach, just use for all the time in tight render loops for your XNA games, foreach will create too much IEnumerator garbage that will not be collected) or other special cases (huge arrays or lists).

Anyway, just guessing around is obviously not as good as actually writing a performance test application and see whats going on (see end of this article for download link). Again, we need a disclaimer: You will probably never write loops with 1 billion iterations, these numbers are similar for less iterations, but you should always try to write clean code and ONLY optimize when it is required after finding performance issues! Lets go directly to the results for 1 billion for or foreach iterations in release mode (will loop through a 10k array or list for 100k times, see below for code):



As you can see there are some major differences between using foreach with an array or using a generic list (boxing/unboxing is not the issue here, you would not be able to archive this performance with an ArrayList). Another thing you can notice right away is that using simple arrays is always faster than using dynamic lists, even converting them from a list to an array via .ToArray() and then executing the foreach is much faster.

It also makes a HUGE difference executing this in debug mode (I execute my code almost 99% in debug mode), everything is 4 times slower for this special performance test. Except for Foreach+List, which is in comparison not that bad in debug mode. What is really bad is For+List, which almost takes 16 seconds to execute 1 billion times (again using my Core 2 Duo 6600 CPU with 4 GB Ram).



Thanks to some helper methods the code for this performance test is pretty short and easy (check end of this article for the source code download link):
 Stopwatch timer = new Stopwatch();
 // Run through all tests
 for (int test = 0; test < 8; test++)
 {
     // Find out which test we wanna make
     bool rememberLength = (test / 4) % 2 == 1;
     bool useForeach = (test / 2) % 2 == 1;
     bool useList = test % 2 == 1;

     // Start timer for test 
     timer.Start();

     // Execute
     int result = RunForOrForeachLoops(rememberLength, useForeach, useList);

     // Stop timer
     timer.Stop();

     // Skip this test if we got no result!
     if (result == 0)
         continue;

     // Report results!
     resultText = ...
     Console.WriteLine(resultText + ": " + timer.ElapsedMilliseconds + "ms");

     // Reset timer for next run
     timer.Reset();
 } // for
The interesting code obviously happens in RunForOrForeachLoops. It basically executes the 1 billion iterations in several different ways (8 ways, as you can guess by the for main test for loop). All of those produce the same result, half of the tests use int[] intArray, the other half uses List<int> intList. And again half of the tests uses just for loops, the other half uses foreach for the iterations. Lets examine the useForeach=true and rememberLength=false cases (btw: Iterations is 100000 and the intList and intArrays have a size of 10000 elements):

if (useForeach)
{
    if (useList)
    {
        for (int iter = 0; iter < Iterations; iter++)
        {
            result = 0;
            foreach (int val in intList)
                result += val;
        } // for
    } // if
    else
    {
        for (int iter = 0; iter < Iterations; iter++)
        {
            result = 0;
            foreach (int val in intArray)
                result += val;
        } // for
    } // else
} // if
else // useForeach == false
...

To figure out the differences between those few lines of code in RunForOrForeachLoops, that almost look the same, we have go to IL code once again. First lets check out what the fastest code block Foreach+Array is being compiled to in IL:
result = 0;
foreach (int val in intArray)
    result += val;
This becomes in IL:
 L_0007: ldsfld int32[] TestSimpleForeachApp.Program::intArray
 L_000c: stloc.2 
 L_000d: ldc.i4.0 
 L_000e: stloc.3 
 L_000f: br.s L_001d
 L_0011: ldloc.2 
 L_0012: ldloc.3 
 L_0013: ldelem.i4 
 L_0014: stloc.1 
 L_0015: ldloc.0 
 L_0016: ldloc.1 
 L_0017: add 
 L_0018: stloc.0 
 L_0019: ldloc.3 
 L_001a: ldc.i4.1 
 L_001b: add 
 L_001c: stloc.3 
 L_001d: ldloc.3 
 L_001e: ldloc.2 
 L_001f: ldlen 
 L_0020: conv.i4 
 L_0021: blt.s L_0011
In L_0021 we jump up to L_0011 as long as we are not done with the loop yet. Inside all the result += val happens, but we do not see any slow code, there are no calls to any methods, there is no unboxing and this all can probably be executed quickly after being complied to assembler by the JIT (just-in-time) compiler.

On the other hand, if we look at pretty much the same code, just with intList instead of intArray, the whole story changes:
result = 0;
foreach (int val in intList)
    result += val;
becomes much more complex IL code with lots of function calls, the creation of an Enumerator and there is even a dispose at the end (hello GC):
  L_0007: ldsfld class [mscorlib]System.Collections.Generic.List`1<int32> TestSimpleForeachApp.Program::intList
  L_000c: callvirt instance valuetype [mscorlib]System.Collections.Generic.List`1/Enumerator<!0>
[mscorlib]System.Collections.Generic.List`1<int32>::GetEnumerator() L_0011: stloc.2 L_0012: br.s L_0020 L_0014: ldloca.s CS$5$0000 L_0016: call instance !0 [mscorlib]System.Collections.Generic.List`1/Enumerator<int32>::get_Current() L_001b: stloc.1 L_001c: ldloc.0 L_001d: ldloc.1 L_001e: add L_001f: stloc.0 L_0020: ldloca.s CS$5$0000 L_0022: call instance bool [mscorlib]System.Collections.Generic.List`1/Enumerator<int32>::MoveNext() L_0027: brtrue.s L_0014 L_0029: leave.s L_0039 L_002b: ldloca.s CS$5$0000 L_002d: constrained [mscorlib]System.Collections.Generic.List`1/Enumerator<int32> L_0033: callvirt instance void [mscorlib]System.IDisposable::Dispose() L_0038: endfinally
What even looks worst too me is the code that is executed in MoveNext (get_Current just returns this.current, no extra instructions here). Since the IL code is very long and confusing (31 IL instructions), here is the C# code of MoveNext:
public bool MoveNext()
{
  List<T> list = this.list;
  if ((this.version == list._version) && (this.index < list._size))
  {
    this.current = list._items[this.index];
    this.index++;
    return true;
  }
  return this.MoveNextRare();
}

And MoveNextRare could also be called at the end of MoveNext if something changes for the list (not the case in our example). But we have to keep in mind that this is just IL code, not actual code that will be executed on the CPU. The JIT compiler can still optimize things out, which we won't see here. But something must be going on since Foreach+List is definiately slower than using all the other approaches to get through all the values in intList. I recommend Foreach+Array converted from List because the IL for Foreach+Array is really fast and does not use any extra methods or Enumerators, especially if you know that the array won't change much and if you do not have to call .ToArray every time when you do a foreach.

And here is the sample project for all this performance testing fun:


    
Comments are closed