Delta Engine Blog

AI, Robotics, multiplatform game development and Strict programming language

Loading Collada Models in XNA 2.0 and doing Skinning and Animation

This article is very similar to the original "Skeletal Bone Animation and Skinning with Collada Models in XNA", which is one of the most read articles on my blog. The old article and the SkinningWithColladaModelsInXna project did only work on XNA 1.0, this is the updated version for XNA 2.0. Most of the content of this post stay the same, but the source code and downloads have changed now.


Video, Screenshot and Downloads

First of all, here is a video for the project I'm talking about here:

Screenshot of the same scene (better quality since youtube videos just suck quality wise):

Before you start reading this article I suggest you download the executable and source code first in case you want to try it out directly while reading. The whole project is based on the RocketCommanderXNA engine and just adds the ColladaModel.cs class, which does all the amazing stuff you see in the video ;-) The video above shows shadowing from the XNA Shooter engine, which is not yet released and their fore not included in the downloads. The rest of the code is the same, just the shadow map rendering was removed. More details about shadow mapping in XNA can also be found in my book (plug plug plug ^^).

  • The source code is about 2.6 MB and contains projects for XNA Game Studio (which works in Visual Studio 2005 too thanks to XNA 2.0). It was tested on Windows XP and Vista 32 and 64 bit as well as on the Xbox 360. Please not the PostScreenShaders are untested and need some tweaking to work correctly, but all the other things like the model loading and rendering work just fine. (2.6 MB)
  • The executable is available too, it will install all necessary files like the XNA 2.0 Framework and DirectX files. You will however need a Shader Model 2.0 capable graphic card. You can read the next blog post for more information on how to create such setups :)
    XnaSkinningColladaModelsSetup.exe (2.2 MB)

Introduction and why Collada?

Note: The rest of this article is from Feb 2007, I have just corrected some errors and updated the source code.

OK, let's get started. I talked a lot about the XNA Content Pipeline and its problems on my blog, and in several recent interviews and also in my upcoming XNA book. For projects like Rocket Commander or even the Racing Game it was sometimes a little bit annoying, but I could do everything I needed by loading X files and adding some features to it to fix the tangents, load the correct shader techniques, etc.
However XNA does not support loading animation data or gives you a way to display them. You have to do all that work by yourself. This includes static mesh animations (like it was used in Rocket Commander, I just left it out in Rocket Commander XNA), but also skeletal animation with bones and skinned meshes.

There is a nice project on CodePlex called XNA Animation Component Library, but it only support FBX and ASCII X files (and recently BVH, ASF and AMC formats) and I could not get any of my test models to work with this library. The main thing missing here is good shader support and it also has a lot of problems with complex animated models. Other than that exporting 3D Models as X files is really a pain in the ass, no matter which exporter you use (Panda Exporter for 3dsMax was good a year ago, I still use it, now kiloWatt X file Exporter for 3dsMax is better, and the Microsoft X File Exporter for Max always sucked). Simple models might work, but the more complex 3D models get and the more meshes and animation data is stored in a scene, the more problems you will have. Sometimes it is not possible to reconstruct everything correctly on your importer side.

Anyway, at the very beginning of the Racing Game development there was no content pipeline in XNA (it was Beta 1) and I implemented loading 3D Models with help of the Collada file format, which is basically just XML and very easy to read. For that reason it was relatively simple to get some 3D data loaded and displayed in the early XNA versions with help of vertex buffers. There were some problems with shader settings and I had to try many different exporters and ended up with the one from Feeling Software. Back then it had still some problems loading shaders and using the correct techniques, but the recent version (3.0.2) is much better and works like a charm.

After Microsoft had implemented the content pipeline and made it possible to load X and FBX files indirectly by going over the content pipeline I had to remove most of my Collada code and re implement the model loading with the new framework. Loading and displaying 3D models was much simpler this way and especially some early unit tests were really simple, but as soon I tried shaders and loading tangents there were a bunch of problems. I reported many bugs back then and it has gotten a lot better, but I still had to fix several issues in the Racing Game and Rocket Commander XNA myself like finding out the correct shader technique and fixing tangent data with help of a custom model processor.

Some things like the level loading in the Racing Game just do not work with X or FBX files because they use splines, which are not exported at all in these formats. Collada came as the rescue again because it is really no big problem for this format. Later versions of the Racing Game removed the Collada level loading and introduced a binary format for the levels, but the importer still accepts Collada files.

Recently I wanted to test a couple of animated models and use skinning since it can often be more useful than static animations and it usually looks much better, especially for organic 3D models. As I said above I could not get anything working with the XNA Animation Component Library and I especially do not like the way they still use the content pipeline and the project is too complicated for me anyway (I just do not need 6 different file formats, I just want one and it should work perfectly with all the features I need).

After some searching I saw some guy called remi from the Collada forum was working on importing Collada models too and I posted some thoughts there too (maybe this was a mistake, I got many emails asking me about tips ^^). Here is the thread about that in the Collada forum. He has provided a test project with some models and it works nice for static meshes without shader information, but that was not really what I was searching for.

This was a month ago and I had not much time working on any of these issues, but after restarting my blog earlier this week I thought this would be a nice topic to talk about. I'm still pretty busy with Arena Wars Reloaded, but I worked a couple of hours every day on this little test program for the past few days. The rest of this article explains the project and into which problems I ran.

Class Overview

Before we go into the details here is a little class overview of the project. As I said before the Rocket Commander XNA engine was used to get up and running with the project without having to re implement the basics. Most of the classes were already in place and had not to be changed. ColladaModel and SkinnedTangentVertex are the 2 new classes and to help us out with the XML loading of the Collada files the XmlHelper class was also brought into the project.

I wrote the ColladaModel file from scratch, but I could reuse some of the static mesh loading code I had done last year. All of the bone and animation loading code was just try and error and I only used the Collada specifications as a source of help, but most stuff had to be tested with the unit tests at the end of the class many many times.

The main program just calls the unit tests TestGoblinColladaModelScene or TestShowBones, there is no real game here, its just a test program. The unit test then calls several Standard Engine classes for doing all the post screen processing, rendering the ground plane, etc. More importantly the ColladaModel class itself basically just provides a constructor and a Render method, everything else is private and will be handled automatically for you. Most people might ignore this, but this is always the most important thing about my classes, the use should be as simple as possible and when I look at the projects mentioned above I really ask myself why people sometimes think so complicated.

The internal Bone class inside ColladaModel is used to store all the bones in a flat list, but each entry has a parent and a list old children bones. This way the list can be used both in a simple for loop, but you can also go through it recursively (which is obviously slower and often more complicated). We will talk about the loading process in a minute.

All the mesh data is stored in vertices, which is just a list of SkinnedTangentVertex structs. The SkinnedTangentVertex struct is very much like the standard TangentVertex struct used in Rocket Commander XNA, but it has 2 new members: blendIndices and blendWeights. Both are in the form of Vector3 and their fore can hold 3 values allowing us to interpolate up to 3 bone influences for each vertex in the shader. More is often not required and we have to re normalize all bone weights anyway, so skipping the least important bone weights is not a big deal. My test models use mostly max. 2-3 influences. Please also note that the vertex shader has now a lot more work to do with all that skinning and you should really optimize it as much as possible. Both the number of vertices we have to process is important (we will talk about optimizing that in the optimizing part of this article) and also the number of instructions the vertex shader has, both numbers should be as low as possible. The GPU is really fast processing this data, but if you do not have animated geometry with bones, there is no reason to let it process all that data (which can make the vertex shader 2-3 times longer and slower). The ground of our test scene does not use a skinning shader as an example.

And finally there are some additions to the ShaderEffect class. First of all we got a new shader called "SkinnedNormalMapping", which does the same thing as the normal mapping (or parallax mapping) shader, but it has an array of 80 bone matrices we can use for skinning. These matrices are set with help of the SetBoneMatrices method in the Set Parameter region of ShaderEffect.cs.

Loading Collada files

Before we go into the details of the loading process, lets make sure we read the summary of the class first because it clearly states what we can do and can't do with this class. This is just a test project and I wanted to make things as simple as possible for both you as the reader and for my requirements.

   /// <summary>
   /// Collada model. Supports bones and animation for Collada (.dae) exported
   /// 3D Models from 3D Studio Max (8 or 9).
   /// This class is just for testing and it will only display one single mesh
   /// with bone and skinning support, the mesh also can only have one single
   /// material. Bones can be either in matrix mode or stored with transform
   /// and rotation values. SkinnedTangentVertex is used to store the vertices.
   /// </summary>
   class ColladaModel : IDisposable

OK, with that said let's go directly into the loading code, which is located in the constructor of this class. All variables used in these class are just for internal use, all you need to know are the vertices and bone lists, which I have already mentioned, and the vertex and index buffers, which are used for rendering. All the rest of the variables are just there to help us loading the Collada file (don't worry, there are not many variables anyway and most methods are short too).

    #region Constructor
    /// <summary>
    /// Create a model from a Collada file
    /// </summary>
    /// <param name="setName">Set name</param>
    public ColladaModel(string setName)
        // Set name to identify this model and build the filename
        name = setName;
        string filename = Path.Combine(ColladaDirectory,
            StringHelper.ExtractFilename(name, true) + "." +

        // Load file
        Stream file = File.OpenRead(filename);
        string colladaXml = new StreamReader(file).ReadToEnd();
        XmlNode colladaFile = XmlHelper.LoadXmlFromText(colladaXml);

        // Load material (we only support one)
        // Load bones

        // Load mesh (vertices data, combine with bone weights)

        // And finally load bone animation data

        // Close file, we are done.
    } // ColladaModel(setFilename)

As you can see first of all the filename is constructed and we just load the file as a text file and throw it to the XmlHelper.LoadXmlFromText helper method (which just uses the existing XmlDocument functionality to load XML from a string). We now get the main Collada node, which contains all the children nodes we need for loading the materials, bones, meshes, etc.

Next all the materials are loaded, but we are only going to use the first one we find because we only support one single mesh anyway. The LoadMaterial method goes through all used textures and shader effects from the Collada file and constructs the material at the end of the method with help of a new constructor in the Material class itself. While this is cool and a lot easier than loading material data from x files, it is not very exciting code, so let's move along.

Even through the bones are located at the end of the Collada file, we have to load them first because all our other loading methods, specifically LoadMesh and LoadAnimation need the bone tree structure and the overall bone list to work. All bones are loaded in sequential order because we want to make sure that we can use the animation matrices later in an easy way without having to check the parent order all the time. Only this way we can be sure that going through our flat bones list we still respect the internal tree structure and always initialize the parents first because the children bones matrices are always multiplied with the parent bones.

    foreach (XmlNode boneNode in boneNodes)
        if (boneNode.Name == "node" &&
            (XmlHelper.GetXmlAttribute(boneNode, "id").Contains("Bone") ||
            XmlHelper.GetXmlAttribute(boneNode, "type").Contains("JOINT")))
            // [...] get matrix
            matrix = LoadColladaMatrix(...);

            // Create this node, use the current number of bones as number.
            Bone newBone = new Bone(matrix, parentBone, bones.Count,
                XmlHelper.GetXmlAttribute(boneNode, "sid"));

            // Add to our global bones list
            // And to our parent, this way we have a tree and a flat list in
            // the bones list :)
            if (parentBone != null)

            // Create all children (will do nothing if there are no sub bones)
            FillBoneNodes(newBone, boneNode);
        } // foreach if (boneNode.Name)

As you can see the code uses the XmlHelper class extensively because otherwise the code would look much uglier and complex. Next we have to load the mesh itself, this is probably the longest method and not easy to figure out if you work with Collada for the first time. Good thing I had already done that in the past and I only had to add the code for getting the blend weights and indices. The following code does load all the weights, which we will use later to fill the blendWeights and blendIndices members of the SkinnedTangentVertex struct vertices list. The code for that is actually a little bit more complicated because we have to find out which weights are the top 3 weights for each vertex in case more than 3 are given.

    #region Load weights
    float[] weights = null;
    foreach (XmlNode sourceNode in skinNode)
        // Get all inv bone skin matrices
        if (sourceNode.Name == "source" &&
            XmlHelper.GetXmlAttribute(sourceNode, "id").Contains("bind_poses"))
            // Get inner float array
            float[] mat = StringHelper.ConvertStringToFloatArray(
                XmlHelper.GetChildNode(sourceNode, "float_array").InnerText);
            for (int boneNum = 0; boneNum < bones.Count; boneNum++)
                if (mat.Length / 16 > boneNum)
                    bones[boneArrayOrder[boneNum]].invBoneSkinMatrix =
                        LoadColladaMatrix(mat, boneNum * 16);
                } // for if
        } // if

        // Get all weights
        if (sourceNode.Name == "source" &&
            XmlHelper.GetXmlAttribute(sourceNode, "id").Contains("skin-weights"))
            // Get inner float array
            weights = StringHelper.ConvertStringToFloatArray(
                XmlHelper.GetChildNode(sourceNode, "float_array").InnerText);
        } // if
    } // foreach

    if (weights == null)
        throw new InvalidOperationException(
            "No weights were found in our skin, unable to continue!");

For more information about the mesh loading please check out the last region in the LoadMesh method, it should explain all the important steps in case you want to add something there or just look how it works.

Problems with the animation data

Getting the animation data was not so easy. First of all I never had done this before because my Collada files for the Racing Game were all just static meshes and I really did not need any animation there. Everything that is actually animated in the Racing Game was done directly in XNA, not in 3D Studio.

The first problem is the many formats that animation data can be in. You can have rotations around any axis or translations and even scalings, but most of your bones will only use one or two of these if they are animated at all. Alternatively all the animation data can be computed directly by the exporter plugin in 3D Studio Max and this way you can make sure that all the animation data is in the correct format. It makes testing certainly a lot harder and if you don't even know the format the matrices are in or how to apply them to each other in which order, you are in a world of trouble.

This is exactly what happened to me, I had most of my test models with rotation animation data only, but the Goblin above from my friend Christoph was done with another technique and the exporter could only export the matrices, so I had to support that too. After some try and error I managed to get the basic animations for my test models working. They are all in the project, feel free to load and test them. To test the bone animations I used the following unit test:

    #region Unit Testing
    // Note: Allow calling all this even in release mode (see Program.cs)
    #region TestShowBones
    /// <summary>
    /// TestShowBones
    /// </summary>
    public static void TestShowBones()
        ColladaModel model = null;
        PlaneRenderer groundPlane = null;
        // Bone colors for displaying bone lines.
        Color[] BoneColors = new Color[]
            { Color.Blue, Color.Red, Color.Yellow, Color.White, Color.Teal,
            Color.RosyBrown, Color.Orange, Color.Olive, Color.Maroon, Color.Lime,
            Color.LightBlue, Color.LightGreen, Color.Lavender, Color.Green,
            Color.Firebrick, Color.DarkKhaki, Color.BlueViolet, Color.Beige };

                // Load our goblin here, you can also load one of my test models!
                model = new ColladaModel(

                // And load ground plane
                groundPlane = new PlaneRenderer(
                    new Vector3(0, 0, -0.001f),
                    new Plane(new Vector3(0, 0, 1), 0),
                    new Material(
                        "GroundStone", "GroundStoneNormal", "GroundStoneHeight"),
                // Show ground
                groundPlane.Render(ShaderEffect.parallaxMapping, "DiffuseSpecular20");

                // Show bones without rendering the model itself
                if (model.bones.Count == 0)

                // Update bone animation.

                // Show bones (all endpoints)
                foreach (Bone bone in model.bones)
                    foreach (Bone childBone in bone.children)
                            BoneColors[bone.num % BoneColors.Length]);
                } // foreach (bone)
    } // TestShowBones()

The most important call here is the call to UpdateAnimations, which goes through the list of bones and updates the so called finalMatrix from the Bone class for each of the bones. In earlier versions this code was horribly complicated and still had a lot of problems, but as soon as I removed all the rotation, translation, scaling, etc. animation support and just allow loading the correctly baked matrices from the Collada files, the code has become much simpler (the actual code does have some optimizations in it, but it is basically the same as the posted code here):

    #region Update animation
    /// <summary>
    /// Update animation.
    /// </summary>
    private void UpdateAnimation(Matrix renderMatrix)
        int aniMatrixNum = (int)(BaseGame.TotalTime * frameRate)) % numOfAnimations;

        foreach (Bone bone in bones)
            // Just assign the final matrix from the animation matrices.
            bone.finalMatrix = bone.animationMatrices[aniMatrixNum];

            // Also use parent matrix if we got one
            // This will always work because all the bones are in order.
            if (bone.parent != null)
                bone.finalMatrix *=
        } // foreach
    } // UpdateAnimation()

For the loading itself we just have to make sure that the animationMatrices are the correct ones. Collada saves them in a absolute mode. Earlier code from me constructed relative matrices (relative to the initial bone matrix), it was easier to construct relative matrices from the rotation, translation, etc. animation data, but much harder to use these matrices later for the animation. Having these absolute matrices makes the UpdateAnimation code so much easier, so make sure you always use them this way.

However, when rendering the vertices later we can't use the absolute matrices because the vertices have to transformed first to get into a relative space to the bones, rotations should not be around the origin, but around the bone positions. Luckily for us (and you should have seen my face when I finally found out that these matrices already exist in Collada and I did not have to create them myself in my own over complicated way ^^) Collada stores the so called invBoneSkin matrices for each bone. By applying these matrices we can easily get the bone matrices we need for rendering, these are directly passed to our shader (as compressed 4x3 matrices BTW to save shader constants, the code for that is a little bit more complex, please check out ShaderEffect.cs and the SkinnedNormalMapping.fx shader itself for details).

    #region GetBoneMatrices
    /// <summary>
    /// Get bone matrices for the shader. We have to apply the invBoneSkinMatrix
    /// to each final matrix, which is the recursively created matrix from
    /// all the animation data (see UpdateAnimation).
    /// </summary>
    /// <returns></returns>
    private Matrix[] GetBoneMatrices(Matrix renderMatrix)
        // Update the animation data in case it is not up to date anymore.

        // And get all bone matrices, we support max. 80 (see shader).
        Matrix[] matrices = new Matrix[Math.Min(80, bones.Count)];
        for (int num = 0; num < matrices.Length; num++)
            // The matrices are constructed from the invBoneSkinMatrix and
            // the finalMatrix, which holds the recursively added animation matrices
            // and finally we add the render matrix too here.
            matrices[num] =
                bones[num].invBoneSkinMatrix * bones[num].finalMatrix * renderMatrix;

        return matrices;
    } // GetBoneMatrices()

And this is what you finally get after executing the TestShowBones unit test. I had not implemented mesh loading or the shader itself at this point. I just was loading and testing the bones itself.


Optimizing the vertices

One major problem with the loaded mesh is the high vertices count, I had two test models with 30k and 60k vertices and as you can imagine this will slow down the vertex shader quite a lot and it is really not necessary to process all these vertices because many of them are exactly the same. The reason we end up with an unoptimized vertices list anyway is because Collada stores separate lists for each component we have to put together at the end of the LoadMesh method. By doing so we have to duplicate the data many times and we just can't know how often each part is reused and how often the overall vertex changes. If just the texture coordinate or normal differs, we have a completely different vertex, which will produce different results in the vertex shader, so just merging everything together is not that simple.

The rendering uses an index buffer anyway, but for the data constructed in LoadMesh it would just be sequential (0, 1, 2 form the first polygon, 3, 4, 5 the next, etc.). Instead of having one index for each vertex, we can often reuse the same vertex 3 or 4 times and reducing the number of vertices drastically. This has also the advantage that we can store much more vertices and complexer meshes even if we still use ushort (16 bit) for our indices (which is half the size of ints and their fore faster). For example if you have 150,000 vertices, but you can reduce them to 40-50,000 optimized vertices, they all can be indexed with ushorts :-)

The easy solution is just to optimize all the vertices after all of them have been loaded, if you have a binary format and do not use Collada directly, this solution is absolutely great, but it still will take a lot of time processing the Collada models if they just have many vertices because we have to check each vertex against each other one and that can be a lot of compares if you have 60 or 70 thousand vertices in a mesh. It actually takes up to a whole minute just to compute that and I have no slow computer ^^ Here is the method that does all that for us:

    #region OptimizeVertexBufferSlow
    /// <summary>
    /// Optimize vertex buffer. Note: The vertices list array will be changed
    /// and shorted quite a lot here. We are also going to create the indices
    /// for the index buffer here (we don't have them yet, they are just
    /// sequential from the loading process above).
    /// Note: Slow version because we have to check each vertex against
    /// each other vertex, which makes this method exponentially slower
    /// the more vertices we have. Takes 10 seconds for 30k vertices,
    /// and over 40 seconds for 60k vertices. It is much easier to understand,
    /// but it produces the same output as the fast OptimizeVertexBuffer
    /// method and you should always use that one (it only requires a couple
    /// of milliseconds instead of the many seconds this method will spend).
    /// </summary>
    /// <returns>ushort array for the optimized indices</returns>
    private ushort[] OptimizeVertexBufferSlow()
        List<SkinnedTangentVertex> newVertices =
            new List<SkinnedTangentVertex>();
        List<ushort> newIndices = new List<ushort>();

        // Go over all vertices (indices are currently 1:1 with the vertices)
        for (int num = 0; num < vertices.Count; num++)
            // Try to find already existing vertex in newVertices list that
            // matches the vertex of the current index.
            SkinnedTangentVertex currentVertex = vertices[FlipIndexOrder(num)];
            bool reusedExistingVertex = false;
            for (int checkNum = 0; checkNum < newVertices.Count; checkNum++)
                if (SkinnedTangentVertex.NearlyEquals(
                    currentVertex, newVertices[checkNum]))
                    // Reuse the existing vertex, don't add it again, just
                    // add another index for it!
                    reusedExistingVertex = true;
                } // if (TangentVertex.NearlyEquals)
            } // for (checkNum)

            if (reusedExistingVertex == false)
                // Add the currentVertex and set it as the current index
            } // if (reusedExistingVertex)
        } // for (num)

        // Reassign the vertices, we might have deleted some duplicates!
        vertices = newVertices;

        // And return index list for the caller
        return newIndices.ToArray();
    } // OptimizeVertexBufferSlow()


Optimizing the Optimization

While this is all nice and dandy and we just optimized the rendering code by 20-30% (I tested with 9 goblins and using 3 passes for them, 2 for shadowing and 1 for the rendering), but the loading now takes painfully long. The main idea here is to not compare every single vertex against every other possible vertex because it does not make much sense, most of the vertices (>99,9%) will always be different. We only need to check the ones that share at least the same position.

I started with comparing neighboring vertices, but since the vertices are stored in index order, they are totally messed up, vertex 1 and 4383 can be equal, but if we just check -10 to +10 we are going to miss it. Instead we have to know which vertices come from the same position data, which we know since Collada saves unique positions. All we have to do is to keep a list of all vertices that share the same position and then we can optimize the comparisions later on. Usually only up to 4 to 6 vertices share the same position, this way the whole comparison process just needs 60,000 * 6 comparisons, not 60,000*60,000 anymore.

    // Initialize reuseVertexPositions and reverseReuseVertexPositions
    // to make it easier to use them below
    reuseVertexPositions = new int[trianglecount * 3];
    reverseReuseVertexPositions = new List<int>[positions.Count / 3];
    for (int i = 0; i < reverseReuseVertexPositions.Length; i++)
        reverseReuseVertexPositions[i] = new List<int>();

    // We have to use int indices here because we often have models
    // with more than 64k triangles (even if that gets optimized later).
    for (int i = 0; i < trianglecount * 3; i++)
        // [...] vertex construction

        // Remember pos for optimizing the vertices later more easily.
        reuseVertexPositions[i] = pos / 3;
        reverseReuseVertexPositions[pos / 3].Add(i);
    } // for (ushort)

And then finally the fast version of OptimizeVertexBuffer, which uses that data:

    #region OptimizeVertexBuffer
    /// <summary>
    /// Optimize vertex buffer. Note: The vertices list array will be changed
    /// and shorted quite a lot here. We are also going to create the indices
    /// for the index buffer here (we don't have them yet, they are just
    /// sequential from the loading process above).
    /// Note: This method is highly optimized for speed, it performs
    /// hundred of times faster than OptimizeVertexBufferSlow, see below!
    /// </summary>
    /// <returns>ushort array for the optimized indices</returns>
    private ushort[] OptimizeVertexBuffer()
        List<SkinnedTangentVertex> newVertices =
            new List<SkinnedTangentVertex>();
        List<ushort> newIndices = new List<ushort>();

        // Helper to only search already added newVertices and for checking the
        // old position indices by transforming them into newVertices indices.
        List<int> newVerticesPositions = new List<int>();

        // Go over all vertices (indices are currently 1:1 with the vertices)
        for (int num = 0; num < vertices.Count; num++)
            // Get current vertex
            SkinnedTangentVertex currentVertex = vertices[num];
            bool reusedExistingVertex = false;

            // Find out which position index was used, then we can compare
            // all other vertices that share this position. They will not
            // all be equal, but some of them can be merged.
            int sharedPos = reuseVertexPositions[num];
            foreach (int otherVertexIndex in reverseReuseVertexPositions[sharedPos])
                // Only check the indices that have already been added!
                if (otherVertexIndex != num &&
                    // Make sure we already are that far in our new index list
                    otherVertexIndex < newIndices.Count &&
                    // And make sure this index has been added to newVertices yet!
                    newIndices[otherVertexIndex] < newVertices.Count &&
                    // Then finally compare vertices (this call is slow, but thanks to
                    // all the other optimizations we don't have to call it that often)
                    currentVertex, newVertices[newIndices[otherVertexIndex]]))
                    // Reuse the existing vertex, don't add it again, just
                    // add another index for it!
                    reusedExistingVertex = true;
                } // if (TangentVertex.NearlyEquals)
            } // foreach (otherVertexIndex)

            if (reusedExistingVertex == false)
                // Add the currentVertex and set it as the current index
            } // if (reusedExistingVertex)
        } // for (num)

        // Finally flip order of all triangles to allow us rendering
        // with CullCounterClockwiseFace (default for XNA) because all the data
        // is in CullClockwiseFace format right now!
        for (int num = 0; num < newIndices.Count / 3; num++)
            ushort swap = newIndices[num * 3 + 1];
            newIndices[num * 3 + 1] = newIndices[num * 3 + 2];
            newIndices[num * 3 + 2] = swap;
        } // for

        // Reassign the vertices, we might have deleted some duplicates!
        vertices = newVertices;

        // And return index list for the caller
        return newIndices.ToArray();
    } // OptimizeVertexBuffer()

With this optimization loading is now pretty fast and rendering performs also nicely thanks to the quick shaders we will discuss below. When running ANTS Profiler over the new project the slowest line of code becomes the actual text parsing, actually the conversion of the long strings for all vertices data into the actual vertices float data, especially the positions array. But we can't do anything about that without loading the data binary and not parsing them ourselves. It takes maybe half a second for a 50k vertices model to load, not great, but ok for our little test app.

Adjusting the shaders for skinned meshes

Now we got a bunch of vertices we can render, but as you know we need a shader to do anything in XNA, there is no fixed function pipeline. There are also no animation helper classes like in DirectX, but they won't help you anyway if you want to render with shaders. Transforming the vertices on the CPU may sometimes be a choice if you do not have many vertices and not many skinned models overall or if they all have to be updated the same way. But generally it is a much better idea to transform all the vertices on the GPU, which costs a little bit more instructions in the vertex shader, but the rest of the rendering stays the same. Most graphic apps are pixel shader bound anyway and I use fairly complex pixel shaders too. Additionally on Shader Model 4.0 cards like the GeForce 8800 with up to 128 parallel unified shader units you can do very complex vertex shaders and use simpler pixel shaders and it will just use more units for the vertex shaders automatically :)

We already have defined the bones matrix array for skinning above and we limited it to 80 bones per mesh, which is quite a lot. Even if you would spend 3 bones per finger and 20-30 bones for the body you would still have plenty of bones left for complex animations. Sure modelers will now say "thats not enough sometimes" ... well, you can always split up the mesh into several meshes with up to 80 bones each if you really need more. My graphic artists are happy with 80 bones ^^

If we want full Shader Model 2.0 support we can only be sure that we got at least 256 shader constants. NVidia has usually more (1024), but you still want your game to run on ATI cards too, especially older ones. Each constant can hold a float4 and we need 4 constants for each 4x4 matrix. This means we can only have up to 64 matrices with this limit and we still need some constants for the world, view, and projection matrices, the light and material values and anything else we want to do in our shader. You should reserve 10-20 constants for that and now we are down to less than 60 matrices, which might sometimes be too little.

Instead of splitting the mesh or providing a different code path in case the GPU can do more constants (I couldn't get that do work in XNA for some reason, not sure if there is some limit or if I made a mistake, my GPU should be able to do at least 1024 constants, and even 2048 for the 8800), there is a trick to save only a 3x4 matrix. The last column is always 0, 0, 0, 1 if we have correctly applied the invBoneMatrix and the animation matrix (see GetBoneMatrices in ColladaModel for details and the order of the matrices). But saving 4 float3 values still needs 4 shader constants per matrix so we have to save it as 4x3 matrices instead.

My first idea was to grab the .w values and reconstruct the translation part of the matrix this way. This worked, but the resulting shader had 80 instructions (from about 20 without skinning), which is not good for the resulting performance when rendering many skinned 3D models.

float4x4 RebuildSkinMatrix(float index)
    return float4x4(
        float4(skinnedMatricesVS20[index*3+0].xyz, 0),
        float4(skinnedMatricesVS20[index*3+1].xyz, 0),
        float4(skinnedMatricesVS20[index*3+2].xyz, 0),
            skinnedMatricesVS20[index*3+2].w, 1));
} // RebuildSkinMatrix(.)

A better idea is to use the 4x3 matrix as a 3x4 matrix by just reversing the order we call mul. This involves some changes to the vertex shader code and looks a little bit confusing sometimes, but if you make sure you transform the world matrix the same way and use it in this reversed mul order too, everything works just fine. Another thing that can be optimized is to pre-multiply the indices of blendIndices for each vertex at the loading time. These indices never change and they don't really care if they are 0, 1, 2 or 0, 3, 6, etc. We save one instruction per matrix we are going to reconstruct (the rest is optimized out by the compiler). This is the much easier version of RebuildSkinMatrix, for more details take a look at the SkinnedNormalMapping.fx file:

// Note: This returns a transposed matrix, use it in reversed order.
// First tests used a 3x3 matrix +3 w values for the transpose values, but
// reconstructing this in the shader costs 20+ extra instructions and after
// some testing I found out this is finally the best way to use 4x3 matrices
// for skinning :)
float4x4 RebuildSkinMatrix(float index)
    return float4x4(
        float4(0, 0, 0, 1));
} // RebuildSkinMatrix(.)

This results in a vertex shader that has almost half as many instructions as before and their fore is twice as fast :) Good work. Performance tests showed that I could increase the framerate from 220 fps to 270 fps just by doing that (test scene with 9 goblins). The following code is actually more or less the only part you have to replace in an existing shader if you want to make it skinnable (plus providing the helper method and the skinned matrices too of course).

    // First transform position with bones that affect this vertex
    // Use the 3 indices and blend weights we have precalculated.
    float4x4 skinMatrix =
        RebuildSkinMatrix(In.blendIndices.x) * In.blendWeights.x +
        RebuildSkinMatrix(In.blendIndices.y) * In.blendWeights.y +
        RebuildSkinMatrix(In.blendIndices.z) * In.blendWeights.z;
    // Calculate local world matrix with help of the skinning matrix
    float4x4 localWorld = mul(world, skinMatrix);
    // Now calculate final screen position with world and viewProj matrices.
    float4 worldPos = mul(localWorld, float4(In.pos, 1));
    Out.pos = mul(worldPos, viewProj);


Post screen shaders for the final result

All the skinning code and bone transformations are nice once you get them done, but without a cool model and some post screen shaders to make the scene look more cool, it is only half the fun. Good think I got the Goblin 3D Model, thanks to Christoph (WAII) again. I also had a couple of other test models and another more complex 3D Model (big evil monster ^^) which was good for some stress testing, but the material just did not look that cool.

As you can see on the following image 6 render targets are used to accomplish the final image. Most shader passes do several things at once and the list of operations (see right side of the image) is longer than the list of used passes. Most of this was just trying to get the best looking values together quickly. If you are an experienced artist you can probably do much better than me, I'm just playing around with the shader parameters until I get bored and then I leave it the way it is. The sceneMap (render target number three) shows the unmodified scene without applying the post screen shaders. It does not look half as cool as the final result.

I hope you enjoyed this article and that you are not as tired as I am writing this all at once (uff). It was a fun project, I have already another one in mind for next week ;-) Take care. If you have problems, post a comment. Please note that not all Collada models will work, you have to follow the rules of ColladaModel or improve the class a bit for other use


References and Links