No, I am not kidding, after a week of optimizing I just reached almost 3 Mio FPS with our engine. I had to disable Clearing/Present and Input, but this is my log output (thanks to PerformanceCounters this is also pretty accurate):
FPS: 2703785
With Input enabled I can easily go above 200000 FPS, most time of each frame is in whatever platform Input GetState method does:
FPS: 202099
With Present (clear is not needed in my test as I draw a fullscreen quad) it goes down to ~13000 FPS, which is not shabby, but almost all time is lost in Present of XNA 4.0 CTP, which hasn't been optimized a lot yet by the XNA Team (the main module I use currently, but we support many different graphic frameworks as well, XNA is just the best for Windows IMO). Might sound crazy, but for performance checking I disable Present because then I immediately see whats slow in the render code.
FPS: 12680
Other platforms are much worse, but it is always waiting for the graphic card or some external library, so I would say our engine is quite fast as of now. I obviously did not have much enabled in my test (just a test screen drawing a big quad, and Time, Input, Graphic, SceneManager, MaterialManager and Profiling modules enabled), but even doing more stuff does not hurt FPS wise. A more complex 2D screen could still achieve 10000 FPS. Next up is 3D optimization, which is much harder obviously (we are still not fast enough, especially on other platforms).