Friday, October 22, 2010

Pre-pass lighting redux

Introduction
After writing the previous post on pre-pass lighting I started doing some tests, to see how it compares to the old deferred renderer. The results that I got where pretty interesting, so thought I might as well share them. Also note that this post might be a bit more technical than the previous.

The good thing with these renderers is that they both share the basic material data. So I can use the same data for both HPl2 and HPl3. HPL3 comes with the few more features for decals but for tests, it is easy to just skip them. When setting up the test I went with a very simple scene, it just the same box model rendered several times, a floor and lights. Some times it is best to test with proper game scenes, but I wanted to something that could be easily tweaked and gave simpler output. This means that the tests are not 100% accurate of in-game performance, but even testing a level in game is not that, as framerate varies a lot depending on where in a level one looks. So usually benchmarking has some kind of fly-through, but that is of the scope from what I intended to do.

Note that HPl2 test was built in Visual Studio 2003, while HPL3 uses the 2010 version. I do not think this should matter much though, even if the optimization routines differ, simply because pretty much all of the work is done on the GPU. The graphics card I did all my testing on is a Radeon 5850 HD (and others where tried for some tests). And as a final note, all of the data is given as average frame time (in milliseconds!) and not as frames per second. As Emil Persson points out, FPS is not a very good way to compare performance.

Test #1
Now with my setup details out of the way, let's get down to the details. I first started out with a scene like this:
1 x box, xz-plane floor, 1x spot light + shadow
which game me the following results:
HPL2: 0.78ms
HPL3: 0.84ms
Difference: +7.7%
This means, that given a simple scene like this the old renderer is actually faster! This is not that strange though since the scene does not have many lit screen pixels, most of the image being sky. Thus, the extra pass extra made with the pre-pass renderer matters more than an lighting speed-ups. Also, the decrease in draw buffer (3 to 2) in the g-buffer does not make up for the extra pass.

Test #2
4000 x boxes, 1 x point light, x-z plane floor
HPL2: 14.9
HPL3: 18.5
Difference: +24%
As expected when there is a lot of things to render, the pre-pass lighting is even slower. That extra pass shows on the performance. Remember though that 4000 objects is quite much and an important thing for good performance on GPUs is to have as few draw calls as possible.


Test #3
1 x boxes, 1000 x point light, x-z plane floor
HPL2: 30.0
HPL3: 29.2
Difference: -2.7%
As noticed, once the scene is filled with lights, pre-pass lighting is faster, but only so by a slight amount. Especially considering the large amount of lights. (I later realised that the actual lit screen pixels where quite few, something fixed later on in test #5).


Test #4
4000 x boxes, 1000 x point light, x-z plane floor
HPL2: 47.5
HPL3: 52.0
Difference: +10%
Doing a really stressful test (the number of lights and objects are really large) it seems like the old deferred renderer wins out. This was actually a bit unexpected and dissappointing to me as I thought that the pre-pass lighting should not be this far behind. But taking the little difference in test 3 into account, it is not that suprising. Still, after these tests it is clearly shown that pre-pass lighting is far from a giant speed up compared to deferred shading and it actually seems slower in most cases.

I also tried to skip the early-z pass for pre-pass lighting (I use early-z in both renderers on all other tests). This is basically a pass where the z-buffer is set up, and makes sure later passes only draws visible pixels. From reading Crytech papers, it does not seem like the the Crysis 2 engine has this though (and same seems true for other engines), so I tried to do a quick and dirty test of not using it and got this data: 48.7 (+2.5%)
This means that even without the early z test, the pre-pass was still slower. However, I did not attempts to reduce overdraw (like sorting front to back) and it might be possible for optimizations here. However, when rendering front to back, there will be a lot more state switching as you cannot sort according to texture, etc as efficiently, so I wonder if the data might not even be worse in a more realistic scenario.

I also tried this test out on a few other other cards (again with full early-z testing):
Geforce 240gt: 125, 137 (+9.6%)
Geforce 320M: 240, 240 (+/- 0%)
This gave the indication that on some cards pre-pass might actually be better, and that it might not be as clear-cut as the first tests seemed to show.

As a final variation on this test, I added illumination maps to all textures, a feature that requires an extra pass in the old engine. I also removed the height map rendering. This gave me: 50.6, 50.0 (-1.2%)
This is a very tiny speed up considering that the methods now have the same amount of passes and that pre-pass lighting has faster light rendering and a smaller g-buffer.

Test #5
488 x boxes, 30 x point light, x-z plane floor
Radeon 5850 HD: 7.4, 7.8 (+5.4%)
Geforce 240gt: 18, 19 (+5.5%)
Geforce 320M: 50.0, 45.5 (-9%)
Geforce 9800gtx: 9.5, 9.5 (0%)

In this test I change to a more realistic number of lights and draw calls. I also aligned the lights so the lit pixels covered the entire screen, which I did not do above. As can be seen, on my computer (the 5850) deferred shading still wins, but on a less powerful card the pre-pass lighting is much faster. This difference might be a bandwidth issue and some cards might have problems pushing the data amounts required for deferred shading.

I also did a tweak to this test and turned down the number of draw calls a bit:
316x boxes, 30 x point light, x-z plane floor
Giving: 6.4, 6.6 (+3%)
This further reduced the difference and if I did the hackish removal of early z, pre-pass lighting plunged down to: 5,2 (-18%)
Even though this removal of early z is not very realistic, the results show that I need to investigate it. Something I will do once I get a more proper scene up and running.

Finally, I also tried to give all the boxes illumination (and turning back on early z test):
6.8, 6.6 (-2.9%)
This clearly shows how you get the illumination almost for free in pre-pass, and that it costs a bit more with the deferred shader. This is not surprising though, given that it requires an extra pass, but hints that further effects can be more efficiently implemented when using pre-pass lighting.


Conclusions
The tests clearly show that my previous assumption that light rendering in pre-pass lighting would be much faster was incorrect. It is a bit faster, but only noticeable so when really stretching the limit and then only by a small fraction. This makes me conclude that one should not use pre-pass lighting to have faster light rendering. However, as can be seen on the test with the Geforce 320M, the pre-pass lighting technique matters a lot more on older hardware, and it might actually be of greater use there.

There is not any vast differences in the techniques though and instead the choice should be based on other merits. Given that pre-pass lighting allows for so much more variety in materials, I will keep it for HPL3, but I will not be expecting any rises in framerate anymore.

I hope this post will prove useful for those who are thinking of using either rendering method, and for the rest it might be an interesting insight on how testing is done (at least how I do it). Again, sorry for the lack of pretty picture, which I promise to make up for!