Tuesday, March 17, 2015

Accumulation to Improve Small-Batch Drawing

I sometimes see "casual" OpenGL ES developers (e.g. users making 2-d games and other less performance intensive GL applications) hit a performance wall on the CPU side. It starts with the app having something like this:
class gl_helper {
  void draw_colored_triangle_2d(color_t color, int x1, int y1,
    int x2, int y2, int x3, int y3);
  void draw_textured_triangle_2d(color_t color,
    int x1, int y1, int x2, int y2,
    int x3, int y3,
    int tex_x, int tex_y, int tex_width, int tex_height);
  void draw_textured_triangle_3d(color_t color,
    int x1, int y1, int z1, int x2, int y2, int z2,
    int x3, int y3, int z3, int tex_x,
    int tex_y, int tex_width, int tex_height);
You get the idea.  OpenGL ES is "tamed" by making simple functions that do what we want - one primitive at a time.

The results are correct drawing - and truly awful performance.

Why This Is Slow

Why is the above code almost guaranteed to produce slow results when implemented naively? The answer is that 3-d graphics hardware has a high CPU cost to set the GPU up to draw and a very low cost per triangle once you do draw.  So creating an API where each triangle comes in differently and thus must be individually set up maximizes the overhead and minimizes throughput.

A profile of this kind of code will show a ton of time in the actual draw call (e.g. glDrawArrays) but don't be fooled.  The time is really being spent at the beginning of glDrawArrays synchronizing the GPU with the type of drawing you want.*

Cheaper By the Dozen

The Mike Acton way of fixing this is "where there's one, there's many" - this API should allow you to draw lots of triangles, assuming they are all approximately the same.  For example,
void draw_lots_of_colored_triangles(color_t color, int count, float xyz[]); 
would not be an insane API.  At least if the number of triangles gets big, the overhead gets small.

One thing is clear: if your application can generate batched geometry, it absolutely should be sending it to OpenGL in bulk!  You never want to run a for-loop over your big pile of triangles and send them one at a time; if you have a wrapper around OpenGL, make sure you can send the data in without chopping it up first!

When You Can't Consolidate

Unfortunately there are times when you can't actually draw a ton of triangles all at once. It's cute of me to go "oh, performance is easy - just go rewrite all of your drawing code", but this is time consuming and in some cases the app structure itself might make this hard. If you can't design for bulk performance, there is a second option: accumulation.

The idea of accumulation is this: instead of actually drawing all of those individual triangles, you stash them in memory.  You do so in a format that makes it reasonably quick to both:

  1. Save the triangles (so you don't waste time saving and)
  2. Send them all to OpenGL at once.
Here's where the performance win comes from: the accumulator can see that the last 200 triangles were all color triangles with no texture, so it can send them to the GPU with one state setup (for non-textured triangles) and then a single 200-triangle draw call.  This is about 200x more efficient than the naive code.

The accumulator also gives you a place to collect statistics about your application's usage of OpenGL.  If your app is alternating colored and textured triangles, you're going to have to change shaders (even in the accumulator) and it will still be slow.  But you can record statistics in debug mode about the size of the draws to detect this kind of "inefficient ordering."

Similarly, the accumulator can eliminate some calls to the driver to setup state because it knows what it was last doing.  The accumulator does all of its drawing in one shot; if you draw two textured triangles with different textures, the accumulator must stop to change textures (not so good), but it can go "hey, another textured triangle, same pixel shader" and avoid changing pixel shaders (a big win).

Dealing With Inefficient Ordering

So now you have an accumulator, it submits the biggest possible batches of the same kinds of triangles, and it makes the minimum state change calls when the drawing type changes.  And it's still slow. When you look at your usage stats, you find the average draw call size is still only two triangles because the client code is alternating between drawing modes all of the time.

(Maybe your level's building block consists of a textured square background with an additively blended square on top, and this means two triangles of background, state change, two triangle of background, state change again.)

I am assuming that you have already combined your images into a few large textures (texture atlasing) and that you don't have a million tiny textures floating around.  If you haven't atlased your textures, go do it now; I'll wait.

Okay welcome back. When your drawing batch size is still too small even after accumulation, you have two tools to get your batch size back up.

Draw Reordering

The first trick you can try (and you should try this one first) is to give your accumulator the freedom to reorder drawing to achieve better performance.

In our example above, every square in the level had two draws, one on top of the other, and they weren't in the same OpenGL mode.  What we can do is define each draw to be in a different layer, and let the accumulator draw all of layer 0 before any of layer 1.

Once we do that, we find that all of layer 0 is in one OpenGL state (big draw) and all of layer 1 is in the other.  We've relaxed our ordering by giving the accumulator an idea of the real draw ordering we need, rather than the implicit one that comes from the order our code runs.

We actually had just this problem in X-Plane 10 Mobile's user interface; virtually every element was a textured draw of a background element (which uses a simple texturing shader) followed by a draw of text (which uses a special font shader that applies coloring from a two-channel texture).

The result was two shader changes per UI element, and the performance was awful.

We simply modified our accumulator to draw all text after all UI elements; there's a simple "barrier" that can be placed to force stored up text to be output before proceeding (to get major layering of the UI right) but most windows can draw all of their UI elements before any text, cutting down the number of shader changes to two changes total - a big win!

Merging OpenGL State

If you absolutely have to have the draw order you have (maybe there's alpha blending going on) the other lever you can pull is to find ways to make disparate OpenGL calls use more similar drawing state. (This is what texture atlasing does.)  A few tricks:

  • Use a very small solid white texture for non-textured geometry - you can now use your texturing shader at all times.
  • You don't need to get rid of color application in a shader - simply set the color to white opaque.
  • If you use pre-multiplied alpha, you can draw both additive and non-additive alpha from the same state by varying how you prepare your art assets. Opaque assets can be run with the blender on.
In most of these cases, performance is potentially being lost, so you need to be sure that the cost of the small batching and specific draw order needs outweighs the cost of not doing the most efficient thing.  The small white texture should be pretty cheap; GPUs usually have very good texture memory caches.  Blending tricks can be very expensive on mobile GPUs, and old mobile GPUs are very sensitive to the length of the pixel shader, so you only want to leave color on if it's in the vertex shader.

The point of the above paragraph is: measure carefully first, then merge state second; merging state can be a win or a loss, and it's very dependent on the particular model you're drawing.

* Most drivers defer the work of changing the GPU's mode of drawing until you actually say draw. This way it can synchronize the net result of all changing, instead of making a single change each time you call an API command.  Since the gl calls you make don't fit the hardware very well, waiting until the driver can see all changes is a big win.

Saturday, March 14, 2015

glNext is Neither OpenGL nor Next, Discuss

The title is a stretch at best, but as I have said before, good punditry has to take precedence over correctness. Khronos posted a second version of the glNextVulkan + SPIR-V talk with good audio and slices. I'll see you in an hour and a half.

That answered all of our questions, right!  Ha ha, I kid. Seriously though, at least a little bit is now known:
  • Vulkan is not an incremental extension to OpenGL - there's no API compatibility. This is a replacement.
  • Vulkan sits a lot lower in the graphics stack than OpenGL did; this is an explicit low level API that exposes a lot of the hard things drivers did that you didn't know existed.
The driver guys in the talk seem pretty upbeat, and they should: they get to do less work in the driver than they used to! And this is a good thing; the surface area of the OpenGL API (particularly when you combine ARB_compatibility with all of the latest extensions) is kafkaesque. If someone showed you the full API and said "go code that" you'd surely offer to cut off a finger as a less painful alternative.

My biases as a developer are in favor of not throwing out things that work, not assuming that things need a from scratch rewrite just because they annoy you, and not getting excited just because it's shiny and new.  So I am surprised with myself that at this point, I'd much rather have a Vulkan-like API than all of the latest OpenGL extensions, even though it's more work for me. (Remember that work the driver guys aren't going to do?  It goes into the engine layer.)

What's Good/Why Do We Need This?

While there's a lot of good things for game engines in Vulkan, there are a few that got my attention because they are not possible with further extension/upgrade to OpenGL:

Threading: A new API is needed because OpenGL is thread-unfriendly, and it's unfriendly at the core of how the API is written; you can't fix this by adding more stuff. Some things OpenGL does:
  • OpenGL sets up a 1:1 correspondence between queues, command buffers, and threads.  If you want something else, you're screwed, because you get one thing ("the context") and it has damned strict threading rules.
  • OpenGL does the thread synchronization for you, even if you don't want that.  There are locks inside the driver, and you can't get rid of them.*
With Vulkan, command buffers and queues are separate, resource management is explicit, and no synchronization is done on your behalf.

This is definitely a win for game engines. For example, with X-Plane we will load a scenery tile "in the background". We know during loading that every object involved in the scenery tile is "thread local" to us, because they have not been shared. There is no common data between the rendering thread and the loader.

Therefore both can run completely lock free.  There is a one-time synchronization when the fully finished tile is inserted into the active world; this insert happens only after the load is complete (via message Q) and is done between frames by the rendering thread.  Again, no locks.  This code can run lock free at pretty much all points.

There's no way for the GL driver to know that. Every time I go to shovel data into a VBO in OpenGL, the driver has to go "I wonder if anyone is using this?  Is this call going to blow up the world?"  Under Vulkan, the answer is "I'm the app, trust me."  That's going to go a lot faster.  We're getting rid of "safety checks" in the driver that are not needed.

Explicit Performance: one of the hardest things about developing realtime graphics with OpenGL is knowing where the "fast path" is.  The GL API lets you do about a gajillion different things, and only a few call paths are going to go fast.  Sometimes I see threads like this on OpenGL mailing lists:
Newb: hey AMD, when I set the refrigerator state to GL_FROZEN_CUSTARD and then issue a glDrawGizmo(GL_ICECREAM, 10); I see a massive performance slow-down. Your driver sucks!
I'm sitting in front of my computer going "Oh noooooes!!!  You can't use ice cream with frozen custard - that's a crazy thing to do."  Maybe I even write a blog post about it.

But how the hell does anyone ever know?  OpenGL becomes a game of "write-once performance tune everywhere" (or "write once, harass the driver guys to run your app through vtune and tell you you're an idiot everywhere") - sometimes it's not possible to tell why something is slow (NVidia, I'm looking at you and your stripped driver :-) and sometimes you just don't have time to look at every driver case (cough cough, Intel, cough).

OpenGL doesn't just have a huge API, it has a combinatorially huge API - you can combine just about anything with anything else; documenting the fast path (even if all driver providers could agree) is mathematically impossible.

Vulkan fixes this by making performance explicit.  These functions are fast, these are slow.  Don't call the slow calls when you want to go fast.  It gives app developers a huge insight into what is expensive for the driver/hardware and what is not.

Shim It: I may do a 180 on this when I have to code it, but I think it may be easier to move legacy OpenGL apps to Vulkan specifically because it is not the OpenGL API.

When we had to port X-Plane 9 for iPhone from GLES 1.1 to GLES 2.0, I wrote a shim that emulated the stuff we needed in GLES 1.1  Some of this is now core to our engine (e.g. the transform stack) and some still exists because it is only used in crufty non-critical path code and it's not worth it to rip it out (e.g. glBegin).

The shimming exercise was not that hard, but it was made more complicated by the fact that half of the GL API is actually implemented in both versions of the spec.  I ended up doing some evil macro trickery: glDrawElements gets #defined over to our internal call, which updates the lazily changed transform stack and then calls the real glDrawElements.  Doing this level of shim with the full desktop GL API would have been quite scary I think.

Because Vulkan isn't gl at all, one option is to simply implement OpenGL using...Vulkan. I will be curious if a portable open source  gl layer emerges; if it does, it would be a useful way for very large legacy code bases to move to Vulkan.  There'd be two wins:

  1. Reliability.  That's a lot less code that comes from the driver; whether your gl layer works right or is buggy as a bed in New York, it's going to be the same bugs everywhere - if you've ever tried to ship a complicated cross-platform OpenGL app, having the same bugs everywhere is like Christmas (or so I'm told).
  2. Incremental move to Vulkan.  Once you are running on a GL shim, poke a hole through it when you need to get to the metal for only performance-critical stuff.  (This is what we did with GLES 1.1/2.0: the entire UI ran in GLES 1.1 emulation and the rendering engine went in and bound its own custom shaders.)

Vulkan is Not For Everyone

When "OpenGL is Borked" went around the blogs last year one thing that struck me was how many different constituencies were grumpy about OpenGL, often wanting fixes that could not co-exist. Vulkan resolves this tension: it's low level, it's explicit, it's not backward compatible, and therefore it's only for developers who want to do more work to get more perf and don't need to run on everything or can shim their old code.

I think this is a good thing: at least Vulkan can do the limited task it tries to do well. But it's clearly not for beginners, not for teaching an introduction to 3-d graphics, and if you were grumpy about how much work it was to use GLES 2.0 for your mobile game, Vulkan's not going to make you very happy.  And if you're sitting on 100,000,000 lines of CAD code that's all written in OpenGL, Vulkan doesn't do you as much good as that one extension you really really really need.

For developers like me (full time, professional, small company, proprietary engine) there's definitely going to be a cost in moving to Vulkan in development time. Whenever the driver guys talk about resource management they often say something like:
The app has to do explicit resource management, which it's probably already doing on console.
for the big game engines this is totally true, so being able to re-use their resource management code is a win. For smaller games, OpenGL is their resouce management code.  It's not necessarily very good resource management (in that the GL driver is basically guessing about you want, and sometimes guessing wrong) but if you have a three-person development team, having Graham Sellers write your resource management code for you for free is sort of an epic win.

Resource management is the one area where what we know now is way too fuzzy. You can look at Apple's Metal API (fully public, shipping, code samples) and see what a world with non-mutable objects, command queues and command buffers looks like. But resource management in Metal is super simple because it only runs on a shared memory device: a buffer object is a ptr to memory, full stop.  (Would if it were that easy on all GPUs.)

It's too soon to tell what the "boiler plate" will look like for handling resource management in Vulkan.  There's a huge difference in the quality of resource management between different driver stacks; writing a resource manager that does as well as AMD or NVidia's is going to be a real challenge for small development teams.

* My understanding is that if you create only one GL context (and thus you are not using threads) the driver will actually run in a lock-free mode to avoid over-head.  The fact that the driver bothers to detect that and special case it gives you some idea how crazy a GL driver is.  If that doesn't, read this.