Indirect Rendering : “A way to a million draw calls”

Hello !
This time I am going to talk about the Multi Draw Indirect (MDI) rendering. This feature allows you to enjoy both the purpose of multiDraw and indirect drawing.

Where does the overhead comes from?

Issuing a lot of commands

Issue a drawcall in GPU based rendering is a really heavy operation for the CPU. Knowing this, drawing a lot of models could be really expensive.  A naive draw loop could be seemed like that:

The problem is solved using glMultiDraw.
The new code is:

Unknown data

Now, admit you want to use culling to improve performance. You know that if you perform it on the GPU side, you will be more efficient than if you use the CPU, but you don’t know how to use the result without passing data from the GPU to the CPU…  This is where indirect drawing is efficient.

Your old code is

Using MDI, you could have something like that

And you don’t have to get the result from the CPU.

ARB (MULTI) DRAW INDIRECT

Data and functions

This extension provides two structures to perform a drawCall. One for glDrawArrays and one for glDrawElements.

count specifies the number of elements (vertices) to be rendered
primcount specifies the number of instances to be rendered (in our cases, it will be 0 or 1)
first specifies the position of the first vertex
firstIndex specifies the position of the first index
baseVertex specifies the position of the first vertex
baseInstance specifies the first instance to be rendered (a bit tricky, but I am going to explain that later).

How to Use it

These structures should be put into an OpenGL Buffer Object using the target GL_DRAW_INDIRECT_BUFFER.
Admit you have a big scene with, for 5000 distinct objects and 100 000 meshes. You must have:

  1. 5 000 matrices in a SSBO
  2. 5 000” materials (not really true, but you understand the idea) in a SSBO
  3. 100 000 commands in your indirect buffer
  4. A SSBO which contains bounding boxes data by meshes (to perform culling for each meshes).

Now, what you want is RENDER all the scene. The steps to do that are :

  1. Fill matrices / materials / bouding boxes / indirect buffer
  2. make a dispatch using a compute shader to perform culling
  3. Issue a memory barrier
  4. render

The first step is straightforward.
The second is easy, you use the indirect buffer as a SSBO in the compute shader and set the primCount value to 0 if the mesh is not visible or 1 instead
You are intending to issue an indirect command…
render.

Beautiful ! But how do I know which data I have to use?

  1. The first way is to use gl_DrawIDARB which is pretty explicit.
  2. The way we are going to see and the one I am advising, is to use the baseInstance from structures seen prior.

Why gl_DrawIDARB is not convenient? Simply because it is slower than the second way on most implementations, and because we will not be able to use ARB INDIRECT PARAMETERS with it.

So, for the second way, we must add one or several buffers to the prior list (two in our cases,  one for indexing the matrix buffer, and one for indexing the material buffer). These buffers will contain integer values (the index of the matrix / material in their SSBO). Because they will be used through baseInstance, you understand that these buffers will be vertex buffers using a divisor through glVertexBindingDivisor.

A Caveat?

As you noticed, when you remove a command setting primCount to 0, the command is not really removed… Here is coming the extension ARB INDIRECT PARAMETERS. Instead of settings the primCount to 0, you let it to one, but if the mesh is not visible, you don’t add to the really used buffer command, using an atomic counter, you know exactly how many meshes should be rendered.
You have to bind the atomic buffer to GL_PARAMETER_BUFFER_ARB and use the functions

References

Indirect Parameters
Multi Draw Indirect
Surviving without drawID

OpenGL AZDO : Bindless Textures : batching problem solved

Hello!
After playing with Vulkan, I had to assume that it is not as easy as I wanted to use. Since this thing done, I preferred to come back to OpenGL. However, Vulkan let sme learn a lot of things about how OpenGL works internally. I am going to make a series of tutorials about OpenGL AZDO. The first one will discuss bindless textures!

What is OpenGL AZDO ?

OpenGL Approaching Zero Driver Overhead is an idea which comes from Cass Everitt, Tim Foley, John McDonald, Graham Sellers. The idea buried in it is to reduce the using of CPU by using the last possibilities offered by the new GPUs.
AZDO presents many techniques to eschew to have a low overhead :

  1. Make less binding as possible
  2. Use persistent mapping
  3. Use batching
  4. Use GPU for everything (culling, fill structures).

This series of tutorials will treat about how to implements such things.

Bindless Texture

Bindless texture solved a problem you may notice to implement batching. A naive draw loop could be like that

The main issue here is we cannot perform an efficient batch since each drawcall could have different textures.
Now, imagine you could put a texture inside a uniform buffer and just perform one big draw call! You reach to a very very few overhead!

How to do it ?

We are lucky, according to me, bindless texture is the easier of the AZDO feature to implement. However, we will really see them in action in the chapter about the batching. To run into bindless texture, you just have to follow these following steps

  1. Create the texture in the normal way
  2. Get the handle (kind of the address of the texture)
  3. Make the handle resident
  4. Put the handle in an uniform buffer

So there is a function you can use to load an image file using SDL and put it into a texture and enable bindless feature:

This code is easy, first you load a surface with SDL_image, you create the texture, you compute the number of possible mipmapping, you allocate them (each mipmapping’s level) and you send the value to the first mipmapping’s level.
After, you generate mipmaps, and you ask the texture to get the handle back, and you make it resident.

To use this “bindless” texture, you just have to put the “handle” (GLuint64) inside one uniform buffer.
After, you can use it like that:

The next article could be about batching (with multi draw indirect) or persistent mapping.

Reference

Blog talking about 3D rendering, Qt and C++