Category: Rendering

  • Vulkan Memory Management : How to write your own allocator

    Hi ! This article will deal with the memory management in Vulkan. But first, I am going to tell you what happened in my life.

    State of my life

    Again, it has been more than one month I did not write anything. So, where am I? I am in the last year of Télécom SudParis. I am following High Tech Imaging courses. It is the image specialization in my school. The funny part of it is : in parallel, I am a lecturer in a video games specialization. I taught OpenGL (3.3 because I cannot make an OpenGL 4 courses (everyone does not have a good hardware for that)). I got an internship in Dassault Systemes (France). It will begin the first February. I will work on the soft shadow engine (OpenGL 4.5).

    Vulkan

    To begin, some articles that I wrote before this one can contain mistakes, or some things are not well explained, or not very optimized.

    Why came back to Vulkan?

    I came back to Vulkan because I wanted to make one of the first “amateur” renderer using Vulkan. Also, I wanted to have a better improvement of memory management, memory barrier and other joys like that. Moreover, I made a repository with “a lot” of Vulkan Example : Vulkan example repository.
    I did not mean to replace the Sascha Willems ones. But I propose my way to do it, in C++, using Vulkan HPP.

    Memory Management with Vulkan

    Different kind of memory

    Heap

    One graphic card can read memory from different heap. It can read memory from its own heap, or the system heap (RAM).

    Type

    It exists a different kind of memory type. For example, it exists memories that are host cached, or host coherent, or device local and other.

    Host and device
    Host

    This memory resides in the RAM. This heap should have generally one (or several) type that own the bit “HOST_VISIBLE”. It means to Vulkan that it could be mapped persistently. Going that way, you get the pointer and you can write from the CPU on it.

    Device Local

    This memory resides on the graphic card. It is freaking fast and is not generally host_visible. That means you have to use a staging resource to write something to it or use the GPU itself.

    Allocation in Vulkan

    In Vulkan, the number of allocation per heap is driver limited. That means you can not do a lot of allocation and you must not use one allocation by buffer or image but one allocation for several buffers and images.
    In this article, I will not take care about the CPU cache or anything like that, I will only focus my explanations on how to have the better from the GPU-side.
    Memory Managements : good and bad

    How will we do it?

    Memory Managements : device allocator
    As you can see, we have a block, that could represent the memory for one buffer, or for one image, we have a chunk that represents one allocation (via vkAllocateMemory) and we have a DeviceAllocator that manages all chunks.

    Block

    I defined a block as follow :

    struct Block {
        vk::DeviceMemory memory;
        vk::DeviceSize offset;
        vk::DeviceSize size;
        bool free;
        void *ptr = nullptr; // Useless if it is a GPU allocation
    
        bool operator==(Block const &block);
    };
    bool Block::operator==(Block const &block) {
        if(memory == block.memory &&
           offset == block.offset &&
           size == block.size &&
           free == block.free &&
           ptr == block.ptr)
            return true;
        return false;
    }

    A block, as it is named, defines a little region within one allocation.
    So, it has an offset, one size, and a boolean to know if it is used or not.
    It may own a ptr if it is an

    Chunk

    A chunk is a memory region that contains a list of blocks. It represents a single allocation.
    What a chunk could let us to do?

    1. Allocate a block
    2. Deallocate a block
    3. Tell us if the block is inside the chunk

    That gives us:

    #pragma once
    #include "block.hpp"
    
    class Chunk : private NotCopyable {
    public:
        Chunk(Device &device, vk::DeviceSize size, int memoryTypeIndex);
    
        bool allocate(vk::DeviceSize size, Block &block);
        bool isIn(Block const &block) const;
        void deallocate(Block const &block);
        int memoryTypeIndex() const;
    
        ~Chunk();
    
    protected:
        Device mDevice;
        vk::DeviceMemory mMemory = VK_NULL_HANDLE;
        vk::DeviceSize mSize;
        int mMemoryTypeIndex;
        std::vector<Block> mBlocks;
        void *mPtr = nullptr;
    };

    One chunk allocates its memory inside the constructor.

    Chunk::Chunk(Device &device, vk::DeviceSize size, int memoryTypeIndex) :
        mDevice(device),
        mSize(size),
        mMemoryTypeIndex(memoryTypeIndex) {
        vk::MemoryAllocateInfo allocateInfo(size, memoryTypeIndex);
    
        Block block;
        block.free = true;
        block.offset = 0;
        block.size = size;
        mMemory = block.memory = device.allocateMemory(allocateInfo);
    
        if((device.getPhysicalDevice().getMemoryProperties().memoryTypes[memoryTypeIndex].propertyFlags & vk::MemoryPropertyFlagBits::eHostVisible) == vk::MemoryPropertyFlagBits::eHostVisible)
            mPtr = device.mapMemory(mMemory, 0, VK_WHOLE_SIZE);
    
        mBlocks.emplace_back(block);
    }

    Since a deallocation is really easy (only to put the block to free), one allocation requires a bit of attention. You need to check if the block is free, and if it is free, you need to check for its size, and, if necessary, create another block if the size of the allocation is less than the available size. You also need take care about memory alignment !

    void Chunk::deallocate(const Block &block) {
        auto blockIt(std::find(mBlocks.begin(), mBlocks.end(), block));
        assert(blockIt != mBlocks.end());
        // Just put the block to free
        blockIt->free = true;
    }
    
    bool Chunk::allocate(vk::DeviceSize size, vk::DeviceSize alignment, Block &block) {
        // if chunk is too small
        if(size > mSize)
            return false;
    
        for(uint32_t i = 0; i < mBlocks.size(); ++i) {
            if(mBlocks[i].free) {
                // Compute virtual size after taking care about offsetAlignment
                uint32_t newSize = mBlocks[i].size;
    
                if(mBlocks[i].offset % alignment != 0)
                    newSize -= alignment - mBlocks[i].offset % alignment;
    
                // If match
                if(newSize >= size) {
    
                    // We compute offset and size that care about alignment (for this Block)
                    mBlocks[i].size = newSize;
                    if(mBlocks[i].offset % alignment != 0)
                        mBlocks[i].offset += alignment - mBlocks[i].offset % alignment;
    
                    // Compute the ptr address
                    if(mPtr != nullptr)
                        mBlocks[i].ptr = (char*)mPtr + mBlocks[i].offset;
    
                    // if perfect match
                    if(mBlocks[i].size == size) {
                        mBlocks[i].free = false;
                        block = mBlocks[i];
                        return true;
                    }
    
                    Block nextBlock;
                    nextBlock.free = true;
                    nextBlock.offset = mBlocks[i].offset + size;
                    nextBlock.memory = mMemory;
                    nextBlock.size = mBlocks[i].size - size;
                    mBlocks.emplace_back(nextBlock); // We add the newBlock
    
                    mBlocks[i].size = size;
                    mBlocks[i].free = false;
    
                    block = mBlocks[i];
                    return true;
                }
            }
        }
    
        return false;
    }

    Chunk Allocator

    Maybe it is bad-named, but the chunk allocator let us to separate the creation of one chunk from the chunk itself. We give it one size and it operates all the verifications we need.

    class ChunkAllocator : private NotCopyable
    {
    public:
        ChunkAllocator(Device &device, vk::DeviceSize size);
    
        // if size > mSize, allocate to the next power of 2
        std::unique_ptr<Chunk> allocate(vk::DeviceSize size, int memoryTypeIndex);
    
    private:
        Device mDevice;
        vk::DeviceSize mSize;
    };
    
    vk::DeviceSize nextPowerOfTwo(vk::DeviceSize size) {
        vk::DeviceSize power = (vk::DeviceSize)std::log2l(size) + 1;
        return (vk::DeviceSize)1 << power;
    }
    
    bool isPowerOfTwo(vk::DeviceSize size) {
        vk::DeviceSize mask = 0;
        vk::DeviceSize power = (vk::DeviceSize)std::log2l(size);
    
        for(vk::DeviceSize i = 0; i < power; ++i)
            mask += (vk::DeviceSize)1 << i;
    
        return !(size & mask);
    }
    
    ChunkAllocator::ChunkAllocator(Device &device, vk::DeviceSize size) :
        mDevice(device),
        mSize(size) {
        assert(isPowerOfTwo(size));
    }
    
    std::unique_ptr<Chunk> ChunkAllocator::allocate(vk::DeviceSize size,
                                                    int memoryTypeIndex) {
        size = (size > mSize) ? nextPowerOfTwo(size) : mSize;
    
        return std::make_unique<Chunk>(mDevice, size, memoryTypeIndex);
    }

    Device Allocator

    I began to make an abstract class for Vulkan allocation :

    /**
     * @brief The AbstractAllocator Let the user to allocate or deallocate some blocks
     */
    class AbstractAllocator : private NotCopyable
    {
    public:
        AbstractAllocator(Device const &device) :
            mDevice(std::make_shared<Device>(device)) {
    
        }
    
        virtual Block allocate(vk::DeviceSize size, vk::DeviceSize alignment, int memoryTypeIndex) = 0;
        virtual void deallocate(Block &block) = 0;
    
        Device getDevice() const {
            return *mDevice;
        }
    
        virtual ~AbstractAllocator() = 0;
    
    protected:
        std::shared_ptr<Device> mDevice;
    };
    
    inline AbstractAllocator::~AbstractAllocator() {
    
    }
    

    As you noticed, it is really easy. You can allocate or deallocate from this allocator. Next, I created a DeviceAllocator that inherits from AbstractAllocator.

    class DeviceAllocator : public AbstractAllocator
    {
    public:
        DeviceAllocator(Device device, vk::DeviceSize size);
    
        Block allocate(vk::DeviceSize size, vk::DeviceSize alignment, int memoryTypeIndex);
        void deallocate(Block &block);
    
    
    private:
        ChunkAllocator mChunkAllocator;
        std::vector<std::shared_ptr<Chunk>> mChunks;
    };
    

    This allocator contains a list of chunks, and contains one ChunkAllocator to allocate chunks.
    The allocation is really easy. We have to check if it exists a “good chunk” and if we can allocate from it. Otherwise, we create another chunk and it is over !

    DeviceAllocator::DeviceAllocator(Device device, vk::DeviceSize size) :
        AbstractAllocator(device),
        mChunkAllocator(device, size) {
    
    }
    
    Block DeviceAllocator::allocate(vk::DeviceSize size, vk::DeviceSize alignment, int memoryTypeIndex) {
        Block block;
        // We search a "good" chunk
        for(auto &chunk : mChunks)
            if(chunk->memoryTypeIndex() == memoryTypeIndex)
                if(chunk->allocate(size, alignment, block))
                    return block;
    
        mChunks.emplace_back(mChunkAllocator.allocate(size, memoryTypeIndex));
        assert(mChunks.back()->allocate(size, alignment, block));
        return block;
    }
    
    void DeviceAllocator::deallocate(Block &block) {
        for(auto &chunk : mChunks) {
            if(chunk->isIn(block)) {
                chunk->deallocate(block);
                return ;
            }
        }
        assert(!"unable to deallocate the block");
    }
    

    Conclusion

    Since I came back to Vulkan, I really had a better understanding of this new API. I can write article in better quality than in march.
    I hope you enjoyed this remake of memory management.
    My next article will be about buffer, and staging resource. It will be a little article. I will write as well an article that explains how to load textures and their mipmaps.

    References

    Vulkan Memory Management

    Kisses and see you soon (probably this week !)

  • Indirect Rendering : “A way to a million draw calls”

    Hello !
    This time I am going to talk about the Multi Draw Indirect (MDI) rendering. This feature allows you to enjoy both the purpose of multiDraw and indirect drawing.

    Where does the overhead comes from?

    Issuing a lot of commands

    Issue a drawcall in GPU based rendering is a really heavy operation for the CPU. Knowing this, drawing a lot of models could be really expensive.  A naive draw loop could be seemed like that:

    foreach(object) {
        writeUniformData(object, uniformData);
        glDraw...();
    }

    The problem is solved using glMultiDraw.
    The new code is:

    foreach(object)
        writeUniformData(object, uniformData[i]);
    glMultiDraw...();

    Unknown data

    Now, admit you want to use culling to improve performance. You know that if you perform it on the GPU side, you will be more efficient than if you use the CPU, but you don’t know how to use the result without passing data from the GPU to the CPU…  This is where indirect drawing is efficient.

    Your old code is

    cullingOnCPU(allObject); // quiet slow
    
    foreach(object)
        if(object->isVisible())
            writeUniformData(object, uniformData[i]);
    glMultiDraw...();

    Using MDI, you could have something like that

    cullingOnGPU(allObject);
    
    foreach(object)
        writeUniformData(object, uniformData[i]);
    glMultiDrawIndirect...();

    And you don’t have to get the result from the CPU.

    ARB (MULTI) DRAW INDIRECT

    Data and functions

    This extension provides two structures to perform a drawCall. One for glDrawArrays and one for glDrawElements.

    typedef  struct {
        GLuint  count;
        GLuint  primCount;
        GLuint  first;
        GLuint  baseInstance;
    } DrawArraysIndirectCommand;
    
    typedef  struct {
        GLuint  count;
        GLuint  primCount;
        GLuint  firstIndex;
        GLint   baseVertex;
        GLuint  baseInstance;
    } DrawElementsIndirectCommand;
    
    
    void glMultiDrawArraysIndirect(GLenum mode,
                               const void *indirect,
                               GLsizei drawcount,
                               GLsizei stride);
    
    
    void glMultiDrawElementsIndirect(GLenum mode,
                                 GLenum type,
                                 const void *indirect,
                                 GLsizei drawcount,
                                 GLsizei stride);
    

    count specifies the number of elements (vertices) to be rendered
    primcount specifies the number of instances to be rendered (in our cases, it will be 0 or 1)
    first specifies the position of the first vertex
    firstIndex specifies the position of the first index
    baseVertex specifies the position of the first vertex
    baseInstance specifies the first instance to be rendered (a bit tricky, but I am going to explain that later).

    How to Use it

    These structures should be put into an OpenGL Buffer Object using the target GL_DRAW_INDIRECT_BUFFER.
    Admit you have a big scene with, for 5000 distinct objects and 100 000 meshes. You must have:

    1. 5 000 matrices in a SSBO
    2. 5 000” materials (not really true, but you understand the idea) in a SSBO
    3. 100 000 commands in your indirect buffer
    4. A SSBO which contains bounding boxes data by meshes (to perform culling for each meshes).

    Now, what you want is RENDER all the scene. The steps to do that are :

    1. Fill matrices / materials / bouding boxes / indirect buffer
    2. make a dispatch using a compute shader to perform culling
    3. Issue a memory barrier
    4. render

    The first step is straightforward.
    The second is easy, you use the indirect buffer as a SSBO in the compute shader and set the primCount value to 0 if the mesh is not visible or 1 instead
    You are intending to issue an indirect command…
    render.

    fillBuffers();
    glDispatchIndirect();
    glMemoryBarrier(GL_COMMAND_BARRIER_BIT /*|GL_SHADER_STORAGE_BARRIER_BIT*/);
    glBindBuffer(GL_DRAW_INDIRECT_BUFFER, indirectBuffer);
    glBindVertexArray(vao);
    glMultiDraw*Indirect();

    Beautiful ! But how do I know which data I have to use?

    1. The first way is to use gl_DrawIDARB which is pretty explicit.
    2. The way we are going to see and the one I am advising, is to use the baseInstance from structures seen prior.

    Why gl_DrawIDARB is not convenient? Simply because it is slower than the second way on most implementations, and because we will not be able to use ARB INDIRECT PARAMETERS with it.

    So, for the second way, we must add one or several buffers to the prior list (two in our cases,  one for indexing the matrix buffer, and one for indexing the material buffer). These buffers will contain integer values (the index of the matrix / material in their SSBO). Because they will be used through baseInstance, you understand that these buffers will be vertex buffers using a divisor through glVertexBindingDivisor.

    A Caveat?

    As you noticed, when you remove a command setting primCount to 0, the command is not really removed… Here is coming the extension ARB INDIRECT PARAMETERS. Instead of settings the primCount to 0, you let it to one, but if the mesh is not visible, you don’t add to the really used buffer command, using an atomic counter, you know exactly how many meshes should be rendered.
    You have to bind the atomic buffer to GL_PARAMETER_BUFFER_ARB and use the functions

    void MultiDrawArraysIndirectCountARB(enum mode,
                                         const void *indirect,
                                         intptr drawcount,
                                         sizei maxdrawcount,
                                         sizei stride);
    
    void MultiDrawElementsIndirectCountARB(enum mode,
                                           enum type,
                                           const void *indirect,
                                           intptr drawcount,
                                           sizei maxdrawcount,
                                           sizei stride);

    References

    Indirect Parameters
    Multi Draw Indirect
    Surviving without drawID

  • OpenGL AZDO : Bindless Textures : batching problem solved

    Hello!
    After playing with Vulkan, I had to assume that it is not as easy as I wanted to use. Since this thing done, I preferred to come back to OpenGL. However, Vulkan let sme learn a lot of things about how OpenGL works internally. I am going to make a series of tutorials about OpenGL AZDO. The first one will discuss bindless textures!

    What is OpenGL AZDO ?

    OpenGL Approaching Zero Driver Overhead is an idea which comes from Cass Everitt, Tim Foley, John McDonald, Graham Sellers. The idea buried in it is to reduce the using of CPU by using the last possibilities offered by the new GPUs.
    AZDO presents many techniques to eschew to have a low overhead :

    1. Make less binding as possible
    2. Use persistent mapping
    3. Use batching
    4. Use GPU for everything (culling, fill structures).

    This series of tutorials will treat about how to implements such things.

    Bindless Texture

    Bindless texture solved a problem you may notice to implement batching. A naive draw loop could be like that

    foreach(render target) { // frame buffer
    foreach(pass) { // Depth, geometry, light
        foreach(material) { // textures
        draw();
        }
    }

    The main issue here is we cannot perform an efficient batch since each drawcall could have different textures.
    Now, imagine you could put a texture inside a uniform buffer and just perform one big draw call! You reach to a very very few overhead!

    How to do it ?

    We are lucky, according to me, bindless texture is the easier of the AZDO feature to implement. However, we will really see them in action in the chapter about the batching. To run into bindless texture, you just have to follow these following steps

    1. Create the texture in the normal way
    2. Get the handle (kind of the address of the texture)
    3. Make the handle resident
    4. Put the handle in an uniform buffer

    So there is a function you can use to load an image file using SDL and put it into a texture and enable bindless feature:

    std::unique_ptr<Texture> Texture::loadImage2D(const std::string &path) {
        std::unique_ptr<Texture> texture = std::make_unique<Texture>();
    
        SDL_Surface *surface = IMG_Load(path.c_str());
    
        if(surface == nullptr)
            throw std::runtime_error(path + " does not opened");
    
        GLenum format, internalFormat;
    
        getFormats(surface, internalFormat, format);
    
        glCreateTextures(GL_TEXTURE_2D, 1, &texture->mId);
    
        glTextureParameteri(*texture, GL_TEXTURE_MIN_FILTER, GL_LINEAR_MIPMAP_LINEAR);
        glTextureParameteri(*texture, GL_TEXTURE_MAG_FILTER, GL_LINEAR);
    
        GLsizei numMipmaps = ((GLsizei)log2(std::max(surface->w, surface->h)) + 1);
        glTextureStorage2D(*texture, numMipmaps, internalFormat, surface->w, surface->h);
        glTextureSubImage2D(*texture, 0, 0, 0, surface->w, surface->h,
                            format, GL_UNSIGNED_BYTE, surface->pixels);
    
        glGenerateTextureMipmap(*texture);
    
        texture->mHandle = glGetTextureHandleARB(*texture);
        glMakeTextureHandleResidentARB(texture->mHandle);
    
        SDL_FreeSurface(surface);
    
        return texture;
    }

    This code is easy, first you load a surface with SDL_image, you create the texture, you compute the number of possible mipmapping, you allocate them (each mipmapping’s level) and you send the value to the first mipmapping’s level.
    After, you generate mipmaps, and you ask the texture to get the handle back, and you make it resident.

    To use this “bindless” texture, you just have to put the “handle” (GLuint64) inside one uniform buffer.
    After, you can use it like that:

    #version 450 core
    
    #extension GL_ARB_bindless_texture : require
    
    layout(std140, binding = 0) uniform frameBuffer {
        // Here are all frameBuffer's renderTarget
        sampler2D gBufferNormal;
        sampler2D gBufferDiffuse;
        sampler2D gBufferDepth;
    };
    
    layout(location = 0) in vec2 uv;
    layout(location = 0) out vec4 outColor;
    
    void main(void)
    {
        outColor = texture(gBufferDiffuse, uv);
    }

    The next article could be about batching (with multi draw indirect) or persistent mapping.

    Reference