Tag: AZDO

  • Indirect Rendering : “A way to a million draw calls”

    Hello !
    This time I am going to talk about the Multi Draw Indirect (MDI) rendering. This feature allows you to enjoy both the purpose of multiDraw and indirect drawing.

    Where does the overhead comes from?

    Issuing a lot of commands

    Issue a drawcall in GPU based rendering is a really heavy operation for the CPU. Knowing this, drawing a lot of models could be really expensive.  A naive draw loop could be seemed like that:

    foreach(object) {
        writeUniformData(object, uniformData);
        glDraw...();
    }

    The problem is solved using glMultiDraw.
    The new code is:

    foreach(object)
        writeUniformData(object, uniformData[i]);
    glMultiDraw...();

    Unknown data

    Now, admit you want to use culling to improve performance. You know that if you perform it on the GPU side, you will be more efficient than if you use the CPU, but you don’t know how to use the result without passing data from the GPU to the CPU…  This is where indirect drawing is efficient.

    Your old code is

    cullingOnCPU(allObject); // quiet slow
    
    foreach(object)
        if(object->isVisible())
            writeUniformData(object, uniformData[i]);
    glMultiDraw...();

    Using MDI, you could have something like that

    cullingOnGPU(allObject);
    
    foreach(object)
        writeUniformData(object, uniformData[i]);
    glMultiDrawIndirect...();

    And you don’t have to get the result from the CPU.

    ARB (MULTI) DRAW INDIRECT

    Data and functions

    This extension provides two structures to perform a drawCall. One for glDrawArrays and one for glDrawElements.

    typedef  struct {
        GLuint  count;
        GLuint  primCount;
        GLuint  first;
        GLuint  baseInstance;
    } DrawArraysIndirectCommand;
    
    typedef  struct {
        GLuint  count;
        GLuint  primCount;
        GLuint  firstIndex;
        GLint   baseVertex;
        GLuint  baseInstance;
    } DrawElementsIndirectCommand;
    
    
    void glMultiDrawArraysIndirect(GLenum mode,
                               const void *indirect,
                               GLsizei drawcount,
                               GLsizei stride);
    
    
    void glMultiDrawElementsIndirect(GLenum mode,
                                 GLenum type,
                                 const void *indirect,
                                 GLsizei drawcount,
                                 GLsizei stride);
    

    count specifies the number of elements (vertices) to be rendered
    primcount specifies the number of instances to be rendered (in our cases, it will be 0 or 1)
    first specifies the position of the first vertex
    firstIndex specifies the position of the first index
    baseVertex specifies the position of the first vertex
    baseInstance specifies the first instance to be rendered (a bit tricky, but I am going to explain that later).

    How to Use it

    These structures should be put into an OpenGL Buffer Object using the target GL_DRAW_INDIRECT_BUFFER.
    Admit you have a big scene with, for 5000 distinct objects and 100 000 meshes. You must have:

    1. 5 000 matrices in a SSBO
    2. 5 000” materials (not really true, but you understand the idea) in a SSBO
    3. 100 000 commands in your indirect buffer
    4. A SSBO which contains bounding boxes data by meshes (to perform culling for each meshes).

    Now, what you want is RENDER all the scene. The steps to do that are :

    1. Fill matrices / materials / bouding boxes / indirect buffer
    2. make a dispatch using a compute shader to perform culling
    3. Issue a memory barrier
    4. render

    The first step is straightforward.
    The second is easy, you use the indirect buffer as a SSBO in the compute shader and set the primCount value to 0 if the mesh is not visible or 1 instead
    You are intending to issue an indirect command…
    render.

    fillBuffers();
    glDispatchIndirect();
    glMemoryBarrier(GL_COMMAND_BARRIER_BIT /*|GL_SHADER_STORAGE_BARRIER_BIT*/);
    glBindBuffer(GL_DRAW_INDIRECT_BUFFER, indirectBuffer);
    glBindVertexArray(vao);
    glMultiDraw*Indirect();

    Beautiful ! But how do I know which data I have to use?

    1. The first way is to use gl_DrawIDARB which is pretty explicit.
    2. The way we are going to see and the one I am advising, is to use the baseInstance from structures seen prior.

    Why gl_DrawIDARB is not convenient? Simply because it is slower than the second way on most implementations, and because we will not be able to use ARB INDIRECT PARAMETERS with it.

    So, for the second way, we must add one or several buffers to the prior list (two in our cases,  one for indexing the matrix buffer, and one for indexing the material buffer). These buffers will contain integer values (the index of the matrix / material in their SSBO). Because they will be used through baseInstance, you understand that these buffers will be vertex buffers using a divisor through glVertexBindingDivisor.

    A Caveat?

    As you noticed, when you remove a command setting primCount to 0, the command is not really removed… Here is coming the extension ARB INDIRECT PARAMETERS. Instead of settings the primCount to 0, you let it to one, but if the mesh is not visible, you don’t add to the really used buffer command, using an atomic counter, you know exactly how many meshes should be rendered.
    You have to bind the atomic buffer to GL_PARAMETER_BUFFER_ARB and use the functions

    void MultiDrawArraysIndirectCountARB(enum mode,
                                         const void *indirect,
                                         intptr drawcount,
                                         sizei maxdrawcount,
                                         sizei stride);
    
    void MultiDrawElementsIndirectCountARB(enum mode,
                                           enum type,
                                           const void *indirect,
                                           intptr drawcount,
                                           sizei maxdrawcount,
                                           sizei stride);

    References

    Indirect Parameters
    Multi Draw Indirect
    Surviving without drawID

  • OpenGL AZDO : Bindless Textures : batching problem solved

    Hello!
    After playing with Vulkan, I had to assume that it is not as easy as I wanted to use. Since this thing done, I preferred to come back to OpenGL. However, Vulkan let sme learn a lot of things about how OpenGL works internally. I am going to make a series of tutorials about OpenGL AZDO. The first one will discuss bindless textures!

    What is OpenGL AZDO ?

    OpenGL Approaching Zero Driver Overhead is an idea which comes from Cass Everitt, Tim Foley, John McDonald, Graham Sellers. The idea buried in it is to reduce the using of CPU by using the last possibilities offered by the new GPUs.
    AZDO presents many techniques to eschew to have a low overhead :

    1. Make less binding as possible
    2. Use persistent mapping
    3. Use batching
    4. Use GPU for everything (culling, fill structures).

    This series of tutorials will treat about how to implements such things.

    Bindless Texture

    Bindless texture solved a problem you may notice to implement batching. A naive draw loop could be like that

    foreach(render target) { // frame buffer
    foreach(pass) { // Depth, geometry, light
        foreach(material) { // textures
        draw();
        }
    }

    The main issue here is we cannot perform an efficient batch since each drawcall could have different textures.
    Now, imagine you could put a texture inside a uniform buffer and just perform one big draw call! You reach to a very very few overhead!

    How to do it ?

    We are lucky, according to me, bindless texture is the easier of the AZDO feature to implement. However, we will really see them in action in the chapter about the batching. To run into bindless texture, you just have to follow these following steps

    1. Create the texture in the normal way
    2. Get the handle (kind of the address of the texture)
    3. Make the handle resident
    4. Put the handle in an uniform buffer

    So there is a function you can use to load an image file using SDL and put it into a texture and enable bindless feature:

    std::unique_ptr<Texture> Texture::loadImage2D(const std::string &path) {
        std::unique_ptr<Texture> texture = std::make_unique<Texture>();
    
        SDL_Surface *surface = IMG_Load(path.c_str());
    
        if(surface == nullptr)
            throw std::runtime_error(path + " does not opened");
    
        GLenum format, internalFormat;
    
        getFormats(surface, internalFormat, format);
    
        glCreateTextures(GL_TEXTURE_2D, 1, &texture->mId);
    
        glTextureParameteri(*texture, GL_TEXTURE_MIN_FILTER, GL_LINEAR_MIPMAP_LINEAR);
        glTextureParameteri(*texture, GL_TEXTURE_MAG_FILTER, GL_LINEAR);
    
        GLsizei numMipmaps = ((GLsizei)log2(std::max(surface->w, surface->h)) + 1);
        glTextureStorage2D(*texture, numMipmaps, internalFormat, surface->w, surface->h);
        glTextureSubImage2D(*texture, 0, 0, 0, surface->w, surface->h,
                            format, GL_UNSIGNED_BYTE, surface->pixels);
    
        glGenerateTextureMipmap(*texture);
    
        texture->mHandle = glGetTextureHandleARB(*texture);
        glMakeTextureHandleResidentARB(texture->mHandle);
    
        SDL_FreeSurface(surface);
    
        return texture;
    }

    This code is easy, first you load a surface with SDL_image, you create the texture, you compute the number of possible mipmapping, you allocate them (each mipmapping’s level) and you send the value to the first mipmapping’s level.
    After, you generate mipmaps, and you ask the texture to get the handle back, and you make it resident.

    To use this “bindless” texture, you just have to put the “handle” (GLuint64) inside one uniform buffer.
    After, you can use it like that:

    #version 450 core
    
    #extension GL_ARB_bindless_texture : require
    
    layout(std140, binding = 0) uniform frameBuffer {
        // Here are all frameBuffer's renderTarget
        sampler2D gBufferNormal;
        sampler2D gBufferDiffuse;
        sampler2D gBufferDepth;
    };
    
    layout(location = 0) in vec2 uv;
    layout(location = 0) out vec4 outColor;
    
    void main(void)
    {
        outColor = texture(gBufferDiffuse, uv);
    }

    The next article could be about batching (with multi draw indirect) or persistent mapping.

    Reference

  • Vulkan Pipelines, Barrier, memory management

    Hi!
    Once again, I am going to present you some vulkan features, like pipelines, barriers, memory management, and all things useful for prior ones. This article will be long, but it will be separating into several chapters.

    Memory Management

    In Vulkan application, it is up to the developer to manage himself the memory. The number of allocations is limited. Make one allocation for one buffer, or one image is really a bad design in Vulkan. One good design is to make a big allocation (let’s call that a chunk), and manage it yourself, and allocate buffer or image within the chunk.

    A Chunk Allocator

    We need a simple object which has responsibility for allocations of chunks. It just has to select the good heap and call allocate and free from Vulkan API.

    #pragma once
    
    #include "System/Vulkan/Hardware/device.hpp"
    #include <tuple>
    
    class ChunkAllocator
    {
    public:
        ChunkAllocator(Device &device);
    
        // Memory, flags, size, ptr
        std::tuple<VkDeviceMemory, VkMemoryPropertyFlags, VkDeviceSize, char *>
        allocate(VkMemoryPropertyFlags flags, VkDeviceSize size);
    
        ~ChunkAllocator();
    
    private:
        Device &mDevice;
    
        std::vector<VkDeviceMemory> mDeviceMemories; //!< Each chunk
    };
    #include "chunkallocator.hpp"
    #include "System/exception.hpp"
    
    ChunkAllocator::ChunkAllocator(Device &device) : mDevice(device)
    {
    
    }
    
    std::tuple<VkDeviceMemory, VkMemoryPropertyFlags, VkDeviceSize, char*>
    ChunkAllocator::allocate(VkMemoryPropertyFlags flags, VkDeviceSize size) {
        VkPhysicalDeviceMemoryProperties const &property = mDevice.memoryProperties();
        int index = -1;
    
        // Looking for a heap with good flags and good size
        for(auto i(0u); i < property.memoryTypeCount; ++i)
            if((property.memoryTypes[i].propertyFlags & flags) == flags)
                if(size < property.memoryHeaps[property.memoryTypes[i].heapIndex].size)
                    index = i;
    
        if(index == -1)
            throw std::runtime_error("No good heap found");
    
        VkMemoryAllocateInfo info = {};
        info.sType = VK_STRUCTURE_TYPE_MEMORY_ALLOCATE_INFO;
        info.pNext = nullptr;
        info.allocationSize = size;
        info.memoryTypeIndex = index;
    
        // Perform the allocation
        VkDeviceMemory mem;
        vulkanCheckError(vkAllocateMemory(mDevice, &info, nullptr, &mem));
        mDeviceMemories.push_back(mem);
    
        char *ptr;
         // We map the memory if it is host visible
        if(flags & VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT)
            vulkanCheckError(vkMapMemory(mDevice, mem, 0, VK_WHOLE_SIZE, 0, (void**)&ptr));
    
    
        return std::tuple<VkDeviceMemory, VkMemoryPropertyFlags, VkDeviceSize, char*>
                (mem, flags, size, ptr);
    }
    
    ChunkAllocator::~ChunkAllocator() {
        // We free all memory objects
        for(auto &mem : mDeviceMemories)
            vkFreeMemory(mDevice, mem, nullptr);
    }
    

    This piece of code is quite simple and easy to read.

    Memory Pool

    Memory pools are structures used to optimize dynamic allocation performances. In video games, it is not an option to use a memory pool. Ideas are the same I told in the first part. Allocate a chunk, and sub allocate yourself within the chunk. I made a simple generic memory pool.
    There is a little scheme which explains what I wanted to do.

    Memory Pool
    Memory Pool

    As you can see, video memory is separated into several parts (4 here) and each “Block” in the linked list describes one sub-allocation.
    One block is described by :

    1. Size of the block
    2. Offset of the block relatively with the DeviceMemory
    3. A pointer to set data from the host (map)
    4. Boolean to know about the freeness of the block

    A sub-allocation within a chunk is performed as follows :

    1. Traverse the linked list until we find a well-sized free block
    2. Modify the size and set the boolean to false
    3. Create a new block, set size, offset and put boolean to true and insert it after the current one.

    A free is quite simple, you just have to put the boolean to true.
    A good other method could be a “shrink to fit”. If there are some following others with the boolean set to true, we merge all blocks into one.

    #pragma once
    
    #include "chunkallocator.hpp"
    
    // Memory, Offset, Size, ptr
    using Allocation = std::tuple<VkDeviceMemory, VkDeviceSize, VkDeviceSize, char*>;
    
    class MemoryPool {
        // Describes one user allocation
        struct Block {
            VkDeviceSize offset;
            VkDeviceSize size;
            char *ptr;
            bool isFree;
        };
    
        struct Chunk {
            VkDeviceMemory memory;
            VkMemoryPropertyFlags flags;
            VkDeviceSize size;
            char *ptr;
            std::vector<Block> blocks;
        };
    
    public:
        MemoryPool(Device &device);
    
        Allocation allocate(VkDeviceSize size, VkMemoryPropertyFlags flags);
    
        void free(Allocation const &alloc);
    
    private:
        Device &mDevice;
        ChunkAllocator mChunkAllocator;
        std::vector<Chunk> mChunks;
    
        void addChunk(std::tuple<VkDeviceMemory, VkMemoryPropertyFlags,
                      VkDeviceSize, char*> const &ptr);
    };
    #include "memorypool.hpp"
    #include <cassert>
    
    MemoryPool::MemoryPool(Device &device) :
        mDevice(device), mChunkAllocator(device) {}
    
    Allocation MemoryPool::allocate(VkDeviceSize size, VkMemoryPropertyFlags flags) {
        if(size % 128 != 0)
            size = size + (128 - (size % 128)); // 128 bytes alignment
        assert(size % 128 == 0);
    
        for(auto &chunk: mChunks) {
            // if flags are okay
            if((chunk.flags & flags) == flags) {
                int indexBlock = -1;
                // We are looking for a good block
                for(auto i(0u); i < chunk.blocks.size(); ++i) {
                    if(chunk.blocks[i].isFree) {
                        if(chunk.blocks[i].size > size) {
                            indexBlock = i;
                            break;
                        }
                    }
                }
    
                // If a block is find
                if(indexBlock != -1) {
                    Block newBlock;
                    // Set the new block
                    newBlock.isFree = true;
                    newBlock.offset = chunk.blocks[indexBlock].offset + size;
                    newBlock.size = chunk.blocks[indexBlock].size - size;
                    newBlock.ptr = chunk.blocks[indexBlock].ptr + size;
    
                    // Modify the current block
                    chunk.blocks[indexBlock].isFree = false;
                    chunk.blocks[indexBlock].size = size;
    
                    // If allocation does not fit perfectly the block
                    if(newBlock.size != 0)
                        chunk.blocks.emplace(chunk.blocks.begin() + indexBlock + 1, newBlock);
    
                    return Allocation(chunk.memory, chunk.blocks[indexBlock].offset, size, chunk.blocks[indexBlock].ptr);
                }
            }
        }
    
        // if we reach there, we have to allocate a new chunk
        addChunk(mChunkAllocator.allocate(flags, 1 << 25));
    
        return allocate(size, flags);
    }
    
    void MemoryPool::free(Allocation const &alloc) {
        for(auto &chunk: mChunks)
            if(chunk.memory == std::get<0>(alloc)) // Search the good memory device
                for(auto &block : chunk.blocks)
                    if(block.offset == std::get<1>(alloc)) // Search the good offset
                        block.isFree = true; // put it to free
    }
    
    void MemoryPool::addChunk(const std::tuple<VkDeviceMemory, VkMemoryPropertyFlags, VkDeviceSize, char *> &ptr) {
        Chunk chunk;
        Block block;
    
        // Add a block mapped along the whole chunk
        block.isFree = true;
        block.offset = 0;
        block.size = std::get<2>(ptr);
        block.ptr = std::get<3>(ptr);
    
        chunk.flags = std::get<1>(ptr);
        chunk.memory = std::get<0>(ptr);
        chunk.size = std::get<2>(ptr);
        chunk.ptr = std::get<3>(ptr);
        chunk.blocks.emplace_back(block);
        mChunks.emplace_back(chunk);
    }

    Buffers

    Buffers are a well-known part in OpenGL. In Vulkan, it is approximately the same, but you have to manage yourself the memory through one memory pool.

    When you create one buffer, you have to give him a size, an usage (uniform buffer, index buffer, vertex buffer, …). You also could ask for a sparse buffer (Sparse resources will be a subject of an article one day ^_^). You also could tell him to be in a mode concurrent. Thanks to that, you could access the same buffer through two different queues.

    #pragma once
    
    #include "memorypool.hpp"
    
    class Buffer
    {
    public:
        Buffer(Device &device, MemoryPool &memoryPool,
               VkBufferUsageFlags usage, VkDeviceSize size,
               VkSharingMode sharing = VK_SHARING_MODE_EXCLUSIVE,
               uint32_t nFamilyIndex = 0, uint32_t *pQueueFamilyIndices = nullptr);
    
        Buffer(Buffer &&buf);
    
        template<typename T>
        T *map() {
            return (T*)std::get<3>(mAllocation);
        }
    
        VkDeviceSize size();
    
        operator VkBuffer();
    
        ~Buffer();
    
    private:
        Device &mDevice;
        MemoryPool &mMemoryPool;
        Allocation mAllocation;
        VkBuffer mBuffer;
    };
    Buffer::Buffer(Device &device, MemoryPool &memoryPool,
                   VkBufferUsageFlags usage, VkDeviceSize size, VkSharingMode sharing,
                   uint32_t nFamilyIndex, uint32_t *pQueueFamilyIndices) :
        mDevice(device), mMemoryPool(memoryPool) {
        VkBufferCreateInfo info = {};
    
        info.sType = VK_STRUCTURE_TYPE_BUFFER_CREATE_INFO;
        info.pNext = nullptr;
        info.flags = 0;
        info.size = size;
        info.usage = usage;
        info.sharingMode = sharing;
        info.queueFamilyIndexCount = nFamilyIndex;
        info.pQueueFamilyIndices = pQueueFamilyIndices;
    
        vulkanCheckError(vkCreateBuffer(mDevice, &info, nullptr, &mBuffer));
    
        mAllocation = memoryPool.allocate(size, VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT | VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT | VK_MEMORY_PROPERTY_HOST_COHERENT_BIT);
        vulkanCheckError(vkBindBufferMemory(mDevice, mBuffer, std::get<0>(mAllocation), std::get<1>(mAllocation)));
    }
    
    Buffer::~Buffer() {
        if(mBuffer != VK_NULL_HANDLE)
            mMemoryPool.free(mAllocation);
        vkDestroyBuffer(mDevice, mBuffer, nullptr);
    }

    I chose to have a host visible and host coherent memory. But it is not especially useful. Indeed, to achieve a better performance, you could want to use a non coherent memory (but you will have to flush/invalidate your memory!!).
    For the host visible memory, it is not especially useful as well, indeed, for indirect rendering, it could be smart to perform culling with the GPU to fill all structures!

    Shaders

    Shaders are Different parts of your pipelines. It is an approximation obviously. But, for each part (vertex processing, geometry processing, fragment processing…), shader associated is invoked. In Vulkan, shaders are wrote with SPIR-V.
    SPIR-V is “.class” are for Java. You may compile your GLSL sources to SPIR-V using glslangvalidator.

    Why is SPIR-V so powerful ?

    SPIR-V allows developers to provide their application without the shader’s source.
    SPIR-V is an intermediate representation. Thanks to that, vendor implementation does not have to write a specific language compiler. It results in a lower complexity for the driver and it could more optimize, and compile it faster.

    Shaders in Vulkan

    Contrary to OpenGL’s shader, it is really easy to compile in Vulkan.
    My implementation keeps in memory all shaders into a hashtable. It lets to prevent any shader’s recompilation.

    #pragma once
    
    #include "System/Vulkan/Hardware/device.hpp"
    #include <unordered_map>
    #include <string>
    
    class Shaders
    {
    public:
        Shaders(Device &device);
    
        VkShaderModule get(std::string const &path);
    
        ~Shaders();
    private:
        Device &mDevice;
        std::unordered_map<std::string, VkShaderModule> mShaders;
    };
    
    #include "shaders.hpp"
    #include "System/exception.hpp"
    #include <fstream>
    
    auto readBinaryFile(std::string const &path) {
        std::ifstream is(path, std::ios::binary);
    
        if(!is.is_open())
            throw std::runtime_error("Shader : " + path + " does not found");
    
        is.seekg(0, std::ios::end);
        auto l = is.tellg();
        is.seekg(0, std::ios::beg);
    
        std::vector<char> values(l);
        is.read(&values[0], l);
    
        return values;
    }
    
    Shaders::Shaders(Device &device) :
        mDevice(device)
    {
    
    }
    
    VkShaderModule Shaders::get(const std::string &path) {
        if(mShaders.find(path) == mShaders.end()) {
            auto file = readBinaryFile(path);
            VkShaderModuleCreateInfo info;
            VkShaderModule module;
    
            info.sType = VK_STRUCTURE_TYPE_SHADER_MODULE_CREATE_INFO;
            info.pNext = nullptr;
            info.flags = 0;
            info.codeSize = file.size();
            info.pCode = (uint32_t*)&file[0];
    
            vulkanCheckError(vkCreateShaderModule(mDevice, &info, nullptr, &module));
            mShaders[path] = module;
        }
    
        return mShaders[path];
    }
    
    Shaders::~Shaders() {
        for(auto &shader: mShaders)
            vkDestroyShaderModule(mDevice, shader.second, nullptr);
    }

    Pipelines

    Pipelines are objects used for dispatch (compute pipelines) or render something (graphic pipelines).

    The beginning of this part is going to be a summarize of the Vulkan’s specs.

    Descriptors

    Shaders access buffer and image resources through special variables. These variables are organized into a set of bindings. One set is described by one descriptor.

    Descriptor Set Layout

    They describe one set. One set is compound with an array of bindings. Each bindings are described by :

    1. A binding number
    2. One type : Image, uniform buffer, SSBO, …
    3. The number of values (Could be an array of textures)
    4. Stage where shader could access the binding.

    Allocation of Descriptor Sets

    They are allocated from descriptor pool objects.
    One descriptor pool object is described by a number of set allocation possible, and an array of descriptor type / count it can allocate.

    Once you have the descriptor pool, you could allocate from it sets (using both descriptor pool and descriptor set layout).
    When you destroy the pool, sets also are destroyed.

    Give buffer / image to sets

    Now, we have descriptors, but we have to tell Vulkan where shaders can get data from.

    Pipeline Layouts

    Pipeline layouts are a kind of bridge between the pipeline and descriptor sets. They let you manage push constant as well (we’ll see them in a future article).

    Implementation

    Since descriptor sets are not coupled with pipelines layout. We could separate pipeline layout and descriptor pool / sets, but currently, I prefer to keep them coupled. It is a choice, and it will maybe change in the future.

    #pragma once
    #include "System/Vulkan/Hardware/device.hpp"
    
    class PipelineLayout : Loggable, NonCopyable
    {
    public:
        PipelineLayout(Device &device);
    
        void setDescriptorSetLayouts(std::vector<VkDescriptorSetLayoutCreateInfo> &&infos);
        void setDescriptorPoolCreateInfo(VkDescriptorPoolCreateInfo const &info);
        void create();
    
        std::vector<VkDescriptorSet> const &descriptorSets() const;
    
        operator VkPipelineLayout();
    
        ~PipelineLayout();
    
    private:
        Device &mDevice;
    
        std::vector<VkDescriptorSetLayoutCreateInfo> mSetLayoutCreateInfos;
        std::vector<VkDescriptorSetLayout> mDescriptorSetLayouts;
        std::vector<VkDescriptorSet> mDescriptorSets;
        VkDescriptorPoolCreateInfo mDescriptorPoolCreateInfo;
        VkDescriptorPool mDescriptorPool = VK_NULL_HANDLE;
    
        VkPipelineLayout mLayout = VK_NULL_HANDLE;
    };
    void PipelineLayout::create() {
        VkPipelineLayoutCreateInfo info = {};
    
        // Create all set layouts
        for(auto &info : mSetLayoutCreateInfos) {
            VkDescriptorSetLayout layout;
            vulkanCheckError(vkCreateDescriptorSetLayout(mDevice, &info, nullptr, &layout));
            mDescriptorSetLayouts.emplace_back(layout);
        }
    
        // Create the descriptor pool
        if(mSetLayoutCreateInfos.size() > 0)
            vulkanCheckError(vkCreateDescriptorPool(mDevice, &mDescriptorPoolCreateInfo, nullptr, &mDescriptorPool));
    
        info.sType = VK_STRUCTURE_TYPE_PIPELINE_LAYOUT_CREATE_INFO;
        info.pNext = nullptr;
        info.flags = 0;
        info.setLayoutCount = mDescriptorSetLayouts.size();
        info.pushConstantRangeCount = 0;
    
        if(mDescriptorSetLayouts.size() > 0)
            info.pSetLayouts = &mDescriptorSetLayouts[0];
    
        // Create the pipeline layout
        vulkanCheckError(vkCreatePipelineLayout(mDevice, &info, nullptr, &mLayout));
    
        if(mDescriptorSetLayouts.size()) {
            VkDescriptorSetAllocateInfo alloc = {};
    
            alloc.sType = VK_STRUCTURE_TYPE_DESCRIPTOR_SET_ALLOCATE_INFO;
            alloc.pNext = nullptr;
            alloc.descriptorPool = mDescriptorPool;
            alloc.descriptorSetCount = mDescriptorSetLayouts.size();
            alloc.pSetLayouts = &mDescriptorSetLayouts[0];
            mDescriptorSets.resize(mDescriptorSetLayouts.size());
            vulkanCheckError(vkAllocateDescriptorSets(mDevice, &alloc, &mDescriptorSets[0]));
        }
    }
    
    std::vector<VkDescriptorSet> const &PipelineLayout::descriptorSets() const {
        return mDescriptorSets;
    }
    
    PipelineLayout::~PipelineLayout() {
        for(auto &layout : mDescriptorSetLayouts)
            vkDestroyDescriptorSetLayout(mDevice, layout, nullptr);
    
        vkDestroyDescriptorPool(mDevice, mDescriptorPool, nullptr);
        vkDestroyPipelineLayout(mDevice, mLayout, nullptr);
    }

    The idea is quite easy. You create all your descriptor set layouts, then you allocate them through a pool.

    Graphics Pipelines in a nutshell

    Graphics Pipelines describe exactly what will happened on the rendering part.
    They describe

    1. Shader stages
    2. Which kind of data you want to deal with (Position, normal,…)
    3. Which kind of primitive you want to draw (triangle, lines, points)
    4. Which operator you want to use for Stencil and Depth
    5. Multi sampling, color blending,…

    The creation of a Graphic Pipeline is really easy, the main difficulty is the configuration.

    void Pipeline::create() {
        VkGraphicsPipelineCreateInfo info = {};
    
        info.sType = VK_STRUCTURE_TYPE_GRAPHICS_PIPELINE_CREATE_INFO;
        info.pNext = nullptr;
        info.flags = mFlags;
    
        info.stageCount = mStages.size();
        info.pStages = &mStages[0];
        info.pVertexInputState = &mVertexInputState;
        info.pInputAssemblyState = &mInputAssemblyState;
        info.pTessellationState = mTesselationState.get();
        info.pViewportState = mViewportState.get();
        info.pRasterizationState = &mRasterizationState;
        info.pMultisampleState = mMultisampleState.get();
        info.pDepthStencilState = mDepthStencilState.get();
        info.pColorBlendState = mColorBlendState.get();
        info.pDynamicState = mDynamicState.get();
    
        if(mLayout != nullptr)
            info.layout = *mLayout;
    
        info.renderPass = mRenderPass;
        info.subpass = mSubpass;
    
        vulkanCheckError(vkCreateGraphicsPipelines(mDevice, VK_NULL_HANDLE, 1, &info, nullptr, &mPipeline));
    }

    I used a kind of builder design pattern to configure pipelines.

    For the example, I configure my pipeline as follows :

    1. 2 stages : vertex shader and fragment shader
    2. Position 4D (x, y, z, w)
    3. No depth / stencil test
    4. An uniform buffer for one color

    This code is a bit long, but it gives all the steps you have to follow to create simple pipelines.

    std::unique_ptr<Pipeline> GBufferPipelineBuilder::build(Context &context,
                                                            RenderPass &renderpass, uint32_t subpass) {
        VkRect2D scissor;
        scissor.offset.x = scissor.offset.y = 0;
        scissor.extent.height = context.surfaceWindow().height();
        scissor.extent.width = context.surfaceWindow().width();
    
        VkViewport vp;
        vp.height = context.surfaceWindow().height();
        vp.width = context.surfaceWindow().width();
        vp.minDepth = 0.0f;
        vp.maxDepth = 1.0f;
        vp.x = vp.y = 0;
    
        VkPipelineViewportStateCreateInfo viewPort;
        viewPort.sType = VK_STRUCTURE_TYPE_PIPELINE_VIEWPORT_STATE_CREATE_INFO;
        viewPort.flags = 0;
        viewPort.pNext = nullptr;
        viewPort.scissorCount = viewPort.viewportCount = 1;
        viewPort.pViewports = &vp;
        viewPort.pScissors = &scissor;
    
        // 2 stages, vertex and fragment
        std::vector<VkPipelineShaderStageCreateInfo> stages(2);
        for(auto &stage : stages) {
            stage.sType = VK_STRUCTURE_TYPE_PIPELINE_SHADER_STAGE_CREATE_INFO;
            stage.pNext = nullptr;
            stage.flags = 0;
            stage.pSpecializationInfo = nullptr;
        }
    
        stages[0].stage = VK_SHADER_STAGE_VERTEX_BIT;
        stages[0].module = context.shader("../Shader/vert.spv");
        stages[0].pName = "main";
    
        stages[1].stage = VK_SHADER_STAGE_FRAGMENT_BIT;
        stages[1].module = context.shader("../Shader/frag.spv");
        stages[1].pName = "main";
    
        // Values are float4
        VkVertexInputAttributeDescription attribute[1];
        attribute[0].location = 0;
        attribute[0].binding = 0;
        attribute[0].offset = 0;
        attribute[0].format = VK_FORMAT_R32G32B32A32_SFLOAT;
    
        VkVertexInputBindingDescription binding[1];
        binding[0].binding = 0;
        binding[0].stride = 4 * sizeof(float);
        binding[0].inputRate = VK_VERTEX_INPUT_RATE_VERTEX;
    
        VkPipelineVertexInputStateCreateInfo vertexInput = {};
        vertexInput.sType = VK_STRUCTURE_TYPE_PIPELINE_VERTEX_INPUT_STATE_CREATE_INFO;
        vertexInput.pNext = nullptr;
        vertexInput.flags = 0;
        vertexInput.vertexAttributeDescriptionCount = 1;
        vertexInput.vertexBindingDescriptionCount = 1;
        vertexInput.pVertexAttributeDescriptions = attribute;
        vertexInput.pVertexBindingDescriptions = binding;
    
        // No really MSAA
        VkPipelineMultisampleStateCreateInfo multisample = {};
        multisample.sType = VK_STRUCTURE_TYPE_PIPELINE_MULTISAMPLE_STATE_CREATE_INFO;
        multisample.pNext = nullptr;
        multisample.flags = 0;
        multisample.pSampleMask = nullptr;
        multisample.rasterizationSamples = VK_SAMPLE_COUNT_1_BIT;
        multisample.sampleShadingEnable = VK_FALSE;
        multisample.alphaToCoverageEnable = VK_FALSE;
        multisample.alphaToOneEnable = VK_FALSE;
    
        // DepthStencil tests disabled
        VkPipelineDepthStencilStateCreateInfo depthStencil = {};
        depthStencil.sType = VK_STRUCTURE_TYPE_PIPELINE_DEPTH_STENCIL_STATE_CREATE_INFO;
        depthStencil.pNext = nullptr;
        depthStencil.flags = 0;
        depthStencil.depthTestEnable = VK_FALSE;
        depthStencil.depthWriteEnable = VK_FALSE;
        depthStencil.depthBoundsTestEnable = VK_FALSE;
        depthStencil.depthCompareOp = VK_COMPARE_OP_ALWAYS;
        depthStencil.stencilTestEnable = VK_FALSE;
    
        // We write all r, g, b, a values
        VkPipelineColorBlendStateCreateInfo colorBlend = {};
        VkPipelineColorBlendAttachmentState cbstate[1] = {};
        cbstate[0].colorWriteMask = VK_COLOR_COMPONENT_A_BIT | VK_COLOR_COMPONENT_B_BIT | VK_COLOR_COMPONENT_G_BIT | VK_COLOR_COMPONENT_R_BIT;
        colorBlend.sType = VK_STRUCTURE_TYPE_PIPELINE_COLOR_BLEND_STATE_CREATE_INFO;
        colorBlend.pNext = nullptr;
        colorBlend.flags = 0;
        colorBlend.logicOpEnable = VK_FALSE;
        colorBlend.attachmentCount = 1;
        colorBlend.pAttachments = cbstate;
    
        std::unique_ptr<PipelineLayout> layout = std::make_unique<PipelineLayout>(context.device());
    
        // 1 set
        VkDescriptorSetLayoutCreateInfo setLayout = {};
        setLayout.sType = VK_STRUCTURE_TYPE_DESCRIPTOR_SET_LAYOUT_CREATE_INFO;
        setLayout.pNext = nullptr;
        setLayout.flags = 0;
        setLayout.bindingCount = 1;
    
        // 1 binding for uniform buffer
        VkDescriptorSetLayoutBinding descriptorBinding = {};
        descriptorBinding.binding = 0;
        descriptorBinding.descriptorType = VK_DESCRIPTOR_TYPE_UNIFORM_BUFFER;
        descriptorBinding.descriptorCount = 1;
        descriptorBinding.stageFlags = VK_SHADER_STAGE_FRAGMENT_BIT;
        descriptorBinding.pImmutableSamplers = nullptr;
        setLayout.pBindings = &descriptorBinding;
    
        // Pool for one and only one set
        VkDescriptorPoolCreateInfo descriptorPoolInfo;
        descriptorPoolInfo.sType = VK_STRUCTURE_TYPE_DESCRIPTOR_POOL_CREATE_INFO;
        descriptorPoolInfo.flags = 0;
        descriptorPoolInfo.pNext = nullptr;
        descriptorPoolInfo.maxSets = 1;
        descriptorPoolInfo.poolSizeCount = 1;
        VkDescriptorPoolSize poolSize;
        poolSize.descriptorCount = 1;
        poolSize.type = VK_DESCRIPTOR_TYPE_UNIFORM_BUFFER;
        descriptorPoolInfo.pPoolSizes = &poolSize;
    
        layout->setDescriptorSetLayouts({setLayout});
        layout->setDescriptorPoolCreateInfo(descriptorPoolInfo);
    
        layout->create();
    
        std::unique_ptr<Pipeline> pipeline;
    
        pipeline = std::make_unique<Pipeline>(context.device(), renderpass, subpass);
        pipeline->setVertexInputState(vertexInput);
        pipeline->setStages(std::move(stages));
        pipeline->setViewportState(std::make_unique<VkPipelineViewportStateCreateInfo>(viewPort));
        pipeline->setMultiSampleState(std::make_unique<VkPipelineMultisampleStateCreateInfo>(multisample));
        pipeline->setDepthStencilState(std::make_unique<VkPipelineDepthStencilStateCreateInfo>(depthStencil));
        pipeline->setColorBlendState(std::make_unique<VkPipelineColorBlendStateCreateInfo>(colorBlend));
        pipeline->setLayout(std::move(layout));
    
        pipeline->create();
    
        // Create an uniform buffer
        std::unique_ptr<Buffer> bufUniform = std::make_unique<Buffer>
                (context.device(), context.memoryPool(),VK_BUFFER_USAGE_UNIFORM_BUFFER_BIT,
                 4 * sizeof(float));
    
        VkDescriptorBufferInfo infoBuffer;
        infoBuffer.buffer = *bufUniform;
        infoBuffer.offset = 0;
        infoBuffer.range = VK_WHOLE_SIZE;
    
        pipeline->addBuffer(std::move(bufUniform));
    
        // "Give" the buffer to the set.
        pipeline->updateBufferDescriptorSets(0, 0,
                                             VK_DESCRIPTOR_TYPE_UNIFORM_BUFFER,
                                             infoBuffer);
    
        return pipeline;
    }

    Pipelines and descriptor sets give you an unmatched flexibility.

    The main.cpp is this one

    #include "Engine/context.hpp"
    #include "System/exception.hpp"
    #include "cstring"
    #include "System/Vulkan/Pipeline/commandpool.hpp"
    #include "System/Vulkan/Pipeline/builder/gbufferpipelinebuilder.hpp"
    #include "System/Vulkan/Synchronisation/fence.hpp"
    #include "System/Vulkan/Memory/buffer.hpp"
    
    void init(Context &context, CommandPool &commandPool, std::unique_ptr<Pipeline> &pipeline, Buffer &buf) {
        GBufferPipelineBuilder builder;
        // Build the pipeline
        pipeline = builder.build(context, context.surfaceWindow().renderPass(), 0);
        commandPool.reset();
    
        VkClearValue value;
        value.color.float32[0] = 0.;
        value.color.float32[1] = 0.;
        value.color.float32[2] = 0.;
        value.color.float32[3] = 1.;
    
        std::vector<VkMemoryBarrier> memoryBarrier;
        std::vector<VkBufferMemoryBarrier> bufferBarrier;
        std::vector<VkImageMemoryBarrier> imageBarrier(1);
    
        VkImageSubresourceRange range;
    
        range.aspectMask = VK_IMAGE_ASPECT_COLOR_BIT;
        range.baseArrayLayer = 0;
        range.baseMipLevel = 0;
        range.layerCount = 1;
        range.levelCount = 1;
    
        // obvious value for imageBarrier
        imageBarrier[0].sType = VK_STRUCTURE_TYPE_IMAGE_MEMORY_BARRIER;
        imageBarrier[0].pNext = nullptr;
        imageBarrier[0].srcQueueFamilyIndex = VK_QUEUE_FAMILY_IGNORED;
        imageBarrier[0].dstQueueFamilyIndex = VK_QUEUE_FAMILY_IGNORED;
        imageBarrier[0].subresourceRange = range;
    
        float color[] = {1.0, 0.0, 1.0, 1.0};
        memcpy(pipeline->buffer(0).map<float>(), color, sizeof color);
    
        for(int i = 0; i < 4; ++i) {
            commandPool.allocateCommandBuffer();
            commandPool.beginCommandBuffer(i);
    
            imageBarrier[0].srcAccessMask = VK_ACCESS_MEMORY_READ_BIT;
            imageBarrier[0].dstAccessMask = VK_ACCESS_COLOR_ATTACHMENT_WRITE_BIT;
            imageBarrier[0].oldLayout = VK_IMAGE_LAYOUT_PRESENT_SRC_KHR;
            imageBarrier[0].newLayout = VK_IMAGE_LAYOUT_COLOR_ATTACHMENT_OPTIMAL;
            imageBarrier[0].image = context.surfaceWindow().image(i);
            commandPool.commandBarrier(i,
                                       VK_PIPELINE_STAGE_ALL_COMMANDS_BIT,
                                       VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT,
                                       VK_FALSE, memoryBarrier, bufferBarrier, imageBarrier);
    
            commandPool.beginRenderPass(i, context.surfaceWindow().frameBuffer(i), context.surfaceWindow().renderPass(), {value});
    
            VkBuffer bufs[] = {buf};
            VkDeviceSize sizes[] = {0};
    
            pipeline->bind(*commandPool.commandBuffer(i));
            vkCmdBindVertexBuffers(*commandPool.commandBuffer(i), 0, 1, bufs, sizes);
            pipeline->bindDescriptorSets(*commandPool.commandBuffer(i));
            vkCmdDraw(*commandPool.commandBuffer(i), 3, 1, 0, 0);
    
            commandPool.endRenderPass(i);
    
            imageBarrier[0].srcAccessMask = VK_ACCESS_COLOR_ATTACHMENT_WRITE_BIT;
            imageBarrier[0].dstAccessMask = VK_ACCESS_MEMORY_READ_BIT;
            imageBarrier[0].oldLayout = VK_IMAGE_LAYOUT_COLOR_ATTACHMENT_OPTIMAL;
            imageBarrier[0].newLayout = VK_IMAGE_LAYOUT_PRESENT_SRC_KHR;
            imageBarrier[0].image = context.surfaceWindow().image(i);
    
            commandPool.commandBarrier(i,
                                       VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT,
                                       VK_PIPELINE_STAGE_BOTTOM_OF_PIPE_BIT,
                                       VK_FALSE, memoryBarrier, bufferBarrier, imageBarrier);
    
            commandPool.endCommandBuffer(i);
        }
    }
    
    void mainLoop(Context &context) {
        Fence fence(context.device(), 1);
        CommandPool commandPool(context.device(), 0);
        std::unique_ptr<Pipeline> pipeline;
    
        // Un triangle
        float vertices[] = {-0.5, -0.5, 1, 1,
                            0.5, -0.5, 1, 1,
                            0.0, 0.5, 1, 1};
    
        // Triangle to buffer
        Buffer buf(context.device(), context.memoryPool(), VK_BUFFER_USAGE_VERTEX_BUFFER_BIT, sizeof(vertices));
        memcpy(buf.map<float>(), vertices, sizeof(vertices));
    
        while(context.surfaceWindow().isRunning()) {
            context.surfaceWindow().updateEvent();
            if(context.surfaceWindow().neetToInit()) {
                init(context, commandPool, pipeline, buf);
                std::cout << "Initialisation" << std::endl;
                context.surfaceWindow().initDone();
            }
            context.surfaceWindow().begin();
            fence.reset(0);
    
            context.queue().submit(commandPool.commandBuffer(context.surfaceWindow().currentSwapImage()), 1, *fence.fence(0));
            fence.wait();
            context.surfaceWindow().end(context.queue());
        }
    }
    
    int main()
    {
        Context c(true);
    
        mainLoop(c);
    
        glfwTerminate();
    
        return 0;
    }

    And now, we have our perfect triangle !!!!

    Triangle using pipelines, shaders
    Triangle using pipelines

    Barrier and explanations for the main

    I am going to explain quickly what memory barriers are.
    The idea behind the memory barrier is ensured writes are performed.
    When you performed one compute or one render, it is your duty to ensure that data will be visible when you want to re-use them.

    In our main.cpp example, I draw a triangle into a frame buffer and present it.

    The first barrier is :

            imageBarrier[0].srcAccessMask = VK_ACCESS_MEMORY_READ_BIT;
            imageBarrier[0].dstAccessMask = VK_ACCESS_COLOR_ATTACHMENT_WRITE_BIT;
            imageBarrier[0].oldLayout = VK_IMAGE_LAYOUT_PRESENT_SRC_KHR;
            imageBarrier[0].newLayout = VK_IMAGE_LAYOUT_COLOR_ATTACHMENT_OPTIMAL;
            imageBarrier[0].image = context.surfaceWindow().image(i);
            commandPool.commandBarrier(i,
                                       VK_PIPELINE_STAGE_ALL_COMMANDS_BIT,
                                       VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT,
                                       VK_FALSE, memoryBarrier, bufferBarrier, imageBarrier);

    Image barriers are compound with access, layout, and pipeline barrier with stage.
    Since the presentation is a read of a framebuffer, srcAccessMask is VK_ACCESS_MEMORY_READ_BIT.
    Now, we want to render inside this image via a framebuffer, so dstAccessMask is VK_ACCESS_COLOR_ATTACHMENT_WRITE_BIT.

    We were presented the image, and now we want to render inside it, so, layouts are obvious.
    When we submit image memory barrier to the command buffer, we have to tell it which stages are affected. Here, we wait for all commands and we begin for the first stage of the pipeline.

    The second image memory barrier is

    imageBarrier[0].srcAccessMask = VK_ACCESS_COLOR_ATTACHMENT_WRITE_BIT;
            imageBarrier[0].dstAccessMask = VK_ACCESS_MEMORY_READ_BIT;
            imageBarrier[0].oldLayout = VK_IMAGE_LAYOUT_COLOR_ATTACHMENT_OPTIMAL;
            imageBarrier[0].newLayout = VK_IMAGE_LAYOUT_PRESENT_SRC_KHR;
            imageBarrier[0].image = context.surfaceWindow().image(i);
    
            commandPool.commandBarrier(i,
                                       VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT,
                                       VK_PIPELINE_STAGE_BOTTOM_OF_PIPE_BIT,
                                       VK_FALSE, memoryBarrier, bufferBarrier, imageBarrier);

    The only difference is the order and stageMasks. Here we wait for the color attachement (and not the Fragment one !!!!) and we begin with the end of the stages (It is not really easy to explain… but it does not sound not logic).

    Steps to render something using pipelines are:

    1. Create pipelines
    2. Create command pools, command buffer and begin them
    3. Create vertex / index buffers
    4. Bind pipelines to their subpass, bind buffers and descriptor sets
    5. VkCmdDraw

    References

    Specification

    It was a long article, I hope it was not unclear and that I didn’t do to much mistakes ^^.

    Kiss !!!!