Author: Antoine MORRIER

Barriers in Vulkan : They are not that difficult
Hi !
Yes, I know, I lied, I said that my next article will be about buffers or images, but, finally, I’d prefer to talk about barriers first. However, barriers are, IMHO, a really difficult thing to well understand, so, this article might countain some mistakes.
In that case, please, let me know it by mail, or by one comment.
By the way, this article could remind you in some parts the article on GPU Open : Performance Tweets series: Barriers, fences, synchronization and Vulkan barriers explained

What memory barriers are for?

Memory barriers are source of bugs.
More seriously, barriers are used for three (actually four) things.
1. Execution Barrier (synchronization) : To ensure that prior commands has finished
2. Memory Barrier (memory visibility / availability): To ensure that prior writes are visible
3. Layout Transitioning (useful for image) : To Optimize the usage of the resource
4. Reformatting
I am not going to talk about reformating because (it is a shame) I am not very confident with it.

What exactly is an execution barrier ?

An execution barrier could remind you mutex on CPU thread. You write something in one resource. When you want to read what you write in, you must wait the write is finished.

What exactly is a memory barrier ?

When you write something from one thread, it could write it on some caches and you must flush them to ensure the visibility where you want to read that data. That is what memory barriers are for.
They ensure as well layout transition for image to get the best performance your graphic card can.

How it is done in Vulkan

Now that we understand why barriers are so important, we are going to see how can we use them in Vulkan.

Vulkan’s Pipeline

To be simple, the command enters in the top_of_pipe stage and end at bottom_of_pipe stage.
It exists an extra stage that refers to the host.

Barriers between stages

We are going to see two examples (that are inspired from GPU Open).
We will begin with the worse case : your first command writes at each stage everywhere it is possible, your second command reads at each stage everywhere it is possible.
It simply means that you want to wait for the first command totally finish before the second one begin.

To be simple, with a scheme it means that :
- In gray : All the stages that need to be executed before or after the barrier (or the ones that are never reached)
- In red : Above the barrier, it means where the data are produced. Below the barrier, it means where the data are consumed.
- In green : They are unblocked stages. You should try to have the maximum green stages as possible.
As you can see, here, you don’t have any green stages, so it is not good at all for performances.

In Vulkan C++, you should have something like that:
```
cmd.pipelineBarrier(
vk::PipelineStageFlagBits::eAllCommands, 
vk::PipelineStageFlagBits::eAllCommands, ...);
```
Some people use BOTTOM_OF_PIPE as source and TOP_OF_PIPE as the destination. It is not false, but it is useful only for execution barrier. These stages do not access memory, so they can’t make memory access visible or even available!!!! You should not (must not?) issue a memory barrier on these stages, but we are going to see that later.

Now, we are going to see a better case
Imagine your first command fills an image or one buffer (SSBO or imageStore) through the VERTEX_SHADER. Now imagine you want to use these data in EVALUATION_SHADER.
The prior scheme, after modification, is :

As you can see, there is a lot of green stages and it is very good!
The Vulkan C++ code should be:
```
cmd.pipelineBarrier(
vk::PipelineStageFlagBits::eVertexShader,
vk::PipelineStageFlagBits::eTessellationEvaluationShader,...);
```
By Region or not?

This part may contain errors, so please, let me know if you disagree with me
To begin, what does by region means?
A region is a little part of your framebuffer. If you specify to use by region dependency, it means that (in fragment buffer space) operations need to be finished only in the region (that is specific to the implementation) and not in the whole image.
Well, it is not clear what is a fragment buffer space. In my opinion, and after reading the documentation, it could be from the EARLY_TEST (or at least FRAGMENT_SHADER if early depth is not enabled) to the COLOR_ATTACHMENT.

Actually, to me this flag lets the driver to optimize a bit. However, it must be used only (and should not be useful elsewhere IMHO) between subpasses for subpasses input attachments).
But I may be wrong !

Everything above about is wrong, if you want a plain explanation, see the comment from devsh. To make it simple, it means that the barrier will operate only on “one pixel” of the image. It could be used for input attachment or pre depth pass for example

Memory Barriers

Okay, now that we have seen how make a pure execution barrier (that means without memory barriers).
Memory barriers ensure the availability for the first half memory dependency and the visibility for the second one. We can see them as a “flushing” and “invalidation”. Make information available does not mean that it is visible.
In each kind of memory barrier you will have a srcAccessMask and a dstAccessMask.
How do they work?

Access and stage are somewhat coupled. For each stage of srcStage, all memory accesses using the set of access types defined in srcAccessMask will be made available. It can be seen as a flush of caches defined by srcAccessMask in all stages.

For dstStage / dstAccess, it is the same thing, but instead to make information available, the information is made visible for these stages and these accesses.

That’s why using BOTTOM/TOP_OF_PIPELINE is meaningless for memory barrier.

For buffer and image barriers, you could as well perform a “releasing of ownership” from a queue to another of the resource you are using.
An example, you transfer the image in your queue that is only used for transfers. At the end, you must perform a releasing from the transfer queue to the compute (or graphic) queue.

Global Memory Barriers

These kind of memory barriers applies to all memory objects that exist at the time of its execution.
I do not have any example of when to use this kind of memory barrier. Maybe if you have a lot of barriers to do, it is better to use global memory barriers.
An example:
```
vk::MemoryBarrier(
vk::AccessFlagBits::eMemoryWrite,
vk::AccessFlagBits::eMemoryRead);
```
Buffer Memory Barriers

Here, accessesMask are valid only for the buffer we are working on through the barrier.
Here is the example :
```
vk::BufferMemoryBarrier(
vk::AccessFlagBits::eTransferWrite,
vk::AccessFlagBits::eShaderRead,
transferFamillyIndex,
queueFamillyIndex,
0, VK_WHOLE_SIZE);
```
Image Memory Barriers

Image memory barriers have another kind of utility. They can perform layout transitions.

Example:
I want to create mipmaps associated to one image (we will see the complete function in another article) through vkCmdBlitImage.
After a vkCmdBlitImage, I want use the mipmap I just wrote as a source for the next mipmap level.

oldLayout must be DST_TRANSFER and newLayout must be SRC_TRANSFER.
Which kind of access I made and which kind of access I will do?
That is easy, I performed a TRANSFER_WRITE and I want to perform a TRANSFER_READ.
At each stage my last command “finish” and at each stage my new command “begin”? Both in TRANSFER_STAGE.

In C++ it is done by something like that:
```
cmd.blitImage();
vk::ImageMemoryBarrier imageBarrier(
vk::AccessFlagBits::eTransferWrite,
vk::AccessFlagBits::eTransferRead,
vk::ImageLayout::eTransferDstOptimal,
vk::ImageLayout::eTransferSrcOptimal,
0, 0, image, subResourceRange);

cmd.pipelineBarrier(
vk::PipelineStageFlagBits::eTransfer,
vk::PipelineStageFlagBits::eTransfer,
vk::DependencyFlags(),
nullptr, nullptr, imageBarrier);
```
I hope that you enjoyed that article and that you have learned some things. Synchronization through Vulkan is not as easy to handle and all I wrote may (surely?) contains some errors.

Reference:

Memory barriers on TOP_OF_PIPE #128
Specs
November 10, 2016

Vulkan Memory Management : How to write your own allocator

Hi ! This article will deal with the memory management in Vulkan. But first, I am going to tell you what happened in my life.

State of my life

Again, it has been more than one month I did not write anything. So, where am I? I am in the last year of Télécom SudParis. I am following High Tech Imaging courses. It is the image specialization in my school. The funny part of it is : in parallel, I am a lecturer in a video games specialization. I taught OpenGL (3.3 because I cannot make an OpenGL 4 courses (everyone does not have a good hardware for that)). I got an internship in Dassault Systemes (France). It will begin the first February. I will work on the soft shadow engine (OpenGL 4.5).

Vulkan

To begin, some articles that I wrote before this one can contain mistakes, or some things are not well explained, or not very optimized.

Why came back to Vulkan?

I came back to Vulkan because I wanted to make one of the first “amateur” renderer using Vulkan. Also, I wanted to have a better improvement of memory management, memory barrier and other joys like that. Moreover, I made a repository with “a lot” of Vulkan Example : Vulkan example repository.
I did not mean to replace the Sascha Willems ones. But I propose my way to do it, in C++, using Vulkan HPP.

Memory Management with Vulkan

Different kind of memory

Heap

One graphic card can read memory from different heap. It can read memory from its own heap, or the system heap (RAM).

Type

It exists a different kind of memory type. For example, it exists memories that are host cached, or host coherent, or device local and other.

Host and device

Host

This memory resides in the RAM. This heap should have generally one (or several) type that own the bit “HOST_VISIBLE”. It means to Vulkan that it could be mapped persistently. Going that way, you get the pointer and you can write from the CPU on it.

Device Local

This memory resides on the graphic card. It is freaking fast and is not generally host_visible. That means you have to use a staging resource to write something to it or use the GPU itself.

Allocation in Vulkan

In Vulkan, the number of allocation per heap is driver limited. That means you can not do a lot of allocation and you must not use one allocation by buffer or image but one allocation for several buffers and images.
In this article, I will not take care about the CPU cache or anything like that, I will only focus my explanations on how to have the better from the GPU-side.
Memory Managements : good and bad

How will we do it?

Memory Managements : device allocator
As you can see, we have a block, that could represent the memory for one buffer, or for one image, we have a chunk that represents one allocation (via vkAllocateMemory) and we have a DeviceAllocator that manages all chunks.

Block

I defined a block as follow :

struct Block {
    vk::DeviceMemory memory;
    vk::DeviceSize offset;
    vk::DeviceSize size;
    bool free;
    void *ptr = nullptr; // Useless if it is a GPU allocation

    bool operator==(Block const &block);
};
bool Block::operator==(Block const &block) {
    if(memory == block.memory &&
       offset == block.offset &&
       size == block.size &&
       free == block.free &&
       ptr == block.ptr)
        return true;
    return false;
}

A block, as it is named, defines a little region within one allocation.
So, it has an offset, one size, and a boolean to know if it is used or not.
It may own a ptr if it is an

Chunk

A chunk is a memory region that contains a list of blocks. It represents a single allocation.
What a chunk could let us to do?

Allocate a block
Deallocate a block
Tell us if the block is inside the chunk

That gives us:

#pragma once
#include "block.hpp"

class Chunk : private NotCopyable {
public:
    Chunk(Device &device, vk::DeviceSize size, int memoryTypeIndex);

    bool allocate(vk::DeviceSize size, Block &block);
    bool isIn(Block const &block) const;
    void deallocate(Block const &block);
    int memoryTypeIndex() const;

    ~Chunk();

protected:
    Device mDevice;
    vk::DeviceMemory mMemory = VK_NULL_HANDLE;
    vk::DeviceSize mSize;
    int mMemoryTypeIndex;
    std::vector<Block> mBlocks;
    void *mPtr = nullptr;
};

One chunk allocates its memory inside the constructor.

Chunk::Chunk(Device &device, vk::DeviceSize size, int memoryTypeIndex) :
    mDevice(device),
    mSize(size),
    mMemoryTypeIndex(memoryTypeIndex) {
    vk::MemoryAllocateInfo allocateInfo(size, memoryTypeIndex);

    Block block;
    block.free = true;
    block.offset = 0;
    block.size = size;
    mMemory = block.memory = device.allocateMemory(allocateInfo);

    if((device.getPhysicalDevice().getMemoryProperties().memoryTypes[memoryTypeIndex].propertyFlags & vk::MemoryPropertyFlagBits::eHostVisible) == vk::MemoryPropertyFlagBits::eHostVisible)
        mPtr = device.mapMemory(mMemory, 0, VK_WHOLE_SIZE);

    mBlocks.emplace_back(block);
}

Since a deallocation is really easy (only to put the block to free), one allocation requires a bit of attention. You need to check if the block is free, and if it is free, you need to check for its size, and, if necessary, create another block if the size of the allocation is less than the available size. You also need take care about memory alignment !

void Chunk::deallocate(const Block &block) {
    auto blockIt(std::find(mBlocks.begin(), mBlocks.end(), block));
    assert(blockIt != mBlocks.end());
    // Just put the block to free
    blockIt->free = true;
}

bool Chunk::allocate(vk::DeviceSize size, vk::DeviceSize alignment, Block &block) {
    // if chunk is too small
    if(size > mSize)
        return false;

    for(uint32_t i = 0; i < mBlocks.size(); ++i) {
        if(mBlocks[i].free) {
            // Compute virtual size after taking care about offsetAlignment
            uint32_t newSize = mBlocks[i].size;

            if(mBlocks[i].offset % alignment != 0)
                newSize -= alignment - mBlocks[i].offset % alignment;

            // If match
            if(newSize >= size) {

                // We compute offset and size that care about alignment (for this Block)
                mBlocks[i].size = newSize;
                if(mBlocks[i].offset % alignment != 0)
                    mBlocks[i].offset += alignment - mBlocks[i].offset % alignment;

                // Compute the ptr address
                if(mPtr != nullptr)
                    mBlocks[i].ptr = (char*)mPtr + mBlocks[i].offset;

                // if perfect match
                if(mBlocks[i].size == size) {
                    mBlocks[i].free = false;
                    block = mBlocks[i];
                    return true;
                }

                Block nextBlock;
                nextBlock.free = true;
                nextBlock.offset = mBlocks[i].offset + size;
                nextBlock.memory = mMemory;
                nextBlock.size = mBlocks[i].size - size;
                mBlocks.emplace_back(nextBlock); // We add the newBlock

                mBlocks[i].size = size;
                mBlocks[i].free = false;

                block = mBlocks[i];
                return true;
            }
        }
    }

    return false;
}

Chunk Allocator

Maybe it is bad-named, but the chunk allocator let us to separate the creation of one chunk from the chunk itself. We give it one size and it operates all the verifications we need.

class ChunkAllocator : private NotCopyable
{
public:
    ChunkAllocator(Device &device, vk::DeviceSize size);

    // if size > mSize, allocate to the next power of 2
    std::unique_ptr<Chunk> allocate(vk::DeviceSize size, int memoryTypeIndex);

private:
    Device mDevice;
    vk::DeviceSize mSize;
};

vk::DeviceSize nextPowerOfTwo(vk::DeviceSize size) {
    vk::DeviceSize power = (vk::DeviceSize)std::log2l(size) + 1;
    return (vk::DeviceSize)1 << power;
}

bool isPowerOfTwo(vk::DeviceSize size) {
    vk::DeviceSize mask = 0;
    vk::DeviceSize power = (vk::DeviceSize)std::log2l(size);

    for(vk::DeviceSize i = 0; i < power; ++i)
        mask += (vk::DeviceSize)1 << i;

    return !(size & mask);
}

ChunkAllocator::ChunkAllocator(Device &device, vk::DeviceSize size) :
    mDevice(device),
    mSize(size) {
    assert(isPowerOfTwo(size));
}

std::unique_ptr<Chunk> ChunkAllocator::allocate(vk::DeviceSize size,
                                                int memoryTypeIndex) {
    size = (size > mSize) ? nextPowerOfTwo(size) : mSize;

    return std::make_unique<Chunk>(mDevice, size, memoryTypeIndex);
}

Device Allocator

I began to make an abstract class for Vulkan allocation :

/**
 * @brief The AbstractAllocator Let the user to allocate or deallocate some blocks
 */
class AbstractAllocator : private NotCopyable
{
public:
    AbstractAllocator(Device const &device) :
        mDevice(std::make_shared<Device>(device)) {

    }

    virtual Block allocate(vk::DeviceSize size, vk::DeviceSize alignment, int memoryTypeIndex) = 0;
    virtual void deallocate(Block &block) = 0;

    Device getDevice() const {
        return *mDevice;
    }

    virtual ~AbstractAllocator() = 0;

protected:
    std::shared_ptr<Device> mDevice;
};

inline AbstractAllocator::~AbstractAllocator() {

}

As you noticed, it is really easy. You can allocate or deallocate from this allocator. Next, I created a DeviceAllocator that inherits from AbstractAllocator.

class DeviceAllocator : public AbstractAllocator
{
public:
    DeviceAllocator(Device device, vk::DeviceSize size);

    Block allocate(vk::DeviceSize size, vk::DeviceSize alignment, int memoryTypeIndex);
    void deallocate(Block &block);


private:
    ChunkAllocator mChunkAllocator;
    std::vector<std::shared_ptr<Chunk>> mChunks;
};

This allocator contains a list of chunks, and contains one ChunkAllocator to allocate chunks.
The allocation is really easy. We have to check if it exists a “good chunk” and if we can allocate from it. Otherwise, we create another chunk and it is over !

DeviceAllocator::DeviceAllocator(Device device, vk::DeviceSize size) :
    AbstractAllocator(device),
    mChunkAllocator(device, size) {

}

Block DeviceAllocator::allocate(vk::DeviceSize size, vk::DeviceSize alignment, int memoryTypeIndex) {
    Block block;
    // We search a "good" chunk
    for(auto &chunk : mChunks)
        if(chunk->memoryTypeIndex() == memoryTypeIndex)
            if(chunk->allocate(size, alignment, block))
                return block;

    mChunks.emplace_back(mChunkAllocator.allocate(size, memoryTypeIndex));
    assert(mChunks.back()->allocate(size, alignment, block));
    return block;
}

void DeviceAllocator::deallocate(Block &block) {
    for(auto &chunk : mChunks) {
        if(chunk->isIn(block)) {
            chunk->deallocate(block);
            return ;
        }
    }
    assert(!"unable to deallocate the block");
}

Conclusion

Since I came back to Vulkan, I really had a better understanding of this new API. I can write article in better quality than in march.
I hope you enjoyed this remake of memory management.
My next article will be about buffer, and staging resource. It will be a little article. I will write as well an article that explains how to load textures and their mipmaps.

References

Vulkan Memory Management

Kisses and see you soon (probably this week !)

November 6, 2016

Indirect Rendering : “A way to a million draw calls”
Hello !
This time I am going to talk about the Multi Draw Indirect (MDI) rendering. This feature allows you to enjoy both the purpose of multiDraw and indirect drawing.

Where does the overhead comes from?

Issuing a lot of commands

Issue a drawcall in GPU based rendering is a really heavy operation for the CPU. Knowing this, drawing a lot of models could be really expensive. A naive draw loop could be seemed like that:
```
foreach(object) {
    writeUniformData(object, uniformData);
    glDraw...();
}
```
The problem is solved using glMultiDraw.
The new code is:
```
foreach(object)
    writeUniformData(object, uniformData[i]);
glMultiDraw...();
```
Unknown data

Now, admit you want to use culling to improve performance. You know that if you perform it on the GPU side, you will be more efficient than if you use the CPU, but you don’t know how to use the result without passing data from the GPU to the CPU… This is where indirect drawing is efficient.

Your old code is
```
cullingOnCPU(allObject); // quiet slow

foreach(object)
    if(object->isVisible())
        writeUniformData(object, uniformData[i]);
glMultiDraw...();
```
Using MDI, you could have something like that
```
cullingOnGPU(allObject);

foreach(object)
    writeUniformData(object, uniformData[i]);
glMultiDrawIndirect...();
```
And you don’t have to get the result from the CPU.

ARB (MULTI) DRAW INDIRECT

Data and functions

This extension provides two structures to perform a drawCall. One for glDrawArrays and one for glDrawElements.
```
typedef  struct {
    GLuint  count;
    GLuint  primCount;
    GLuint  first;
    GLuint  baseInstance;
} DrawArraysIndirectCommand;

typedef  struct {
    GLuint  count;
    GLuint  primCount;
    GLuint  firstIndex;
    GLint   baseVertex;
    GLuint  baseInstance;
} DrawElementsIndirectCommand;


void glMultiDrawArraysIndirect(GLenum mode,
                           const void *indirect,
                           GLsizei drawcount,
                           GLsizei stride);


void glMultiDrawElementsIndirect(GLenum mode,
                             GLenum type,
                             const void *indirect,
                             GLsizei drawcount,
                             GLsizei stride);
```
count specifies the number of elements (vertices) to be rendered
primcount specifies the number of instances to be rendered (in our cases, it will be 0 or 1)
first specifies the position of the first vertex
firstIndex specifies the position of the first index
baseVertex specifies the position of the first vertex
baseInstance specifies the first instance to be rendered (a bit tricky, but I am going to explain that later).

How to Use it

These structures should be put into an OpenGL Buffer Object using the target GL_DRAW_INDIRECT_BUFFER.
Admit you have a big scene with, for 5000 distinct objects and 100 000 meshes. You must have:
1. 5 000 matrices in a SSBO
2. “5 000” materials (not really true, but you understand the idea) in a SSBO
3. 100 000 commands in your indirect buffer
4. A SSBO which contains bounding boxes data by meshes (to perform culling for each meshes).
Now, what you want is RENDER all the scene. The steps to do that are :
1. Fill matrices / materials / bouding boxes / indirect buffer
2. make a dispatch using a compute shader to perform culling
3. Issue a memory barrier
4. render
The first step is straightforward.
The second is easy, you use the indirect buffer as a SSBO in the compute shader and set the primCount value to 0 if the mesh is not visible or 1 instead
You are intending to issue an indirect command…
render.
```
fillBuffers();
glDispatchIndirect();
glMemoryBarrier(GL_COMMAND_BARRIER_BIT /*|GL_SHADER_STORAGE_BARRIER_BIT*/);
glBindBuffer(GL_DRAW_INDIRECT_BUFFER, indirectBuffer);
glBindVertexArray(vao);
glMultiDraw*Indirect();
```
Beautiful ! But how do I know which data I have to use?
1. The first way is to use gl_DrawIDARB which is pretty explicit.
2. The way we are going to see and the one I am advising, is to use the baseInstance from structures seen prior.
Why gl_DrawIDARB is not convenient? Simply because it is slower than the second way on most implementations, and because we will not be able to use ARB INDIRECT PARAMETERS with it.

So, for the second way, we must add one or several buffers to the prior list (two in our cases, one for indexing the matrix buffer, and one for indexing the material buffer). These buffers will contain integer values (the index of the matrix / material in their SSBO). Because they will be used through baseInstance, you understand that these buffers will be vertex buffers using a divisor through glVertexBindingDivisor.

A Caveat?

As you noticed, when you remove a command setting primCount to 0, the command is not really removed… Here is coming the extension ARB INDIRECT PARAMETERS. Instead of settings the primCount to 0, you let it to one, but if the mesh is not visible, you don’t add to the really used buffer command, using an atomic counter, you know exactly how many meshes should be rendered.
You have to bind the atomic buffer to GL_PARAMETER_BUFFER_ARB and use the functions
```
void MultiDrawArraysIndirectCountARB(enum mode,
                                     const void *indirect,
                                     intptr drawcount,
                                     sizei maxdrawcount,
                                     sizei stride);

void MultiDrawElementsIndirectCountARB(enum mode,
                                       enum type,
                                       const void *indirect,
                                       intptr drawcount,
                                       sizei maxdrawcount,
                                       sizei stride);
```
References

Approaching zero driver overhead from Cass Everitt

Indirect Parameters
Multi Draw Indirect
Surviving without drawID
September 16, 2016

Author: Antoine MORRIER

Barriers in Vulkan : They are not that difficult

What memory barriers are for?

What exactly is an execution barrier ?

What exactly is a memory barrier ?

How it is done in Vulkan

Vulkan’s Pipeline

Barriers between stages

By Region or not?

Memory Barriers

Global Memory Barriers

Buffer Memory Barriers

Image Memory Barriers

Reference:

Vulkan Memory Management : How to write your own allocator

State of my life

Vulkan

Why came back to Vulkan?

Memory Management with Vulkan

Different kind of memory

Heap

Type

Host and device

Host

Device Local

Allocation in Vulkan

How will we do it?

Block

Chunk

Chunk Allocator

Device Allocator

Conclusion

References

Indirect Rendering : “A way to a million draw calls”

Where does the overhead comes from?

Issuing a lot of commands

Unknown data

ARB (MULTI) DRAW INDIRECT

Data and functions

How to Use it

A Caveat?

References