Threaded renderer что это
Threaded Rendering
Information for graphics programmers working with the threaded renderer.
Rendering thread
In Unreal Engine 4 (UE4), the entire renderer operates in its own thread that is a frame or two behind the game thread.
When dealing with rendering things, you have to carefully consider every memory read and write to ensure not only thread safety, but also determinism in behavior. When functional behavior depends on execution speed differences between two threads, it is called a race condition. Avoiding race conditions is important because they are usually very difficult to reproduce, and may be machine, platform, debugger or configuration dependent because of speed differences. These kind of bugs can rarely be debugged and take something like 10x the time to fix compared to a normal reproducible bug.
Here is a simple example of a race condition / threading bug:
Development approach
There is no way to exhaustively test to find race conditions. It is important to realize that you cannot create reliable threaded code by guess-and-checking or retroactively fixing bugs. The best approach is to completely understand the interactions of the game thread and rendering thread and use mechanisms to ensure determinism. You should be able to explain the order of events that will make every interaction deterministic, or else you are almost certainly creating race conditions.
Thread specific data structures
For this reason, it is a good idea to have data in separate structures that are ‘owned’ by the different threads so that it is obvious who can modify what. This holds true for functions as well. It is best to always call each function from the same thread or things get really complicated. Most of UE4 is structured this way, for example, UPrimitiveComponent is the base game thread class of anything that can be rendered, cast shadows, has its own visibility state, etc. The rendering thread can never touch the memory of UPrimitiveComponent directly since the game thread may be writing to its members at any time. The rendering thread has its own class to represent this functionality, which is FPrimitiveSceneProxy. The game thread can never touch the members of memory of an FPrimitiveSceneProxy after it is created and registered. UActorComponent::RegisterComponent adds a component to the scene and makes it visible to the renderer by creating a FPrimitiveSceneProxy. Once the component is registered, it will have FPrimitiveSceneProxy::DrawDynamicElements called on it for every pass that is needed if it is visible.
Performance considerations
The game thread blocks at the end of each Tick() until the rendering thread catches up to either one frame or two frames behind. Since the rendering thread is so far behind, it is never acceptable during gameplay to block the game thread until the rendering thread catches up completely. Blocking during loading or GC of individual objects is also a bad idea, since UE4 supports async streaming levels. There are asynchronous mechanisms for various operations to avoid blocking.
Inter-thread communication
Asynchronous
The primary method of communication between the two threads is through the ENQUEUE_UNIQUE_RENDER_COMMAND_XXXPARAMETER macro. This macro creates a local class with a virtual Execute function that contains the code you enter into the macro. The game thread inserts the command into the rendering command queue, and the rendering thread calls the Execute function when it gets around to it.
FRenderCommandFence provides a convenient way to track the progress of the rendering thread on the game thread. The game thread calls FRenderCommandFence::BeginFence to begin the fence. The game thread can then call FRenderCommandFence::Wait to block until the rendering thread has processed the fence, or it can just poll the progress of the rendering thread by checking GetNumPendingFences. When GetNumPendingFences returns 0, the rendering thread has processed the fence.
Blocking
FlushRenderingCommands is the standard method of blocking the game thread until the rendering thread has caught up. This is useful for offline (editor) operations which modify memory being accessed by the rendering thread.
Rendering resources
FRenderResource provides the base rendering resource interface and provides hooks for initialization and releasing. Anything that derives from FRenderResource (FVertexBuffer, FIndexBuffer, etc) needs to be initialized before it is used for rendering and released before being deleted. FRenderResource::InitResource can only be called from the rendering thread, so there is a helper function (BeginInitResource) that can be called on the game thread to enqueue a rendering command to call FRenderResource::InitResource. RHI functions can only be called from the rendering thread (with the exception of a few for creating devices, viewports, etc).
UObjects and Garbage Collection
Garbage Collection (GC) happens on the game thread and operates on UObjects. The game thread may delete a UObject while the rendering thread is processing a command that references it. For this reason, the rendering thread should never dereference a UObject pointer unless a mechanism is in place to make sure the UObject is not deleted until the rendering thread no longer references it. An example is UPrimitiveComponent, which uses a FRenderCommandFence called DetachFence to prevent GC from deleting the UObject before the rendering thread has processed the detach command.
Game thread FRenderResource handling
There is two common scenarios of game thread rendering thread resource interaction to consider, the case of static resources (only modified on load or in the editor, like an index buffer) and dynamic resources, which need to be updated every frame with the latest results of the game thread simulation.
Static resources
Here is how the static resource interaction is handled in UE4, using USkeletalMesh as an example.
USkeletalMesh::PostLoad gets called on load, which calls InitResources. This calls BeginInitResource on any static FRenderResources that it has like the index buffer. BeginInitResource enqueues a rendering command to call FRenderResource::InitResource. From this point on, the game thread can no longer modify the index buffer memory until it does something to take back ownership.
A component registers which starts rendering with the USkeletalMesh’s index buffer.
GC determines that the component is no longer referenced at some point (level unload or no longer referenced) and detaches the component. Note that at this point, the game thread cannot delete the index buffer memory, because the rendering thread may not have processed the detach yet and may still be rendering with the index buffer.
GC calls USkeletalMesh::BeginDestroy, which is the game thread object’s chance to enqueue commands to release the rendering resources, so it does BeginReleaseResource(&IndexBuffer); The game thread still cannot delete the memory of IndexBuffer because the rendering thread has not necessarily processed the release yet. We could block the game thread until the rendering thread catches up, but this would cause hitches and be slow, so we have an asynchronous mechanism instead. In order to track the rendering thread’s progress of processing the release command we initiate a fence.
GC finally calls UObject::FinishDestroy which can be used to release memory in a central location. In the case of the index buffer, its memory gets freed when the USkeletalMesh destructor calls FRawStaticIndexBuffer‘s destructor, which calls the destructor of the TArray holding the index buffer memory, which frees the memory.
This mechanism works well because it is efficient (never blocks either thread, initializes in a central location instead of checking for whether initialization is needed every frame), and is deterministic.
Dynamic resources
The skeletal mesh bone transforms which are produced by the game thread animation each frame are a good example of dynamic resource updating. The goal is to get the transforms from the game thread after each animation update into an array on the rendering thread where they can be set as shader constants. The same would be true if you were updating an index or vertex buffer each frame. Here is the order of operations:
USkinnedMeshComponent::CreateRenderState_Concurrent allocates USkinnedMeshComponent::MeshObject. From this point on, the game thread can only write to the MeshObject pointer, but not to the memory of the FSkeletalMeshObject.
USkinnedMeshComponent::UpdateTransform gets called to update the component’s movement at least once per frame. This calls FSkeletalMeshObjectGPUSkin::Update in the case of GPU skinning. At this point, we have up to date transforms on the game thread and need to get them over to the rendering thread. This is done by first allocating memory on the heap (FDynamicSkelMeshObjectData), then copying the bone transforms into it, and then passing off this copy to the rendering thread using ENQUEUE_UNIQUE_RENDER_COMMAND_TWOPARAMETER. The rendering thread now owns the copy and is responsible for deleting it. The ENQUEUE_UNIQUE_RENDER_COMMAND_TWOPARAMETER macro contains code to copy the transforms to their final destination so they can be set as shader constants. This is where you would lock and update a vertex buffer if updating vertex positions.
At some point, the component gets detached. The game thread enqueues rendering commands to release all of the dynamic FRenderResources and can now set the MeshObject pointer to NULL, however the actual memory is still being referenced by the rendering thread and cannot be deleted. This is where the deferred deletion mechanism comes in to play. Classes that derive from FDeferredCleanupInterface can be deleted in an asynchronous way that is thread safe. FSkeletalMeshObject implements this interface. The game thread wants to kick off the deferred deletion of the FSkeletalMeshObject so it calls BeginCleanup(MeshObject). The memory will eventually be deleted when it is safe to do so and cleanup is complete.
Updating state vs Traversing the scene for rendering
When developing a system that has distinct update and render operations, it is tempting to combine the two in DrawDynamicElements, however this is a poor design choice. A better solution is to separate the update out of the rendering traversal, for example enqueue the update command from within the game thread Tick.
DrawDynamicElements is called by the high level rendering code to draw the elements of a primitive component. The high level code assumes that no RHI state is being changed, and that it can call DrawDynamicElements as many times as it needs each frame, depending on shading passes, number of views, and scene captures in the scene. DrawDynamicElements may even be called, but then the underlying drawing policy discards the results for various reasons (for example a translucent FMeshElement submitted during the depth pass will be discarded). If the primitive component is actually not visible, the occlusion system may or may not actually call DrawDynamicElements, depending on the heuristic it is using. All of these factors can conflict with state updating which should happen once per frame.
A better solution is to separate the update from the rendering traversal. The game thread Tick can enqueue a rendering command to do the update operation. The rendering command can optionally skip updating based on visibility, if this is acceptable for the use case, by using LastRenderTime of the primitive scene info. If the update operation is enqueued separately in this manner, any RHI functions can be used including setting different render targets.
State caching (as opposed to updating) is an exception to this rule. State caching is storing an intermediate result of the rendering traversal as an optimization. It is closely tied with the traversal, and does not change RHI state, so it does not suffer the downsides mentioned before (as long as the determination of when to cache is done correctly).
Molecular Musings
Development blog of the Molecule Engine
Stateless, layered, multi-threaded rendering – Part 1
In this post, I would like to describe what features and performance characteristics I want from a modern rendering system: it should support stateless rendering, rendering in different layers/buckets, and rendering that can run in parallel on as many cores as are available.
I have been pondering about how to implement such a rendering system efficiently lately, and wanted to document/share my ideas and findings so far, before I go and implement the whole thing.
Rendering backend
What do I mean by a rendering backend? In my opinion, the rendering backend should be responsible for only one thing: submitting draw calls using a graphics API such as D3D or OGL. It is the responsibility of higher-level systems to make sure that only the minimum amount of draw calls are made, and that draw calls and state changes are ordered and optimized.
Stateless rendering
All graphics APIs we usually deal with are stateful things. This means that whenever you change any state in the API for subsequent draw calls, this state change also affects draw calls submitted at a later point in time. As an example, if you change the culling state from backface to frontface culling for some object, you need to either reset the state after the object finished rendering, or set a default state for all other objects, otherwise some objects end up being rendered with the wrong culling state.
Exposing such a stateful API to the user is error-prone, and a bad abstraction. Ideally, submitting a draw call with whatever state we want should not affect any of the other draw calls. This would allow us to treat each individual draw call like a single “thing” that carries all state it needs with it, not leaking any of its state into other draw calls.
This would also enable us to easily change the order of draw calls (as long as the rendering result stays the same), allowing us to get rid of redundant state changes, and sorting draw calls by a certain key (front-to-back,, back-to-front, or some other criteria).
The presentation about Firaxis’ LORE system used in Civilization V goes a bit more into detail.
Layered rendering
Also known as bucketized rendering, the idea here is to assign a key to a draw call which is then used for sorting. Typically, a key is just a single 32-bit or 64-bit integer, nothing more. Usually, a key encodes certain data like distance, material, shader, etc. of a draw call in individual bits. Depending on where those bits are stored in the integer, you can apply different sorting criteria for the same array of draw calls, as long as you know how the keys were built.
This is a very efficient and straightforward approach, because you can use e.g. a simple radix sort on the integers, and don’t have to worry about how to sort the data (is it sorted by distance? by texture? by material?). The sorting criteria is basically encoded in the bits of the integer – if you want to sort by material rather than by distance, just put the respective bits in a different place.
Christer Ericson’s blog explains the concept well, if you are not familiar with it.
There is one thing I would most likely change compared to Christer’s approach, though. I would argue that the renderer itself (e.g. a Deferred Renderer, a Clustered Renderer, a Forward+ Renderer) knows how to render the entities, and hence also knows how many and which layers it needs.
For example, a simple Deferred Renderer will first render objects to the G-Buffer, then render the decals, then render a shadow map for each shadow casting light source, apply the lighting of all light sources one by one, and finally render transparent objects using a forward pass, followed by HUD elements and similar things. Of course there are dozens of different implementations possible, but you get the idea.
My point is that I would not try to cram every draw call of every layer into a key of the same size, storing them all in one big stream (or per-thread local streams), but rather use differently sized keys for different layers. E.g. when rendering a shadow map, objects should be sorted front-to-back, and there is no need to sort by material or shader. Hence, a single 16-bit integer is probably enough for a rough front-to-back sort based on the distance to the camera. Therefore, I would put those 16-bit keys into a different “bucket” than draw calls that belong to other layers.
I might have 16-bit keys for shadow map buckets, 32-bit keys for transparent object buckets, and 64-bit keys for general G-Buffer buckets. The idea here is that smaller data can be sorted faster, and individual buckets can be sorted in parallel on different threads.
Multi-threaded rendering
Using such a layered/bucketized system, one of the big benefits we get is of course the ability to use all available threads for rendering. The common approach is to queue all draw calls into layers/buckets for a whole frame, sort them by key, and then submit them to the graphics API using the aforementioned rendering backend. API calls are only done from the main thread, but queueing the individual draw calls can easily be done in parallel.
If the engine follows a data-oriented approach, and/or uses a task scheduler, each core can work on N given entities at a time, possibly splitting its work into several tasks which are handed to the scheduler. Instead of storing all draw calls and their keys in one single stream of data, I would probably use an approach similar to the one used by the Bitsquid engine: storing draw call data in per-thread local streams, which are then sorted and merged before being submitted on the main thread.
General thoughts
Last but not least, just a thought about rendering in general: rendering of entities should happen by pushing draw calls into individual buckets, not by pulling state from each and every entity for each bucket.
More specifically, instead of doing this:
You should rather to this:
This might seem trivial to some of you, but I thought it was worth pointing out. After all, there are still many people that render a std::vector of GameObjects by iterating through the vector, calling a virtual Render() method for each object. Nothing wrong per se with that, but you won’t be doing a game with hundreds of hundreds of entities with such an approach.
Being able to push draw calls into different buckets ensures that we only touch each entity once per frame instead of several times, which can make a difference in performance if you have lots of entities.
That’s it for today! I’m hoping that there will be more things to share and maybe even some code to look at by Thursday next week!
Threaded Rendering
Information for graphics programmers working with the threaded renderer.
Rendering thread
In Unreal Engine 4 (UE4), the entire renderer operates in its own thread that is a frame or two behind the game thread.
When dealing with rendering things, you have to carefully consider every memory read and write to ensure not only thread safety, but also determinism in behavior. When functional behavior depends on execution speed differences between two threads, it is called a race condition. Avoiding race conditions is important because they are usually very difficult to reproduce, and may be machine, platform, debugger or configuration dependent because of speed differences. These kind of bugs can rarely be debugged and take something like 10x the time to fix compared to a normal reproducible bug.
Here is a simple example of a race condition / threading bug:
Development approach
There is no way to exhaustively test to find race conditions. It is important to realize that you cannot create reliable threaded code by guess-and-checking or retroactively fixing bugs. The best approach is to completely understand the interactions of the game thread and rendering thread and use mechanisms to ensure determinism. You should be able to explain the order of events that will make every interaction deterministic, or else you are almost certainly creating race conditions.
Thread specific data structures
For this reason, it is a good idea to have data in separate structures that are ‘owned’ by the different threads so that it is obvious who can modify what. This holds true for functions as well. It is best to always call each function from the same thread or things get really complicated. Most of UE4 is structured this way, for example, UPrimitiveComponent is the base game thread class of anything that can be rendered, cast shadows, has its own visibility state, etc. The rendering thread can never touch the memory of UPrimitiveComponent directly since the game thread may be writing to its members at any time. The rendering thread has its own class to represent this functionality, which is FPrimitiveSceneProxy. The game thread can never touch the members of memory of an FPrimitiveSceneProxy after it is created and registered. UActorComponent::RegisterComponent adds a component to the scene and makes it visible to the renderer by creating a FPrimitiveSceneProxy. Once the component is registered, it will have FPrimitiveSceneProxy::DrawDynamicElements called on it for every pass that is needed if it is visible.
Performance considerations
The game thread blocks at the end of each Tick() until the rendering thread catches up to either one frame or two frames behind. Since the rendering thread is so far behind, it is never acceptable during gameplay to block the game thread until the rendering thread catches up completely. Blocking during loading or GC of individual objects is also a bad idea, since UE4 supports async streaming levels. There are asynchronous mechanisms for various operations to avoid blocking.
Inter-thread communication
Asynchronous
The primary method of communication between the two threads is through the ENQUEUE_UNIQUE_RENDER_COMMAND_XXXPARAMETER macro. This macro creates a local class with a virtual Execute function that contains the code you enter into the macro. The game thread inserts the command into the rendering command queue, and the rendering thread calls the Execute function when it gets around to it.
FRenderCommandFence provides a convenient way to track the progress of the rendering thread on the game thread. The game thread calls FRenderCommandFence::BeginFence to begin the fence. The game thread can then call FRenderCommandFence::Wait to block until the rendering thread has processed the fence, or it can just poll the progress of the rendering thread by checking GetNumPendingFences. When GetNumPendingFences returns 0, the rendering thread has processed the fence.
Blocking
FlushRenderingCommands is the standard method of blocking the game thread until the rendering thread has caught up. This is useful for offline (editor) operations which modify memory being accessed by the rendering thread.
Rendering resources
FRenderResource provides the base rendering resource interface and provides hooks for initialization and releasing. Anything that derives from FRenderResource (FVertexBuffer, FIndexBuffer, etc) needs to be initialized before it is used for rendering and released before being deleted. FRenderResource::InitResource can only be called from the rendering thread, so there is a helper function (BeginInitResource) that can be called on the game thread to enqueue a rendering command to call FRenderResource::InitResource. RHI functions can only be called from the rendering thread (with the exception of a few for creating devices, viewports, etc).
UObjects and Garbage Collection
Garbage Collection (GC) happens on the game thread and operates on UObjects. The game thread may delete a UObject while the rendering thread is processing a command that references it. For this reason, the rendering thread should never dereference a UObject pointer unless a mechanism is in place to make sure the UObject is not deleted until the rendering thread no longer references it. An example is UPrimitiveComponent, which uses a FRenderCommandFence called DetachFence to prevent GC from deleting the UObject before the rendering thread has processed the detach command.
Game thread FRenderResource handling
There is two common scenarios of game thread rendering thread resource interaction to consider, the case of static resources (only modified on load or in the editor, like an index buffer) and dynamic resources, which need to be updated every frame with the latest results of the game thread simulation.
Static resources
Here is how the static resource interaction is handled in UE4, using USkeletalMesh as an example.
USkeletalMesh::PostLoad gets called on load, which calls InitResources. This calls BeginInitResource on any static FRenderResources that it has like the index buffer. BeginInitResource enqueues a rendering command to call FRenderResource::InitResource. From this point on, the game thread can no longer modify the index buffer memory until it does something to take back ownership.
A component registers which starts rendering with the USkeletalMesh’s index buffer.
GC determines that the component is no longer referenced at some point (level unload or no longer referenced) and detaches the component. Note that at this point, the game thread cannot delete the index buffer memory, because the rendering thread may not have processed the detach yet and may still be rendering with the index buffer.
GC calls USkeletalMesh::BeginDestroy, which is the game thread object’s chance to enqueue commands to release the rendering resources, so it does BeginReleaseResource(&IndexBuffer); The game thread still cannot delete the memory of IndexBuffer because the rendering thread has not necessarily processed the release yet. We could block the game thread until the rendering thread catches up, but this would cause hitches and be slow, so we have an asynchronous mechanism instead. In order to track the rendering thread’s progress of processing the release command we initiate a fence.
GC finally calls UObject::FinishDestroy which can be used to release memory in a central location. In the case of the index buffer, its memory gets freed when the USkeletalMesh destructor calls FRawStaticIndexBuffer‘s destructor, which calls the destructor of the TArray holding the index buffer memory, which frees the memory.
This mechanism works well because it is efficient (never blocks either thread, initializes in a central location instead of checking for whether initialization is needed every frame), and is deterministic.
Dynamic resources
The skeletal mesh bone transforms which are produced by the game thread animation each frame are a good example of dynamic resource updating. The goal is to get the transforms from the game thread after each animation update into an array on the rendering thread where they can be set as shader constants. The same would be true if you were updating an index or vertex buffer each frame. Here is the order of operations:
USkinnedMeshComponent::CreateRenderState_Concurrent allocates USkinnedMeshComponent::MeshObject. From this point on, the game thread can only write to the MeshObject pointer, but not to the memory of the FSkeletalMeshObject.
USkinnedMeshComponent::UpdateTransform gets called to update the component’s movement at least once per frame. This calls FSkeletalMeshObjectGPUSkin::Update in the case of GPU skinning. At this point, we have up to date transforms on the game thread and need to get them over to the rendering thread. This is done by first allocating memory on the heap (FDynamicSkelMeshObjectData), then copying the bone transforms into it, and then passing off this copy to the rendering thread using ENQUEUE_UNIQUE_RENDER_COMMAND_TWOPARAMETER. The rendering thread now owns the copy and is responsible for deleting it. The ENQUEUE_UNIQUE_RENDER_COMMAND_TWOPARAMETER macro contains code to copy the transforms to their final destination so they can be set as shader constants. This is where you would lock and update a vertex buffer if updating vertex positions.
At some point, the component gets detached. The game thread enqueues rendering commands to release all of the dynamic FRenderResources and can now set the MeshObject pointer to NULL, however the actual memory is still being referenced by the rendering thread and cannot be deleted. This is where the deferred deletion mechanism comes in to play. Classes that derive from FDeferredCleanupInterface can be deleted in an asynchronous way that is thread safe. FSkeletalMeshObject implements this interface. The game thread wants to kick off the deferred deletion of the FSkeletalMeshObject so it calls BeginCleanup(MeshObject). The memory will eventually be deleted when it is safe to do so and cleanup is complete.
Updating state vs Traversing the scene for rendering
When developing a system that has distinct update and render operations, it is tempting to combine the two in DrawDynamicElements, however this is a poor design choice. A better solution is to separate the update out of the rendering traversal, for example enqueue the update command from within the game thread Tick.
DrawDynamicElements is called by the high level rendering code to draw the elements of a primitive component. The high level code assumes that no RHI state is being changed, and that it can call DrawDynamicElements as many times as it needs each frame, depending on shading passes, number of views, and scene captures in the scene. DrawDynamicElements may even be called, but then the underlying drawing policy discards the results for various reasons (for example a translucent FMeshElement submitted during the depth pass will be discarded). If the primitive component is actually not visible, the occlusion system may or may not actually call DrawDynamicElements, depending on the heuristic it is using. All of these factors can conflict with state updating which should happen once per frame.
A better solution is to separate the update from the rendering traversal. The game thread Tick can enqueue a rendering command to do the update operation. The rendering command can optionally skip updating based on visibility, if this is acceptable for the use case, by using LastRenderTime of the primitive scene info. If the update operation is enqueued separately in this manner, any RHI functions can be used including setting different render targets.
State caching (as opposed to updating) is an exception to this rule. State caching is storing an intermediate result of the rendering traversal as an optimization. It is closely tied with the traversal, and does not change RHI state, so it does not suffer the downsides mentioned before (as long as the determination of when to cache is done correctly).