Preview

Scene made using assets from https://polyhaven.com/. The scene’s XML file is in the final_scene folder

Introduction

Real-time ray tracing is currently a hot topic in the field of computer graphics industry. In recent years, a series of works in both academia and industry have showcased advancements in sampling techniques (ReSTIR), denoising methods (SVGF), deep learning denoising (DLSS), and GPU architecture (RT cores), leading to the successful integration of real-time path tracing into the industrial products (e.g., Ray Tracing: Overdrive Mode in Cyberpunk 2077). In order to keep up with this trend, our team aims to learn the technical details of real-time ray tracing by implementing a GPU path tracer using CUDA.

Goal

In our final project, we dive into the exciting world of parallel computing to supercharge the performance of our path tracer using GPUs. Path tracing is a demanding algorithm used in computer graphics to create lifelike images by simulating how light rays behave. By adapting our path tracer to run on GPUs using the CUDA toolkit, we tap into the immense parallel processing power of modern graphics cards. This allows us to render scenes much faster and achieve stunning visual quality. In this report, we explain the process of porting our code to NVIDIA GPUs using CUDA, discussing the practical implementation, optimizations, challenges, and trade-offs involved in harnessing the full potential of parallelism for real-time rendering.

Implementation

In this section, we outline the implementation details of porting the path tracer onto GPU using CUDA. The implementation consists of three key parts: passing the scene to the GPU, re-implementing the BVH intersection, and implementing the kernel function.

Passing the Scene to the GPU

The first step in parallelizing the path tracer involves transferring the scene data from the CPU to the GPU. This includes geometry (shape and Trianglemeshes), materials, lights, textures, and BVH acceleration structures. To accomplish this, we utilize GPU memory allocation and data transfer API, such as cudaMalloc(), cudaMemcpy() functions, to efficiently pass the scene data to the GPU. Also, note that the device side does not support some of the C++ Standard Template Library (STL) data structures including std::vector. Therefore, when working with vectors, all the data stored within the vector should be passed to the device memory in the form of an array. Each data element will be assigned a pointer to allow access to them.

Re-implementing the BVH Intersection algorithm

The next crucial aspect of our implementation is re-implementing the bounding volume hierarchy (BVH) accelerated structure on the GPU. In the original CPU version, the closest intersection point for the current ray is obtained by recursively traversing the BVH structure from the root. However, the recursive traversal method is not recommended in the device function because:

For safety and controllability. The recursion uses stack memory which has a limit of 512 KB. The default is usually 1KB which is easy to overflow and crashes the program.
For compatibility. Recursive device function call is only supported after NVIDIA hardware supports compute capability 2.0 and CUDA 3.1

Therefore, we need to modify the recursive algorithm into a for-loop form. To achieve this, we re-implement the traversal process using a simulated stack. By continuously pushing nodes onto the stack, we successfully reproduce the process of traversing the BVH structure in the DFS order.

Another important consideration in the reimplementation of the BVH structure is the method of storing the BVH tree itself. Unlike the traditional approach of storing child nodes with random addresses using pointers, we need to pre-allocate a contiguous array in device memory. Consequently, our method involves passing all the nodes into an array, where each node holds the indices of its child nodes within the array. This approach greatly facilitates access to the nodes within CUDA, simplifying the traversal and manipulation of the BVH structure during rendering.