I expanded my NumPy rasterizer prototype I wrote about earlier with support for multiple triangles and GPU acceleration. It’s dog slow though 🙂
The way it works I use PyTorch to do the matrix operations instead of NumPy. Then adding CUDA support is very simple; you just need to add some tactical .cuda() statements to your code and you’re done. Of course for getting real performance benefits the data flow of your program needs to be refactored and some operations replaced with their inplace alternatives.
I verified with Process Explorer that, yes, the code does indeed do something with the GPU as it allocates some (300 megs!) VRAM:
About the code. It’s a really simple system in principle. All the barycentrics get evaluated for all the triangles producing a tensor (a multi-dimensional array) of size (H*W) * N * 3, where H and W are the image dimensions, N is the number of triangles, and 3 is the number of barycentrics. So it’s an array of images.
From these barycentrics we can calculate a visibility mask (line 86) for each image. Then for each pixel we find the ID of the last triangle that enclosed it (line 99), and use those indices to pick a pixel from the correct image (line 108) to output in the final framebuffer. This means the triangles are drawn in the order as they were specified in the original array.
In practice this was pretty annoying to write, and for example I got stuck in the final blitting part (lines 107-108) for a good couple of hours. In the end it was simple (found the answer here) and was just about replacing arr[:, ind, :] with a arr[range(0, N), ind, :] for whatever reason.
Anyway, you can run the code below without CUDA by simply removing all the .cuda() calls.