PyTorch rasterizer on the GPU

These three triangles were drawn on the GPU!

I expanded my NumPy rasterizer prototype I wrote about earlier with support for multiple triangles and GPU acceleration. It’s dog slow though 🙂

The way it works I use PyTorch to do the matrix operations instead of NumPy. Then adding CUDA support is very simple; you just need to add some tactical .cuda() statements to your code and you’re done. Of course for getting real performance benefits the data flow of your program needs to be refactored and some operations replaced with their inplace alternatives.

I verified with Process Explorer that, yes, the code does indeed do something with the GPU as it allocates some (300 megs!) VRAM:


About the code. It’s a really simple system in principle. All the barycentrics get evaluated for all the triangles producing a tensor (a multi-dimensional array) of size (H*W) * N *  3, where H and W are the image dimensions, N is the number of triangles, and 3 is the number of barycentrics. So it’s an array of images.

From these barycentrics we can calculate a visibility mask (line 86) for each image. Then for each pixel we find the ID of the last triangle that enclosed it (line 99), and use those indices to pick a pixel from the correct image (line 108) to output in the final framebuffer. This means the triangles are drawn in the order as they were specified in the original array.

In practice this was pretty annoying to write, and for example I got stuck in the final blitting part (lines 107-108) for a good couple of hours. In the end it was simple (found the answer here) and was just about replacing arr[:, ind, :] with a arr[range(0, N), ind, :] for whatever reason.

Anyway, you can run the code below without CUDA by simply removing all the .cuda() calls.



NumPy Rasterizer


I found myself thinking about a certain really short MATLAB rasterizer, and I wanted to try to do something similar with NumPy. It draws only a single triangle because I couldn’t bother working out how the tensor operations would go with multiple polygons. I didn’t really code golf it into small size either.

The code generates a 2D array with (y, x) pixel coordinates in each cell and then feeds those into the edge functions as arguments on lines 33-35. We can then extract the triangle pixel mask and the barycentrics to plot the pixel colors.

It would be interesting to extend this to support multiple triangles in a single operation and then use PyTorch to run it on the GPU 🙂

Software Rasterizer: Environment Map Demo


I made a short demo that shows off an environment mapping effect rendered with my rasterizer. It’s pretty slow in 720p but I made an offline rendered 60 FPS video you can enjoy:

I calculate the reflection vector and the corresponding cube map texture coordinate per vertex and then interpolate them across the torus object. It’s a classic effect for a reason: shiny things just look so pretty.

Windows executable: (19.5 MB)
You need the MSVC2017 redistributable to run it.

Software Rasterizer: Simpler Mipmaps

Per polygon mipmaps work fine in blocky BSP levels. Blue shades imply coarser texture miplevels.

I finally fixed that stupid z-buffering bug. Turns out I wasn’t interpolating the clipspace z perspective correctly. Such a simple thing! I was 100% sure I had already verified it before but well, I hadn’t.

The miplevel changes aren’t very easy to see without the extra coloring

Anyway, encouraged by above I went and fixed the nasty per polygon mipmaps. Well, I didn’t really fix them, but changed the miplevel picking to something less broken. My idea is to assume affine texture mapping, and then calculate the texel-to-pixel-ratio to use for picking the miplevel. The more texels we have per pixel, the coarser miplevel LOD we need pick. Coarser miplevels have larger values, and the original texture size is miplevel 0.

There’s a base four logarithm in the formula since the texture area quadruples each time its dimensions double. I use the elementary notion that log_4(x) = log_2(x) / log_2(4) to calculate the final miplevel.

I hope this code clarifies the concept:

// Calculate the triangle's position on the texture map
vec2 texSize = pipeline.currentTexture->size;
vec2 v0tex = v0attr.uv * texSize;
vec2 v1tex = v1attr.uv * texSize;
vec2 v2tex = v2attr.uv * texSize;

// Calculate the triangle area on screen and on the texture map
float screenarea = orient2dfloat(v0f, v1f, v2f);
float texarea = abs(orient2dfloat(v0tex, v1tex, v2tex));

// Bigger texels/pixel ratio implies a coarser miplevel
float ratio = texarea / screenarea;

// Standard miplevel picking and some bias
miplevel = (int)(log2(ratio) / log2(4.0)) + 0.8;
miplevel = max(0, miplevel);
miplevel = min(MAX_MIPMAPS - 1, miplevel);

Below is an illustration that shows the same triangle in screen and texture space.


It could be a bit more aggressive on sloped walls, and it also fails pretty badly with really large triangles. I didn’t take screenshots of the failures this time 🙂

It all works out if the level is tesselated enough and the triangles are regularly shaped. Note how the perpendicular walls get coarser miplevels, as it should be.

Studying MAME Voodoo emulation

Sometimes you find Good Stuff on Twitter:

The Voodoo 2 emulation code in MAME is split into a couple of files:

  • vooddefs.h Has the main scanline rasterizer loop and other interesting stuff.
  • voodoo.cpp Rest of the meat: command buffer parsing & FIFO emulation, triangle setup code, blitting, frame buffer access logic.
  • voodoo_pci.cpp PCI memory map emulation (?)
  • voodoo_rast.hxx Instantiates game specific rasterizer functions, I guess? The rasterizers are not called directly as worker threads are used instead.

So here’s some highlights I found. Some of the code is somewhat indirect because they use C macros extensively to support different renderers at compile time.

The exact dithering matrices used in the hardware.

A nice specsheet of the different cards:

Voodoo 2:
2,4MB frame buffer RAM
2,4,8,16MB texture RAM
90MHz clock frquency
clears @ 2 pixels/clock (RGB and depth simultaneously)
renders @ 1 pixel/clock
ultrafast clears @ 16 pixels/clock
128 entry PCI FIFO
memory FIFO up to 65536 entries

It’s ironic how it takes so much work and care to quickly render those sweet perspective-correct-bilinearly-interpolated-texture-mapped-depth-buffered-triangles, but then you advertise how quickly you can clear those off the frame buffer (16 pixels per clock!) 🙂

We learn that different interpolants have different bit depths. I haven’t checked the specs if this matches the hardware, but I presume it does.

iterated RGBA = 12.12 [24 bits]
iterated Z = 20.12 [32 bits]
iterated W = 18.32 [48 bits]

The input vertices have only four bits of subpixel precision, and a range of 2^16 bits which maps to [-4096, 4095]. Maximum texture size is 16384 x 16384.

Vertex data is 12.4 formatted fixed point
S,T data is 14.18 formatted fixed point, converted to 16.32 internally [in the emulator]

The command format supports jump instructions! It would be interesting to know what they are good for.

Packet type 0: 1 or 2 words
 Word Bits
  0 31:29 = reserved
  0 28:6 = Address [24:2]
  0 5:3 = Function (0 = NOP, 1 = JSR, 2 = RET, 3 = JMP LOCAL, 4 = JMP AGP)
  0 2:0 = Packet type (0)
  1 31:11 = reserved (JMP AGP only)
  1 10:0 = Address [35:25]

Reciprocal (for perspective correction) and log2 (for MIP level selection) functions are precalculated to an array and linearly interpolated when needed. There’s also a separate integer log2 function.

The heart of the rasterizer: perspective division. It’s interesting how they calculate the LOD level with the same division function.

Apparently you can pick between w or z buffering. I really like the artesanal fixed point to 32-bit float conversion they use (depthval is of type float):

int exp = count_leading_zeros(temp);
depthval = ((exp << 12) | ((~temp >> (19 - exp)) & 0xfff)) + 1;

It’s a scanline renderer that iterates over horizontal spans.

MAME doesn’t handle trilinear filtering yet, but looks like they do support dithering between two miplevels:

    lod += dither4[x&3] << 4;

I guess the dithering works pretty well since I’ve never noticed it in action!

Overall this code is pretty nice to read since it’s not a hot mess of #ifdefs or scattered to tons of files/classes/interfaces as some hobbyist code tends to be. I’ve still left with some questions, such as where are the fill rules evaluated and how does the subpixel accurate evaluation work out. Maybe it’s not a problem since with only four bits of subpixel precision you can still use 32-bit integers?

Another good source for this study could be the PCem Voodoo code. It’s already mature enough to run Quake 3.

Software Rasterizer: Depth debugging

Some geometry poking through a wall has visibly wrong Z-buffer values.

I could already do depth buffering using a 1/w buffer (which is an abomination: w should be kept linear since it’s a scaled view space depth, and that’s whole point of a w-buffer, but I didn’t have near and far plane values at hand). But it’s still bugging me how Z-buffering doesn’t work. I didn’t exactly solve the mystery yet but at least I could verify that it’s not about buffer precision.

You can see in the screenshot above how the machine guns are visible through the wall and the offending pixels get lower depth values (closer to camera) even though they shouldn’t. So it seems like my Z-interpolation might be somehow very broken.

Software Rasterizer: Drawing a Half-Life map

My earlier attempt at rendering this map looked terrible because the OBJ loader didn’t know how to split the vertex stream into separate objects based on each polygon’s material. Now it does, and also sorts the submeshes by texture for Ultimate Performance.

Z-buffer and a W-buffer

My Z-buffering was really glitchy earlier and you could see some faces through walls. I first suspected it was because of the 16-bit precision, but expanding it to 24-bits didn’t help. I was originally storing 1/Z_window ([0, 1] range) that gets linearly interpolated in screenspace. Moving the near plane further away didn’t have much effect either. For some reason the perspective transform gives me clip space z coordinates that are very close to clip space w of the same vector. Then to get NDC we do v.z / v.w, which ends up being basically in the [0.95, 1.0] range since the values were so close. It makes sense the Z-buffer is broken since it has to work with just 5% of the value range.

There’s the idea of W-buffering: use the perspective correct interpolated clip space w for depth comparisons. This is a scaled version of your view space z coordinate, which is linear, and you need w anyway for texture mapping. However, in my case the w calculated inside the triangle filling loop doesn’t lie in [0, 1] range (more like [0, 1000]), which probably means my vertex transformations are somehow busted. This is something I must carefully review next. Anyway, storing 1/w works just fine so I used that for the video.