PyTorch rasterizer on the GPU

torch_triangles
These three triangles were drawn on the GPU!

I expanded my NumPy rasterizer prototype I wrote about earlier with support for multiple triangles and¬†GPU acceleration. It’s dog slow though ūüôā

The way it works I use PyTorch to do the matrix operations instead of NumPy. Then adding CUDA support is very simple; you just need to add some tactical .cuda() statements to your code and you’re done. Of course for getting real performance benefits the data flow of your program needs to be refactored and some operations replaced with their inplace alternatives.

I verified with Process Explorer that, yes, the code does indeed do something with the GPU as it allocates some (300 megs!) VRAM:

torch_gpu_memory

About the code. It’s a really simple system in principle. All the barycentrics get evaluated for all the triangles producing a tensor (a multi-dimensional array) of size (H*W) * N *¬† 3, where H and W are the image dimensions, N is the number of triangles, and 3 is the number of barycentrics. So it’s an array of images.

From these barycentrics we can calculate a visibility mask (line 86) for each image. Then for each pixel we find the ID of the last triangle that enclosed it (line 99), and use those indices to pick a pixel from the correct image (line 108) to output in the final framebuffer. This means the triangles are drawn in the order as they were specified in the original array.

In practice this was pretty annoying to write, and for example I got stuck in the final blitting part (lines 107-108) for a good couple of hours. In the end it was simple (found the answer here) and was just about replacing arr[:, ind, :] with a arr[range(0, N), ind, :] for whatever reason.

Anyway, you can run the code below without CUDA by simply removing all the .cuda() calls.

import numpy as np
import matplotlib.pyplot as plt
from numpy import array
import torch
from torch import Tensor
"""
A dumb triangle rasterizer with PyTorch.
It evaluates the barycentrics for all image pixels for each triangle
and then picks the "colors" (just barycentrics again) for each pixel
based on which triangle rendered to that pixel last.
In other words it draws the triangles in order (painter's algorithm).
"""
width = 300
height = 200
# The triangle list
d = Tensor([
( (0.9, 0.5), (0.5, 0.8), (0.1, 0.15) ),
( (0.5, 0.1), (0.3, 0.85), (0.8, 0.25) ),
( (0.3, 0.15), (0.1, 0.2), (0.15, 0.05) ),
]).cuda()
N = d.size()[0]
P = height*width # The number of pixels in the output image.
# Calculates the signed distance from the edge v0-v1 to point p
def edgefunc(v0, v1, p):
"""
let S = H*W
v0 and v1 have vertex positions for all N triangles.
Their shapes are [N x 2]
p is a list of sampling points as a [N x S x 2] tensor.
Each of the N triangles has an [S x 2] matrix of sampling points.
returns a [N x S] matrix
"""
S = p.size()[1]
# Take all the x and y coordinates of all the positions as a
# [N x S] tensor
px = p[:, :, 1].cuda()
py = p[:, :, 0].cuda()
# We need to manually broadcast the vector to cover all sample points
y01 = v0[:,0] v1[:,0] # [N]
x10 = v1[:,1] v0[:,1] # [N]
y01 = y01.unsqueeze(0).t().repeat((1, S)).cuda() # [N x S]
x10 = x10.unsqueeze(0).t().repeat((1, S)).cuda() # [N x S]
cross = v0[:,1]*v1[:,0] v0[:,0]*v1[:,1] # [N]
cross = cross.unsqueeze(0).t().repeat((1,S)) # [N x S]
return y01*px + x10*py + cross
# Calculate the area of the parallelogram formed by the triangle
area = edgefunc(d[:, 2, :], d[:, 1, :], d[:, None, 0, :])
# Create a grid of sampling positions
ys = np.linspace(0, 1, height, endpoint=False)
xs = np.linspace(0, 1, width, endpoint=False)
xmesh, ymesh = np.meshgrid(xs, ys)
# Reshape the sampling positions to a H x W x 2 tensor
gridcpu = np.moveaxis(array(list(zip(ymesh, xmesh))), 1, 2)
gridcpu = np.reshape(gridcpu, (height*width, 2))
grid = Tensor(gridcpu)
grid = grid.unsqueeze(0).repeat((N, 1, 1)) # [N x P x 2]
# Evaluate the edge functions at every position.
# We should get a [N x P] vector out of each.
w0 = edgefunc(d[:, 1, :], d[:, 2, :], grid) / area
w1 = edgefunc(d[:, 2, :], d[:, 0, :], grid) / area
w2 = edgefunc(d[:, 0, :], d[:, 1, :], grid) / area
ids = torch.tensor(range(0, N), dtype=torch.long)
# Only pixels inside the triangles will have color
# [N x P]
mask = (w0 > 0) & (w1 > 0) & (w2 > 0)
# Pack the images to a tensor for pixel-wise indexing.
# Each triangle will have its own image.
#
# Size will be [P x N x 3] which is kind of weird but we need it (maybe?)
# for indexing later.
imgs = torch.stack([w0,w1,w2], dim=1).transpose(0,2).transpose(1,2)
# Construct a vector of length P that will tell for each pixel from which
# image we should fetch the pixel.
ids = ids.unsqueeze(0).t().repeat((1,height*width))
idmask = ids * mask.type(torch.LongTensor)
idmax = torch.max(idmask, dim=0)[0]
# Pick a rendered pixel from the correct image (specified by idmax) for each
# output pixel.
# "pixelrange" simply selects all pixels. For some reason just using
# imgs[:, idmax, :] wouldn't get rid of the middle dimension and produced
# a matrix with incorrect dimensions.
pixelrange = torch.arange(P, dtype=torch.long)
img2 = imgs[pixelrange, idmax, :].t()
# [3 x P]
buf = torch.zeros((3, mask.size()[1]), dtype=torch.float32)
buf[:,:] = img2
# Collapse all masks into one and zero out pixels with no coverage.
mask2, _ = torch.max(mask, dim=0)
buf[:, 1mask2] = 0
w0cpu = array(w0)
w1cpu = array(w1)
w2cpu = array(w2)
maskcpu = array(mask)
bufcpu = array(buf)
data = np.zeros((height * width, 3), dtype=np.uint8)
data[:,:] = array(buf.t()*255)
plt.imshow(np.reshape(data, (height, width, 3)), interpolation='nearest')
plt.show()

 

 

NumPy Rasterizer

numpy_triangle

I found myself thinking about¬†a certain really short MATLAB rasterizer, and I wanted to try to do something similar with NumPy. It draws only a single triangle because I couldn’t bother working out how the tensor operations would go with multiple polygons. I didn’t really code golf it into small size either.

The code generates a 2D array with (y, x) pixel coordinates in each cell and then feeds those into the edge functions as arguments on lines 33-35. We can then extract the triangle pixel mask and the barycentrics to plot the pixel colors.

It would be interesting to extend this to support multiple triangles in a single operation and then use PyTorch to run it on the GPU ūüôā

import numpy as np
import matplotlib.pyplot as plt
from numpy import array
width = 100
height = 80
# The triangle
d = np.array([ [ [0.9, 0.5], [0.5, 0.8], [0.1, 0.15] ] ])
# Calculates the distance from the edge v0-v1 to point p
def edgefunc(v0, v1, p):
px = p[:, 1]
py = p[:, 0]
return (v0[:,0] v1[:,0])*px + (v1[:,1] v0[:,1])*py + (v0[:,1]*v1[:,0] v0[:,0]*v1[:,1])
# Calculate the area of the parallelogram formed by the triangle
area = edgefunc(d[:, 2, :], d[:, 1, :], d[:, 0, :])
# Create a grid of sampling positions
ys = np.linspace(0, 1, height, endpoint=False)
xs = np.linspace(0, 1, width, endpoint=False)
xpos = np.tile(xs, (height, 1))
ypos = np.tile(ys, (width, 1)).transpose()
# Reshape the sampling positions to a H x W x 2 tensor
pos = np.moveaxis(array(list(zip(ypos, xpos))), 1, 2)
pos = np.reshape(pos, (height*width, 2))
# Evaluate the edge functions at every position
w0 = edgefunc(d[:, 2, :], d[:, 1, :], pos)
w1 = edgefunc(d[:, 0, :], d[:, 2, :], pos)
w2 = edgefunc(d[:, 1, :], d[:, 0, :], pos)
# Only pixels inside the triangle will have color
mask = (w0 > 0) & (w1 > 0) & (w2 > 0)
data = np.zeros((height * width, 3), dtype=np.uint8)
data[:,0] = (w0 / area[0])*255*mask
data[:,1] = (w1 / area[0])*255*mask
data[:,2] = (w2 / area[0])*255*mask
plt.imshow(np.reshape(data, (height, width, 3)), interpolation='nearest')
plt.show()

view raw

triangle.py

hosted with ❤ by GitHub

Software Rasterizer: Environment Map Demo

envmapped_big.jpg

I made a short demo that shows off an environment mapping effect rendered with my rasterizer. It’s pretty slow in 720p but I made an offline rendered¬†60 FPS video you can enjoy:

I calculate the reflection vector and the corresponding cube map texture coordinate per vertex and then interpolate them across the torus object. It’s a classic effect for a reason: shiny things just look so pretty.

Windows executable: soiwroteasoftwarerasterizer.zip (19.5 MB)
You need the MSVC2017 redistributable to run it.

Software Rasterizer: Simpler Mipmaps

miplevels_overview
Per polygon mipmaps work fine in blocky BSP levels. Blue shades imply coarser texture miplevels.

I finally fixed that stupid z-buffering bug. Turns out I wasn’t interpolating the clipspace z perspective correctly. Such a simple thing! I was 100% sure I had already verified it before but well, I hadn’t.

mipanim
The miplevel changes aren’t very easy to see without the extra coloring

Anyway, encouraged by above I went and fixed the nasty per polygon mipmaps. Well, I didn’t really¬†fix them, but changed the miplevel picking to something less broken. My idea is to assume affine texture mapping, and then calculate the texel-to-pixel-ratio to use for picking the miplevel. The more texels we have per pixel, the coarser miplevel LOD we need pick. Coarser miplevels have larger values, and the original texture size is miplevel 0.

There’s a base four logarithm in the formula since the texture¬†area quadruples each time its¬†dimensions¬†double. I use the elementary notion that log_4(x) = log_2(x) / log_2(4) to calculate the final miplevel.

I hope this code clarifies the concept:

// Calculate the triangle's position on the texture map
vec2 texSize = pipeline.currentTexture->size;
vec2 v0tex = v0attr.uv * texSize;
vec2 v1tex = v1attr.uv * texSize;
vec2 v2tex = v2attr.uv * texSize;

// Calculate the triangle area on screen and on the texture map
float screenarea = orient2dfloat(v0f, v1f, v2f);
float texarea = abs(orient2dfloat(v0tex, v1tex, v2tex));

// Bigger texels/pixel ratio implies a coarser miplevel
float ratio = texarea / screenarea;

// Standard miplevel picking and some bias
miplevel = (int)(log2(ratio) / log2(4.0)) + 0.8;
miplevel = max(0, miplevel);
miplevel = min(MAX_MIPMAPS - 1, miplevel);

Below is an illustration that shows the same triangle in screen and texture space.

mipmap_formula2

It could be a bit more aggressive on sloped walls, and it also fails pretty badly with really large triangles. I didn’t take screenshots of the failures this time ūüôā

miplevels
It all works out if the level is tesselated enough and the triangles are regularly shaped. Note how the perpendicular walls get coarser miplevels, as it should be.

Studying MAME Voodoo emulation

Sometimes you find Good Stuff on Twitter:

The Voodoo 2 emulation code in MAME is split into a couple of files:

  • vooddefs.h¬†Has the main scanline rasterizer loop and other interesting stuff.
  • voodoo.cpp¬†Rest of the meat: command buffer parsing & FIFO emulation, triangle setup code, blitting, frame buffer access logic.
  • voodoo_pci.cpp¬†PCI memory map emulation (?)
  • voodoo_rast.hxx¬†Instantiates game specific rasterizer functions, I guess? The rasterizers are not called directly as worker threads are used instead.

So here’s some highlights I found. Some of the code is somewhat indirect because they use C macros extensively to support different renderers at compile time.

The exact dithering matrices used in the hardware.

A nice specsheet of the different cards:

Voodoo 2:
2,4MB frame buffer RAM
2,4,8,16MB texture RAM
90MHz clock frquency
clears @ 2 pixels/clock (RGB and depth simultaneously)
renders @ 1 pixel/clock
ultrafast clears @ 16 pixels/clock
128 entry PCI FIFO
memory FIFO up to 65536 entries

It’s ironic how it takes so much work and care to quickly render those sweet perspective-correct-bilinearly-interpolated-texture-mapped-depth-buffered-triangles, but then you advertise how quickly you can clear those off the frame buffer (16 pixels per clock!) ūüôā

We learn that different interpolants have different bit depths. I haven’t checked the specs if this matches the hardware, but I presume it does.

iterated RGBA = 12.12 [24 bits]
iterated Z = 20.12 [32 bits]
iterated W = 18.32 [48 bits]

The input vertices have only four bits of subpixel precision, and a range of 2^16 bits which maps to [-4096, 4095]. Maximum texture size is 16384 x 16384.

Vertex data is 12.4 formatted fixed point
S,T data is 14.18 formatted fixed point, converted to 16.32 internally [in the emulator]

The command format supports jump instructions! It would be interesting to know what they are good for.

Packet type 0: 1 or 2 words
 Word Bits
  0 31:29 = reserved
  0 28:6 = Address [24:2]
  0 5:3 = Function (0 = NOP, 1 = JSR, 2 = RET, 3 = JMP LOCAL, 4 = JMP AGP)
  0 2:0 = Packet type (0)
  1 31:11 = reserved (JMP AGP only)
  1 10:0 = Address [35:25]

Reciprocal (for perspective correction) and log2 (for MIP level selection) functions are precalculated to an array and linearly interpolated when needed. There’s also a separate integer log2 function.

The heart of the rasterizer: perspective division.¬†It’s interesting how they calculate the LOD level with the same division function.

Apparently you can pick between w or z buffering. I really like the artesanal fixed point to 32-bit float conversion they use (depthval is of type float):

int exp = count_leading_zeros(temp);
depthval = ((exp << 12) | ((~temp >> (19 - exp)) & 0xfff)) + 1;

It’s a scanline renderer that iterates over horizontal spans.

MAME doesn’t handle trilinear filtering yet, but looks like they do¬†support dithering between two miplevels:

if (TEXMODE_ENABLE_LOD_DITHER(TEXMODE))
    lod += dither4[x&3] << 4;

I guess the dithering works pretty well since I’ve never noticed it in action!

Overall this code is pretty nice to read since it’s not a hot mess of #ifdefs or scattered to tons of files/classes/interfaces as some hobbyist code tends to be. I’ve still left with some questions, such as where are the fill rules evaluated and how does the subpixel accurate evaluation work out. Maybe it’s not a problem since with only four bits of subpixel precision you can still use 32-bit integers?

Another good source for this study could be the PCem Voodoo code. It’s already mature enough to run Quake 3.

Software Rasterizer: Depth debugging

depthbug
Some geometry poking through a wall has visibly wrong Z-buffer values.

I could already do depth buffering using a 1/w buffer (which is an abomination:¬†w should be kept linear since it’s a scaled view space depth, and that’s whole point of a w-buffer, but I didn’t have near and far plane values at hand). But it’s still bugging me how Z-buffering doesn’t work. I didn’t exactly solve the mystery yet but at least I could verify that it’s not about buffer precision.

You can see in the screenshot above how the machine guns are visible through the wall and the offending pixels get lower depth values (closer to camera) even though they shouldn’t. So it seems like my Z-interpolation might be somehow very broken.

Software Rasterizer: Drawing a Half-Life map

My earlier attempt at rendering this map looked terrible because the OBJ loader didn’t know how to split the vertex stream into separate objects based on each polygon’s material. Now it does, and also sorts the submeshes by texture for Ultimate Performance.

zbuffer_woes
Z-buffer and a W-buffer

My Z-buffering was really glitchy earlier and you could see some faces through walls. I first suspected it was because of the 16-bit precision, but expanding it to 24-bits didn’t help. I was originally storing 1/Z_window ([0, 1] range) that gets linearly interpolated in screenspace. Moving the near plane further away didn’t have much effect either. For some reason the perspective transform gives me clip space z coordinates that are very close to clip space w of the same vector. Then to get NDC we do v.z / v.w, which ends up being basically in the [0.95, 1.0] range since the values were so close. It makes sense the Z-buffer is broken since it has to work with just 5% of the value range.

There’s the idea of W-buffering: use the perspective correct interpolated clip space w for depth comparisons. This is a scaled version of your view space z coordinate, which is linear, and you need w anyway for texture mapping. However, in my case the¬†w calculated inside the triangle filling loop doesn’t lie in [0, 1] range (more like [0, 1000]), which probably means my vertex transformations are somehow busted. This is something I must carefully review next. Anyway, storing 1/w works just fine so I used that for the video.

Software Rasterizer: More broken models

hl_broken.pngI’m trying to load Half-Life: Uplink‘s maps but all textures are mixed up! I also need some freeflight camera controls. The map model is converted to Wavefront OBJ with Crafty. It also neatly exports all the textures as separate Truevision TGA images and lists them in an .mtl file. For some reason I had to run the tool in a Windows 7 virtual machine to get it to work.

I decided to always keep all textures as ARGB in memory just to keep things simple. Alpha tested textures will probably be the only “blending” the renderer will support. I will post more screenshots once the map looks at least somewhat correct. Of course the lightmaps will be missing because a) they are not exported, and b) the renderer supports no blending.

Currently the rasterizer loop calculates the barycentrics and then interpolates the texture coordinates and Z correspondigly. I got the idea of simplifying it to just integer additions as they do in scanline-oriented rasterizers, but I'm running out of precision since I want to prenormalize the interpolants with the triangle area first. Concretely, my current code is the following:

int64_t area = orient2d(v0, v1, v2); // two times triangle area
float oneOverArea = 1.0f/area; // this is very small
float znorm[3] = { v0f.z, v1f.z, v2f.z }; 
 for (int i = 0; i < 3; i++)
    znorm[i] *= oneOverArea;

// and then inside the rasterizer loop
// w0, w1, and w2 are the barycentrics

float z = w0 * znorm[0] + w1 * znorm[1] + w2 * znorm[2];

The problem is that znorm becomes very small after oneOverArea multiplication, and converting it into fixed point loses a lot of precision, even with something crazy like 4.28 integer/fractional split. The normalization could be done inside the loop too but that would neuter the gains we wanted from the optimization in the first place. I must be missing something very elementary here.

Edit: The only real solution seems to move the oneOverArea multiplication back to the inner loop to keep the fixed point interpolant precision reasonable.

Software Rasterizer: Fixing interpolation

wireframe_sponza
A wireframe overlay hides broken UV maps of the arches on the sides.

My perspective-correct attribute interpolation was all wrong. I had made multiple glaring mistakes:

  • The end point values of the interpolants were divided by screen space Z, not clip space w as it should be. It still worked surprisingly well but went crazy near the camera. I got confused about this since, if I got this right, the clip space¬†w¬†is just a sign flipped eye space z coordinate. Something to verify on paper I guess.
  • Z got interpolated as 1/Z¬†even though it’s just fine to keep it screen space linear (for proof, see¬†W Pleasure, W Fun, James F. Blinn, 1998).
  • All screen space Z values were by accident in [0.5, 1.5] range instead of [0, 1].
  • The near clip plane was set to Z=0 but projection matrix gave the clipping coordinates in [-1, 1]^3 space.

I don’t have any real tests but simply liberally adding asserts all over the place seemed to be pretty effective in catching the stupidest bugs. The first two were gross misunderstandings and it’s hard to find those with just tests.

Software Rasterizer: Rendering real models

sponza2_mips
The Sponza model shows how some UVs are still broken. Renders at roughly 10 Hz at 1024×768.

It’s fun to finally see some actual 3D models! I use tinyobjloader to load them. The earlier near clipping problems were caused by having the window Z-coordinate (the coordinate space you get after w division and viewport shifting) in [-1, 1] range instead of [0, 1] as it should be.

I’m still doing something wrong when it comes interpolation: it breaks down as the positions approach the Z=0 near clip plane. This makes sense as the 1/z division will explode and therefore break the calculated UVs. This is very visible in big polygons such as the floor. I should research some other implementations to see how they deal with this.

I also added some hacky per-polygon mipmapping which seems to work fine for at least static images. Maybe I’ll write another post about it later. Below is the same image without mipmaps (full size images:¬†no mipmaps, mipmaps):

sponza2_nomips

My fill rules might be still wrong as I get some dark pixels with this cow mesh:

fillrules

Once I get the interpolation sorted out I can look into multitexturing, or maybe just blending so one could do multipass rendering instead.