--- type: article identifier: opengl-fractal-explorer title: GPU-Accelerated Fractal Explorer description: Using OpenGL's compute shaders to dispatch fractal computation to the GPU and render in realtime. datestring: 2023-12-07 banner_image: /static/images/mandelbrot.png links: Source Code: https://github.com/JoshuaS3/zydeco/tree/fractal The Mandelbrot Set: https://en.wikipedia.org/wiki/Mandelbrot_set IEEE 754: https://en.wikipedia.org/wiki/IEEE_754 OpenGL Compute Shaders: https://www.khronos.org/opengl/wiki/Compute_Shader Adam7 Algorithm: https://en.wikipedia.org/wiki/Adam7_algorithm OpenGL Memory Model: https://www.khronos.org/opengl/wiki/Memory_Model --- I've been toying around for a while with an idea for a procedural world generation + simulation project as an experiment in C++ and graphics programming to teach myself more about computer science and rendering techniques. Part of this is, of course, setting up the infrastructure for input handling, world logic, debug menus, and rendering. When writing the initial code, I used the Mandelbrot set for testing. This led me down a rabbit hole of improving my rendering techniques for this application, as well as trying out different fractals, ultimately culminating in this GPU-accelerated fractal explorer (transform, zoom, pan) with progressive refine:

Video compression doesn't allow for demonstration of the progressive refine element very well; this is explained in detail later here.

**Outline** 1. [Fractal Sets](#fractal-sets) 1. [The Mandelbrot Set](#the-mandelbrot-set) 2. [The Tricorn Set](#the-tricorn-set) 3. [The Burning Ship Fractal](#the-burning-ship-fractal) 2. [Notes on Fractal Computation](#notes-on-fractal-computation) 1. [Divergence](#divergence) 2. [Iteration Count](#iteration-count) 3. [Float-Point Precision](#floating-point-precision) 3. [Rendering on the GPU](#rendering-on-the-gpu) 1. [Using a Fragment Shader](#using-a-fragment-shader) 2. [Using a Compute Shader](#using-a-compute-shader) 3. [Progressive Refine](#progressive-refine) # Fractal Sets ## The Mandelbrot Set The [Mandelbrot set](https://en.wikipedia.org/wiki/Mandelbrot_set) is defined to be the set of all numbers *c* in the complex plane for which the following sequence (what I call a "z-transform") *does not* diverge to infinity: Note that the z-*squared* term is squaring a complex number, given by the following: where *a* is the real term and *b* is the imaginary term. When rendering, we take the real axis to be *x* and the imaginary axis to be *y*. Points (numbers) in the set are colored black, and points not in the set are colored with a brightness corresponding to the number of iterations required until divergence. The above unassuming sequence and rules of complex algebra result in perhaps the most popular fractal shape, which exhibits infinite complexity at the boundary of the set and yields new patterns—including copies and variations of the set itself!—wherever you zoom in, forever.

Needless to say, I've been pretty fascinated by it. This isn't the only fractal set though. You can generate more interesting shapes and patterns by simply modifying the original sequence, or just coming up with something new. You can also add an additional parameter to play around with, transforming fractals. I don't get very scientific with it. You can see this used in the video to transform between fractals. Most random variants however are relatively boring in that they 1. don't produce more than one or two patterns, 2. produce patterns that are just the Mandelbrot set (this by itself is an interesting pattern of emergence), or 3. devolve into noise when zooming in most places. There are a couple exceptions of note: ## The Tricorn Set The Tricorn set is a variant of the Mandelbrot set that uses the *conjugate* of z, which inverts the sign of the imaginary term.

## The Burning Ship Fractal A more well-known variant of the Mandelbrot set is the Burning Ship fractal, which takes the *absolute value* of z before squaring it.

The most interesting part about this one is actually the figure to the left, which is what the fractal is named after.

# Notes on Fractal Computation I want to talk about some nuances of computing and rendering fractals. As you would expect, for a full quality render you will need to compute iterations for every pixel in the image. **This is very computationally expensive.** Even more troublesome is having to calculate this for *every frame* when panning around and zooming in if you're writing a realtime explorer. ## Divergence Let's first define what is meant by "diverge" when iterating over a z-transform. Mathematically, this means the point transforms off to infinity. We can discard a point from the set long before infinity though—in fact, for the three fractals mentioned above, any complex number with a distance from the origin *greater than 2* will diverge during a z-transform. Storing the square of this—to prevent having to compute square roots when applying the Pythagorean theorem—in a `discard_threshold_squared` constant or parameter, we can speed things up by stopping before unneeded iterations in our compute code: ```c const int discard_threshold_squared = 4; ``` ```c // [inside z-transform loop] if ((a*a + b*b) > discard_threshold_squared) { // [store current iteration count for purpose of // coloring, indicating point is not in set] break; } ``` Points in the set will not exit the sequence early. An implication of this is that *the more points in the set the frame contains, the longer the frame will take to render.* ## Iteration Count We also need to define a maximum iteration count, the number of iterations it takes to confidently say "this point does not diverge." This makes for another design consideration, though. Note in the screenshots above how points closer to the set are brighter; this means it takes more iterations for those points to diverge. From this, it should follow that **increasing the maximum number of iterations will lead to greater detail at the bounds of the set**. If we set the iteration count too low, we get undetailed renders like the following (compare to previous screenshot).

Not only that, but zooming in only makes the boundary seem coarser. To compensate for this, I define my maximum iteration count to actually be a function of zoom level, where `n0` is a "base" iteration count parameter and `s` is the scale (decreases when zooming in). This largely fixes the coarseness of the shape when zooming in, but poses a new issue. At a certain point when zooming in, the iteration count will become so large that framerate begins to drop. In the [Rendering on the GPU](#rendering-on-the-gpu) section I detail a rendering method called *interlacing* (or *progressive refine*) that lets us split up the work of a render across multiple frames. ## Floating-Point Precision The primary limitation with a realtime fractal renderer like this is computer hardware architecture. For most applications, computers store decimal numbers according to standard [IEEE 754](https://en.wikipedia.org/wiki/IEEE_754) (or variants thereof), which, in essence, represent decimal numbers in scientific notation form, comprising a significand ("mantissa") and an exponent. On modern CPUs and GPUs, there are floating-point arithmetic units (FPUs) built into hardware that make computation with floating-point types significantly faster than it would be with a software-only implementation. FPUs nowadays come in sizes of 16 bits (half-width/FP16), 32 bits (single-width/FP32), 64 bits (double-width/FP64), and 128 bits (quad-width/FP128). As the names indicate, more bits means greater precision. This pertains to computing fractals because the points needed on the complex plane are decimals, not integers. **Zooming into one point—effectively increasing the number of decimal places encoded by each pixel's location—only increases the precision required in computing.** Limiting ourselves to hardware floating-point implementations caps our precision at FP128. In reality, if we're running this on the GPU, we're capped to FP64, since most GPU architectures don't support FP128. (OpenGL's shader language doesn't even provide a quad-width type, e.g. `long double`. Even for CPU architectures, hardware support for FP128 is [iffy](https://en.wikipedia.org/wiki/Quadruple-precision_floating-point_format#Hardware_support).) Also, since graphics applications generally don't need more than 32 bits of floating-point precision, GPUs tend to have only 32-bit wide FPUs, with a slow processing path for FP64 (about 1/64 the speed of FP32 according to some benchmarks). Despite this, GPUs have significantly more floating-point execution units than CPUs, so the performance sacrifice is still acceptable for the extra precision. With 64-bit precision on the GPU, we can zoom in by a factor of about 14 times before we hit our precision limit.

This isn't ideal but it's the best I could come up with (or cared to, considering this was not a planned project) for a realtime renderer. Fractal dive renderers use high-level CPU software implementations like [BigFloat](https://github.com/nicowilliams/bigfloat) for arbitrary-precision floating point computation, but this would be disastrously slow for a realtime application (and be incompatible with GPU acceleration). # Rendering on the GPU ## Using a Fragment Shader The oversimplified 10,000-foot view of the basic OpenGL rendering pipeline for an object is as follows: 1. You give the graphics card mesh data and some arbitrary program-defined render settings 2. A vertex shader interprets this mesh data as primitive shapes, e.g. triangles, with world position data 3. A fragment shader uses the geometry of the primitive each fragment (pixel, basically) is attached to and the given arbitrary render settings (including textures) to color that fragment 4. The graphics card does some linear algebra magic to combine the computed data for all objects into a rendered scene, the framebuffer

This is the explain-like-I'm-five version. If you actually know OpenGL you know this is so insanely simplified it could just be called "wrong," but it's a good enough overview for the purposes of this writeup.

Rather than doing fractal computation on the CPU and loading the result into a texture, my initial attempt used an OpenGL fragment shader to do computation entirely on the GPU. It seemed like a good idea; given two triangles filling the entire screen, the fragment shader is executed for every pixel of the frame and outputs a color for that pixel. However, there were some drawbacks: 1. Everything was redrawn every frame, even if the user hadn't moved, unnecessarily consuming GPU resources 2. This was extraordinarily slow at high iteration counts (high zoom) 3. Iterative refine—either splitting iterations across frames or breaking the frame up into chunks—wasn't feasible Clearly, a different strategy was needed. ## Using a Compute Shader The [compute shader](https://www.khronos.org/opengl/wiki/Compute_Shader) is OpenGL's interface for general purpose GPU (GPGPU) programming—analogous to NVIDIA's CUDA, which lets you use a GPU for arbitrary floating-point-intensive computation. Compute shaders provide cross-platform support and integrate with the rest of the graphics library, for things like accessing textures. Perfect!! As they're meant to be used for GPGPU programming, compute shaders aren't built into the core OpenGL rendering pipeline. They must be explicitly invoked (dispatched) separate of the rendering sequence. This is actually great for our renderer because being able to control when computes happen means we can have them run only when they're actually needed (when the user pans, zooms, or transforms). ```cpp // inside application C++ "world logic" // once we determine we need to recompute after some user input, we can do it like this: // bind texture/image to save computed data in m_pTexture->BindAsImage(); // upload arbitrary data/settings (x/y position, zoom level, screen size, iteration count) m_pComputeShaderUniformUploader->UploadUniforms(program); // do compute across the width and height of the screen split up into 32x32 pixel chunks glDispatchCompute((m_windowWidth+31)/32, (m_windowHeight+31)/32, 1); ``` Splitting up the screen into 32x32 chunks is pretty arbitrary. GPUs are very heavily designed for parallelism, so, generally, splitting work up into chunks means things will get done more quickly, but there is an upper limit to this. In my experimentation, 32x32 chunks seemed to work best. The compute shader looks like this: ```glsl #version 460 core // define the size of the local working group to be 32x32x1 // main() runs every time for each pixel in the 32x32 region layout (local_size_x = 32, local_size_y = 32, local_size_z = 1) in; // arbitrary data/settings ("uniforms") uploaded by the application layout(rgba32f, binding = 0) uniform image2D texture0; uniform ivec2 screen_size; uniform dvec2 offset; // indicates x/y position within the fractal uniform double scale; // decreases when zooming in uniform float discard_threshold_squared; uniform int max_iteration_count; void main() { // 1. convert screen pixel location to world/graph space // 2. run z-transform // 3. store iteration count into texture } ``` The only data we're storing here is a single number per pixel: the number of iterations it takes for the pixel's corresponding point to diverge. The same texture is then fed into the fragment shader during the rendering stage, which reads these values and spits out colors to the screen accordingly. `-1` can be used to indicate a point that doesn't diverge (in the set, colored black). This is great and all, but it doesn't really solve the issue with high iteration counts slowing things down. When zooming in a lot, the application becomes so slow that doing anything has a more-than-noticeable input latency (see this in the video around 0:15). This is where we implement a method for *progressive refine*. ## Progressive Refine Progressive refine is the act of taking an intensive piece of work and breaking it down into multiple chunks *over time*. This is commonly done in dedicated renderers when you want a preview of a render that will take a while, or in networking when loading images over a slow connection; it quickly gives you an image at, for example, 1/64 full quality, then not-so-quickly an image at 1/32, then 1/16, and so on, with each step taking longer on average than the previous. Inspired by a friend's [use of the Bayer matrix for this purpose](https://jbaker.graphics/writings/bayer.html), I used a similar **interlacing pattern** defined by the **[Adam7 algorithm](https://en.wikipedia.org/wiki/Adam7_algorithm)**, which splits work in an 8x8 grid across seven steps: ```glsl const int ADAM7_MATRIX[8][8] = { {1, 6, 4, 6, 2, 6, 4, 6}, {7, 7, 7, 7, 7, 7, 7, 7}, {5, 6, 5, 6, 5, 6, 5, 6}, {7, 7, 7, 7, 7, 7, 7, 7}, {3, 6, 4, 6, 3, 6, 4, 6}, {7, 7, 7, 7, 7, 7, 7, 7}, {5, 6, 5, 6, 5, 6, 5, 6}, {7, 7, 7, 7, 7, 7, 7, 7}, }; ```

For this pattern of interlacing, the 1st and 2nd steps actually always take the same amount of time because they do the same amount of work. Every step after that, though, takes twice as long as the previous.

Incorporating this into the fractal renderer, what we can do is pass some number to the compute shader which indicates which step we're on (1-7). The compute shader then indexes the above matrix at the pixel's relative position, compares that value to the instructed interlace step, and only computes if the values are equal. ```glsl uniform int interlace_layer; ``` ```glsl // [in main()] // 32x32 work group size means we'll have 16 internal 8x8 grids for interlacing. // Check where the current pixel is within whichever 8x8 grid it falls in: int relative_pixel_grid_pos_x = gl_LocalInvocationID.x % 8; int relative_pixel_grid_pos_y = gl_LocalInvocationID.y % 8; bool do_compute = (ADAM7_MATRIX[relative_pixel_grid_pox_y][relative_pixel_grid_pos_x] == interlace_layer); if (do_compute) { // z-transform } ``` For each pixel, the routine then stores either the new computed value or, if nothing was computed, does nothing. Correspondingly, the fragment shader must be updated for this behavior as well. Following step 1, most elements in the texture will be empty; to prevent the screen from displaying mostly uncomputed pixels (undefined behavior?), we should check whether a pixel has been computed before trying to use it, and use the nearest computed pixel if the first hasn't been. The resulting behavior is this:

Now, when zooming in at high iteration count, the first compute step is only doing 1/64 (~1.56%) of the computations it normally would, keeping things within the span of a single frame, i.e. preventing framerate drops. This is great for user actions, where zooming and panning happen for as long as the mouse input lasts—potentially hundreds of frames. Suppose a single step takes longer than a frame to compute. Well, that's no worry either since *compute shaders act independent of the rendering pipeline*, so you can make use of asynchronous OpenGL interfaces like [fences and memory barriers](https://www.khronos.org/opengl/wiki/Memory_Model) to prevent dispatching a new compute for the next step until the previous has finished, framerate unaffected. My implementation can actually take this a step further, allowing the point's z-transform itself to be distributed across computes, resulting in *multiple* interlacing passes. It does this by storing another two elements per pixel in the texture; the real and imaginary components of the z-transform output, for it to pick up with on the start of the next compute. This poses the same precision issues as before, however, considering OpenGL textures only allow you to store elements up to FP32 precision, so it doesn't work very well at high zoom levels. We can work around this by using a separate texture with four elements per pixel: two 32-bit elements for the real component, and two more 32-bit elements for the imaginary component. See [floatBitsToInt](https://registry.khronos.org/OpenGL-Refpages/gl4/html/floatBitsToInt.xhtml) and related GLSL functions for a way you might accomplish this.