---
type: article
identifier: opengl-fractal-explorer
title: GPU-Accelerated Fractal Explorer
description: Using OpenGL's compute shaders to dispatch fractal computation to the GPU and render in realtime.
datestring: 2023-12-07
banner_image: /static/images/mandelbrot.png
links:
Source Code: https://github.com/JoshuaS3/zydeco/tree/fractal
The Mandelbrot Set: https://en.wikipedia.org/wiki/Mandelbrot_set
IEEE 754: https://en.wikipedia.org/wiki/IEEE_754
OpenGL Compute Shaders: https://www.khronos.org/opengl/wiki/Compute_Shader
Adam7 Algorithm: https://en.wikipedia.org/wiki/Adam7_algorithm
OpenGL Memory Model: https://www.khronos.org/opengl/wiki/Memory_Model
---
I've been toying around for a while with an idea for a procedural world
generation + simulation project as an experiment in C++ and graphics
programming to teach myself more about computer science and rendering
techniques. Part of this is, of course, setting up the infrastructure for input
handling, world logic, debug menus, and rendering. When writing the initial
code, I used the Mandelbrot set for testing. This led me down a rabbit hole of
improving my rendering techniques for this application, as well as trying out
different fractals, ultimately culminating in this GPU-accelerated fractal
explorer (transform, zoom, pan) with progressive refine:
Video compression doesn't allow for demonstration of the progressive refine element very well; this is explained in detail later here.
**Outline**
1. [Fractal Sets](#fractal-sets)
1. [The Mandelbrot Set](#the-mandelbrot-set)
2. [The Tricorn Set](#the-tricorn-set)
3. [The Burning Ship Fractal](#the-burning-ship-fractal)
2. [Notes on Fractal Computation](#notes-on-fractal-computation)
1. [Divergence](#divergence)
2. [Iteration Count](#iteration-count)
3. [Float-Point Precision](#floating-point-precision)
3. [Rendering on the GPU](#rendering-on-the-gpu)
1. [Using a Fragment Shader](#using-a-fragment-shader)
2. [Using a Compute Shader](#using-a-compute-shader)
3. [Progressive Refine](#progressive-refine)
# Fractal Sets
## The Mandelbrot Set
The [Mandelbrot set](https://en.wikipedia.org/wiki/Mandelbrot_set) is defined
to be the set of all numbers *c* in the complex plane for which the following
sequence (what I call a "z-transform") *does not* diverge to infinity:
Note that the z-*squared* term is squaring a complex number, given by the
following:
where *a* is the real term and *b* is the imaginary term.
When rendering, we take the real axis to be *x* and the imaginary axis to be
*y*. Points (numbers) in the set are colored black, and points not in the set
are colored with a brightness corresponding to the number of iterations
required until divergence.
The above unassuming sequence and rules of complex algebra result in perhaps
the most popular fractal shape, which exhibits infinite complexity at the
boundary of the set and yields new patterns—including copies and variations of
the set itself!—wherever you zoom in, forever.
Needless to say, I've been pretty fascinated by it. This isn't the only
fractal set though. You can generate more interesting shapes and
patterns by simply modifying the original sequence, or just coming up with
something new. You can also add an additional parameter to play around with,
transforming fractals. I don't get very scientific with it. You can see this
used in the video to transform between fractals. Most random variants however
are relatively boring in that they 1. don't produce more than one or two
patterns, 2. produce patterns that are just the Mandelbrot set (this by itself
is an interesting pattern of emergence), or 3. devolve into noise when zooming
in most places. There are a couple exceptions of note:
## The Tricorn Set
The Tricorn set is a variant of the Mandelbrot set that uses the *conjugate* of
z, which inverts the sign of the imaginary term.
## The Burning Ship Fractal
A more well-known variant of the Mandelbrot set is the Burning Ship fractal,
which takes the *absolute value* of z before squaring it.
The most interesting part about this one is actually the figure to the left,
which is what the fractal is named after.
# Notes on Fractal Computation
I want to talk about some nuances of computing and rendering fractals. As you
would expect, for a full quality render you will need to compute iterations for
every pixel in the image. **This is very computationally expensive.**
Even more troublesome is having to calculate this for *every frame* when
panning around and zooming in if you're writing a realtime explorer.
## Divergence
Let's first define what is meant by "diverge" when iterating over a
z-transform. Mathematically, this means the point transforms off to infinity.
We can discard a point from the set long before infinity though—in fact, for
the three fractals mentioned above, any complex number with a distance from the
origin *greater than 2* will diverge during a z-transform. Storing the square
of this—to prevent having to compute square roots when applying the Pythagorean
theorem—in a `discard_threshold_squared` constant or parameter, we can speed
things up by stopping before unneeded iterations in our compute code:
```c
const int discard_threshold_squared = 4;
```
```c
// [inside z-transform loop]
if ((a*a + b*b) > discard_threshold_squared)
{
// [store current iteration count for purpose of
// coloring, indicating point is not in set]
break;
}
```
Points in the set will not exit the sequence early. An implication of this is
that *the more points in the set the frame contains, the longer the frame will
take to render.*
## Iteration Count
We also need to define a maximum iteration count, the number of iterations it
takes to confidently say "this point does not diverge." This makes for another
design consideration, though. Note in the screenshots above how points closer
to the set are brighter; this means it takes more iterations for those points
to diverge. From this, it should follow that **increasing the maximum number
of iterations will lead to greater detail at the bounds of the set**. If we
set the iteration count too low, we get undetailed renders like the following
(compare to previous screenshot).
Not only that, but zooming in only makes the boundary seem coarser. To
compensate for this, I define my maximum iteration count to actually be a
function of zoom level, where `n0` is a "base" iteration count parameter and
`s` is the scale (decreases when zooming in).
This largely fixes the coarseness of the shape when zooming in, but poses a new
issue. At a certain point when zooming in, the iteration count will become so
large that framerate begins to drop. In the [Rendering on the
GPU](#rendering-on-the-gpu) section I detail a rendering method called
*interlacing* (or *progressive refine*) that lets us split up the work of a
render across multiple frames.
## Floating-Point Precision
The primary limitation with a realtime fractal renderer like this is computer
hardware architecture. For most applications, computers store decimal numbers
according to standard [IEEE 754](https://en.wikipedia.org/wiki/IEEE_754) (or
variants thereof), which, in essence, represent decimal numbers in scientific
notation form, comprising a significand ("mantissa") and an exponent. On
modern CPUs and GPUs, there are floating-point arithmetic units (FPUs) built
into hardware that make computation with floating-point types significantly
faster than it would be with a software-only implementation. FPUs nowadays come
in sizes of 16 bits (half-width/FP16), 32 bits (single-width/FP32), 64 bits
(double-width/FP64), and 128 bits (quad-width/FP128). As the names indicate,
more bits means greater precision.
This pertains to computing fractals because the points needed on the complex
plane are decimals, not integers. **Zooming into one point—effectively
increasing the number of decimal places encoded by each pixel's location—only
increases the precision required in computing.**
Limiting ourselves to hardware floating-point implementations caps our
precision at FP128. In reality, if we're running this on the GPU, we're capped
to FP64, since most GPU architectures don't support FP128. (OpenGL's shader
language doesn't even provide a quad-width type, e.g. `long double`. Even for
CPU architectures, hardware support for FP128 is
[iffy](https://en.wikipedia.org/wiki/Quadruple-precision_floating-point_format#Hardware_support).)
Also, since graphics applications generally don't need more than 32 bits of
floating-point precision, GPUs tend to have only 32-bit wide FPUs, with a slow
processing path for FP64 (about 1/64 the speed of FP32 according to some
benchmarks). Despite this, GPUs have significantly more floating-point
execution units than CPUs, so the performance sacrifice is still acceptable for
the extra precision.
With 64-bit precision on the GPU, we can zoom in by a factor of about 14 times
before we hit our precision limit.
This isn't ideal but it's the best I could come up with (or cared to,
considering this was not a planned project) for a realtime renderer. Fractal
dive renderers use high-level CPU software implementations like
[BigFloat](https://github.com/nicowilliams/bigfloat) for arbitrary-precision
floating point computation, but this would be disastrously slow for a realtime
application (and be incompatible with GPU acceleration).
# Rendering on the GPU
## Using a Fragment Shader
The oversimplified 10,000-foot view of the basic OpenGL rendering pipeline for
an object is as follows:
1. You give the graphics card mesh data and some arbitrary program-defined
render settings
2. A vertex shader interprets this mesh data as primitive shapes, e.g.
triangles, with world position data
3. A fragment shader uses the geometry of the primitive each fragment (pixel,
basically) is attached to and the given arbitrary render settings (including
textures) to color that fragment
4. The graphics card does some linear algebra magic to combine the computed
data for all objects into a rendered scene, the framebuffer
This is the explain-like-I'm-five version. If you actually know
OpenGL you know this is so insanely simplified it could just be called "wrong,"
but it's a good enough overview for the purposes of this writeup.
Rather than doing fractal computation on the CPU and loading the result into a
texture, my initial attempt used an OpenGL fragment shader to do computation
entirely on the GPU. It seemed like a good idea; given two triangles filling
the entire screen, the fragment shader is executed for every pixel of the frame
and outputs a color for that pixel. However, there were some drawbacks:
1. Everything was redrawn every frame, even if the user hadn't moved,
unnecessarily consuming GPU resources
2. This was extraordinarily slow at high iteration counts (high zoom)
3. Iterative refine—either splitting iterations across frames or breaking the
frame up into chunks—wasn't feasible
Clearly, a different strategy was needed.
## Using a Compute Shader
The [compute shader](https://www.khronos.org/opengl/wiki/Compute_Shader) is
OpenGL's interface for general purpose GPU (GPGPU) programming—analogous to
NVIDIA's CUDA, which lets you use a GPU for arbitrary floating-point-intensive
computation. Compute shaders provide cross-platform support and integrate with
the rest of the graphics library, for things like accessing textures. Perfect!!
As they're meant to be used for GPGPU programming, compute shaders aren't built
into the core OpenGL rendering pipeline. They must be explicitly invoked
(dispatched) separate of the rendering sequence. This is actually great for our
renderer because being able to control when computes happen means we can have
them run only when they're actually needed (when the user pans, zooms, or
transforms).
```cpp
// inside application C++ "world logic"
// once we determine we need to recompute after some user input, we can do it like this:
// bind texture/image to save computed data in
m_pTexture->BindAsImage();
// upload arbitrary data/settings (x/y position, zoom level, screen size, iteration count)
m_pComputeShaderUniformUploader->UploadUniforms(program);
// do compute across the width and height of the screen split up into 32x32 pixel chunks
glDispatchCompute((m_windowWidth+31)/32, (m_windowHeight+31)/32, 1);
```
Splitting up the screen into 32x32 chunks is pretty arbitrary. GPUs are very
heavily designed for parallelism, so, generally, splitting work up into chunks
means things will get done more quickly, but there is an upper limit to this.
In my experimentation, 32x32 chunks seemed to work best.
The compute shader looks like this:
```glsl
#version 460 core
// define the size of the local working group to be 32x32x1
// main() runs every time for each pixel in the 32x32 region
layout (local_size_x = 32, local_size_y = 32, local_size_z = 1) in;
// arbitrary data/settings ("uniforms") uploaded by the application
layout(rgba32f, binding = 0) uniform image2D texture0;
uniform ivec2 screen_size;
uniform dvec2 offset; // indicates x/y position within the fractal
uniform double scale; // decreases when zooming in
uniform float discard_threshold_squared;
uniform int max_iteration_count;
void main()
{
// 1. convert screen pixel location to world/graph space
// 2. run z-transform
// 3. store iteration count into texture
}
```
The only data we're storing here is a single number per pixel: the number of
iterations it takes for the pixel's corresponding point to diverge. The same
texture is then fed into the fragment shader during the rendering stage, which
reads these values and spits out colors to the screen accordingly. `-1` can be
used to indicate a point that doesn't diverge (in the set, colored black).
This is great and all, but it doesn't really solve the issue with high
iteration counts slowing things down. When zooming in a lot, the application
becomes so slow that doing anything has a more-than-noticeable input latency
(see this in the video around 0:15). This is where we implement a method for
*progressive refine*.
## Progressive Refine
Progressive refine is the act of taking an intensive piece of work and breaking
it down into multiple chunks *over time*. This is commonly done in dedicated
renderers when you want a preview of a render that will take a while, or in
networking when loading images over a slow connection; it quickly gives you an
image at, for example, 1/64 full quality, then not-so-quickly an image at 1/32,
then 1/16, and so on, with each step taking longer on average than the
previous. Inspired by a friend's [use of the Bayer matrix for this
purpose](https://jbaker.graphics/writings/bayer.html), I used a similar
**interlacing pattern** defined by the **[Adam7
algorithm](https://en.wikipedia.org/wiki/Adam7_algorithm)**, which splits
work in an 8x8 grid across seven steps:
```glsl
const int ADAM7_MATRIX[8][8] = {
{1, 6, 4, 6, 2, 6, 4, 6},
{7, 7, 7, 7, 7, 7, 7, 7},
{5, 6, 5, 6, 5, 6, 5, 6},
{7, 7, 7, 7, 7, 7, 7, 7},
{3, 6, 4, 6, 3, 6, 4, 6},
{7, 7, 7, 7, 7, 7, 7, 7},
{5, 6, 5, 6, 5, 6, 5, 6},
{7, 7, 7, 7, 7, 7, 7, 7},
};
```
For this pattern of interlacing, the 1st and 2nd steps actually
always take the same amount of time because they do the same amount of work.
Every step after that, though, takes twice as long as the
previous.
Incorporating this into the fractal renderer, what we can do is pass some
number to the compute shader which indicates which step we're on (1-7). The
compute shader then indexes the above matrix at the pixel's relative position,
compares that value to the instructed interlace step, and only computes if
the values are equal.
```glsl
uniform int interlace_layer;
```
```glsl
// [in main()]
// 32x32 work group size means we'll have 16 internal 8x8 grids for interlacing.
// Check where the current pixel is within whichever 8x8 grid it falls in:
int relative_pixel_grid_pos_x = gl_LocalInvocationID.x % 8;
int relative_pixel_grid_pos_y = gl_LocalInvocationID.y % 8;
bool do_compute = (ADAM7_MATRIX[relative_pixel_grid_pox_y][relative_pixel_grid_pos_x] == interlace_layer);
if (do_compute)
{
// z-transform
}
```
For each pixel, the routine then stores either the new computed value or, if
nothing was computed, does nothing. Correspondingly, the fragment shader must
be updated for this behavior as well. Following step 1, most elements in the
texture will be empty; to prevent the screen from displaying mostly uncomputed
pixels (undefined behavior?), we should check whether a pixel has been computed
before trying to use it, and use the nearest computed pixel if the first hasn't
been.
The resulting behavior is this:
Now, when zooming in at high iteration count, the first compute step is only
doing 1/64 (~1.56%) of the computations it normally would, keeping things
within the span of a single frame, i.e. preventing framerate drops. This is
great for user actions, where zooming and panning happen for as long as the
mouse input lasts—potentially hundreds of frames.
Suppose a single step takes longer than a frame to compute. Well, that's no
worry either since *compute shaders act independent of the rendering pipeline*,
so you can make use of asynchronous OpenGL interfaces like [fences and memory
barriers](https://www.khronos.org/opengl/wiki/Memory_Model) to prevent
dispatching a new compute for the next step until the previous has finished,
framerate unaffected.
My implementation can actually take this a step further, allowing the point's
z-transform itself to be distributed across computes, resulting in *multiple*
interlacing passes. It does this by storing another two elements per pixel in
the texture; the real and imaginary components of the z-transform output, for
it to pick up with on the start of the next compute. This poses the same
precision issues as before, however, considering OpenGL textures only allow you
to store elements up to FP32 precision, so it doesn't work very well at high
zoom levels. We can work around this by using a separate texture with four
elements per pixel: two 32-bit elements for the real component, and two more
32-bit elements for the imaginary component. See
[floatBitsToInt](https://registry.khronos.org/OpenGL-Refpages/gl4/html/floatBitsToInt.xhtml)
and related GLSL functions for a way you might accomplish this.