---
type: article
identifier: opengl-fractal-explorer
title: GPU-Accelerated Fractal Explorer
description: Using OpenGL's compute shaders to dispatch fractal computation to the GPU and render in realtime.
datestring: 2023-12-07
banner_image: /static/images/mandelbrot.png
links:
Source Code: https://github.com/JoshuaS3/zydeco/tree/fractal
The Mandelbrot Set: https://en.wikipedia.org/wiki/Mandelbrot_set
IEEE 754: https://en.wikipedia.org/wiki/IEEE_754
OpenGL Compute Shaders: https://www.khronos.org/opengl/wiki/Compute_Shader
Adam7 Algorithm: https://en.wikipedia.org/wiki/Adam7_algorithm
OpenGL Memory Model: https://www.khronos.org/opengl/wiki/Memory_Model
---
I've been toying around for a while with an idea for a procedural world
generation + simulation project as an experiment in C++ and graphics
programming to teach myself more about computer science and rendering
techniques. Part of this is, of course, setting up the infrastructure for input
handling, world logic, debug menus, and rendering. When writing the initial
code, I used the Mandelbrot set for testing. This led me down a rabbit hole of
improving my rendering techniques for this application, as well as trying out
different fractals, ultimately culminating in this GPU-accelerated fractal
explorer (transform, zoom, pan) with progressive refine:
Video compression doesn't allow for demonstration of the progressive refine element very well; this is explained in detail later here.
**Outline**
1. [Fractal Sets](#fractal-sets)
1. [The Mandelbrot Set](#the-mandelbrot-set)
2. [The Tricorn Set](#the-tricorn-set)
3. [The Burning Ship Fractal](#the-burning-ship-fractal)
2. [Notes on Fractal Computation](#notes-on-fractal-computation)
1. [Divergence](#divergence)
2. [Iteration Count](#iteration-count)
3. [Floating-Point Precision](#floating-point-precision)
3. [Rendering on the GPU](#rendering-on-the-gpu)
1. [Using a Fragment Shader](#using-a-fragment-shader)
2. [Using a Compute Shader](#using-a-compute-shader)
3. [Progressive Refine](#progressive-refine)
# Fractal Sets
## The Mandelbrot Set
The [Mandelbrot set](https://en.wikipedia.org/wiki/Mandelbrot_set) is defined
to be the set of all numbers *c* in the complex plane for which the following
sequence (what I call a "z-transform") *does not* diverge to infinity:
Note that the z-*squared* term is squaring a complex number, given by the
following:
where *a* is the real term and *b* is the imaginary term.
When rendering, we take the real axis to be *x* and the imaginary axis to be
*y*. Points (numbers) in the set are colored black, and points not in the set
are colored with a brightness corresponding to the number of iterations
required until divergence.
The above unassuming sequence and rules of complex algebra result in perhaps
the most popular fractal shape, which exhibits infinite complexity at the
boundary of the set and yields new patterns—including copies and variations of
the set itself!—wherever you zoom in, forever.
Needless to say, I've been pretty fascinated by it. This isn't the only
fractal set though. You can generate more interesting shapes and
patterns by simply modifying the original sequence, or just coming up with
something new. You can also add an additional parameter to play around with,
transforming fractals. I don't get very scientific with it. You can see this
used in the video to transform between fractals. Most random variants however
are relatively boring in that they 1. don't produce more than one or two
patterns, 2. produce patterns that are just the Mandelbrot set (this by itself
is an interesting pattern of emergence), or 3. devolve into noise when zooming
in most places. There are a couple exceptions of note:
## The Tricorn Set
The Tricorn set is a variant of the Mandelbrot set that uses the *conjugate* of
z, which inverts the sign of the imaginary term.
## The Burning Ship Fractal
A more well-known variant of the Mandelbrot set is the Burning Ship fractal,
which takes the *absolute value* of z before squaring it.
The most interesting part about this one is actually the figure to the left,
which is what the fractal is named after.
# Notes on Fractal Computation
I want to talk about some details of computing and rendering fractals. As you
would expect, for a full quality render you will need to compute iterations for
every pixel in the image. **This is very computationally expensive.**
Even more troublesome is having to calculate this for *every frame* when
panning around and zooming in if you're writing a realtime explorer.
## Divergence
Let's first define what is meant by "diverge" when iterating over a
z-transform. Mathematically, this means the point is unbounded, or transforms
off to infinity. We can discard a point from the set long before infinity
though—in fact, for the three fractals mentioned above, any complex number with
a distance from the origin *greater than 2* will diverge during a z-transform.
Storing the square of this—to prevent having to compute square roots when
applying the Pythagorean theorem—in a `discard_threshold_squared` constant or
parameter, we can speed things up by stopping before unneeded iterations in our
compute code:
```c
const int discard_threshold_squared = 4;
```
```c
// [inside z-transform loop]
if ((a*a + b*b) > discard_threshold_squared)
{
// [store current iteration count for purpose of
// coloring, indicating point is not in set]
break;
}
```
Points in the set will not exit the sequence early. An implication of this is
that *the more points in the set the frame contains, the longer the frame will
take to render.*
## Iteration Count
We also need to define a maximum iteration count, the number of iterations it
takes to confidently say "this point does not diverge." This makes for another
design consideration, though. Note in the screenshots above how points closer
to the set are brighter; this means it takes more iterations for those points
to diverge. From this, it should follow that **increasing the maximum number
of iterations will lead to greater detail at the bounds of the set**. If we
set the iteration count too low, we get undetailed renders like the following
(compare to previous screenshot).
Not only that, but zooming in only makes the boundary seem coarser. To
compensate for this, I define my maximum iteration count to be a function of
zoom level, where `n0` is a "base" iteration count parameter and `s` is the
scale (decreases when zooming in).
This largely fixes the coarseness of the shape when zooming in, but poses a new
issue. At a certain point when zooming in, the iteration count will become so
large that framerate begins to drop. In the [Rendering on the
GPU](#rendering-on-the-gpu) section I detail a rendering method called
*interlacing* (or *progressive refine*) that lets us split up the work of a
render across multiple frames.
## Floating-Point Precision
The primary limitation with a realtime fractal renderer like this is computer
hardware architecture. For most applications, computers store decimal numbers
according to standard [IEEE 754](https://en.wikipedia.org/wiki/IEEE_754) (or
variants thereof), which, in essence, represent decimal numbers in scientific
notation form, comprising a significand ("mantissa") and an exponent. On
modern CPUs and GPUs, there are floating-point arithmetic units (FPUs) built
into hardware that make computation with floating-point types significantly
faster than it would be with a software-only implementation. FPUs nowadays come
in sizes of 16 bits (half-width/FP16), 32 bits (single-width/FP32), 64 bits
(double-width/FP64), and 128 bits (quad-width/FP128). As you might be able to
guess, more bits means larger values means greater precision.
This pertains to computing fractals because the points needed on the complex
plane are decimals, not integers. **Zooming into one point—effectively
increasing the number of decimal places encoded by each pixel's location—only
increases the precision required in computing.**
Limiting ourselves to hardware floating-point implementations caps our
precision at FP128. In reality, if we're running this on the GPU, we're capped
to FP64, since most GPU architectures don't support FP128. (OpenGL's shader
language doesn't even provide a quad-width type, e.g. `long double`. Even for
CPU architectures, hardware support for FP128 is
[iffy](https://en.wikipedia.org/wiki/Quadruple-precision_floating-point_format#Hardware_support).)
Also, since graphics applications generally don't need more than 32 bits of
floating-point precision, GPUs tend to have only 32-bit wide FPUs, with a slow
processing path for FP64 (about 1/64 the speed of FP32 according to some
benchmarks). Despite this, GPUs have significantly more floating-point
execution units than CPUs, so we're *still* running faster on the GPU.
With 64-bit precision on the GPU, we can zoom in by a factor of about 14 times
before we hit our precision limit.
This isn't ideal but it's the best I could come up with (or cared to,
considering this was not a planned project) for a realtime renderer. Fractal
dive renderers use high-level CPU software implementations for
*arbitrary-precision* floating-point computation, like
[BigFloat](https://github.com/nicowilliams/bigfloat), but this would be
disastrously slow for a realtime application (and be incompatible with GPU
acceleration).
# Rendering on the GPU
## Using a Fragment Shader
The 10,000-foot view of the basic OpenGL rendering pipeline for an object is as
follows:
1. You give the graphics card mesh data and some arbitrary program-defined
render settings
2. A vertex shader interprets this mesh data as primitive shapes, e.g.
triangles, and applies perspective transformations to scale, rotate, and
position them relative to the screen or "camera"
3. A fragment shader uses the geometry of the primitive plus the given
arbitrary render settings (including textures) to fill in the colors of
fragments (pixels, basically) within the primitive
4. The graphics card does some linear algebra magic to combine the computed
data for all objects into a rendered scene, the framebuffer
This is the explain-like-I'm-five version. If you actually know
OpenGL you know this is so insanely simplified it could just be called "wrong,"
but it's a good enough overview for the purposes of this writeup.
Knowing my CPU would be too slow for realtime rendering, rather than doing
fractal computation on the CPU and loading the result into a texture to be
displayed by the GPU, my initial attempt used an OpenGL fragment shader to do
computation directly on the GPU. It seemed like a good idea; given two
triangles filling the entire screen, the fragment shader is executed for every
pixel of the frame and outputs a color for that pixel. However, there were some
drawbacks:
1. Everything was redrawn every frame, even if the user hadn't moved,
unnecessarily consuming GPU resources
2. This was extraordinarily slow at high iteration counts (high zoom)
3. Iterative refine—either splitting iterations across frames or breaking the
frame up into chunks—wasn't feasible (or wouldn't be efficient/elegant
enough)
Clearly, a different strategy was needed.
## Using a Compute Shader
The [compute shader](https://www.khronos.org/opengl/wiki/Compute_Shader) is
OpenGL's interface for general purpose GPU (GPGPU) programming—analogous to
NVIDIA's CUDA, which lets you use a GPU for arbitrary floating-point-intensive
computation. Unlike CUDA however, compute shaders provide cross-platform
support (including integrated GPUs) and integrate with the rest of the graphics
library, for things like accessing textures. Perfect!!
As they're meant to be used for GPGPU programming, compute shaders aren't built
into the core OpenGL rendering pipeline. They must be explicitly invoked
(dispatched) by the application, separate of the rendering sequence. This is
actually great for our renderer because being able to control when computes
happen means we can have them run only when they're actually needed (when the
user pans, zooms, or transforms).
```cpp
// inside application C++ "world logic"
// once we determine we need to recompute after some user input, we can do it like this:
// bind texture/image to save computed data in
m_pTexture->BindAsImage();
// upload arbitrary data/settings (x/y position, zoom level, screen size, iteration count)
m_pComputeShaderUniformUploader->UploadUniforms(program);
// do compute across the width and height of the screen split up into 32x32 pixel chunks
glDispatchCompute((m_windowWidth+31)/32, (m_windowHeight+31)/32, 1);
```
The compute shader looks like this:
```glsl
#version 460 core
// define the size of the local working group to be 32x32x1
// main() runs every time for each pixel in the 32x32 region
layout (local_size_x = 32, local_size_y = 32, local_size_z = 1) in;
// arbitrary data/settings ("uniforms") uploaded by the application
layout(r32f, binding = 0) uniform image2D texture0;
uniform ivec2 screen_size;
uniform dvec2 offset; // indicates x/y position within the fractal
uniform double scale; // decreases when zooming in
uniform float discard_threshold_squared;
uniform int max_iteration_count;
void main()
{
// 1. convert screen pixel location to world/graph space
// 2. run z-transform
// 3. store iteration count into texture
}
```
Splitting up the screen into 32x32 chunks is pretty arbitrary. GPUs are very
heavily designed for parallelism, so, generally, splitting work up into chunks
means things will get done more quickly, but there is an upper limit to this.
In my experimentation, 32x32 chunks seemed to work best for whatever reason.
The only data we're storing in the texture during computes is a single number
per pixel: the number of iterations it takes for the pixel's corresponding
point to diverge. The same texture is then fed into the fragment shader during the
rendering stage, which reads these values and spits out colors to the screen
accordingly. `-1` can be used to indicate a point that doesn't diverge (in the
set, colored black).
Note the texture is declared in the computer shader as having
format r32f; this indicates a single channel r with
type FP32, though r32i would work just as well here. See
possible formats in OpenGL documentation for glTexImage2D.
Texture creation is managed by the application.
This is great and all, but it doesn't really solve the issue with high
iteration counts slowing things down. When zooming in a lot, the application
becomes so slow that doing anything has a more-than-noticeable input latency
(see this in the video around 0:15). This is where we implement a method for
*progressive refine*.
## Progressive Refine
Progressive refine is the act of taking an intensive piece of work and breaking
it down into multiple chunks *over time*. This is commonly done in dedicated
renderers when you want a preview of a render that will take a while, or in
networking when loading images over a slow connection; it quickly gives you an
image at, for example, 1/64 full quality, then not-so-quickly an image at 1/32,
then 1/16, and so on, with each step taking longer on average than the
previous. Inspired by a friend's [use of the Bayer matrix for this
purpose](https://jbaker.graphics/writings/bayer.html), I used a similar
**interlacing pattern** defined by the **[Adam7
algorithm](https://en.wikipedia.org/wiki/Adam7_algorithm)**, which splits
work in an 8x8 grid across seven steps:
```glsl
const int ADAM7_MATRIX[8][8] = {
{1, 6, 4, 6, 2, 6, 4, 6},
{7, 7, 7, 7, 7, 7, 7, 7},
{5, 6, 5, 6, 5, 6, 5, 6},
{7, 7, 7, 7, 7, 7, 7, 7},
{3, 6, 4, 6, 3, 6, 4, 6},
{7, 7, 7, 7, 7, 7, 7, 7},
{5, 6, 5, 6, 5, 6, 5, 6},
{7, 7, 7, 7, 7, 7, 7, 7},
};
```
For this pattern of interlacing, the 1st and 2nd steps actually
always take the same amount of time because they do the same amount of work.
Every step after that, though, takes twice as long as the
previous.
Incorporating this into the fractal renderer, what we can do is pass some
number to the compute shader which indicates which step we're on, [1-7]. The
compute shader then indexes the above matrix at the pixel's position, compares
that value to the instructed interlace step, and only computes if the values
are equal.
```glsl
uniform int interlace_step;
```
```glsl
// [in main()]
// 32x32 work group size means we'll have 16 internal 8x8 grids for interlacing.
// Check where the current pixel is within whichever 8x8 grid it falls in:
int relative_pixel_grid_pos_x = gl_LocalInvocationID.x % 8;
int relative_pixel_grid_pos_y = gl_LocalInvocationID.y % 8;
bool do_compute = (ADAM7_MATRIX[relative_pixel_grid_pox_y][relative_pixel_grid_pos_x] == interlace_step);
if (do_compute)
{
// z-transform
}
```
For each pixel, the routine then stores either the new computed value or, if
nothing was computed, does nothing. Correspondingly, the fragment shader must
be updated for this behavior as well. Following step 1, most elements in the
texture will still be empty; to prevent the screen from displaying mostly
uncomputed pixels (undefined behavior?), we should check whether a pixel has
been computed before trying to use it, and use the nearest computed pixel if
the first hasn't been.
The resulting behavior is this:
Now, when zooming in at high iteration count, the first compute step is only
doing 1/64 (~1.56%) of the computations it normally would, keeping things
within the span of a single frame, i.e. preventing framerate drops. This is
great for user actions, where zooming and panning happen for as long as the
mouse input lasts—potentially hundreds of frames.
Suppose, though, that a single step takes longer than a frame to compute. Well,
that's no worry either since *compute shaders act independent of the rendering
pipeline*, so rendering is not being held up by a compute shader still running.
Furthermore, to prevent compute dispatches from overlapping, you can make use
of asynchronous OpenGL interfaces like [fences and memory
barriers](https://www.khronos.org/opengl/wiki/Memory_Model) to prevent
dispatching a new compute for the next step until the previous has finished,
framerate unaffected.
My implementation can actually take the idea of progressive refine a step
further, allowing the point's z-transform itself to be distributed across
computes, resulting in *multiple* interlacing passes. It does this by storing
another two elements per pixel in the texture, the real and imaginary
components of the z-transform output, for it to pick up with on the start of
the next compute. This poses the same precision issues as before, however,
considering OpenGL textures only allow you to store elements up to FP32
precision, so it doesn't work very well at high zoom levels. We can work around
this by using a separate texture with four elements per pixel: two 32-bit
elements for the real component, and two more 32-bit elements for the imaginary
component. See
[floatBitsToInt](https://registry.khronos.org/OpenGL-Refpages/gl4/html/floatBitsToInt.xhtml)
and related GLSL functions for a way you might accomplish this.