Use the M1 Mac GPU with Go

I bought one of the new M1 Macbok pro machines when they came out. I’m really happy with it, everything works great, and they got rid of the Touchbar (which I hated) and put Magsafe, HDMI, and the SD card slot back.

But, the chip is the special bit. I went for the M1 Max with 64GB RAM, 10 CPU cores and 32 GPU cores.

Since I paid for those extra GPU cores, I thought I should know how to use them…

I wanted to be able to carry out file operations, network operations, and data processing or file handling in Go, and still be able to run specific data processing tasks on the GPU.

Here’s what I ended up with.

Adding numbers

Apple’s example code shows you how to add 2 arrays of numbers together.

In this version, the numbers are generated and printed using Go.

add.metal

kernel void process(device const Params* p,
    device const float* input, 
    device float* output, 
    uint3 gridSize[[threads_per_grid]],
    uint3 gid[[thread_position_in_grid]]) {
  // Only process once per row of data.
  if(gid.x != 0) {
    return;
  }

  // Since we know we're in the first column...
  // we can process the whole row.
  int input_index = idx(gid.x, gid.y, gid.z,
    p->w_in, p->h_in, p->d_in);

  float a = input[input_index];
  float b = input[input_index+1];

  int output_index = idx(0, gid.y, 0,
    p->w_out, p->h_out, p->d_out);

  output[output_index] = a + b;
}

main.go

I’ve made the Go side as simple as I could. Compile the shader, create the input matrix and populate it, create the output matrix, run gpu.Run(input, output) and collect the results.

//go:embed add.metal
var source string

func main() {
	gpu.Compile(source)

	input := gpu.NewMatrix[float32](2, 10, 1)
	for y := 0; y < input.H; y++ {
		for x := 0; x < input.W; x++ {
			input.Set(x, y, 0, float32(y))
		}
	}
	output := gpu.NewMatrix[float32](1, input.H, 1)

	gpu.Run(input, output)

	for y := 0; y < output.H; y++ {
		fmt.Printf("Summed: %v\n", output.Get(0, y, 0))
	}
}

Grayscale an image

Here, the Metal Shader Language is expecting an input image consisting of a 2D matrix of uint8 values with each 4 group of 4 values representing a single pixel’s red, green, blue and alpha values.

The shader code calculates the average of red, green and blue, and uses that as the value for all 3, to create a grayscale image.

grayscale.metal

kernel void process(device const Params* p,
    device uint8_t* input, 
    device uint8_t* output, 
    uint3 gridSize[[threads_per_grid]],
    uint3 gid[[thread_position_in_grid]]) {
  // Only process once per pixel of data (4 uint8_t)
  if(gid.x % 4 != 0) {
    return;
  }

  int input_index = idx(gid.x, gid.y, gid.z,
    p->w_in, p->h_in, p->d_in);

  uint8_t r = input[input_index+0];
  uint8_t g = input[input_index+1];
  uint8_t b = input[input_index+2];
  uint8_t a = input[input_index+3];

  uint8_t avg = uint8_t((int(r) + int(g) + int(b)) / 3);

  output[input_index+0] = avg;
  output[input_index+1] = avg;
  output[input_index+2] = avg;
  output[input_index+3] = 255;
}

The Go code is similar to the first example, except that it has to load from the input JPEG into the matrix instead of generating numbers.

main.go

f, err := os.Open("puppy-g7b38fec9b_1920.jpg")
if err != nil {
	log.Fatalf("failed to read puppy JPEG: %v", err)
}
defer f.Close()
jpg, err := jpeg.Decode(f)
if err != nil {
	log.Fatalf("failed to decode JPEG: %v", err)
}

bounds := jpg.Bounds()
stride := 4
input := gpu.NewMatrix[uint8](bounds.Dx()*stride, bounds.Dy(), 1)
for y := 0; y < bounds.Dy(); y++ {
	for x := 0; x < bounds.Dx(); x++ {
		r, g, b, a := jpg.At(x, y).RGBA()
		input.Set((x*stride)+0, y, 0, uint8(r/257))
		input.Set((x*stride)+1, y, 0, uint8(g/257))
		input.Set((x*stride)+2, y, 0, uint8(b/257))
		input.Set((x*stride)+3, y, 0, uint8(a/257))
	}
}

Then, configure the output to be the same size as the input.

output := gpu.NewMatrix[uint8](bounds.Dx()*stride, bounds.Dy(), 1)

And run!

gpu.Run(input, output)

As before, the output is populated with the image.

fo, err := os.Create("gray-puppy.jpg")
if err != nil {
	log.Fatalf("failed to create grayscale puppy: %v", err)
}
img := image.NewRGBA(jpg.Bounds())
img.Pix = output.Data
err = jpeg.Encode(fo, img, &jpeg.Options{
	Quality: 100,
})

And out comes the gray puppy.

Mandelbrot

Here the Metal shader code is more complex [0]. Fractal generation matches GPU rendering quite well, because each pixel is calculated independently from those around it.

https://github.com/a-h/gpu/blob/main/examples/mandelbrot/mandelbrot.metal [0]

Running this example renders the familiar shape to PNG.

Summary

The Metal APIs are reasonably accessible as a means of adding more parallel processing of data than is possible on the CPU on the M1 Macs, however, gains made by this are offset by the time spent transferring data to / from the GPU.

Writing shader code is fairly straightforward, but it seems to be a very dependency-free environment, which turns using things like cryptographic functions a big job.

The CGO capability of Go made it fairly easy to read and write data to and from the GPU, although I’m sure there’s a cost to this.

The code is available at [1]

https://github.com/a-h/gpu [1]

If you’re interested in the details of how I got to the simple code above, read on.

Picking a library

I was hoping there’d be something I could pick up and use, because I thought that lots of people would like to run bits and pieces of code on the GPU but the closest thing I found was [2] - a wrapper around Apple’s Metal graphics APIs.

https://dmitri.shuralyov.com/gpu/mtl [2]

When the author posted this on Hacker News, they got a suprisingly hostile reaction, with the top comment saying to use Vulcan instead [3].

https://www.reddit.com/r/golang/comments/8sue60/rendering_a_triangle_with_apples_metal_api_using/ [3]

Vulcan [4] is supposed to provide a set of cross-platform APIs so that you can “write code once” and, hopefully, it will run anywhere.

https://www.vulkan.org [4]

Well, I tried, but I was put off pretty quickly. The tutorials I saw were complex, and a “Hello World” project I found [5] was huge. Even the first commit was massive [6].

https://github.com/krOoze/Hello_Triangle [5]

https://github.com/krOoze/Hello_Triangle/commit/d92dfb07e285b9ead99e616c156da9a5739cceb7#diff-e51d79de792a2a35b6348665f952ecdd98ea0e99d6f336a28ee05805881b4952 [6]

I’d read that Vulcan’s APIs were hard to understand, verbose, and complex, and it seemed to match what I was seeing, so I gave up on that and went back to dmitri.shuralyov.com/gpu/mtl

Unfortunately, it became clear that it doesn’t support compute operations [7].

https://developer.apple.com/documentation/metal/mtlcomputecommandencoder?language=objc [7]

I tried to implement the missing functionality in the library, but my knowledge of how it all worked wasn’t at the level where I could contribute effectively, so I headed to Apple’s native APIs to learn from the ground up.

Apple’s documentation is pretty good, and ships with clear sample code [8].

https://developer.apple.com/documentation/metal/metal_sample_code_library?language=objc [8]

I started playing around to get the examples working and was up and running quick enough, since Apple has a fairly straightforward example [9] of adding two arrays together.

https://developer.apple.com/documentation/metal/performing_calculations_on_a_gpu?language=objc [9]

Aside from the hassle of learning more about Objective C syntax and its terminology, it seemed OK.

Shaders

Go code can’t be compiled to run on the GPU (although there is an interesting looking project that’s aiming to achieve that for Rust [10]), so the code running on the GPU is written off Metal Shader Language [11], which is based on C++ 14.

https://github.com/EmbarkStudios/rust-gpu [10]

https://developer.apple.com/metal/Metal-Shading-Language-Specification.pdf [11]

These compute functions are called shaders because they were originally designed to be ran against individual pixels in a graphics frame, to execute changes in brightness, e.g. a fade. The key benefit over CPUs is the quantity of parallelisation that they provide, with my GPU capable of xecuting 32 operations in parallel, vs only 10 in parallel on the CPU.

Shader code itself can be fairly simple. The add.metal below simply adds two numbers together.

To do this, it receives some parameters:

Params* p to describe the shape of the input and output data.
The input and output array of floats (const float* input). I’ve used a flattened representation of the shape (e.g. a 3*3 matrix is provided as a an 9-element array).
gridSize to provide the size of a region being worked on (not used here).
gid to describe which coordinate within the input data the shader should execute on (the x, y, z coordinate of input data).

The code is then instructed to run on each element within the input array.

This particular shader code is aware that it only needs to add two numbers in column A and column B together. Column A is x == 0, so if x != 0, we must be in Column B, so no processing is carried out.

I could have taken some shortcuts with this code, but I wanted to establish a pattern in the code of:

Getting the input value index based on the gid xyz coordinate.
Retrieving input values.
Executing the calculation.
Determining the output index(es).
Setting the the output value(s).

add.metal

#include <metal_stdlib>
using namespace metal;

typedef struct Params {
  int w_in, h_in, d_in;
  int w_out, h_out, d_out;
} Params;

int idx(int x, int y, int z, int w, int h, int d) {
  int i = z * w * h;
  i += y * w;
  i += x;
  return i;
}

kernel void process(device const Params* p,
    device const float* input, 
    device float* output, 
    uint3 gridSize[[threads_per_grid]],
    uint3 gid[[thread_position_in_grid]]) {
  // Only process once per row of data.
  if(gid.x != 0) {
    return;
  }

  // Since we know we're in the first column...
  // we can process the whole row.
  int input_index = idx(gid.x, gid.y, gid.z,
    p->w_in, p->h_in, p->d_in);

  float a = input[input_index];
  float b = input[input_index+1];

  int output_index = idx(0, gid.y, 0,
    p->w_out, p->h_out, p->d_out);

  output[output_index] = a + b;
}

Connecting to Go

With Metal shader code ready, I needed to do the rest of the work.

Load / compile the shader code into the GPU.
Get some data loaded from Go onto the GPU.
Run the shader code against the data.
Get the results back in Go.

This is where CGO comes in [10].

https://pkg.go.dev/cmd/cgo [10]

Using special CGO comments, I could get go build to compile the mtl.h file and mtl.m Objective C file along with my Go code.

//go:build darwin
// +build darwin

package gpu

/*
#cgo LDFLAGS: -framework Metal -framework CoreGraphics -framework Foundation
#include <stdlib.h>
#include <stdbool.h>
#include "mtl.h"
*/
import "C"
import (
	"sync"
	"unsafe"
)

Calling C functions from Go

In the mtl.h file I’ve got a function:

void compile(char* source);

To call this from Go, all I needed to write is C.compile which looks like this in practice:

// Compile the shader. Only needs to be done once.
func Compile(shaderCode string) {
	src := C.CString(shaderCode)
	defer C.free(unsafe.Pointer(src))
	C.compile(src)
}

Mapping Go types to C types

In the mtl.h file the Params type is defined, and use in the run function.

typedef struct Params {
  int w_in, h_in, d_in;
  int w_out, h_out, d_out;
} Params;

void* run(Params *params);

In the Go code, there’s a similar implementation of the C Params struct. The order and data sizes of the structures must match exactly the C implementation, so the C int becomes a Go int32 rather than a Go int, to ensure that they’re exactly the same size, but the variable names can be different.

type params struct {
	// Size of input matrix.
	WIn, HIn, DIn int32
	// Size of output matrix.
	WOut, HOut, DOut int32
}

That mapping allows Go code to call the C code with custom types:

p := params{
	WIn:  int32(input.W),
	HIn:  int32(input.H),
	DIn:  int32(input.D),
	WOut: int32(output.W),
	HOut: int32(output.H),
	DOut: int32(output.D),
}
cp := (*C.Params)(unsafe.Pointer(&p))
// Run.
ptr := C.run(cp)

Building a matrix that can be flattened

I was able to keep the C implementation simple by using the idea that a matrix can be flattened, much like a bitmap is just a stream of pixel values with a set width.

This allows the mtl.m C code to support all the input shapes of data I thought I’d be working with, e.g. 1D array, 2D array, 3D array of a single type without having to have different versions for different inputs.

The next problem was supporting different types for the input and output matrices. For example, I thought it likely that I would have int and float types.

In the shader C code, I used void* and left it to the code to blow up if the types were incorrect.

In Go, I used generics to set the types which both the GPU and Go can support (I’m sure it could support more, but this is all I’ve tested with so far).

type GPUType interface {
	uint8 | uint32 | int32 | float32
}

Then created a convenience API around the flat underlying array to allow read and write of xyz coordinates within in the matrix.

func NewMatrix[T GPUType](w, h, d int) *Matrix[T] {
	// Store matrix like a display buffer, i.e.
	// get a whole y row, and index it with x.
	// so, d, y, x
	m := &Matrix[T]{
		W:    w,
		H:    h,
		D:    d,
		init: &sync.Once{},
	}
	return m
}

type Matrix[T GPUType] struct {
	W, H, D int
	Data    []T
	init    *sync.Once
}

func (m *Matrix[T]) Populate() {
	m.init.Do(func() {
		if m.Data == nil {
			m.Data = make([]T, m.W*m.H*m.D)
		}
	})
}

func (m Matrix[T]) Index(x, y, z int) (i int) {
	i += z * m.W * m.H
	i += y * m.W
	i += x
	return i
}

func (m *Matrix[T]) Set(x, y, z int, v T) {
	m.Populate()
	m.Data[m.Index(x, y, z)] = v
}

func (m Matrix[T]) Get(x, y, z int) T {
	m.Populate()
	return m.Data[m.Index(x, y, z)]
}

func (m Matrix[T]) Size() int {
	return m.W * m.H * m.D
}

Will I start writing more stuff for the GPU, I don’t know. Maybe. It’s not very useful for work I currently do because the Metal APIs don’t run in AWS or GCP.