Fuzz Testing in Golang

Software testing generally raises an association with tests that determine the correctness of an implementation by asserting against an expected behavior or result. Regardless of their importance, tests on unit- and integration level cannot answer all questions regarding the production readiness of an implementation.

Performance, resource efficiency and security become even more important in connection with cloud solutions, where more efficient code has a much higher likelihood of scaling better and saving costs. The same applies to infrastructure components - e.g. a slow or insecure logging framework can cause significant problems.

In this context, the quality of the tests themselves also become more important. In the case of security-relevant components in particular, it is helpful to automatically check the tests for completeness and to find gaps before production does.

Both aspects can be roughly evaluated using experience and gut feeling, but for an objective and reliable consideration, measurable and easily comprehensible procedures are required.

In its standard library, Go already includes tools for measuring efficiency and checking a test suite for blind spots along with the seasoned testing framework.

In this article I want to illustrate typical use-cases of testing in Golang beyond testing.T in a few examples along a Go implementation of Conways Game Of Life.

Tests and Test Coverage with testing.T

The standard Go toolchain includes a testing framework that covers both simple, nested and table-driven tests.

Lets start implementing the Game of Life in a TDD’ish manner. During the implementation of the basic rule set, we implement a function named livingNeighborsCount, which counts the number of living neighbor cells at a given position on a board. The test below illustrates the basic expectations towards the livingNeighborsCount function.

func TestLivingNeighbors(t *testing.T) {
	t.Run("should return 0 Living on board with no living cells", func(t *testing.T) {
		var fakeBoard = &Board{
			{Dead, Dead, Dead, Dead},
			{Dead, Dead, Dead, Dead},
			{Dead, Dead, Dead, Dead},
		}
		assert.Equal(t, 0, fakeBoard.livingNeighborsCount(1, 1))
	})
	t.Run("should return all (8) corners as Living on board with only living cells", //..
}

The assumption for the first test cycle is that the number of living neighbors at any given coordinate is always 0 for a board that is occupied exclusively by dead cells. In the reverse case of a field with only living cells, the result would be the maximum number of possible neighboring cells, i.e. 8.

Listing 2 shows an implementation satisfying the test expectations.

func (board *Board) livingNeighborsCount(cellX int, cellY int) (result int) {
	neighborOffsets := []int{-1, 0, 1}
	for _, offsetY := range neighborOffsets {
		for _, offsetX := range neighborOffsets {
			if offsetX == 0 && offsetY == 0 {
				continue
			}

			neighborX, neighborY := cellX+offsetX, cellY+offsetY

			if neighborX >= 0 &&
				neighborY >= 0 &&
				neighborY < len(*board) &&
				neighborX < len((*board)[neighborY]) &&
				(*board)[neighborY][neighborX] == Living {
				result++
			}
		}
	}
	return
}

With the integrated test coverage analysis tools, one can obtain a basic (and controversial) metric for the overall test coverage. A cheap (and controversial) metric for the test suite can be obtained with the built-in test coverage analysis.

go test --cover
PASS
coverage: 100.0% of statements
ok      codecentric.de/demo/gol        0.207s

A detailed report can be generated using the --coverprofile parameter of go test. Although the output is human-readable, the HTML output of go tool cover is much more convenient:

go test -coverprofile=coverage.out
go tool cover -html=coverage.out

Test coverage metrics provide a first impression of a test suite and could be used both to accompany development and as an additional criterion within a CI/CD pipeline. However, a final statement about the usability of the tests as a safety net cannot be made with a test coverage metric. Despite a high test coverage, it is possible that critical gaps are not tested. One way to find such gaps is fuzzing.

Fuzzing helps revealing blind spots

Since Go 1.18, the test package contains a Fuzzer. Given a set of existing tests, the Fuzzer derives new test data within the set of possible input arguments of the system under test.

This can be useful in several situations. During the development of a new functionality, cases which have not yet been considered can be revealed. Furthermore, fuzzing can support the replacement or expansion of an existing implementation by revealing different behavior not yet covered by the existing tests.

Suppose the tests from the previous chapter were successfully implemented and fulfilled. As part of the optimization phase, a fuzz test is conducted against the previously implemented livingNeighborsCount function to determine possible errors that are not covered by tests. Our first Fuzz test validates the assumption, “Querying any coordinate on the field is safe”.

func FuzzLivingNeighbors(f *testing.F) {
	var fakeBoard = &Board{
		{Dead, Dead, Dead, Dead},
		{Dead, Living, Dead, Dead},
		{Dead, Dead, Dead, Dead},
	}

	testcases := []struct {
		x int y int
	}{
		{x: 1,y: 1,}, {x: 2,y: 2,},
	}

	for _, testcase := range testcases {
		f.Add(testcase.x, testcase.y)
	}

	f.Fuzz(func(t *testing.T, x int, y int) {
		result := fakeBoard.livingNeighbors(x, y)
		assert.NotNil(t, result)
		assert.GreaterOrEqual(t, result, 0)
		assert.LessOrEqual(t, result, 8)
	})
}

The fuzz test above appears very similar to an ordinary test. Instead of the familiar testing.T argument, the test takes a testing.F structure, where F stands for fuzzing. The steps for arranging test data and the test data table in the top half are reused from existing tests..

In the next step, the entries from the test data table are added to the so-called seed corpus using the f.Add function. This designates a specified set of test data, which is used by the mutator to derive new data during fuzzing. The entries within a seed corpus serve the mutator as starting points for the recursive derivation of new generations of mutations. Without an existing seed corpus, the mutator uses random data as a starting point.

When deriving a new generation, one attribute of the current generation is randomly selected for mutation. In the case of numeric data types, mutations are formed by adding or subtracting a random number, in the case of arrays, among other things, by removing, masking, shifting or inserting elements. The creation of mutants with several altered properties is under consideration for later versions of Go.

A fuzz test can be invoked the same way as a normal test. In this case though, one test is run for each entry in the seed corpus, assuming there are no results from previous fuzz tests.

 $ go test . -run=FuzzLivingNeighbors

After activating the fuzzer using the --fuzz flag, the actual fuzz test is started. A fuzz test terminates either when an error is found, the maximum execution time specified by the fuzztime flag has elapsed, or after manual termination of the test process. In the case of the existing livingNeighbors function, the Fuzzer very quickly finds the obvious gap when checking the function arguments.

$ go test -fuzz=. -run=FuzzLivingNeighbors
=== FUZZ  FuzzLivingNeighbors
fuzz: elapsed: 0s, gathering baseline coverage: 0/18 completed
fuzz: elapsed: 0s, gathering baseline coverage: 18/18 completed, now fuzzing with 12 workers

Failing input written to testdata/fuzz/FuzzLivingNeighbors/cafebabe
    FAIL

After a failed fuzz test, the test data used is automatically written to the file system in a readable format, the path of the output file is written in the console output. This allows the reproduction of a test that failed in the past and should therefore be checked into version control as part of the tests. This causes the test case that revealed the finding to be executed again with every normal test run, without having to run another computationally intensive fuzz test. A reoccurrence of an error is thus detected more quickly.

go test fuzz v1
int(-1)
int(-1)

The admittedly obvious error case when passing a negative value for x or y can now be corrected, the faulty dataset above should not yield to a test failure anymore. This cycle can be repeated until no more abnormalities occur.

if neighborX >= 0 &&
	neighborY >= 0 &&
	neighborY < len(*b) &&
	// ...

Fuzz tests put higher demands on the assert phase of a test run. Due to the mutation of the transfer values, a comparison with a constant set value is only possible in a few cases. Patterns that rely on checking expected behavior are more advantageous, for example:

  1. Comparison with the result of an alternative implementation of the function. This can be used when replacing an existing implementation.
  2. Checking for characteristic properties such as behavior with multiple calls, for example with functions such as string.Reverse.
  3. Constraining the result to a subset in the set of plausible return values.
  4. Ensuring that in the case of mission critical attributes, the desired behavior also occurs after mutation of the remaining values. This is applicable, for example, when passing an invalid security key to a function and checking for unexpected crosstalk.
  5. Verification that a mutation of the passed values does not trigger unexpected errors such as overflow, access, or memory violations.

The Fuzz test above combines several of the approaches above by asserting that the program does not crash and that the result is in the plausible range from 0 to the highest possible number of neighboring cells on a two-dimensional board. Otherwise, the test run is considered as failed.

When designing fuzz tests, it must be noted that the Fuzzer has so far only accepted primitive data types (string, int, float, bool) in the seed body. In addition, the function to be examined must not have any side effects that affect further iterations, such as access to global variables or the file system. In other words, the rules for good unit tests also apply on Fuzz Tests.

Benchmarking

Assuming that said Game Of Life implementation is now running without errors after extensive testing. When outputting the current state of a board on the console, however, it is noticeable that the standard serialization of a byte array does not meet the requirements for a user-friendly display of a board:

[[false true false false] [false true false false]]

To render it in human-readable form, we consider adding a prettyprint/string function that converts a GoL board into a human-readable string.

In the NFR considerations, we assume that the function is under constantly high load and needs to process large GoL boards with reliable response time, which consequently precludes implementations with high running times or frequent garbage collector pauses.

At first glance, three approaches appear to be viable solutions:

  1. Naive composition of the string representation by string concatenation
  2. Use of a StringBuilder
  3. Use a StringBuilder that is directly initialized with the expected length

In terms of efficiency, solution 1 is expected to be the worst variant, since a large number of memory allocations and their side effects are to be expected here. Solution 2 and 3, on the other hand, cannot be weighted against each other immediately. Solution 2 runs without the initial memory reservation in favor of a fast start, but has to perform management operations to its underlying memory area several times during runtime. To what extent this represents an advantage or disadvantage compared to solution 3 cannot be answered spontaneously.

Clarity can be created here with benchmarking. A benchmark makes statements about performance and resource consumption by repeatedly executing a code, measuring the average runtime and optionally recording memory statistics.

The possible solutions can be implemented as following:

func (b *Board) StringNaive() (result string) {
	for _, row := range *b {
		for _, col := range row {
			result += col.String()
		}
		result += "\n"
	}
	return
}

func (b *Board) StringBuilderWoPreallocate() string {
	var builder strings.Builder

	for _, row := range *b {
		for _, col := range row {
			builder.WriteString(col.String())
		}
		builder.WriteString("\n")
	}
	return builder.String()
}

func (b *Board) StringPreallocate() string {
	var builder strings.Builder
	
	if len(*b) > 0 && len((*b)[0]) > 0 {
		builder.Grow(len(*b)*len((*b)[0])*3 + len((*b)))
	}

	for _, row := range *b {
		for _, col := range row {
			builder.WriteString(col.String())
		}
		builder.WriteString("\n")
	}
	return builder.String()
}

A benchmark test case is defined as below. As seen in Fuzz testing, a benchmark expects a caller argument of type testing.B, which includes a property N. This specifies the number of desired runs to be performed, which is a responsibility of the developer to be taken in to account, i.e. by using a loop structure.

func BenchmarkToStringNaive(b *testing.B) {
	board := createFakeBoard(400, 300)
	for i := 0; i < b.N; i++ {
		board.StringNaive()
	}
}

func BenchmarkToStringWithoutPreallocate(b *testing.B) {
	board := createFakeBoard(400, 300)
	for i := 0; i < b.N; i++ {
		board.StringBuilderWoPreallocate()
	}
}

A benchmark is executed using the -bench flag, which accepts a selector for the benchmarks to be executed as a parameter. The –benchmem flag also forces recording of memory usage information.

$ go test --benchmem --bench=ToString

Be...ingNaive-12                  1  3476785913 ns/op  22130260504 B/op  122008 allocs/op
Be...ingWithoutPreallocate-12  1876      629009 ns/op      1981262 B/op      29 allocs/op
Be...ingWithPreallocate-12     3315      343694 ns/op       360489 B/op       1 allocs/op

Benchmarks require a reliable amount of available CPU power to ensure consistent results. It must therefore be ensured during benchmarking that as few other processes as possible cause CPU load and that the machine’s performance management does not change the available CPU performance.

The interpretation of the result supports the original assumptions regarding the cost of memory allocation. Let’s look at these in detail: The suffix -12 on the benchmark name means that the benchmark was run on a maximum of 12 CPU cores. This number could be specified using the -cpu parameter, for example -cpu 1,2,4,8 for each run with double the number of CPUs. In the example used here, no speedup is to be expected due to a higher number of CPUs, but in the case of program code with a high level of concurrency, the scaling behavior can be checked in a comprehensible manner.

The next columns provide further insights. The specification with the unit ns/op indicates the average running time of an iteration in nanoseconds - lower values mean better performance. The same applies to the adjacent columns, which provide summaries regarding memory accesses. Here we see the number of bytes allocated per operation (B/op), as well as the number of allocation operations (allocs/op). It also provides evidence for the poor memory strategy of the naive approach, as one further string is instantiated for each completed string concatenation, in our case at least once per field. By copying the ever-growing intermediate result, a lot of time and memory is consumed and, as a result, the probability of a timely garbage collector run increases. This is not applicable to to the initial requirement of efficiency and scalability.

Furthermore, it can be deduced from the results that the StringBuffer approach without initial memory reservation allocates new memory 29 times due to its growth strategy and is already significantly more efficient overall. The same approach with additional initial memory reservation takes about half the time and only slightly more memory than the originally calculated length of the actual result (360300 bytes). From the specification “1 allocs/op” it can be concluded that the calculated memory to be reserved was not too small (but not necessarily not too large).

As part of this evaluation of different solution approaches, it also makes sense a examine the scalability of the given approaches. This can be performed using a table-driven benchmark, running the three approaches with boards sized 8x8, 128x128 and 1024x1024.

func BenchmarkToStringDifferentSizes(b *testing.B) {
	for _, s := range []int{8, 128, 1024} {
		board := createFakeBoard(s, s)
		b.Run(fmt.Sprintf("method=without_preallocate board_size=%d", s),    
           func(b *testing.B) {
			for i := 0; i < b.N; i++ {
				board.StringBuilderWoPreallocate()
			}
		})

		b.Run(fmt.Sprintf("method=with_preallocate board_size=%d", s), 
           func(b *testing.B) {
			for i := 0; i < b.N; i++ {
				board.String()
			}
		})
	}
}
$ go test -benchmem --bench=BenchmarkToStringDifferentSizes   -run=^# --cpu 1

method=naive_board_size=8                    
     208743             5577 ns/op                7864 B/op         71 allocs/op
method=without_preallocate_board_size=8     
    2209314             543.4 ns/op                504 B/op          6 allocs/op
method=with_preallocate_board_size=8        
    4303492             283.9 ns/op                208 B/op          1 allocs/op
method=naive_board_size=128                      
         14          94907911 ns/op          439336056 B/op      16511 allocs/op
method=without_preallocate_board_size=128     
      13608             86758 ns/op             211704 B/op         21 allocs/op
method=with_preallocate_board_size=128        
      22322             53403 ns/op              57344 B/op          1 allocs/op
method=naive_board_size=1024                      
          1      324161788570 ns/op      1655682724216 B/op     049599 allocs/op
method=without_preallocate_board_size=1024      
        186           6324295 ns/op           16792320 B/op         38 allocs/op
method=with_preallocate_board_size=1024         
        368           3268643 ns/op            3153920 B/op          1 allocs/op

The result can be interpreted as follows: The naive approach scales much worse than the Stringbuilder variants. With an 8x8 board it takes 5577ns for a run, with 1024x1024 the runtime increases to 324s. Variant 3 requires 0.003s for the same task and is therefore around 500000 times faster. Solution 2 had to conduct 5 additional memory allocations for a board with the dimensions of 8x8. Solution 3 accomplishes the same task in almost half the time. As well, an initially slower solution never outperforms a faster alternative “down the road” in the considered interval.

Benchmarking results, combined with details on functional and technical framework conditions, can be used here to identify relevant bottlenecks and to select an efficient solution.

Next Steps

In this article we have considered the use of benchmarks and fuzz tests primarily for the motivation of development support. Through robust measurements, we have gained data points that have supported informed decisions on implementation details.

Beyond this snapshot, both benchmarks and fuzz tests can have substantial value. In this way, benchmarks can be used as part of a CI/CD pipeline as a evaluation criterion, for example if the resource consumption of a functionality exceeds limits or deviates noticeably from historical values. The same applies to fuzz tests, which in turn can examine critical functionalities for previously undiscovered gaps.

In this way, conspicuous releases could be recognized more reliably and ejected for precise analysis with other tools, such as Godebug or the Go Profiler[4].

Conclusion

Fuzzing and benchmarking complement the existing test suite. They cannot replace “classic” tests.

Benchmarking offers a low-threshold opportunity for data-driven selection of the solution best suited to a requirement, identification of optimization potential and measurement of the effectiveness of optimizations. This can also be done continuously, for example by checking critical parts of the artifact with each release to determine whether they meet the performance requirements.

Fuzzing supports both the search for “blind spots” in the existing tests and increases confidence in the test suite. It also simplifies refactorings by finding special cases that were not considered in the implementation supposed to supersede the given one.

Together with conventional tests, fuzzing and benchmarking thus contribute to a loosely coupled and clean code base with justifiable and target-oriented technical decisions.

15 Minutes

Recent Posts

Transalp Bike tour
codefreeze 2024
Fuzz Testing in Golang
Hyères
Gran Canaria
Solingen
Norderney
Scotland
Træfik
Copenhagen | København
Tenerife