Table 2-1. Relative benchmarks for common service languages (seconds)^a
	C++	Go	Java	NodeJS	Python3	Ruby	Rust
Fannkuch-Redux	`7.53`	`8.25`	`10.71`	`11.08`	`285.20`	`169.71`	`7.21`
FASTA	`1.03`	`1.26`	`1.16`	`33.14`	`33.40`	`26.18`	`0.90`
K-Nucleotide	`1.96`	`7.48`	`5.03`	`15.72`	`44.13`	`83.82`	`2.88`
Mandelbrot	`2.34`	`3.73`	`4.11`	`4.03`	`155.28`	`155.55`	`1.01`
N-Body	`4.88`	`6.36`	`6.77`	`8.41`	`383.12`	`190.96`	`3.92`
Spectral norm	`1.52`	`1.42`	`1.54`	`1.66`	`78.36`	`57.56`	`0.72`
^a Isaac Gouy, The Computer Language Benchmarks Game website, June 20, 2023.

Chapter 3. Go Language Foundations

A language that doesn’t affect the way you think about programming is not worth knowing.¹

Alan Perlis, ACM SIGPLAN Notices (September 1982)

No programming book would be complete without at least a brief refresher of its language of choice, so here we are!

This chapter will differ slightly from the ones in more introductory-level books, however, in that we’re assuming that you’re at least familiar with common coding paradigms but may or may not be a little rusty with the finer points of Go syntax. As such, this chapter will focus as much on Go’s nuances and subtleties as its fundamentals. For a deeper dive into the latter, I recommend either Learning Go by Jon Bodner (O’Reilly) or The Go Programming Language by Alan A. A. Donovan and Brian W. Kernighan (Addison-Wesley Professional).

If you’re relatively new to the language, you’ll definitely want to read on. Even if you’re somewhat comfortable with Go, you might want to skim this chapter: there will be a gem or two in here for you. If you’re a seasoned veteran of the language, you can go ahead and move on to Chapter 4 (or read it ironically and judge me).

Basic Data Types

Go’s basic data types, the fundamental building blocks from which more complex types are constructed, can be divided into three subcategories:

Booleans that contain only one bit of information—true or false—representing some logical conclusion or state.
Numeric types that represent simple—variously sized floating point and signed and unsigned integers—or complex numbers.
Strings that represent an immutable sequence of Unicode code points.

Booleans

The Boolean data type, representing the two logical truth values, exists in some form² in every programming language ever devised. It’s represented by the bool type, a special 1-bit integer type that has two possible values:

true
false

Go supports all of the typical logical operations:

and := true && false
fmt.Println(and)        // "false"

or := true || false
fmt.Println(or)         // "true"

not := !true
fmt.Println(not)        // "false"

Note

Curiously, Go doesn’t include a logical XOR operator. There is a ^ operator, but it’s reserved for bitwise XOR operations.

Simple Numbers

Go has a small menagerie of systematically named floating point and signed and unsigned integer number types:

Signed integer: int8, int16, int32, int64
Unsigned integer: uint8, uint16, uint32, uint64
Floating point: float32, float64

Systematic naming is nice, but code is written by humans with squishy human brains, so the Go designers provided two lovely conveniences.

First, there are two “machine-dependent” types, simply called int and uint, whose size is determined based on available hardware. These are convenient if the specific size of your numbers isn’t critical. Sadly, there’s no machine-dependent floating-point number type.

Second, two integer types have mnemonic aliases: byte, which is an alias for uint8, and rune, which is an alias for int32.

Tip

For most uses, it generally makes sense to use just int and float64.

Complex Numbers

Go offers two sizes of complex numbers, if you’re feeling a little imaginative:³ complex64 and complex128. These can be expressed as an imaginary literal by a floating point immediately followed by an i:

var x complex64 = 3.1415i
fmt.Println(x)                  // "(0+3.1415i)"

Complex numbers are very neat but don’t come into play all that often, so I won’t drill down into them here. If you’re as fascinated by them as I hope you are, Learning Go by Jon Bodner gives them the full treatment they deserve.

Strings

A string represents a sequence of textual code points. Strings in Go are immutable: once created, it’s not possible to change a string’s contents.

Go supports two styles of string literals, the double-quote style (or interpreted literals) and the back-quote style (or raw string literals). For example, the following two string literals are equivalent:

// The interpreted form
"\"Hello\nworld!\""

// The raw form
`"Hello
world!"`

In this interpreted string literal, each \n character pair will be escaped as one newline character, and each \" character pair will be escaped as one double-quote character.

Behind the scenes, a string is actually just a read-only slice of byte values, so pretty much any operation that can be applied to slices and arrays can also be applied to strings. If you aren’t clear on slices yet, you can take this moment to read ahead to “Slices”.

Warning

It’s important to remember that a string value holds arbitrary byte values. There’s nothing that forces it to use Unicode, UTF-8, or any other format. A string really is just a fancy byte slice.

For a deep dive into the subject, see Rob Pike’s post “Strings, Bytes, Runes and Characters in Go”.

Variables

Variables can be declared by using the var keyword to pair an identifier with some typed value and may be updated at any time, with the general form:

var name type = expression

However, there is considerable flexibility in variable declaration:

With initialization: var foo int = 42
Of multiple variables: var foo, bar int = 42, 1302
With type inference: var foo = 42
Of mixed multiple types: var b, f, s = true, 2.3, "hello"
Without initialization (see “Zero Values”): var s string

Note

Go is very opinionated about clutter: it hates it. If you declare a variable in a function but don’t use it, your program will refuse to compile.

Short Variable Declarations

Go provides a bit of syntactic sugar that allows variables within functions to be simultaneously declared and assigned by using the := operator in place of a var declaration with an implicit type.

Short variable declarations have the general form:

name := expression

These can be used to declare both single and multiple assignments:

With initialization: percent := rand.Float64() * 100.0
Multiple variables at once: x, y := 0, 2

In practice, short variable declarations are the most common way that variables are declared and initialized in Go; var is usually used only for local variables that need an explicit type or to declare a variable that will be assigned a value later.

Warning

Remember that := is a declaration and = is an assignment. A := operator that only attempts to redeclare existing variables will fail at compile time.

Interestingly (and sometimes confusingly), if a short variable declaration has a mix of new and existing variables on its left-hand side, the short variable declaration acts like an assignment to the existing variables.

Zero Values

When a variable is declared without an explicit value, it’s assigned to the zero value for its type:

Integers: 0
Floats: 0.0
Booleans: false
Strings: "" (the empty string)

To illustrate, let’s define four variables of various types, without explicit initialization:

var i int
var f float64
var b bool
var s string

Now, if we were to use these variables we’d find that they were, in fact, already initialized to their zero values:

fmt.Printf("integer: %d\n", i)   // integer: 0
fmt.Printf("float: %f\n", f)     // float: 0.000000
fmt.Printf("boolean: %t\n", b)   // boolean: false
fmt.Printf("string: %q\n", s)    // string: ""

You’ll notice the use of the fmt.Printf function, which allows greater control over output format. If you’re not familiar with this function, or with Go’s format strings, see the following sidebar.

Formatting I/O in Go

Go’s fmt package implements several functions for formatting input and output. The most commonly used of these are (probably) fmt.Printf and fmt.Scanf, which can be used to write to standard output and read from standard input, respectively:

func Printf(format string, a ...any) (n int, err error) {}
func Scanf(format string, a ...any) (n int, err error) {}

You’ll notice that each requires a format parameter. This is its format string: a string embedded with one or more verbs that direct how its parameters should be interpreted. For output functions like fmt.Printf, the formation of these verbs specifies the format with which the arguments will be printed.

Each function also has a parameter a. The ... (variadic) operator indicates that the function accepts zero or more parameters in this place; any indicates that the parameter’s type is unspecified. Variadic functions will be covered in “Variadic Functions”; the any type in “Interfaces”.

Some of the common verb flags used in format strings include:

%v   The value in a default format
%T   A representation of the type of the value
%%   A literal percent sign; consumes no value
%t   Boolean: the word true or false
%b   Integer: base 2
%d   Integer: base 10
%f   Floating point: decimal point but no exponent, e.g., 123.456
%s   String: the uninterpreted bytes of the string or slice
%q   String: a double-quoted string (safely escaped with Go syntax)

If you’re familiar with C, you may recognize these as somewhat simplified derivations of the flags used in the printf and scanf functions. A far more complete listing can be found in Go’s documentation for the fmt package.

The Blank Identifier

The blank identifier, represented by the _ (underscore) operator, acts as an anonymous placeholder. It may be used like any other identifier in a declaration, except it doesn’t introduce a binding.

It’s most commonly used as a way to selectively ignore unneeded values in an assignment, which can be useful in a language that both supports multiple returns and demands there be no unused variables. For example, if you wanted to handle any potential errors returned by fmt.Printf, but don’t care about the number of bytes it writes,⁴ you could do the following:

str := "world"

_, err := fmt.Printf("Hello %s\n", str)
if err != nil {
    // Do something
}

The blank identifier can also be used to import a package solely for its side effects:

import _ "github.com/lib/pq"

Packages imported in this way are loaded and initialized as normal, including triggering any of its init functions, but are otherwise ignored and need not be referenced or otherwise directly used.

Constants

Constants are very similar to variables, using the const keyword to associate an identifier with some typed value. However, constants differ from variables in some important ways. First, and most obviously, attempting to modify a constant will generate an error at compile time. Second, constants must be assigned a value at declaration: they have no zero value.

Both var and const may be used at both the package and function levels, as follows:

const language string = "Go"

var favorite bool = true

func main() {
    const text = "Does %s rule? %t!"
    var output = fmt.Sprintf(text, language, favorite)

    fmt.Println(output)   // "Does Go rule? true!"
}

To demonstrate their behavioral similarity, the previous snippet arbitrarily mixes explicit type definitions with type inference for both the constants and variables.

Finally, the choice of fmt.Sprintf is inconsequential to this example, but if you’re unclear about Go’s format strings you can look back to “Formatting I/O in Go”.

Container Types: Arrays, Slices, and Maps

Go has three first-class container types that can be used to store collections of element values:

Array: A fixed-length sequence of zero or more elements of a particular type
Slice: An abstraction around an array that can be resized at runtime
Map: An associative data structure that allows distinct keys to be arbitrarily paired with, or “mapped to,” values

As container types, all of these have a length property that reflects how many elements are stored in that container. The len built-in function can be used to find the length of any array, slice (including strings), or map.

Arrays

In Go, as in most other mainstream languages, an array is a fixed-length sequence of zero or more elements of a particular type.

Arrays can be declared by including a length declaration. The zero value of an array is an array of the specified length containing zero-valued elements. Individual array elements are indexed from 0 to N-1 and can be accessed using the familiar bracket notation:

var a [3]int        // Zero-value array of type [3]int
fmt.Println(a)      // "[0 0 0]"
fmt.Println(a[1])   // "0"

a[1] = 42           // Update second index
fmt.Println(a)      // "[0 42 0]"
fmt.Println(a[1])   // "42"

i := a[1]
fmt.Println(i)      // "42"

Arrays can be initialized using array literals, as follows:

b := [3]int{2, 4, 6}

You can also have the compiler count the array elements for you:

b := [...]int{2, 4, 6}

In both cases, the type of b is [3]int.

As with all container types, the len built-in function can be used to discover the length of an array:

fmt.Println(len(b))        // "3"
fmt.Println(b[len(b)-1])   // "6"

In practice, arrays aren’t actually used directly very often. Instead, it’s much more common to use slices, an array abstraction type that behaves (for all practical purposes) like a resizable array.

Slices

Slices are a data type in Go that provide a powerful abstraction around a traditional array, such that working with slices looks and feels to the programmer much like working with arrays. Like arrays, slices provide access to a sequence of elements of a particular type via the familiar bracket notation, indexed from 0 to N-1. However, where arrays are fixed-length, slices can be resized at runtime.

As shown in Figure 3-1, under the hood a slice is actually a lightweight data structure with three components:

A pointer to some element of a backing array that represents the first element of the slice (not necessarily the first element of the array)
A length, representing the number of elements in the slice
A capacity, which represents the upper value of the length

If not otherwise specified, the capacity value equals the number of elements between the start of the slice and the end of the backing array. The built-in len and cap functions will provide the length and capacity of a slice, respectively.

Working with slices

Creating a slice is somewhat different from creating an array: slices are typed only according to the type of their elements, not their length. The make built-in function can be used to create a slice with a nonzero length as follows:

n := make([]int, 3)   // Create an int slice with 3 elements

fmt.Println(n)        // "[0 0 0]"
fmt.Println(len(n))   // "3"; len works for slices and arrays

n[0] = 8
n[1] = 16
n[2] = 32

fmt.Println(n)        // "[8 16 32]"

As you can see, working with slices feels a lot like working with arrays. Like arrays, the zero value of a slice is a slice of the specified length containing zero-valued elements, and elements in a slice are indexed and accessed exactly like they are in an array.

A slice literal is declared just like an array literal, except that you omit the element count:

m := []int{1}    // A literal []int declaration
fmt.Println(m)   // "[1]"

Slices can be extended using the append built-in, which returns an extended slice containing one or more new values appended to the original one:

m = append(m, 2)   // Append 2 to m
fmt.Println(m)     // "[1 2]"

The append built-in function also happens to be variadic, which means it can accept a variable number of arguments in addition to the slice to be appended. Variadic functions will be covered in more detail in “Variadic Functions”:

m = append(m, 2)      // Append to m from the previous snippet
fmt.Println(m)        // "[1 2]"

m = append(m, 3, 4)
fmt.Println(m)        // "[1 2 3 4]"

m = append(m, m...)   // Append m to itself
fmt.Println(m)        // "[1 2 3 4 1 2 3 4]"

Note that the append built-in function returns the appended slice rather than modifying the slice in place. The reason for this is that behind the scenes, if the destination has sufficient capacity to accommodate the new elements, then a new slice is created from the original underlying array. But if not, a new underlying array is automatically allocated instead.

Warning

Note that append returns the appended slice. Failing to store it is a common error.

The slice operator

Arrays and slices (including strings) support the slice operator, which has the syntax s[i:j], where i and j are in the range 0 ≤ i ≤ j ≤ cap(s).

For example:

s0 := []int{0, 1, 2, 3, 4, 5, 6}   // A slice literal
fmt.Println(s0)                    // "[0 1 2 3 4 5 6]"

In the previous snippet, we define a slice literal. Recall that it closely resembles an array literal, except that it doesn’t indicate a size.

If the values of i or j are omitted from a slice operator, they’ll default to 0 and len(s), respectively:

s1 := s0[:4]
fmt.Println(s1)   // "[0 1 2 3]"

s2 := s0[3:]
fmt.Println(s2)   // "[3 4 5 6]"

A slice operator will produce a new slice backed by the same array with a length of j - i. Changes made to this slice will be reflected in the underlying array, and subsequently in all slices derived from that same array:

s0[3] = 42        // Change reflected in all 3 slices
fmt.Println(s0)   // "[0 1 2 42 4 5 6]"
fmt.Println(s1)   // "[0 1 2 42]"
fmt.Println(s2)   // "[42 4 5 6]"

This effect is illustrated in more detail in Figure 3-1.

Strings as slices

The subject of how Go implements strings under the hood is actually quite a bit more complex than you might think, involving lots of details like the differences between bytes and runes, Unicode and UTF-8 encoding, and strings and string literals.

For now it’s enough to know that Go strings are essentially read-only slices of bytes that typically (but aren’t required to) contain a series of UTF-8 sequences representing Unicode code points, called runes. Go even allows you to cast your strings into byte or rune arrays:

s := "foö"       // UTF-8: f=0x66 o=0x6F ö=0xC3B6
r := []rune(s)
b := []byte(s)

By casting the string s in this way we’re able to uncover its identity as either a slice of bytes or a slice of runes. We can illustrate this by using fmt.Printf with the %T (type) and %v (value) flags (which we presented in “Formatting I/O in Go”) to output the results:

fmt.Printf("%T %v\n", s, s)   // "string foö"
fmt.Printf("%T %v\n", r, r)   // "[]int32 [102 111 246]"
fmt.Printf("%T %v\n", b, b)   // "[]uint8 [102 111 195 182]"

Note that the value of the string literal, foö, contains a mix of characters whose encoding can be contained in a single byte (f and o, encoded as 102 and 111, respectively) and one character that cannot (ö, encoded as 195 182).

Note

Remember that the byte and rune types are mnemonic aliases for uint8 and int32, respectively.

Each of these lines print the type and value of the variables passed to it. As expected, the string value, foö, is printed literally. The next two lines are interesting, however. The uint8 (byte) slice contains four bytes, which represent the string’s UTF-8 encoding (two 1-byte code points and one 2-byte code point). The int32 (rune) slice contains three values that represent the code points of the individual characters.

There’s far, far more to string encoding in Go, but we only have so much space. If you’re interested in learning more, take a look at Rob Pike’s “Strings, Bytes, Runes and Characters in Go” on The Go Blog for a deep dive into the subject.

Maps

Go’s map data type references a hash table: an incredibly useful associative data structure that allows distinct keys to be arbitrarily “mapped” to values as key-value pairs. This data structure is common among today’s mainstream languages: if you’re coming to Go from one of these, then you probably already use them, perhaps in the form of Python’s dict, Ruby’s Hash, or Java’s HashMap.

Map types in Go are written map[K]V, where K and V are the types of its keys and values, respectively. Any type that is comparable using the == operator may be used as a key, and K and V need not be of the same type. For example, string keys may be mapped to float32 values.

A map can be initialized using the built-in make function, and its values can be referenced using the usual name[key] syntax. Our old friend len will return the number of key-value pairs in a map; the delete built-in can remove key-value pairs:

freezing := make(map[string]float32)   // Empty map of string to float32

freezing["celsius"] = 0.0
freezing["fahrenheit"] = 32.0
freezing["kelvin"] = 273.2

fmt.Println(freezing["kelvin"])        // "273.2"
fmt.Println(len(freezing))             // "3"

delete(freezing, "kelvin")             // Delete "kelvin"
fmt.Println(len(freezing))             // "2"

Maps may also be initialized and populated as map literals:

freezing := map[string]float32{
    "celsius":    0.0,
    "fahrenheit": 32.0,
    "kelvin":     273.2,          // The trailing comma is required!
}

Note the trailing comma on the last line. This is not optional: the code will refuse to compile if it’s missing.

Warning

While any comparable type can be a map key, certain types can have unexpected behaviors and should be avoided:

Floating point numbers don’t always obey the usual rules of equality due to rounding errors. This can lead to subtle bugs.
Struct key values can change, which means you might create a new key when you think you’re updating an old one.
Pointers are compared by memory address. Pointers to identical values are different if they point to different addresses.

Map membership testing

Requesting the value of a key that’s not present in a map won’t cause an exception to be thrown (those don’t exist in Go anyway) or return some kind of null value. Rather, it returns the zero value for the map’s value type:

foo := freezing["no-such-key"]   // Get non-existent key
fmt.Println(foo)                 // "0" (float32 zero value)

This can be a very useful feature because it reduces a lot of boilerplate membership testing when working with maps, but it can be a little tricky when your map happens to actually contain zero-valued values. Fortunately, accessing a map can also return a second optional bool that indicates whether the key is present in the map:

newton, ok := freezing["newton"]   // What about the Newton scale?
fmt.Println(newton)                // "0"
fmt.Println(ok)                    // "false"

In this snippet, the value of newton is 0.0. But is that really the correct value,⁵ or was there just no matching key? Fortunately, since ok is also false, we know the latter to be the case.

Pointers

Okay. Pointers. The bane and undoing of undergraduates the world over. If you’re coming from a dynamically typed language, the idea of the pointer may seem alien to you. While we’re not going to drill down too deeply into the subject, we’ll do our best to cover it well enough to provide some clarity on the subject.

Going back to first principles, a “variable” is a piece of storage in memory that contains some value. Typically, when you refer to a variable by its name (foo = 10) or by an expression (s[i] = "foo"), you’re directly reading or updating the value of the variable.

A pointer stores the address of a variable: the location in memory where the value is stored. Every variable has an address, and using pointers allows us to indirectly read or update the value of their variables (illustrated in Figure 3-2):

Retrieving the address of a variable: The address of a named variable can be retrieved by using the & operator. For example, the expression p := &a will obtain the address of a and assign it to p.
Pointer types: The variable p, which you can say “points to” a, has a type of *int, where the * indicates that it’s a pointer type that points to an int.
Dereferencing a pointer: To retrieve the value of the value a from p, you can dereference it using a * before the pointer variable name, allowing us to indirectly read or update a.

Now, to put everything in one place, take a look at the following:

var a int = 10

var p *int = &a   // p of type *int points to a
fmt.Println(p)    // "0x0001"
fmt.Println(*p)   // "10"

*p = 20           // indirectly update a
fmt.Println(a)    // "20"

Pointers can be declared like any other variable, with a zero value of nil if not explicitly initialized. They’re also comparable, being equal only if they contain the same address (that is, they point to the same variable) or if they are both nil:

var n *int
var x, y int

fmt.Println(n)          // "<nil>"
fmt.Println(n == nil)   // "true" (n is nil)

fmt.Println(x == y)     // "true" (x and y are both zero)
fmt.Println(&x == &x)   // "true" (&x is equal to itself)
fmt.Println(&x == &y)   // "false" (different vars)
fmt.Println(&x == nil)  // "false" (&x is not nil)

Because n is never initialized, its value is nil, so comparing it to nil returns true. The integers x and y both have a value of 0, so comparing their values yields true, but they are still distinct variables, so comparing pointers to each of them still evaluates to false.

Control Structures

Any programmer coming to Go from another language will find its suite of control structures to be generally familiar, even comfortable (at first) for those coming from a language heavily influenced by C. However, there are some pretty important deviations in their implementation and usages that might seem odd at first.

For example, control structure statements don’t require lots of parentheses. Okay. Less clutter. That’s fine.

There’s also only one loop type. There is no while; only for. Seriously! It’s actually pretty cool, though. Read on, and you’ll see what I mean.

Fun with for

The for statement is Go’s one and only loop construct, and while there’s no explicit while loop, Go’s for can provide all of its functionality, effectively unifying all of the entry control loop types to which you’ve become accustomed.

Go has no do-while equivalent.

The general for statement

The general form of for loops in Go is nearly identical to that of other C-family languages, in which three statements—the init statement, the continuation condition, and the post statement—are separated by semicolons in the traditional style. Any variables declared in the init statement will be scoped only to the for statement:

sum := 0

for i := 0; i < 10; i++ {
    sum += 1
}

fmt.Println(sum)  // "10"

In this example, i is initialized to 0. At the end of each iteration, i is incremented by 1, and if it’s still less than 10, the process repeats.

Note

Unlike most C-family languages, for statements don’t require parentheses around their clauses, and braces are required.

In a break from traditional C-style languages, Go’s for statement’s init and post statements are entirely optional. As shown in the code that follows, this makes it considerably more flexible:

sum, i := 0, 0

for i < 10 {         // Equivalent to: for ; i < 10;
    sum += i
    i++
}

fmt.Println(i, sum)  // "10 45"

The for statement in the previous example has no init or post statements, only a bare condition. This is actually a big deal, because it means that for is able to fill the role traditionally occupied by the while loop.

Finally, omitting all three clauses from a for statement creates a block that loops infinitely, just like a traditional while (true):

fmt.Println("For ever...")

for {
    fmt.Println("...and ever")
}

Because it lacks any terminating condition, the loop in the previous snippet will iterate forever. On purpose.

Looping over arrays and slices

Go provides a useful keyword, range, that simplifies looping over a variety of data types.

In the case of arrays and slices, range can be used with a for statement to retrieve the index and the value of each element as it iterates:

s := []int{2, 4, 8, 16, 32}  // A slice of ints

for i, v := range s {        // range gets each index/value
    fmt.Println(i, "->", v)  // Output index and its value
}

In the previous example, the values of i and v will update each iteration to contain the index and value, respectively, of each element in the slice s. So the output will look something like the following:

0 -> 2
1 -> 4
2 -> 8
3 -> 16
4 -> 32

What if you don’t need both of these values? After all, the Go compiler will demand that you use them if you declare them. Fortunately, as elsewhere in Go, the unneeded values can be discarded by using the “blank identifier,” signified by the underscore operator:

a := []int{0, 2, 4, 6, 8}
sum := 0

for _, v := range a {
    sum += v
}

fmt.Println(sum)    // "20"

As in the last example, the value v will update each iteration to contain the value of each element in the slice a. This time, however, the index value is conveniently ignored and discarded, and the Go compiler stays content.

Looping over maps

The range keyword may also be used with a for statement to loop over maps, with each iteration returning the current key and value:

m := map[int]string{
    1: "January",
    2: "February",
    3: "March",
    4: "April",
}

for k, v := range m {
    fmt.Println(k, "->", v)
}

Note that Go maps aren’t ordered, so the output won’t be either:

3 -> March
4 -> April
1 -> January
2 -> February

The if Statement

The typical application of the if statement in Go is consistent with other C-style languages, except for the lack of parentheses around the clause and the fact that braces are required:

if 7 % 2 == 0 {
    fmt.Println("7 is even")
} else {
    fmt.Println("7 is odd")
}

Note

Unlike most C-family languages, if statements don’t require parentheses around their clauses, and braces are required.

Interestingly, Go allows an initialization statement to precede the condition clause in an if statement, allowing for a particularly useful idiom. For example:

if _, err := os.Open("foo.ext"); err != nil {
    fmt.Println(err)
} else {
    fmt.Println("All is fine.")
}

Note how the err variable is being initialized prior to a check for its definition, making it somewhat similar to the following:

_, err := os.Open("foo.go")
if err != nil {
    fmt.Println(err)
} else {
    fmt.Println("All is fine.")
}

The two constructs aren’t exactly equivalent however: in the first example err is scoped to only the if statement; in the second example err is visible to the entire containing function.

The switch Statement

As in other languages, Go provides a switch statement that provides a way to more concisely express a series of if-then-else conditionals. However, it differs from the traditional C-style implementation in a number of ways that make it considerably more flexible.

Perhaps the most obvious difference to folks coming from C-family languages is that there’s no fallthrough between the cases by default; this behavior can be explicitly added by using the fallthrough keyword:

i := 0

switch i % 3 {
case 0:
    fmt.Println("Zero")
    fallthrough
case 1:
    fmt.Println("One")
case 2:
    fmt.Println("Two")
default:
    fmt.Println("Huh?")
}

In this example, the value of i % 3 is 0, which matches the first case, causing it to output the word Zero. In Go, switch cases don’t fall through by default, but the existence of an explicit fallthrough statement means that the subsequent case is also executed and One is printed. Finally, the absence of a fallthrough on that case causes the resolution of the switch to complete. All told, the following is printed:

Zero
One

Switches in Go have two interesting properties. First, case expressions don’t need to be integers, or even constants: the cases will be evaluated from top to bottom, running the first case whose value is equal to the condition expression. Second, if the switch expression is left empty, it’ll be interpreted as true and will match the first case whose guarding condition evaluates to true. Both of these properties are demonstrated in the following example:

hour := time.Now().Hour()

switch {
case hour >= 5 && hour < 9:
    fmt.Println("I'm writing")
case hour >= 9 && hour < 18:
    fmt.Println("I'm working")
default:
    fmt.Println("I'm sleeping")
}

The switch has no condition, so it’s exactly equivalent to using switch true. As such, it matches the first statement whose condition also evaluates to true. In my case, hour is 23, so the output is “I’m sleeping.”⁶

Finally, just as with if, a statement can precede the condition expression of a switch, in which case any defined values are scoped to the switch. For example, the previous example can be rewritten as follows:

switch hour := time.Now().Hour(); {  // Empty expression means "true"
case hour >= 5 && hour < 9:
    fmt.Println("I'm writing")
case hour >= 9 && hour < 18:
    fmt.Println("I'm working")
default:
    fmt.Println("I'm sleeping")
}

Note the trailing semicolon: this empty expression implies true so that this expression is equivalent to switch hour := time.Now().Hour(); true and matches the first true case condition.

Error Handling

Errors in Go are treated as just another value, represented by the built-in error type. This makes error handling straightforward: idiomatic Go functions may include an error-typed value in its list of returns, which if not nil indicates an error state that may be handled via the primary execution path. For example, the os.Open function returns a non-nil error value when it fails to open a file:

file, err := os.Open("somefile.ext")
if err != nil {
    log.Fatal(err)
    return err
}

The actual implementation of the error type is actually incredibly simple: it’s just a universally visible interface that declares a single method:

type error interface {
    Error() string
}

This is very different from the exceptions that are used in many languages, which necessitate a dedicated system for exception catching and handling that can lead to confusing and unintuitive flow control.

Creating an Error

There are two simple ways to create error values, and a more complicated way. The simple ways are to use either the errors.New or fmt.Errorf functions; the latter is handy because it provides string formatting too:

e1 := errors.New("error 42")
e2 := fmt.Errorf("error %d", 42)

The fact that error is an interface allows you to implement your own error types, if you need to. For example, a common pattern is to allow errors to be nested within other errors:

type NestedError struct {
    Message string
    Err     error
}

func (e *NestedError) Error() string {
    return fmt.Sprintf("%s\n  contains: %s", e.Message, e.Err.Error())
}

For more information about errors, and some good advice on error handling in Go, take a look at Andrew Gerrand’s “Error Handling and Go” on The Go Blog.

Putting the Fun in Functions: Variadics and Closures

Functions in Go work a lot like they do in other languages: they receive parameters, do some work, and (optionally) return something.

But Go functions are built for a level of flexibility not found in many mainstream languages and can also do a lot of things that many other languages can’t, such as returning or accepting multiple values or being used as first-class types or anonymous functions.

Functions

Declaring a function in Go is similar to most other languages: they have a name, a list of typed parameters, an optional list of return types, and a body. However, Go function declaration differs somewhat from other C-family languages in that it uses a dedicated func keyword, the type for each parameter follows its name, and return types are placed at the end of the function definition header and may be omitted entirely (there’s no void type).

A function with a return type list must end with a return statement, except when execution can’t reach the end of the function due to the presence of an infinite loop or a terminal panic before the function exits:

func add(x int, y int) int {
    return x + y
}

func main() {
    sum := add(10, 5)
    fmt.Println(sum)        // "15"
}

Additionally, a bit of syntactic sugar allows the type for a sequence of parameters or returns of the same type to be written only once. For example, the following definitions of func foo are equivalent:

func foo(i int, j int, a string, b string) { /* ... */ }
func foo(i, j int, a, b string)            { /* ... */ }

Multiple return values

Functions can return any number of values. For example, the following swap function accepts two strings and returns two strings. The list of return types for multiple returns must be enclosed in parentheses:

func swap(x, y string) (string, string) {
    return y, x
}

To accept multiple values from a function with multiple returns, you can use multiple assignment:

a, b := swap("foo", "bar")

When run, the value of a will be “bar” and b will be “foo.”

Recursion

Go allows recursive function calls, in which functions call themselves. Used properly, recursion can be a powerful tool that can be applied to many types of problems. The canonical example is the calculation of the factorial of a positive integer, the product of all positive integers less than or equal to n:

func factorial(n int) int {
    if n < 1 {
        return 1
    }
    return n * factorial(n-1)
}

func main() {
    fmt.Println(factorial(11))      // "39916800"
}

For any integer n greater than one, factorial will call itself with a parameter of n - 1. This can add up quickly!

Defer

Go’s defer keyword can be used to schedule the execution of a function call for immediately before the surrounding function returns and is commonly used to guarantee that resources are released or otherwise cleaned up.

For example, to defer printing the text “cruel world” to the end of a function call, we insert the defer keyword immediately before it:

func main() {
    defer fmt.Println("cruel world")

    fmt.Println("goodbye")
}

When the previous snippet is run, it produces the following output, with the deferred output printed last:

goodbye
cruel world

For a less trivial example, we’ll create an empty file and attempt to write to it. A closeFile function is provided to close the file when we’re done with it. However, if we simply call it at the end of main, an error could result in closeFile never being called and the file being left in an open state. Therefore, we use a defer to ensure that the closeFile function is called before the function returns, however, it returns this:

func main() {
    file, err := os.Create("/tmp/foo.txt")  // Create an empty file
    if err != nil {
        return
    }
    defer closeFile(file)                   // Ensure closeFile(file) is called

    _, err = fmt.Fprintln(file, "Your mother was a hamster")
    if err != nil {
        return
    }

    fmt.Println("File written to successfully")
}

func closeFile(f *os.File) {
    if err := f.Close(); err != nil {
        fmt.Println("Error closing file:", err.Error())
    } else {
        fmt.Println("File closed successfully")
    }
}

When you run this code, you should get the following output:

File written to successfully
File closed successfully

If multiple defer calls are used in a function, each is pushed onto a stack. When the surrounding function returns, the deferred calls are executed in last-in-first-out order. For example:

func main() {
    defer fmt.Println("world")
    defer fmt.Println("cruel")
    defer fmt.Println("goodbye")
}

This function, when run, will output the following:

goodbye
cruel
world

Defers are a very useful feature for ensuring that resources are cleaned up. If you’re working with external resources, you’ll want to make liberal use of them.

Pointers as parameters

Much of the power of pointers becomes evident when they’re combined with functions. Typically, function parameters are passed by value: when a function is called, it receives a copy of each parameter, and changes made to the copy by the function don’t affect the caller. However, pointers contain a reference to a value, rather than the value itself, and can be used by a receiving function to indirectly modify the value passed to the function in a way that can affect the function caller.

The following function demonstrates both scenarios:

func main() {
    x := 5

    zeroByValue(x)
    fmt.Println(x)              // "5"

    zeroByReference(&x)
    fmt.Println(x)              // "0"
}

func zeroByValue(x int) {
    x = 0
}

func zeroByReference(x *int) {
    *x = 0                      // Dereference x and set it to 0
}

This behavior isn’t unique to pointers. In fact, under the hood, several data types are actually references to memory locations, including slices, maps, functions, and channels. Changes made to such reference types in a function can affect the caller, without needing to explicitly dereference them:

func update(m map[string]int) {
    m["c"] = 2
}

func main() {
    m := map[string]int{"a": 0, "b": 1}

    fmt.Println(m)                  // "map[a:0 b:1]"

    update(m)

    fmt.Println(m)                  // "map[a:0 b:1 c:2]"
}

In this example, the map m has a length of two when it’s passed to the update function, which adds the pair {"c" : 2}. Because m is a reference type, it’s passed to update as a reference to an underlying data structure instead of a copy of one, so the insertion is reflected in m in main after the update function returns.

Variadic Functions

A variadic function is one that may be called with zero or more trailing arguments. The most familiar example is the members of the fmt.Printf family of functions, which accept a single format specifier string and an arbitrary number of additional arguments.

This is the signature for the standard fmt.Printf function:

func Printf(format string, a ...any) (n int, err error) {}

Note that it accepts a string, and zero or more any values. If you’re rusty on the any syntax, we’ll review it in “Interfaces”, but you can interpret any to mean “some arbitrarily typed thing.” What’s most interesting here, however, is that the final argument contains an ellipsis (...). This is the variadic operator, which indicates that the function may be called with any number of arguments of this type. For example, you can call fmt.Printf with a format and two differently typed parameters:

const name, age = "Kim", 22
fmt.Printf("%s is %d years old.\n", name, age)

Within the variadic function, the variadic argument is a slice of the argument type. In the following example, the variadic factors parameter of the product method is of type []int and may be ranged over accordingly:

func product(factors ...int) int {
    p := 1

    for _, n := range factors {
        p *= n
    }

    return p
}

func main() {
    fmt.Println(product())          // "1"
    fmt.Println(product(2, 2, 2))   // "8"
}

In this example, the call to product from main uses three parameters (though it could use any number of parameters it likes). In the product function, these are translated into an []int slice with the value {2, 2, 2} that are iteratively multiplied to construct the final return value of 8.

Passing slices as variadic values

What if your value is already in slice form, and you still want to pass it to a variadic function? Do you need to split it into multiple individual parameters? Goodness no.

In this case, you can apply the variadic operator after the variable name when calling the variadic function:

m := []int{3, 3, 3}
fmt.Println(product(m...))   // "27"

Here, you have a variable m with the type []int, which you want to pass to the variadic function product. Using the variadic operator when calling product(m...) makes this possible.

Anonymous Functions and Closures

In Go, functions are first-class values that can be operated upon in the same way as any other entity in the language: they have types, may be assigned to variables, and may even be passed to and returned by other functions:

func sum(x, y int) int     { return x + y }
func product(x, y int) int { return x * y }

func main() {
    var f func(int, int) int    // Function variables have types

    f = sum
    fmt.Println(f(3, 5))        // "8"

    f = product                 // Legal: product has same type as sum
    fmt.Println(f(3, 5))        // "15"
}

The zero value of a function type is nil; calling a nil function value will cause a panic.

Functions may be created within other functions as anonymous functions, which may be called, passed, or otherwise treated like any other functions. A particularly powerful feature of Go is that anonymous functions have access to the state of their parent and retain that access even after the parent function has executed. This is, in fact, the definition of a closure.

Tip

A closure is a nested function that has access to the variables of its parent function, even after the parent has executed.

Take, for example, the following incrementor function. This function has state, in the form of the variable i, and returns an anonymous function that increments that value before returning it. The returned function can be said to close over the variable i, making it a true (if trivial) closure:

func incrementor() func() int {
    i := 0

    return func() int {    // Return an anonymous function
        i++                // "Closes over" parent function's i
        return i
    }
}

When we call incrementor, it creates its own new, local value of i and returns a new anonymous function of type func() int that will increment that value. Subsequent calls to incrementor will each receive their own copy of i. We can demonstrate that in the following:

func main() {
    increment := incrementor()
    fmt.Println(increment())       // "1"
    fmt.Println(increment())       // "2"
    fmt.Println(increment())       // "3"

    increment2 := incrementor()
    fmt.Println(increment2())      // "1"
}

As you can see, calling incrementor returns a new function increment as a variable; each call to increment increments its internal counter by one. When incrementor is called again, though, it creates and returns an entirely new function, with its own brand new counter. Neither of these functions can influence the other.

Structs, Methods, and Interfaces

One of the biggest mental switches that people sometimes have to make when first coming to the Go language is that Go isn’t a traditional object-oriented language. Not really. Sure, Go has types with methods, which kind of look like objects, but they don’t have a prescribed inheritance hierarchy. Instead Go allows components to be assembled into a whole using composition.

For example, where a strictly object-oriented language might have a Car class that extends an abstract Vehicle class, perhaps it would implement Wheels and Engine. This sounds fine in theory, but these relationships can easily become convoluted and hard to manage.

Go’s composition approach, on the other hand, allows components to be “put together” without having to define their ontological relationships. Extending the previous example, Go could have a Car struct, which could have its various parts, such as Wheels and Engine, embedded within it. Furthermore, methods in Go can be defined for any sort of data; they’re not just for structs anymore.

Structs

In Go, a struct is nothing more than an aggregation of zero or more fields as a single entity, where each field is a named value of an arbitrary type. A struct can be defined using the following type Name struct syntax. A struct value is never nil: rather, the zero value of a struct is the zero value of all of its fields:

type Vertex struct {
    X, Y float64
}

func main() {
    var v Vertex            // Structs are never nil
    fmt.Println(v)          // "{0 0}"

    v = Vertex{}            // Explicitly define an empty struct
    fmt.Println(v)          // "{0 0}"

    v = Vertex{1.0, 2.0}    // Defining fields, in order
    fmt.Println(v)          // "{1 2}"

    v = Vertex{Y:2.5}       // Defining specific fields, by label
    fmt.Println(v)          // "{0 2.5}"
}

Struct fields can be accessed using the standard dot notation:

func main() {
    v := Vertex{X: 1.0, Y: 3.0}
    fmt.Println(v)                  // "{1 3}"

    v.X *= 1.5
    v.Y *= 2.5

    fmt.Println(v)                  // "{1.5 7.5}"
}

Structs are commonly created and manipulated by reference, so Go provides a little bit of syntactic sugar: members of structs can be accessed from a pointer to the struct using dot notation; the pointers are automatically dereferenced:

func main() {
    var v *Vertex = &Vertex{1, 3}
    fmt.Println(v)                  // &{1 3}

    v.X, v.Y = v.Y, v.X
    fmt.Println(v)                  // &{3 1}
}

In this example, v is a pointer to a Vertex whose X and Y member values you want to swap. If you had to dereference the pointer to do this, you’d have to do something like (*v).X, (*v).Y = (*v).Y, (*v).X, which is clearly terrible. Instead, automatic pointer dereferencing lets you do v.X, v.Y = v.Y, v.X, which is far less terrible.

Methods

In Go, methods are functions that are attached to types, including, but not limited to, structs. The declaration syntax for a method is very similar to that of a function, except that it includes an extra receiver argument before the function name that specifies the type that the method is attached to. When the method is called, the instance is accessible by the name specified in the receiver.

For example, our earlier Vertex type can be extended by attaching a Square method with a receiver named v of type *Vertex:

func (v *Vertex) Square() {    // Attach method to the *Vertex type
    v.X *= v.X
    v.Y *= v.Y
}

func main() {
    vert := &Vertex{3, 4}
    fmt.Println(vert)          // "&{3 4}"

    vert.Square()
    fmt.Println(vert)          // "&{9 16}"
}

In addition to structs, you can also claim standard composite types—structs, slices, or maps—as your own and attach methods to them. For example, we declare a new type, MyMap, which is just a standard map[string]int, and attach a Length method to it:

type MyMap map[string]int

func (m MyMap) Length() int {
    return len(m)
}

func main() {
    mm := MyMap{"A":1, "B": 2}

    fmt.Println(mm)             // "map[A:1 B:2]"
    fmt.Println(mm["A"])        // "1"
    fmt.Println(mm.Length())    // "2"
}

The result is a new type, MyMap, which is (and can be used as) a map of strings to integers, map[string]int, but which also has a Length method that returns the map’s length.

Interfaces

In Go, an interface is just a set of method signatures. As in other languages with a concept of an interface, they are used to describe the general behaviors of other types without being coupled to implementation details. An interface can thus be viewed as a contract that a type may satisfy, opening the door to powerful abstraction techniques.

For example, a Shape interface can be defined that includes an Area method signature. Any type that wants to be a Shape must have an Area method that returns a float64:

type Shape interface {
    Area() float64
}

Now we’ll define two shapes, Circle and Rectangle, that satisfy the Shape interface by attaching an Area method to each one. Note that we don’t have to explicitly declare that they satisfy the interface: if a type possesses all of its methods, it can implicitly satisfy an interface. This is particularly useful when you want to design interfaces that are satisfied by types that you don’t own or control:

type Circle struct {
    Radius float64
}

func (c Circle) Area() float64 {
    return math.Pi * c.Radius * c.Radius
}

type Rectangle struct {
    Width, Height float64
}

func (r Rectangle) Area() float64 {
    return r.Width * r.Height
}

Because both Circle and Rectangle implicitly satisfy the Shape interface, we can pass them to any function that expects a Shape:

func PrintArea(s Shape) {
    fmt.Printf("%T's area is %0.2f\n", s, s.Area())
}

func main() {
    r := Rectangle{Width:5, Height:10}
    PrintArea(r)                         // "main.Rectangle's area is 50.00"

    c := Circle{Radius:5}
    PrintArea(c)                         // "main.Circle's area is 78.54"
}

Type assertions

A type assertion can be applied to an interface value to “assert” its identity as a concrete type. The syntax takes the general form of x.(T), where x is an expression of an interface and T is the asserted type.

Referring to the Shape interface and Circle struct we used previously:

var s Shape
s = Circle{}                // s is an expression of Shape
c := s.(Circle)             // Assert that s is a Circle
fmt.Printf("%T\n", c)       // "main.Circle"

The any type

One curious construct is the any type, any, an alias for the empty interface: interface{}. The empty interface specifies no methods. It carries no information; it says nothing.⁷

As its name implies, a variable of type any or interface{} can hold a value of any type, which can be very useful when your code needs to support arbitrary types. The fmt.Println method is a good example of a function using this strategy.

There are downsides to using any, however. Working with the empty interface requires certain assumptions to be made, which have to be checked at runtime and results in code that’s more fragile and less efficient.

Composition with Type Embedding

Go doesn’t allow subclassing or inheritance in the traditional object-oriented sense. Instead it allows types to be embedded within one another, extending the functionalities of the embedded types into the embedding type.

This is a particularly useful feature of Go that allows functionalities to be reused via composition—combining the features of existing types to create new types—instead of inheritance, removing the need for the kinds of elaborate type hierarchies that can saddle traditional OOP projects.

Interface embedding

A popular example of embedding interfaces comes to us by way of the io package. Specifically, the widely used io.Reader and io.Writer interfaces, which are defined as follows:

type Reader interface {
    Read(p []byte) (n int, err error)
}

type Writer interface {
    Write(p []byte) (n int, err error)
}

But what if you want an interface with the methods of both an io.Reader and io.Writer? Well, you could implement a third interface that copies the methods of both, but then you have to keep all of them in agreement. That doesn’t just add unnecessary maintenance overhead: it’s also a good way to accidentally introduce errors.

Rather than go the copy–paste route, Go allows you to embed the two existing interfaces into a third one that takes on the features of both. Syntactically, this is done by adding the embedded interfaces as anonymous fields, as demonstrated by the standard io.ReadWriter interface, shown here:

type ReadWriter interface {
    Reader
    Writer
}

The result of this composition is a new interface that has all of the methods of the interfaces embedded within it.

Note

Only interfaces can be embedded within interfaces.

Struct embedding

Embedding isn’t limited to interfaces: structs can also be embedded into other structs.

The struct equivalent to the io.Reader and io.Writer example in the previous section comes from the bufio package, specifically, bufio.Reader (which implements io.Reader) and bufio.Writer (which implements io.Writer). Similarly, bufio also provides an implementation of io.ReadWriter, which is just a composition of the existing bufio.Reader and bufio.Writer types:

type ReadWriter struct {
    *Reader
    *Writer
}

As you can see, the syntax for embedding structs is identical to that of interfaces: adding the embedded types as unnamed fields. In the preceding case, the bufio.ReadWriter embeds bufio.Reader and bufio.Writer as pointer types.

Warning

Just like any pointers, embedded pointers to structs have a zero value of nil and must be initialized to point to valid structs before they can be used.

Promotion

So, why would you use composition instead of just adding a struct field? The answer is that when a type is embedded, its exported properties and methods are promoted to the embedding type, allowing them to be directly invoked. For example, given the ReadWriter definition from “Interface embedding”, the Read method of a bufio.Reader is accessible directly from an instance of bufio.ReadWriter:

var rw *bufio.ReadWriter = GetReadWriter()
var bytes []byte = make([]byte, 1024)

n, err := rw.Read(bytes) {
    // Do something
}

You don’t have to know or care that the Read method is actually attached to the embedded *bufio.Reader. It’s important to know, though, that when a promoted method is invoked, the method’s receiver is still the embedded type, so the receiver of rw.Read is the ReadWriter’s Reader field, not the ReadWriter.

Directly accessing embedded fields

Occasionally, you’ll need to refer to an embedded field directly. To do this, you use the type name of the field as a field name. In the following (somewhat contrived) example, the UseReader function requires a *bufio.Reader, but what you have is a *bufio.ReadWriter instance:

func UseReader(r *bufio.Reader) {
    fmt.Printf("We got a %T\n", r)      // "We got a *bufio.Reader"
}

func main() {
    var rw *bufio.ReadWriter = GetReadWriter()
    UseReader(rw.Reader)
}

As you can see, this snippet uses the type name of the field you want to access (Reader) as the field name (rw.Reader) to retrieve the *bufio.Reader from rw. This can be handy for initialization as well:

rw := &bufio.ReadWriter{Reader: &bufio.Reader{}, Writer: &bufio.Writer{}}

If we’d just created rw as &bufio.ReadWriter{}, its embedded fields would be nil, but the snippet produces a *bufio.ReadWriter with fully defined *bufio.Reader and *bufio.Writer fields. While you wouldn’t typically use a &bufio.ReadWriter in this way, this approach could be used to provide a useful mock in a pinch.

Generics

Generics, also known as type parameters, have been a highly anticipated feature since Go’s earliest days. With good reason, too: generics allow the creation of code that’s independent of the specific types being used. This is a really useful feature in a language that’s as strictly typed as Go, resulting in code that’s more reusable and reducing tedious code duplication.

This section will provide an introduction to generics and Go. For a more detailed and much longer description, including many examples, see the original Type Parameters Proposal.

The Before Times

Before Go 1.18, if you wanted a function that could handle multiple types, you had a few options, but none of them were especially good.

First, you might be able to use explicit casting for some types, like numeric types. This approach works, but it leads to more verbose and ugly code, and it doesn’t work for many types.

Second, your function could accept an interface{} (or any) value, but doing so means that you lose the typing information and type safety benefits of compile-time type checking. Plus, the need for a type switch means that the code is less performant and doesn’t support derived types.

Finally, you could have specifically typed variants of your function for every type you want to support, like MaxInt(x, y int) and MaxFloat(x, y float64). This makes for code that’s less ugly to use than the alternatives but requires a lot of duplication and is tedious to create. Plus, this approach still doesn’t support derived types.

I don’t know which of these options is the worst. I think they all are. They’re all the worst.

Generic Functions

Enter generics. Generics allow you to write code that’s independent of the specific types being used. This means that functions and types may now be written generically, to use any of a set of types.

Say you had a nongeneric function that counts elements in a slice and returns a map of the counts:

func Count(ss []string) map[string]int {
    m := make(map[string]int)

    for _, s := range ss {
        m[s]++
    }

    return m
}

Surely this could be useful somewhere, but if you wanted to use this for another type, you’re kind of out of luck. However, generic functions allow you to specify a type parameter. A type parameter list looks like a typical parameter list except that it uses square brackets instead of parentheses.

For example, a generic version of our Count function might look something like the following code:

func Count[T comparable](ss []T) map[T]int {
    m := make(map[T]int)

    for _, s := range ss {
        m[s]++
    }

    return m
}

The T in brackets is the function’s type constraint: a definition of exactly what types are allowed by a generic. We’ll talk more about constraints in “Type Constraints”.

You’ll notice the use of the comparable constraint. This is a special predefined type constraint that allows any comparable type. It’s commonly used for generic map keys because Go requires that map keys be comparable, so it ensures that the calling code is using an allowable type for map keys.

To use a generic function, you must provide its type in square brackets (this isn’t strictly true: see “Type Inference” for a spoiler). For example, to use the generic Count function with strings, you would do something like the following:

ss := []string{"a", "b", "b", "c", "c", "c"}
m := Count[string](ss)

Providing the type argument to a generic is called instantiation. For more about how instantiation works, see “An Introduction to Generics” on The Go Blog.

Generic Types

Types can be generic too, which is incredibly useful for creating things like generic data structures—for example,⁸ a generic Tree type that stores values of the type parameter T:

type Tree[T any] struct {
    left, right *Tree[T]
    value       T
}

func (t *Tree[T]) Lookup(x T) *Tree[T] { ... }

Generic types can have methods (just like any other type), like Lookup in the preceding example. Note that the Lookup definition doesn’t require an explicit type constraint, but any references to Tree, like in the method’s receiver and return type declarations, do have to include type parameters.

Generic types have to be instantiated to be used, as shown here:

var stringTree Tree[string]

This creates a Tree value, stringTree, that works with string values.

Type Constraints

As mentioned, a type constraint is a definition of exactly what types are allowed by a generic, but there’s quite a bit more to constraints than what we’ve seen so far.

So far we’ve seen two specific constraints used: any and comparable. You might recognize any as an alias for the empty interface (review “The any type” if you don’t recognize it). In fact, type constraints are always interfaces.

But to make this work, Go 1.18 extended the interface syntax to allow sets of types. Take the following interface as an example:

type Number interface {
    int64 | float64
}

This snippet declares the Number interface, which contains the union of the int64 and float64 types. We can then use the Number interface as a type constraint:

func Square[T Number](v T) T {
    return v * v
}

It seems like a lot of work to have to create an interface for every generic. Fortunately, you don’t have to.⁹ You can also define type constraint sets anonymously by specifying the set of allowed types:

func Square[T int64 | float64](v T) T {
    return v * v
}

Both of these Square functions are functionally identical.

Tip

There’s an experimental constraints package that defines some useful type constraint interfaces, like Integer and Ordered.

Type Inference

Remember when we said that you have to instantiate generics to use them? Well, that’s not always true.

As it turns out, the compiler can often infer type arguments for a generic function from the ordinary arguments being passed to it. This is called function argument type inference, and it can lead to code that’s shorter while still remaining clear.

For example, referring back to the Square function from “Type Constraints”, we can do something like the following:

var n int64 = 9
s := Square(n)   // 81

As you can see, this call omits the explicit instantiation. This is possible because it’s clear from the type of the argument we’re passing (an int64), so the compiler is able to infer the generic function type.

The Good Stuff: Concurrency

The subtleties of concurrent programming are many and are well beyond the scope of this work. However, you can say that reasoning about concurrency is hard and that the way concurrency is generally done makes it harder. In most languages, the usual approach to process orchestration is to create some shared bit of memory, which is then wrapped in locks to restrict access to one process at a time, often introducing maddeningly difficult-to-debug errors such as race conditions or deadlocks.

Go, on the other hand, favors another strategy: it provides two concurrency primitives—goroutines and channels—that can be used together to elegantly structure concurrent software that doesn’t depend quite so much on locking. It encourages developers to limit sharing memory between processes and to instead allow processes to interact with one other entirely by passing messages.

Goroutines

One of Go’s most powerful features is the go keyword. Any function call prepended with the go keyword will run as usual, but the caller can proceed uninterrupted rather than wait for the function to return. Under the hood, the function is executed as a lightweight, concurrently executing process called a goroutine.

The syntax is strikingly simple: a function foo, which may be executed sequentially as foo(), may be executed as a concurrent goroutine simply by adding the go keyword: go foo():

foo()       // Call foo() and wait for it to return
go foo()    // Spawn a new goroutine that calls foo() concurrently

Goroutines can also be used to invoke a function literal:

func Log(w io.Writer, message string) {
    go func() {
        fmt.Fprintln(w, message)
    }() // Don't forget the trailing parentheses!
}

Warning

Goroutines that can’t terminate, be they stuck in an infinite loop or blocked forever trying to send or receive on a channel, will continue to consume resources for the lifetime of the application. When these accumulate over time, this is called a goroutine leak and can be a significant (even fatal) drain on resources.

Channels

In Go, channels are typed primitives that allow communication between two goroutines. They act as pipes into which a value can be sent and then received by a goroutine on the other end.

Channels may be created using the make function. Each channel can transmit values of a specific type, called its element type. Channel types are written using the chan keyword followed by their element type.

The following example declares and allocates an int channel:

var ch chan int = make(chan int)

The two primary operations supported by channels are send and receive, both of which use the <- (arrow) operator, where the arrow indicates the direction of the data flow as demonstrated in the following:

ch <- val     // Sending on a channel
val = <-ch    // Receiving on a channel and assigning it to val
<-ch          // Receiving on a channel and discarding the result

Channel blocking

By default, a channel is unbuffered. Unbuffered channels have a very useful property: sends on them block until another goroutine receives on the channel, and receives block until another goroutine sends on the channel. This behavior can be exploited to synchronize two goroutines, as demonstrated in the following:

func main() {
    ch := make(chan string)    // Allocate a string channel

    go func() {
       message := <-ch         // Blocking receive; assigns to message
       fmt.Println(message)    // "ping"
       ch <- "pong"            // Blocking send
    }()

    ch <- "ping"               // Send "ping"
    fmt.Println(<-ch)          // "pong"
}

Recall that in “Anonymous Functions and Closures” we made a note that anonymous functions had access to their parent’s variables, and the usage of ch on this line demonstrates that.

Although main and the anonymous goroutine run concurrently and could in theory run in any order, the blocking behavior of unbuffered channels guarantees that the output will always be “ping” followed by “pong.”

Channel buffering

Go channels may be buffered, in which case they contain an internal value queue with a fixed capacity that’s specified when the buffer is initialized. Sends to a buffered channel block only when the buffer is full; receives from a channel block only when the buffer is empty. Any other time, send and receive operations write to or read from the buffer, respectively, and exit immediately.

A buffered channel can be created by providing a second argument to the make function to indicate its capacity:

ch := make(chan string, 2)    // Buffered channel with capacity 2

ch <- "foo"                   // Two non-blocking sends
ch <- "bar"

fmt.Println(<-ch)             // Two non-blocking receives
fmt.Println(<-ch)             // The buffer is now empty

fmt.Println(<-ch)             // The third receive will block

Closing channels

The third available channel operation is close, which sets a flag to indicate that no more values will be sent on it. The built-in close function can be used to close a channel: close(ch).

Tip

The channel close operation is just a flag to tell the receiver not to expect any more values. You don’t have to explicitly close channels.

Trying to send on a closed channel will cause a panic. Receiving from a closed channel will retrieve any values sent on the channel prior to its closure; any subsequent receive operations will immediately yield the zero value of the channel’s element type. Receivers may also test whether a channel has been closed (and its buffer is empty) by assigning a second bool parameter to the receive expression:

ch := make(chan string, 10)

ch <- "foo"

close(ch)                          // One value left in the buffer

msg, ok := <-ch
fmt.Printf("%q, %v\n", msg, ok)    // "foo", true

msg, ok = <-ch
fmt.Printf("%q, %v\n", msg, ok)    // "", false

Warning

While either party may close a channel, in practice only the sender should do so. Inadvertently sending on a closed channel will cause a panic.

Looping over channels

The range keyword may be used to loop over channels that are open or contain buffered values. The loop will block until a value is available to be read or until the channel is closed. You can see how this works in the following:

ch := make(chan string, 3)

ch <- "foo"                 // Send three (buffered) values to the channel
ch <- "bar"
ch <- "baz"

close(ch)                   // Close the channel

for s := range ch {         // Range will continue to the "closed" flag
    fmt.Println(s)
}

In this example, we create a new buffered string channel ch and send it three values before closing it. Because the three values were sent to the channel before it was closed, looping over this channel will output all three strings before terminating. Had the channel not been closed, the loop would stop and wait for the next value to be sent along the channel, potentially indefinitely.

Select

Go’s select statement provides a convenient mechanism for multiplexing communications with multiple channels. The syntax for select is similar to switch, with some number of case statements that specify code to be executed upon a successful send or receive operation:

select {
case <-ch1:                         // Discard received value
    fmt.Println("Got something")

case x := <-ch2:                    // Assign received value to x
    fmt.Println(x)

case ch3 <- y:                      // Send y to channel
    fmt.Println(y)

default:
    fmt.Println("None of the above")
}

In the preceding snippet, there are three primary cases specified with three different conditions. If the channel ch1 is ready to be read, then its value will be read (and discarded) and the text “Got something” will be printed. If ch2 is ready to be read, then its value will be read and assigned to the variable x before printing the value of x. Finally, if ch3 is ready to be sent to, then the value y is sent to it before printing the value of y.

Finally, if no cases are ready, the default statements will be executed. If there’s no default, then the select will block until one of its cases is ready, at which point it performs the associated communication and executes the associated statements. If multiple cases are ready, select will execute one at random.

Gotcha!

When using select, keep in mind that a closed channel never blocks and is always readable.

Implementing channel timeouts

The ability to use select to multiplex on channels can be very powerful, and can make otherwise very difficult or tedious tasks trivial. Take, for example, the implementation of a timeout on an arbitrary channel. In some languages this might require some awkward thread work, but a select with a call to time.After, which returns a channel that sends a message after a specified duration, makes short work of it:

var ch = make(chan int)

select {
case m := <-ch:                        // Read from ch; blocks forever
    fmt.Println(m)
case <-time.After(10 * time.Second):   // time.After returns a channel
    fmt.Println("Timed out")
}

Since there’s no default statement, this select will block until one if its case conditions becomes true. If ch doesn’t become available to read before the channel returned by time.After emits a message, then the second case will activate and the statement will time out.

Summary

What I covered in this chapter could easily have consumed an entire book, if I’d been able to drill down into the level of detail the subject really deserves. But space and time are limited (and that book’s already been written),¹⁰ so I have to remain content to have only this one chapter as a broad and shallow survey of the Go language (at least until the third edition comes out).

But learning Go’s syntax and grammar will only get you so far. In Chapter 4, I’ll be presenting a variety of Go programming patterns that I see come up pretty regularly in the cloud native context. So, if you thought this chapter was interesting, you’re going to love the next one.

¹ Alan Perlis, ACM SIGPLAN Notices 17, no. 9, (September 1982): 7–13.

² Earlier versions of C, C++, and Python lacked a native Boolean type, instead representing them using the integers 0 (for false) or 1 (for true). Some languages like Perl, Lua, and Tcl still use a similar strategy.

³ See what I did there?

⁴ Why would you?

⁵ In fact, the freezing point of water on the Newton scale actually is 0.0, but that’s not important.

⁶ Clearly this code needs to be recalibrated.

⁷ Rob Pike, “Go Proverbs”, YouTube, December 1, 2015.

⁸ This example was borrowed from the article “An Introduction to Generics”, The Go Blog, March 22, 2022. It was just too perfect not to use.

⁹ This isn’t Java, after all.

¹⁰ One last time, if you haven’t read it yet, go read Learning Go by Jon Bodner (O’Reilly).

Chapter 4. Cloud Native Patterns

Progress is possible only if we train ourselves to think about programs without thinking of them as pieces of executable code.¹

Edsger W. Dijkstra, August 1979

In 1991, while still at Sun Microsystems, L Peter Deutsch² formulated the “fallacies of distributed computing,” which lists some of the false assumptions that programmers new (and not so new) to distributed applications often make:

The network is reliable: switches fail, routers get misconfigured
Latency is zero: it takes time to move data across a network
Bandwidth is infinite: a network can handle only so much data at a time
The network is secure: don’t share secrets in plain text; encrypt everything
Topology doesn’t change: servers and services come and go
There is one administrator: multiple admins lead to heterogeneous solutions
Transport cost is zero: moving data around costs time and money
The network is homogeneous: every network is (sometimes very) different

If I might be so audacious, I’d like to add a ninth one:

Services are reliable: services that you depend on can fail at any time

In this chapter, I’ll present a selection of idiomatic patterns—tested, proven development paradigms—designed to address one or more of the conditions described in Deutsch’s fallacies and demonstrate how to implement them in Go. None of the patterns discussed in this book are original to this book—some have been around for as long as distributed applications have existed—but most haven’t been previously published together in a single work. Many of them are unique to Go or have novel implementations in Go relative to other languages.

Unfortunately, this book won’t cover infrastructure-level patterns like the Bulkhead or Gatekeeper patterns. Largely, this is because our focus is on application-layer development in Go, and those patterns, while indispensable, function at an entirely different abstraction level. If you’re interested in learning more, I recommend Cloud Native Infrastructure by Justin Garrison and Kris Nova (O’Reilly) and Designing Distributed Systems by Brendan Burns (O’Reilly).

The Context Package

Most of the code examples in this chapter make use of the context package, which was introduced in Go 1.7 to provide an idiomatic means of carrying deadlines, cancellation signals, and request-scoped values between processes. It contains a single interface, context.Context, whose methods are listed in the following:

type Context interface {
    // Deadline returns the time when this Context should be canceled; it
    // returns ok==false if no deadline is set.
    Deadline() (deadline time.Time, ok bool)

    // Done returns a channel that's closed when this Context is canceled.
    Done() <-chan struct{}

    // Err indicates why this context was canceled after the Done channel is
    // closed. If Done is not yet closed, Err returns nil.
    Err() error

    // Value returns the value associated with this context for key, or nil
    // if no value is associated with key. Use with care.
    Value(key any) any
}

Three of these methods can be used to learn something about a Context value’s cancellation status or behavior. The fourth, Value, can be used to retrieve a value associated with an arbitrary key. Context’s Value method is the focus of some controversy in the Go world and will be discussed more in “Defining Request-Scoped Values”.

What Context Can Do for You

A context.Context value is used by passing it directly to a service request, which may in turn pass it to one or more subrequests. What makes this useful is that when a Context is canceled, all functions holding it (or a derived Context; more on this in “Defining Context Deadlines and Timeouts”) will receive the signal, allowing them to coordinate their cancellation and reduce wasted effort.

Take, for example, a request from a user to a service, which in turn makes a request to a database. In an ideal scenario, the user, application, and database requests can be diagrammed as in Figure 4-1.

But what if the user terminates their request before it’s fully completed? In most cases, oblivious to the overall context of the request, the processes will continue to live on anyway (Figure 4-2), consuming resources in order to provide a result that’ll never be used.

However, by sharing a Context to each subsequent request, all long-running processes can be sent a simultaneous “done” signal, allowing the cancellation signal to be coordinated among each of the processes (Figure 4-3).

Importantly, Context values are also thread safe: they can be safely used by multiple concurrently executing goroutines without fear of unexpected behaviors.

Creating Context

A brand-new context.Context can be obtained using one of two functions:

Background() Context: Returns an empty Context that’s never canceled, has no values, and has no deadline. It’s typically used by the main function, initialization, and tests and as the top-level Context for incoming requests.
TODO() Context: Also provides an empty Context, but it’s intended to be used as a placeholder when it’s unclear which Context to use or when a parent Context is not yet available.

Defining Context Deadlines and Timeouts

The context package also includes a number of methods for creating derived Context values that allow you to direct cancellation behavior, either by applying a timeout or by a function hook that can explicitly trigger a cancellation:

WithDeadline(Context, time.Time) (Context, CancelFunc): Accepts a specific time at which the Context will be canceled and the Done channel will be closed.
WithTimeout(Context, time.Duration) (Context, CancelFunc): Accepts a duration after which the Context will be canceled and the Done channel will be closed.
WithCancel(Context) (Context, CancelFunc): Unlike the previous functions, WithCancel accepts nothing additional and only returns a function that can be called to explicitly cancel the Context.

All three of these functions return a derived Context that includes any requested decoration and a context.CancelFunc, a zero-parameter function that can be called to explicitly cancel the Context and all of its derived values.

Tip

When a Context is canceled, all Contexts that are derived from it are also canceled. Contexts that it was derived from are not.

For the sake of completion, it’s worth mentioning that there are also three recently introduced functions that parallel the three just mentioned but that also allow you to specify a specific error value as the cancellation cause:

WithDeadlineCause(Context, time.Time, error) (Context, CancelFunc): Introduced in Go 1.21. Behaves like WithDeadline but also sets the cause of the returned Context when the deadline is exceeded. The returned CancelFunc does not set the cause.
WithTimeoutCause(Context, time.Duration, error) (Context, CancelFunc): Introduced in Go 1.21. Behaves like WithTimeout but also sets the cause of the returned Context when the timeout expires. The returned CancelFunc does not set the cause.
WithCancelCause(Context) (Context, CancelCauseFunc): Introduced in Go 1.20. Behaves like WithCancel but returns a CancelCauseFunc instead of a CancelFunc. Calling cancel with a non-nil error (the “cause”) records that error in ctx; it can then be retrieved using Cause(ctx).

The ability to explicitly define a cause can provide useful context³ for logging or otherwise deciding on an appropriate response.

Defining Request-Scoped Values

Finally, the context package includes a function that can be used to define an arbitrary request-scoped key-value pair that can be accessed from the returned Context—and all Context values derived from it—via the Value method:

WithValue(parent Context, key, val any) Context: WithValue returns a derivation of parent in which key is associated with the value val.

On Context Values

The context.WithValue and context.Value functions provide convenient mechanisms for setting and getting arbitrary key-value pairs that can be used by consuming processes and APIs. However, it has been argued that this functionality is orthogonal to Context’s function of orchestrating the cancellation of long-lived requests, obscures your program’s flow, and can easily break compile-time coupling. See Dave Cheney’s blog post “Context Is for Cancelation” for a more in-depth discussion.

This functionality isn’t used in any of the examples in this chapter (or this book). If you choose to make use of it, please take care to ensure that all of your values are scoped only to the request, don’t alter the functioning of any processes, and don’t break your processes if they happen to be absent.

Using a Context

When a service request is initiated, either by an incoming request or triggered by the main function, the top-level process will use the Background function to create a new Context value, possibly decorating it with one or more of the context.With* functions, before passing it along to any subrequests. Those subrequests then need only watch the Done channel for cancellation signals.

For example, take a look at the following Stream function:

func Stream(ctx context.Context, out chan<- Value) error {
    // Create a derived Context with a 10s timeout; dctx
    // will be canceled upon timeout, but ctx will not.
    // cancel is a function that will explicitly cancel dctx.
    dctx, cancel := context.WithTimeout(ctx, 10 * time.Second)

    // Release resources if SlowOperation completes before timeout
    defer cancel()

    res, err := SlowOperation(dctx)   // res is a Value channel
    if err != nil {                   // True if dctx times out
        return err
    }

    for {
        select {
        case out <- <-res:            // Read from res; send to out

        case <-ctx.Done():            // Triggered if ctx is canceled
            return ctx.Err()          // but not if dctx is canceled
        }
    }
}

Stream receives a ctx Context as an input parameter, which it sends to WithTimeout to create dctx, a derived Context with a 10-second timeout. Because of this decoration, the SlowOperation(dctx) call could possibly time out after 10 seconds and return an error. Functions using the original ctx, however, will not have this timeout decoration and will not time out.

Further down, the original ctx value is used in a for loop around a select statement to retrieve values from the res channel provided by the SlowOperation function. Note the case <-ctx.Done() statement, which is executed when the ctx.Done channel closes to return an appropriate error value.

Layout of This Chapter

The general presentation of each pattern in this chapter is loosely based on the one used in the famous “Gang of Four” Design Patterns book⁴ but is simpler and less formal. Each pattern opens with a very brief description of its purpose and the reasons for using it and is followed by the following sections:

Applicability: Context and descriptions of where this pattern may be applied.
Participants: A listing of the components of the pattern and their roles.
Implementation: A discussion of the solution and its implementation.
Sample code: A demonstration of how the code may be implemented in Go.

Stability Patterns

The stability patterns presented here address one or more of the assumptions called out by the fallacies of distributed computing. They’re generally intended to be applied by distributed applications to improve their own stability and the stability of the larger system they’re a part of.

Circuit Breaker

Circuit Breaker automatically degrades service functions in response to a likely fault, preventing larger or cascading failures by eliminating recurring errors and providing reasonable error responses.

Applicability

If the fallacies of distributed computing were to be distilled to one point, it would be that errors and failures are an undeniable fact of life for distributed, cloud native systems. Services become misconfigured, databases crash, networks partition. We can’t prevent it; we can only accept and account for it.

Failing to do so can have some rather unpleasant consequences. We’ve all seen them, and they aren’t pretty. Some services might keep futilely trying to do their job and returning nonsense to their client; others might fail catastrophically and maybe even fall into a crash/restart death spiral. It doesn’t matter, because in the end they’re all wasting resources, obscuring the source of original failure, and making cascading failures even more likely.

On the other hand, a service that’s designed with the assumption that its dependencies can fail at any time can respond reasonably when they do. The Circuit Breaker allows a service to detect such failures and to “open the circuit” by temporarily ceasing to execute requests, instead providing clients with an error message consistent with the service’s communication contract.

For example, imagine a service that (ideally) receives a request from a client, executes a database query, and returns a response. What if the database fails? The service might continue futilely trying to query it anyway, flooding the logs with error messages and eventually timing out or returning useless errors. Such a service can use a Circuit Breaker to “open the circuit” when the database fails, preventing the service from making any more doomed database requests (at least for a while) and allowing it to respond to the client immediately with a meaningful notification.

Participants

This pattern includes the following participants:

Circuit: The function that interacts with the service.
Breaker: A closure with the same function signature as Circuit.

Implementation

Essentially, the Circuit Breaker is just a specialized Adapter pattern, with Breaker wrapping Circuit to add some additional error handling logic.

Like the electrical switch from which this pattern derives its name, Breaker has two possible states: closed and open. In the closed state, everything is functioning normally. All requests received from the client by Breaker are forwarded unchanged to Circuit, and all responses from Circuit are forwarded back to the client. In the open state, Breaker doesn’t forward requests to Circuit. Instead it “fails fast” by responding with an informative error message.

Breaker internally tracks the errors returned by Circuit; if the number of consecutive errors returned by Circuit exceeds a defined threshold, Breaker trips and its state switches to open.

Most implementations of Circuit Breaker include some logic to automatically close the circuit after some period of time. Keep in mind, though, that hammering an already malfunctioning service with lots of retries can cause its own problems, so it’s standard to include some kind of backoff logic that reduces the rate of retries over time. The subject of backoff is actually fairly nuanced, but it will be covered in detail in “Play It Again: Retrying Requests”.

In a multinode service, this implementation can be extended to include some shared storage mechanism, such as a Memcached or Redis network cache, to track the circuit state.

Sample code

We begin by creating a Circuit type that specifies the signature of the function that’s interacting with your database or other upstream service. In practice, this can take whatever form is appropriate for your functionality. It should include an error in its return list, however:

type Circuit func(context.Context) (string, error)

In this example, Circuit is a function that accepts a Context value, which was described in depth in “The Context Package”. Your implementation may vary.

The Breaker function accepts any function that conforms to the Circuit type definition, and an integer representing the number of consecutive failures allowed before the circuit automatically opens. In return it provides another function, which also conforms to the Circuit type definition:

func Breaker(circuit Circuit, threshold int) Circuit {
    var failures int
    var last = time.Now()
    var m sync.RWMutex

    return func(ctx context.Context) (string, error) {
        m.RLock()                       // Establish a "read lock"

        d := failures - threshold

        if d >= 0 {
            shouldRetryAt := last.Add((2 << d) * time.Second)

            if !time.Now().After(shouldRetryAt) {
                m.RUnlock()
                return "", errors.New("service unavailable")
            }
        }

        m.RUnlock()                     // Release read `lock`

        response, err := circuit(ctx)   // Issue the request proper

        m.Lock()                        // Lock around shared resources
        defer m.Unlock()

        last = time.Now()               // Record time of attempt

        if err != nil {                 // Circuit returned an error,
            failures++                  // so we count the failure
            return response, err        // and return
        }

        failures = 0                    // Reset failures counter

        return response, nil
    }
}

The Breaker function constructs another function, also of type Circuit, which wraps circuit to provide the desired functionality. You may recognize this from “Anonymous Functions and Closures” as a closure: a nested function with access to the variables of its parent function. As you will see, all of the “stability” functions implemented for this chapter work this way.

The closure works by counting the number of consecutive errors returned by circuit. If that value meets the failure threshold, then it returns the error “service unreachable” without actually calling circuit. Any successful calls to circuit cause failures to reset to 0, and the cycle begins again.

The closure even includes an automatic reset mechanism that allows requests to call circuit again after several seconds, with an exponential backoff in which the durations of the delays between retries roughly doubles with each attempt. Though simple and quite common, this actually isn’t the ideal backoff algorithm. We’ll review exactly why in “Backoff Algorithms”.

This function also includes our first use of a mutex (also known as a lock).⁵ Mutexes are a common idiom in concurrent programming, and we’re going to use them quite a bit in this chapter, so if you’re fuzzy on mutexes in Go, see “Mutexes”.

Mutexes

A mutex (“mutual exclusion”) is a concurrency construct that prevents multiple processes from simultaneously accessing the same shared resources. In Go, the sync.Mutex type allows a process to establish a “lock” such that subsequent attempts to establish locks will block until the first lock is released.

A popular extension of this is the read-write mutex, implemented in Go by sync.RWMutex, which provides methods to establish both read and write locks. Any number of processes can establish simultaneous read locks as long as there are no open write locks; a process can establish a write lock only when there are no existing read or write locks. Attempts to establish additional locks will block until any locks ahead of it are released.

Here we use a sync.RWMutex to allow thread-safe reads and writes on a map:

var items = struct{                             // Struct with a map and a
    sync.RWMutex                                // composed sync.RWMutex
    m map[string]int
}{m: make(map[string]int)}

func ThreadSafeRead(key string) int {
    items.RLock()                               // Establish read lock
    defer items.RUnlock()                       // Release read lock
    return items.m[key]
}

func ThreadSafeWrite(key string, value int) {
    items.Lock()                                // Establish write lock
    defer items.Unlock()                        // Release write lock
    items.m[key] = value
}

Here we see the two different lock methods available to RWMutex: RLock and RUnlock, to make and clear read locks, and Lock and Unlock, to make and clear write locks.

Debounce

Debounce limits the frequency of a function invocation so that only the first or last in a cluster of calls is actually performed.

Applicability

Debounce is the second of our patterns to be labeled with an electrical circuit theme. Specifically, it’s named after a phenomenon in which a switch’s contacts “bounce” when they’re opened or closed, causing the circuit to fluctuate a bit before settling down. It’s usually no big deal, but this “contact bounce” can be a real problem in logic circuits where a series of on/off pulses can be interpreted as a data stream. The practice of eliminating contact bounce so that only one signal is transmitted by an opening or closing contact is called debouncing.

In the world of services, we sometimes find ourselves performing a cluster of potentially slow or costly operations where only one would do. Using the Debounce pattern, a series of similar calls that are tightly clustered in time are restricted to only one call, typically the first or last in a batch.

This technique has been used in the JavaScript world for years to limit the number of operations that could slow the browser by taking only the first in a series of user events, or to delay a call until a user is ready. You’ve probably seen an application of this technique in practice before. We’re all familiar with the experience of using a search bar whose autocomplete pop-up doesn’t display until after you pause typing, or spam-clicking a button only to see the clicks after the first ignored.

Those of us who specialize in backend services can learn a lot from our frontend colleagues, who have been working for years to account for the reliability, latency, and bandwidth issues inherent to distributed systems. For example, this approach could be used to retrieve some slowly updating remote resource without bogging down, wasting both client and server time with wasteful requests.

This pattern is similar to “Throttle”, in that it limits how often a function can be called. But where Debounce restricts clusters of invocations, Throttle simply limits according to time period. For more on the difference between the Debounce and Throttle patterns, see “What’s the Difference Between Throttle and Debounce?”.

Participants

This pattern includes the following participants:

Circuit: The function to regulate
Debounce: A closure with the same function signature as Circuit

Implementation

The Debounce implementation is actually very similar to the one for Circuit Breaker in that it wraps Circuit to provide the rate-limiting logic. That logic is actually quite straightforward: on each call of the outer function—regardless of its outcome—a time interval is set. Any subsequent call made before that time interval expires is ignored; any call made afterward is passed along to the inner function. This implementation, in which the inner function is called once and subsequent calls are ignored, is called function-first and is useful because it allows the initial response from the inner function to be cached and returned.

A function-last implementation will wait for a pause after a series of calls before calling the inner function. This variant is common in the JavaScript world when a programmer wants a certain amount of input before making a function call, such as when a search bar waits for a pause in typing before autocompleting. Function-last tends to be less common in backend services because it doesn’t provide an immediate response, but it can be useful if your function doesn’t need results right away.

Sample code

Just like in the Circuit Breaker implementation, we start by defining a derived function type with the signature of the function we want to limit. Also like Circuit Breaker, we call it Circuit; it’s identical to the one declared in that example. Again, Circuit can take whatever form is appropriate for your functionality, but it should include an error in its returns:

type Circuit func(context.Context) (string, error)

The similarity with the Circuit Breaker implementation is quite intentional: their compatibility makes them “chainable,” as demonstrated in the following:

func myFunction(ctx context.Context) (string, error) { /* ... */ }

wrapped := Breaker(Debounce(myFunction))
response, err := wrapped(ctx)

The function-first implementation of Debounce—DebounceFirst—is very straightforward compared to function-last because it needs to track only the last time it was called and return a cached result if it’s called again less than d duration after:

func DebounceFirst(circuit Circuit, d time.Duration) Circuit {
    var threshold time.Time
    var result string
    var err error
    var m sync.Mutex

    return func(ctx context.Context) (string, error) {
        m.Lock()
        defer m.Unlock()

        if time.Now().Before(threshold) {
            return result, err
        }

        result, err = circuit(ctx)
        threshold = time.Now().Add(d)

        return result, err
    }
}

This implementation of DebounceFirst takes pains to ensure thread safety by wrapping the entire function in a mutex. While this will force overlapping calls at the start of a cluster to have to wait until the result is cached, it also guarantees that circuit is called exactly once, at the very beginning of a cluster. A defer ensures that the value of threshold, representing the time when a cluster ends (if there are no further calls), is reset with every call.

There’s a potential problem with this approach: it effectively just caches the result of the function and returns it if it’s called again. But what if the circuit function has important side effects? The following variation, DebounceFirstContext, is a little more sophisticated in that every call to it produces a call to circuit, but each call after the first context cancels the one before it:

func DebounceFirstContext(circuit Circuit, d time.Duration) Circuit {
    var threshold time.Time
    var m sync.Mutex
    var lastCtx context.Context
    var lastCancel context.CancelFunc

    return func(ctx context.Context) (string, error) {
        m.Lock()

        if time.Now().Before(threshold) {
            lastCancel()
        }

        lastCtx, lastCancel = context.WithCancel(ctx)
        threshold = time.Now().Add(d)

        m.Unlock()

        result, err := circuit(lastCtx)

        return result, err
    }
}

In DebounceFirstContext we use the same general structure and locking scheme as in DebounceFirst, but this time we set aside a Context and CancelFunc for circuit, which allows us to explicitly cancel circuit with each subsequent invocation before calling it again. In this way, circuit is still called (and any expected side effects triggered) each time while explicitly canceling any prior calls.

What if we want to call our Circuit function at the end of a cluster of calls? For this we’ll use a function-last implementation. Unfortunately, it’s a bit more awkward because it involves the use of a timer function to determine whether enough time has passed since the function was last called, and to call only circuit when it has:

type Circuit func(context.Context) (string, error)

func DebounceLast(circuit Circuit, d time.Duration) Circuit {
    var m sync.Mutex
    var timer *time.Timer
    var cctx context.Context
    var cancel context.CancelFunc

    return func(ctx context.Context) (string, error) {
        m.Lock()

        if timer != nil {
            timer.Stop()
            cancel()
        }

        cctx, cancel = context.WithCancel(ctx)
        ch := make(chan struct {
            result string
            err    error
        }, 1)

        timer = time.AfterFunc(d, func() {
            r, e := circuit(cctx)
            ch <- struct {
                result string
                err    error
            }{r, e}
        })

        m.Unlock()

        select {
        case res := <-ch:
            return res.result, res.err
        case <-cctx.Done():
            return "", cctx.Err()
        }
    }
}

In this implementation, a call to DebounceLast uses time.AfterFunc to execute the circuit function after the specified duration. This useful function lets us call an arbitrary function after a specific duration. It also provides a time.Timer value that can be used to cancel it. This is exactly what we need: you’ll notice that any already-existing timer is stopped (via its Stop method) before starting a new one, ensuring that circuit is called only once.

You’ve probably noticed that we also use a channel to send an anonymous struct (yes, you can do that!). This not only allows us to transmit both of the return values of circuit beyond the asynchronous function it’s called in, but it also conveniently allows us to use a select statement to react appropriately to a context cancellation event.

Retry

Retry accounts for a possible transient fault in a distributed system by transparently retrying a failed operation.

Applicability

Transient errors are a fact of life when working with complex distributed systems. These can be caused by any number of (hopefully) temporary conditions, especially if the downstream service or network resource has protective strategies in place, such as throttling that temporarily rejects requests under high workload, or adaptive strategies like autoscaling that can add capacity when needed.

These faults often resolve themselves after a bit of time, so repeating the request after a reasonable delay is likely (but not guaranteed) to be successful. Failing to account for transient faults can lead to a system that’s unnecessarily brittle. On the other hand, implementing an automatic retry strategy can considerably improve the stability of the service that can benefit both it and its upstream consumers.

Warning

Retry should be used with only idempotent operations. If you are not familiar with the concept of idempotence, we will cover it in detail in “What Is Idempotence and Why Does It Matter?”.

Participants

This pattern includes the following participants:

Effector: The function that interacts with the service.
Retry: A function that accepts Effector and returns a closure with the same function signature as Effector.

Implementation

This pattern works similarly to Circuit Breaker or Debounce in that there is a derived function type, Effector, that defines a function signature. This signature can take whatever form is appropriate for your implementation, but when the function executing the potentially failing operation is implemented, it must match the signature defined by Effector.

The Retry function accepts the user-defined Effector function and returns an Effector function that wraps the user-defined function to provide the retry logic. Along with the user-defined function, Retry also accepts an integer describing the maximum number of retry attempts that it will make and a time.Duration that describes how long it’ll wait between each retry attempt. If the retries parameter is 0, then the retry logic will effectively become a no-op.

Note

Although not included here, most retry implementations will include some kind of backoff logic.

Sample code

The signature for function argument of the Retry function is Effector. It looks exactly like the function types for the previous patterns:

type Effector func(context.Context) (string, error)

The Retry function itself is relatively straightforward, at least compared to the functions we’ve seen so far in this chapter:

func Retry(effector Effector, maxRetries int, delay time.Duration) Effector {
    return func(ctx context.Context) (string, error) {
        for r := 0; ; r++ {
            response, err := effector(ctx)
            if err == nil || r >= maxRetries {
                return response, err
            }

            log.Printf("Attempt %d failed; retrying in %v", r + 1, delay)

            select {
            case <-time.After(delay):
            case <-ctx.Done():
                return "", ctx.Err()
            }
        }
    }
}

You may have already noticed what it is that keeps the Retry function so slender: although it returns a function, that function doesn’t have any external state. This means we don’t need any elaborate mechanisms to manage concurrency.

Note the contents of the select block, which demonstrates a common idiom for implementing a channel read timeout in Go based on the time.After function, similar to the example in “Implementing channel timeouts”. This very useful function returns a channel that emits a message after the specified time has elapsed, which activates its case and ends the current iteration of the retry loop.

To use Retry, we can implement the function that executes the potentially failing operation and whose signature matches the Effector type; this role is played by EmulateTransientError in the following example:

var count int

func EmulateTransientError(ctx context.Context) (string, error) {
    count++

    if count <= 3 {
        return "intentional fail", errors.New("error")
    } else {
        return "success", nil
    }
}

func main() {
    r := Retry(EmulateTransientError, 5, 2 * time.Second)

    res, err := r(context.Background())

    fmt.Println(res, err)
}

In the main function, the EmulateTransientError function is passed to Retry, providing the function variable r. When r is called, EmulateTransientError is called, and called again after a delay if it returns an error, according to the retry logic shown previously. Finally, after the fourth attempt, EmulateTransientError returns a nil error, and the Retry function exits.

Throttle

Throttle limits the frequency of a function call to some maximum number of invocations per unit of time.

Applicability

The Throttle pattern is named after a device used to manage the flow of a fluid, such as the amount of fuel going into a car engine. Like its namesake mechanism, Throttle restricts the number of times that a function can be called over a period of time. For example:

A user may be allowed only 10 service requests per second.
A client may restrict itself to call a particular function once every 500 milliseconds.
An account may be allowed only three failed login attempts in a 24-hour period.

Perhaps the most common reason to apply a Throttle is to account for sharp activity spikes that could saturate the system with a possibly unreasonable number of requests that may be expensive to satisfy, or lead to service degradation and eventually failure. While it may be possible for a system to scale up to add sufficient capacity to meet user demand, this takes time, and the system may not be able to react quickly enough.

Participants

This pattern includes the following participants:

Effector: The function to regulate
Throttle: A function that accepts Effector and returns a closure with the same function signature as Effector

What’s the Difference Between Throttle and Debounce?

Conceptually, Debounce and Throttle seem fairly similar. After all, they’re both about reducing the number of calls per unit of time. However, as illustrated in Figure 4-4, the precise timing of each differs quite a bit:

Throttle works like the throttle in a car, limiting the amount of fuel going into the engine by capping the flow of fuel to some maximum rate. This is illustrated in Figure 4-4: no matter how many times the input function is called, Throttle allows only a fixed number of calls to proceed per unit of time.
Debounce focuses on clusters of activity, making sure that a function is called only once during a cluster of requests, either at the start or the end of the cluster. A function-first debounce implementation is illustrated in Figure 4-4: for each of the two clusters of calls to the input function, Debounce allows only one call to proceed at the beginning (or end) of each cluster.

Figure 4-4. Throttle limits the event rate; debounce allows only one event in a cluster.

Implementation

The Throttle pattern is similar to many of the other patterns described in this chapter: it’s implemented as a function that accepts an effector function and returns a Throttle closure with the same signature that provides the rate-limiting logic.

The most common algorithm for implementing rate-limiting behavior is the token bucket, which uses the analogy of a bucket that can hold some maximum number of tokens. When a function is called, a token is taken from the bucket, which then refills at some fixed rate.

The way that a Throttle treats requests when there are insufficient tokens in the bucket to pay for it can vary depending on the needs of the developer. Some common strategies are as follows:

Return an error: This is the most basic strategy and is common when you’re only trying to restrict unreasonable or potentially abusive numbers of client requests. A RESTful service adopting this strategy might respond with a status 429 (Too Many Requests).
Replay the response of the last successful function call: This strategy can be useful when a service or expensive function call is likely to provide an identical result if called too soon. It’s commonly used in the JavaScript world.
Enqueue the request for execution when sufficient tokens are available: This approach can be useful when you want to eventually handle all requests, but it’s also more complex and may require taking care to ensure that memory isn’t exhausted.

Sample code

The following example implements a basic “token bucket” algorithm that uses the “error” strategy:

type Effector func(context.Context) (string, error)

func Throttle(e Effector, max uint, refill uint, d time.Duration) Effector {
    var tokens = max
    var once sync.Once
    var m sync.Mutex

    return func(ctx context.Context) (string, error) {
        if ctx.Err() != nil {
            return "", ctx.Err()
        }

        once.Do(func() {
            ticker := time.NewTicker(d)

            go func() {
                defer ticker.Stop()

                for {
                    select {
                    case <-ctx.Done():
                        return

                    case <-ticker.C:
                        m.Lock()
                        t := tokens + refill
                        if t > max {
                            t = max
                        }
                        tokens = t
                        m.Unlock()
                    }
                }
            }()
        })

        m.Lock()
        defer m.Unlock()

        if tokens <= 0 {
            return "", fmt.Errorf("too many calls")
        }

        tokens--

        return e(ctx)
    }
}

This Throttle implementation is similar to our other examples in that it wraps an effector function e with a closure that contains the rate-limiting logic. The bucket is initially allocated max tokens; each time the closure is triggered, it checks whether it has any remaining tokens. If tokens are available, it decrements the token count by one and triggers the effector function. If not, an error is returned. Tokens are added at a rate of refill tokens every duration d.

Timeout

Timeout allows a process to stop waiting for an answer once it’s clear that an answer may not be coming.

Applicability

The first of the fallacies of distributed computing is that “the network is reliable,” and it’s first for a reason. Switches fail, routers and firewalls get misconfigured, packets get blackholed. Even if your network is working perfectly, not every service is thoughtful enough to guarantee a meaningful and timely response—or any response at all—if and when it malfunctions.

Timeout represents a common solution to this dilemma and is so beautifully simple that it barely even qualifies as a pattern at all: given a service request or function call that’s running for a longer-than-expected time, the caller simply…stops waiting.

However, don’t mistake “simple” or “common” for “useless.” On the contrary, the ubiquity of the timeout strategy is a testament to its usefulness. The judicious use of timeouts can provide a degree of fault isolation, preventing cascading failures and reducing the chance that a problem in a downstream resource becomes your problem.

Participants

This pattern includes the following participants:

Client: The client who wants to execute SlowFunction
SlowFunction: The long-running function that implements the functionality desired by Client
Timeout: A wrapper function around SlowFunction that implements the timeout logic

Implementation

There are several ways to implement a timeout in Go, but the most idiomatic way is to use the functionality provided by the context package. See “The Context Package” for more information.

In an ideal world, any possibly long-running function will accept a context.Context parameter directly. If so, your work is fairly straightforward: you need only pass it a Context value decorated with the context.WithTimeout function:

ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
defer cancel()

result, err := SomeFunction(ctx)

However, this isn’t always the case, and with third-party libraries you don’t always have the option of refactoring to accept a Context value. In these cases, the best course of action may be to wrap the function call in such a way that it does respect your Context.

For example, imagine you have a potentially long-running function that not only doesn’t accept a Context value but comes from a package you don’t control. If Client were to call SlowFunction directly, it would be forced to wait until the function completes, if indeed it ever does. Now what?

Instead of calling SlowFunction directly, you can call it in a goroutine. This way you can capture the results if it returns in an acceptable period of time, but you can also move on if it doesn’t.

Warning

Timing out doesn’t actually cancel SlowFunction. If it doesn’t end somehow, the result will be a goroutine leak. See “Goroutines” for more on this phenomenon.

To do this, we can leverage a few tools that we’ve seen before: context.Context for timeouts, channels for communicating results, and select to catch whichever one acts first.

Sample code

The following example imagines the existence of the fictional function, Slow, whose execution may or may not complete in some reasonable amount of time and whose signature conforms with the following type definition:

type SlowFunction func(string) (string, error)

Rather than calling Slow directly, we instead provide a Timeout function, which wraps a provided SlowFunction in a closure and returns a WithContext function, which adds a context.Context to the SlowFunction’s parameter list:

type WithContext func(context.Context, string) (string, error)

func Timeout(f SlowFunction) WithContext {
    return func(ctx context.Context, arg string) (string, error) {
        ch := make(chan struct {
            result string
            err    error
        }, 1)

        go func() {
            res, err := f(arg)
            ch <- struct {
                result string
                err    error
            }{res, err}
        }()

        select {
        case res := <-ch:
            return res.result, res.err
        case <-ctx.Done():
            return "", ctx.Err()
        }
    }
}

Within the function that Timeout constructs, the SlowFunction is run in a goroutine, with its return values being wrapped in a struct and sent into a channel constructed for that purpose, assuming it completes in time.

The next statement select’s on two channels: the SlowFunction function response channel and the Context value’s Done channel. If the former completes first, the closure will return the SlowFunction’s return values; otherwise, it returns the error provided by the Context.

Tip

If the SlowFunction is slow because it’s expensive, a possible improvement would be to check whether ctx.Err() returns a non-nil value before calling the goroutine.

Using the Timeout function isn’t much more complicated than consuming Slow directly, except that instead of one function call, we have two: the call to Timeout to retrieve the closure and the call to the closure itself:

func main() {
    ctx, cancel := context.WithTimeout(context.Background(), time.Second)
    defer cancel()

    timeout := Timeout(Slow)
    res, err := timeout(ctx, "some input")

    fmt.Println(res, err)
}

Finally, although it’s usually preferred to implement service timeouts using context.Context, channel timeouts can also be implemented using the channel provided by the time.After function. See “Implementing channel timeouts” for an example of how this is done.

Concurrency Patterns

A cloud native service will often be called upon to efficiently juggle multiple processes and handle high (and highly variable) levels of load, ideally without having to suffer the trouble and expense of scaling up. As such, it needs to be highly concurrent and able to manage multiple simultaneous requests from multiple clients. While Go is known for its concurrency support, bottlenecks can and do happen. Some of the patterns that have been developed to prevent them are presented here.

For the sake of simplicity, the code examples in this section don’t implement context cancellation. Generally, you shouldn’t have too much difficulty adding it on your own.

Fan-In

Fan-in multiplexes multiple input channels onto one output channel.

Applicability

Services that have some number of workers that all generate output may find it useful to combine all of the workers’ outputs to be processed as a single unified stream. For these scenarios we use the fan-in pattern, which can read from multiple input channels by multiplexing them onto a single destination channel.

Participants

This pattern includes the following participants:

Sources: A set of one or more input channels with the same type. Accepted by Funnel.
Destination: An output channel of the same type as Sources. Created and provided by Funnel.
Funnel: Accepts Sources and immediately returns Destination. Any input from any Sources will be output by Destination.

Implementation

Funnel is implemented as a function that receives zero to N input channels (Sources). For each input channel in Sources, the Funnel function starts a separate goroutine to read values from its assigned channel and forward them to a single output channel shared by all of the goroutines (Destination).

Sample code

The Funnel function is a variadic function that receives sources: zero to N channels of some type (int in the following example):

func Funnel(sources ...<-chan int) <-chan int {
    dest := make(chan int)         // The shared output channel

    wg := sync.WaitGroup{}         // Used to automatically close dest
                                   // when all sources are closed

    wg.Add(len(sources))           // Set size of the WaitGroup

    for _, ch := range sources {   // Start a goroutine for each source
        go func(ch <-chan int) {
            defer wg.Done()        // Notify WaitGroup when ch closes

            for n := range ch {
                dest <- n
            }
        }(ch)
    }

    go func() {                    // Start a goroutine to close dest
        wg.Wait()                  // after all sources close
        close(dest)
    }()

    return dest
}

For each channel in the list of sources, Funnel starts a dedicated goroutine that reads values from its assigned channel and forwards them to dest, a single-output channel shared by all of the goroutines.

Note the use of a sync.WaitGroup to ensure that the destination channel is closed appropriately. Initially, a WaitGroup is created and set to the total number of source channels. If a channel is closed, its associated goroutine exits, calling wg.Done. When all of the channels are closed, the WaitGroup’s counter reaches zero, the lock imposed by wg.Wait is released, and the dest channel is closed.

Using Funnel is reasonably straightforward: given N source channels (or a slice of N channels), pass the channels to Funnel. The returned destination channel may be read in the usual way and will close when all source channels close:

func main() {
    var sources []<-chan int            // Declare an empty channel slice

    for i := 0; i < 3; i++ {
        ch := make(chan int)
        sources = append(sources, ch)   // Create a channel; add to sources

        go func() {                     // Run a toy goroutine for each
            defer close(ch)             // Close ch when the routine ends

            for i := 1; i <= 5; i++ {
                ch <- i
                time.Sleep(time.Second)
            }
        }()
    }

    dest := Funnel(sources...)
    for d := range dest {
        fmt.Println(d)
    }
}

This example creates a slice of three int channels, into which the values from 1 to 5 are sent before being closed. In a separate goroutine, the outputs of the single dest channel are printed. Running this will result in the appropriate 15 lines being printed before dest closes and the function ends.

Fan-Out

Fan-out evenly distributes messages from an input channel to multiple output channels.

Applicability

Fan-out receives messages from an input channel, distributing them evenly among output channels, and is a useful pattern for parallelizing CPU and I/O utilization.

For example, imagine that you have an input source, such as a Reader on an input stream or a listener on a message broker that provides the inputs for some resource-intensive unit of work. Rather than coupling the input and computation processes, which would confine the effort to a single serial process, you might prefer to parallelize the workload by distributing it among some number of concurrent worker processes.

Participants

This pattern includes the following participants:

Source: An input channel. Accepted by Split.
Destinations: An output channel of the same type as Source. Created and provided by Split.
Split: A function that accepts Source and immediately returns Destinations. Any input from Source will be output to a Destination.

Implementation

Fan-out may be relatively conceptually straightforward, but the devil is in the details.

Typically, fan-out is implemented as a Split function, which accepts a single Source channel and integer representing the desired number of Destination channels. The Split function creates the Destination channels and executes some background process that retrieves values from Source channel and forwards them to one of the Destinations.

The implementation of the forwarding logic can be done in one of two ways:

Using a single goroutine that reads values from Source and forwards them to the Destinations in a round-robin fashion. This has the virtue of requiring only one master goroutine, but if the next channel isn’t ready to read yet, it’ll slow the entire process.
Using separate goroutines for each Destination that compete to read the next value from Source and forward it to their respective Destination. This requires slightly more resources but is less likely to get bogged down by a single slow-running worker.

The next example uses the latter approach.

Sample code

In this example, the Split function accepts a single receive-only channel, source, and an integer describing the number of channels to split the input into, n. It returns a slice of n send-only channels with the same type as source.

Internally, Split creates the destination channels. For each channel created, it executes a goroutine that retrieves values from source in a for loop and forwards them to their assigned output channel. Effectively, each goroutine competes for reads from source; if several are trying to read, the “winner” will be randomly determined. If source is closed, all goroutines terminate and all of the destination channels are closed:

func Split(source <-chan int, n int) []<-chan int {
    var dests []<-chan int              // Declare the dests slice

    for i := 0; i < n; i++ {            // Create n destination channels
        ch := make(chan int)
        dests = append(dests, ch)

        go func() {                     // Each channel gets a dedicated
            defer close(ch)             // goroutine that competes for reads

            for val := range source {
                ch <- val
            }
        }()
    }

    return dests
}

Given a channel of some specific type, the Split function will return a number of destination channels. Typically, each will be passed to a separate goroutine, as demonstrated in the following example:

func main() {
    source := make(chan int)         // The input channel
    dests := Split(source, 5)        // Retrieve 5 output channels

    go func() {                      // Send the number 1..10 to source
        for i := 1; i <= 10; i++ {   // and close it when we're done
            source <- i
        }

        close(source)
    }()

    var wg sync.WaitGroup            // Use WaitGroup to wait until
    wg.Add(len(dests))               // the output channels all close

    for i, d := range dests {
        go func(i int, d <-chan int) {
            defer wg.Done()

            for val := range d {
                fmt.Printf("#%d got %d\n", i, val)
            }
        }(i, d)
    }

    wg.Wait()
}

This example creates an input channel, source, which it passes to Split to receive its output channels. Concurrently, it passes the values 1 to 10 into source in a goroutine while receiving values from dests in 5 others. When the inputs are complete, the source channel is closed, which triggers closures in the output channels, which ends the read loops, which causes wg.Done to be called by each of the read goroutines, which releases the lock on wg.Wait and allows the function to end.

Future

Future provides a placeholder for a value that’s not yet known.

Applicability

Futures (also known as Promises or Delays)⁶ are a synchronization construct that provide a placeholder for a value that’s still being generated by an asynchronous process.

This pattern isn’t used as frequently in Go as in some other languages because channels can often be used in a similar way. For example, the long-running blocking function BlockingInverse (not shown) can be executed in a goroutine that returns the result (when it arrives) along a channel. The ConcurrentInverse function that follows does exactly that, returning a channel that can be read when a result becomes available:

func ConcurrentInverse(m Matrix) <-chan Matrix {
    out := make(chan Matrix)

    go func() {
        out <- BlockingInverse(m)
        close(out)
    }()

    return out
}

Using ConcurrentInverse, one could then build a function to calculate the inverse product of two matrices:

func InverseProduct(a, b Matrix) Matrix {
    inva := ConcurrentInverse(a)
    invb := ConcurrentInverse(b)

    return Product(<-inva, <-invb)
}

This doesn’t seem so bad, but it comes with some baggage that makes it undesirable for something like a public API. First, the caller has to be careful to call ConcurrentInverse with the correct timing. To see what I mean, take a close look at the following:

return Product(<-ConcurrentInverse(a), <-ConcurrentInverse(b))

See the problem? Since the computation isn’t started until ConcurrentInverse is actually called, this construct would be effectively executed serially, requiring twice the runtime.

What’s more, when using channels this way, functions with more than one return value will usually assign a dedicated channel to each member of the return list, which can become awkward as the return list grows or when the values need to be read by more than one goroutine.

The Future pattern contains this complexity by encapsulating it in an API that provides the consumer with a simple interface whose method can be called normally, blocking all calling routines until all of its results are resolved. The interface that the value satisfies doesn’t even have to be constructed specially for that purpose; any interface that’s convenient for the consumer can be used.

Participants

This pattern includes the following participants:

Future: The interface that is received by the consumer to retrieve the eventual result
SlowFunction: A wrapper function around some function to be asynchronously executed; provides Future
InnerFuture: Satisfies the Future interface; includes an attached method that contains the result access logic

Implementation

The API presented to the consumer is fairly straightforward: the programmer calls SlowFunction, which returns a value that satisfies the Future interface. Future may be a bespoke interface, as in the following example, or it may be something more like an io.Reader that can be passed to its own functions.

In actuality, when SlowFunction is called, it executes the core function of interest as a goroutine. In doing so, it defines channels to capture the core function’s output, which it wraps in InnerFuture.

InnerFuture has one or more methods that satisfy the Future interface, which retrieve the values returned by the core function from the channels, cache them, and return them. If the values aren’t available on the channel, the request blocks. If they have already been retrieved, the cached values are returned.

Sample code

In this example, we use a Future interface that the InnerFuture will satisfy:

type Future interface {
    Result() (string, error)
}

The InnerFuture struct is used internally to provide the concurrent functionality. In this example, it satisfies the Future interface but could just as easily choose to satisfy something like io.Reader by attaching a Read method, if you prefer:

type InnerFuture struct {
    once sync.Once
    wg   sync.WaitGroup

    res   string
    err   error
    resCh <-chan string
    errCh <-chan error
}

func (f *InnerFuture) Result() (string, error) {
    f.once.Do(func() {
        f.wg.Add(1)
        defer f.wg.Done()
        f.res = <-f.resCh
        f.err = <-f.errCh
    })

    f.wg.Wait()

    return f.res, f.err
}

In this implementation, the struct itself contains a channel and a variable for each value returned by the Result method. When Result is first called, it attempts to read the results from the channels and send them back to the InnerFuture struct so that subsequent calls to Result can immediately return the cached values.

Note the use of sync.Once and sync.WaitGroup. The former does what it says on the tin: it ensures that the function that’s passed to it is called exactly once. The WaitGroup is used to make this function call thread safe: any calls after the first will be blocked at wg.Wait until the channel reads are complete.

SlowFunction is a wrapper around the core functionality that you want to run concurrently. It has the job of creating the results channels, running the core function in a goroutine, and creating and returning the Future implementation (InnerFuture, in this example):

func SlowFunction(ctx context.Context) Future {
    resCh := make(chan string)
    errCh := make(chan error)

    go func() {
        select {
        case <-time.After(2 * time.Second):
            resCh <- "I slept for 2 seconds"
            errCh <- nil
        case <-ctx.Done():
            resCh <- ""
            errCh <- ctx.Err()
        }
    }()

    return &InnerFuture{resCh: resCh, errCh: errCh}
}

To make use of this pattern, you need only call the SlowFunction and use the returned Future as you would any other value:

func main() {
    ctx := context.Background()
    future := SlowFunction(ctx)

    // Do stuff while SlowFunction chugs along in the background.

    res, err := future.Result()
    if err != nil {
        fmt.Println("error:", err)
        return
    }

    fmt.Println(res)
}

This approach provides a reasonably good user experience. The programmer can create a Future and access it as they wish, and can even apply timeouts or deadlines with a Context.

Sharding

Sharding splits a large data structure into multiple partitions to localize the effects of read/write locks.

Applicability

The term sharding is typically used in the context of distributed state to describe data that is partitioned between server instances. This kind of horizontal sharding is commonly used by databases and other data stores to distribute load and provide redundancy.

A slightly different issue can sometimes affect highly concurrent services that have a shared data structure with a locking mechanism to protect it from conflicting writes. In this scenario, the locks that serve to ensure the fidelity of the data can also create a bottleneck when processes start to spend more time waiting for locks than they do doing their jobs. This unfortunate phenomenon is called lock contention.

While this might be resolved in some cases by scaling the number of instances, this also increases complexity and latency, because distributed locks need to be established, and writes need to establish consistency. An alternative strategy for reducing lock contention around shared data structures within an instance of a service is vertical sharding, in which a large data structure is partitioned into two or more structures, each representing a part of the whole. Using this strategy, only a portion of the overall structure needs to be locked at a time, decreasing overall lock contention.

Participants

This pattern includes the following participants:

ShardedMap: An abstraction around one or more Shards providing read and write access as if the Shards were a single map
Shard: An individually lockable collection representing a single data partition

Implementation

While idiomatic Go generally prefers the use of memory sharing via channels over using locks to protect shared resources,⁷ this isn’t always possible. Maps are particularly unsafe for concurrent use, making the use of locks as a synchronization mechanism a necessary evil. Fortunately, Go provides sync.RWMutex for precisely this purpose. You may recall sync.RWMutex from “Mutexes”.

This strategy generally works well enough. However, because locks allow access to only one process at a time, the average amount of time spent waiting for locks to clear in a read/write intensive application can increase dramatically with the number of concurrent processes acting on the resource. The resulting lock contention can potentially bottleneck key functionality.

Vertical sharding reduces lock contention by splitting the underlying data structure—usually a map—into several individually lockable maps. An abstraction layer provides access to the underlying shards as if they were a single structure (see Figure 4-5).

Internally, this is accomplished by creating an abstraction layer around what is essentially a map of maps. Whenever a value is read or written to the map abstraction, a hash value is calculated for the key, which is then modded by the number of shards to generate a shard index. This allows the map abstraction to isolate the necessary locking to only the shard at that index.

Sample code

In the following example, we use the standard sync and hash/fnv packages to implement a basic sharded map: ShardedMap.

Internally, ShardedMap is just a slice of pointers to some number of Shard values, but we define it as a type so we can attach methods to it. Each Shard includes a map[string]any that contains that shard’s data and a composed sync.RWMutex so that it can be individually locked:

type Shard[K comparable, V any] struct {
    sync.RWMutex    // Compose from sync.RWMutex
    items map[K]V   // m contains the shard's data
}

type ShardedMap[K comparable, V any] []*Shard[K, V]

Go doesn’t have Java-style constructors, so we provide a NewShardedMap function to retrieve a new ShardedMap:

func NewShardedMap[K comparable, V any](nshards int) ShardedMap[K, V] {
    shards := make([]*Shard[K, V], nshards)      // Initialize a *Shards slice

    for i := 0; i < nshards; i++ {
        shard := make(map[K]V)
        shards[i] = &Shard[K, V]{items: shard}   // A ShardedMap IS a slice!
    }

    return shards
}

ShardedMap has two unexported methods, getShardIndex and getShard, which are used to calculate a key’s shard index and retrieve a key’s correct shard, respectively. These could be easily combined into a single method, but splitting them this way makes them easier to test:⁸

func (m ShardedMap[K, V]) getShardIndex(key K) int {
    str := reflect.ValueOf(key).String()  // Get string representation of key
    hash := fnv.New32a()                  // Get a hash implementation
    hash.Write([]byte(str))               // Write bytes to the hash
    sum := int(hash.Sum32())              // Get the resulting checksum
    return sum % len(m)                   // Mod by len(m) to get index
}

func (m ShardedMap[K, V]) getShard(key K) *Shard[K, V] {
    index := m.getShardIndex(key)
    return m[index]
}

To get the index, we first compute the hash of the string representation of the value, from which we calculate our final value by modding it against the number of shards. This definitely isn’t the most computationally efficient way to go about generating a hash, but it’s conceptually simple and works well enough for an example.

The precise hash algorithm doesn’t really matter very much—we use the FNV-1a hash function in this example—as long as it’s deterministic and provides a reasonably uniform value distribution.

Finally, we add methods to ShardedMap to allow a user to read and write values. Obviously, these don’t demonstrate all of the functionality a map might need. The source for this example is in the GitHub repository associated with this book, however, so please feel free to implement them as an exercise. A Delete and a Contains method would be nice:

func (m ShardedMap[K, V]) Get(key K) V {
    shard := m.getShard(key)
    shard.RLock()
    defer shard.RUnlock()

    return shard.items[key]
}

func (m ShardedMap[K, V]) Set(key K, value V) {
    shard := m.getShard(key)
    shard.Lock()
    defer shard.Unlock()

    shard.items[key] = value
}

When you do need to establish locks on all of the tables, it’s generally best to do so concurrently. In the following, we implement a Keys function using goroutines and our old friend sync.WaitGroup:

func (m ShardedMap[K, V]) Keys() []K {
    var keys []K                           // Declare an empty keys slice
    var mutex sync.Mutex                   // Mutex for write safety to keys

    var wg sync.WaitGroup                  // Create a wait group and add a
    wg.Add(len(m))                         // wait value for each slice

    for _, shard := range m {              // Run a goroutine for each slice in m
        go func(s *Shard[K, V]) {
            s.RLock()                      // Establish a read lock on s

            defer wg.Done()                // Release of the read lock
            defer s.RUnlock()              // Tell the WaitGroup it's done

            for key, _ := range s.items {  // Get the slice's keys
                mutex.Lock()
                keys = append(keys, key)
                mutex.Unlock()
            }
        }(shard)
    }

    wg.Wait()                              // Block until all goroutines are done

    return keys                            // Return combined keys slice
}

Using ShardedMap isn’t quite like using a standard map, unfortunately, but while it’s different, it’s no more complicated:

func main() {
    m := NewShardedMap[string, int](5)
    keys := []string{"alpha", "beta", "gamma"}

    for i, k := range keys {
        m.Set(k, i+1)

        fmt.Printf("%5s: shard=%d value=%d\n",
            k, m.getShardIndex(k), m.Get(k))
    }

    fmt.Println(m.Keys())
}

Output:

alpha: shard=3 value=1
 beta: shard=2 value=2
gamma: shard=0 value=3
[gamma beta alpha]

Prior to Go 1.21 (and in the previous edition of this book), the ShardedMap had to be constructed with any type values, at least if it was going to be reusable. However, this came with the loss of type safety associated with the use of any and the subsequent requirement of type assertions. Fortunately, with the release of Go generics, this is a solved problem.

Worker Pool

A worker pool directs multiple processes to concurrently execute work on a collection of input.

Applicability

Probably one of the most commonly used of the patterns in this chapter, a worker pool (or a “thread pool” in pretty much any other language) is a pattern used in many languages to efficiently manage the execution of multiple concurrent tasks by using a fixed number of workers.

Worker pools are very good at managing tasks that you want to run concurrently, but only to a point. Common applications include things like handling multiple incoming requests, processing task queues, handling steps in data pipelines, and executing long-duration batch processing jobs.

Participants

This pattern includes the following participants:

Worker: A function that does some work on items from Jobs and sends the results to Results
Jobs: A channel from which Worker receives the raw data to be worked on
Results: A channel into which Worker sends the results of its work

Implementation

Programmers coming from other languages may be familiar with the concept of the “thread pool”: a group of threads that are standing by to do some work. This is a useful concurrency pattern in any language, but thread pools can often be tedious and sometimes complicated to work with.⁹

Fortunately, Go’s concurrency features—goroutines and channels—make it especially well-suited for building something like this. As such, our implementation is rather straightforward relative to many other languages: what we call a “worker” is really just a goroutine¹⁰ (Worker) that receives input data on an input channel (Jobs) and returns that work along an output channel (Results).

Both channels are shared among all workers, so any work sent along the jobs channel will be done by the first available worker. Importantly, each worker lives only as long as the jobs channel is open: they’re designed to automatically terminate once their job is done.

Sample code

If you’re accustomed to working with thread pools in other languages, you’ll hopefully find our example of a worker pool to be refreshingly simple. In my opinion, this pattern highlights the elegance of the design around Go’s concurrency features particularly well.

The meat of the pattern is the worker, which (as its name implies) does the work. Below is the blueprint for a single worker function:

func worker(id int, jobs <-chan int, results chan<- int) {
    for j := range jobs {
        fmt.Println("Worker", id, "started job", j)
        time.Sleep(time.Second)
        results <- j * 2
    }
}

As you can see, this example worker is a fairly standard function that accepts a job channel for sending work to it, and results channels for sending the results of the work out. This implementation also takes an id integer, which is just used for demonstration purposes.

This worker’s functionality is pretty trivial, reading an int value from the jobs channel and (after a brief pause to simulate effort) writing that value’s double out via the results channel. Note that it’ll automatically exit when the jobs channel closes.

Warning

You should always know how and when any goroutine you create is going to terminate, or you risk creating a goroutine leak.

Now, if we were using just one worker, this wouldn’t be much more useful than just looping over the inputs and doing the computation one by one. But of course, as you’ve probably already inferred, we won’t be using just one worker:

func main() {
    jobs := make(chan int, 10)
    results := make(chan int)
    wg := sync.WaitGroup{}

    for w := 1; w <= 3; w++ {          // Spawn 3 workers processes
        go worker(w, jobs, results)
    }

    for j := 1; j <= 10; j++ {         // Send jobs to workers
        wg.Add(1)
        jobs <- j
    }

    go func() {
        wg.Wait()
        close(jobs)
        close(results)
    }()

    for r := range results {
        fmt.Println("Got result:", r)
        wg.Done()
    }
}

In this example, we start by creating the two channels: job to put work into and results to receive the completed work from. We also create a sync.WaitGroup, which lets us pause until all the work is done.

Now the interesting bit: we use the go keyword to start three workers, passing both channels into each. We haven’t given them any work to do yet, so they’re all blocked waiting for jobs.

Now that we have the workers in place, let’s see what they can do by sending them some work units via the job channel. Each work unit is received and processed by exactly one of the three workers. Finally, we read the results coming from the results channel until it’s closed by the goroutine.

This pattern may strike you as trivial, and that’s okay. But it demonstrates how, with a couple of channels and goroutines, we can create a fully functional worker pool in Go.

Chord

The Chord (or Join) pattern performs an atomic consumption of messages from each member of a group of channels.

Applicability

In music, a chord is a group of multiple notes sounded together to produce a harmony. Like its namesake musical concept, the Chord pattern consists of multiple signals being sent together, in this case along channels, to produce a single output.

More specifically, the Chord pattern receives inputs from multiple channels but emits a value only when all its channels have emitted a value.

This could be a useful pattern if, for example, you need to receive inputs from multiple monitored data sources before acting.

Participants

This pattern includes the following participants:

Sources: A set of one or more input channels with the same type. Accepted by Chord.
Destination: An output channel of the same type as Sources. Created and provided by Chord.
Chord: Accepts Sources and immediately returns Destination. Any input from any Sources will be output by Destination.

Implementation

The Chord pattern is very similar to “Fan-In” in that it concurrently reads inputs from multiple Sources channels and emits them back to the Destination channel, but that’s where the similarity ends. Where Fan-In immediately forwards all inputs, Chord waits to act until it’s received at least one value from each of its source channels, then it packages all the values into a slice and sends it on the Destination channel.

In fact, the first half of this pattern, which retrieves values from Sources and sends them into an intermediate channel for processing, looks and acts exactly like a Fan-In, to the extent that Chord can reasonably be considered an extension of Fan-In.

It’s that last half, though, that makes this pattern interesting. It’s this part that keeps track of which channels have sent messages since the last output, and for packaging the most recent inputs from each channel into a slice to be sent to Destination.

Sample code

Chord is probably the most conceptually complex pattern described in this chapter (which is why I saved it for last). However, as mentioned previously, the first half of the following example is taken directly from “Fan-In”.

Since you’re (hopefully) a Fan-in expert at this point, you’ll probably see that pattern in the first half or so of the following Chord function:

func Chord(sources ...<-chan int) <-chan []int {
    type input struct {                    // Used to send inputs
        idx, input int                     // between goroutines
    }

    dest := make(chan []int)                // The output channel

    inputs := make(chan input)              // An intermediate channel

    wg := sync.WaitGroup{}                  // Used to close channels when
    wg.Add(len(sources))                    // all sources are closed

    for i, ch := range sources {            // Start goroutine for each source
        go func(i int, ch <-chan int) {
            defer wg.Done()                 // Notify WaitGroup when ch closes

            for n := range ch {
                inputs <- input{i, n}       // Transfer input to next goroutine
            }
        }(i, ch)
    }

    go func() {
        wg.Wait()                           // Wait for all sources to close
        close(inputs)                       // then close inputs channel
    }()

    go func() {
        res := make([]int, len(sources))    // Slice for incoming inputs
        sent := make([]bool, len(sources))  // Slice to track sent status
        count := len(sources)               // Counter for channels

        for r := range inputs {
            res[r.idx] = r.input            // Update incoming input

            if !sent[r.idx] {               // First input from channel?
                sent[r.idx] = true
                count--
            }

            if count == 0 {
                c := make([]int, len(res))  // Copy and send inputs slice
                copy(c, res)
                dest <- c

                count = len(sources)        // Reset counter
                clear(sent)                 // Clear status tracker
            }
        }

        close(dest)
    }()

    return dest
}

Looking closely, you’ll notice that Chord is composed of three key sections.

The first is a for loop that spawns a number of goroutines, one for each channel in sources. When any of these receives a value from its respective channel, it sends both its channel’s index and the value received to the intermediate inputs channel to be picked up by another goroutine. When a sources channel closes, the goroutine watching it ends as well, decrementing the WaitGroup counter in the process.

The next section is a goroutine whose only job is to wait for the WaitGroup. When all of the sources channels close, the lock is released, and the intermediate inputs channel is closed.

Finally, the last goroutine’s purpose is to read input values from the intermediate inputs channel and determine whether all channels in sources have sent a value yet. Whenever a value is received from an input channel, the res slice is updated at the appropriate index with the value. If this is the first time receiving a value from a particular channel, the count counter is decremented.

Once count hits zero, a few things happen: the res slice (which contains all of the most recent inputs from all the sources channels) is copied and sent into the destination channel, the count counter is reset, and the sent slice is cleared. With that, Chord is reset for another read cycle.

Note how we copy the res slice before sending it. This is a safety feature made necessary by the fact that slices are a pointer type, so if we don’t make a copy in this way, subsequent inputs could cause the slice to be modified as it’s being used elsewhere.

Now let’s see Chord in action:

func main() {
    ch1 := make(chan int)
    ch2 := make(chan int)
    ch3 := make(chan int)

    go func() {
        for n := 1; n <= 4; n++ {
            ch1 <- n
            ch1 <- n * 2              // Writing twice to ch1!
            ch2 <- n
            ch3 <- n
            time.Sleep(time.Second)
        }

        close(ch1)                    // Closing all input channels
        close(ch2)                    // causes res to be closed as
        close(ch3)                    // as well
    }()

    res := Chord(ch1, ch2, ch3)

    for s := range res {              // Read results
        fmt.Println(s)
    }
}

The first thing that this function does is create some input channels, which we’ll be sending some data to see how Chord behaves.

It then spawns a goroutine that sends some data into the channels, closing them all when it’s done sending. Note the double send to the ch1 channel.

Running this produces a nice, consistent output and terminates normally:

[2 1 1]
[4 2 2]
[6 3 3]
[8 4 4]

This output tells us two things. First, the program terminated normally, so we know that the res channel indeed closed when all of the input channels closed, breaking the final for loop as expected. Second, the values in position 0 of each of the output slices reflects the second value sent into ch1, highlighting that this Chord implementation responds with the most recently sent values.

Summary

This chapter covered quite a few very interesting—and useful—idioms. There are probably many more,¹¹ but these are the ones I felt were most important, either because they’re somehow practical in a directly applicable way or because they showcase some interesting feature of the Go language. Often both.

In Chapter 5, we’ll move on to the next level, taking some of the things we discussed in Chapters 3 and 4 and putting them into practice by building a simple key-value store from scratch!

¹ Spoken August 1979. Attested to by Vicki Almstrum, Tony Hoare, Niklaus Wirth, Wim Feijen, and Rajeev Joshi. In Pursuit of Simplicity: A Symposium Honoring Professor Edsger Wybe Dijkstra, May 12–13, 2000.

² L (yes, his legal name is L) is a brilliant and fascinating human being. Look him up some time.

³ Pun unavoidable.

⁴ Erich Gamma et al., Design Patterns: Elements of Reusable Object-Oriented Software, 1st ed. (Addison-Wesley Professional, 1994).

⁵ If you prefer boring names.

⁶ While these terms are often used interchangeably, they can also have shades of meaning depending on their context. I know. Please don’t write me any angry letters about this.

⁷ See the article, “Share Memory by Communicating”, The Go Blog.

⁸ If you’re into that kind of thing.

⁹ Java developers, you know what I’m talking about.

¹⁰ Which is why we call it a “worker pool” instead of a “thread pool.”

¹¹ Did I leave out your favorite? Let me know, and I’ll try to include it in the next edition!

Chapter 5. Building a Cloud Native Service

Life was simple before World War II. After that, we had systems.¹

Grace Hopper, OCLC Newsletter (1987)

In this chapter, our real work finally begins.

We’ll weave together many of the materials discussed throughout Part II to create a service that’ll serve as the jumping-off point for the remainder of the book. As we go forward, we’ll iterate on what we begin here, adding layers of functionality with each chapter until, at the conclusion, we have ourselves a true cloud native application.

Naturally, it won’t be “production ready”—it’ll be missing some important security features, for example—but it’ll still provide a solid foundation to build upon.

But what do we build?

Let’s Build a Service!

Okay. So. We need something to build.

It should be conceptually simple, straightforward enough to implement in its most basic form but nontrivial and amenable to scaling and distributing. Something that we can iteratively refine over the remainder of the book. I put a lot of thought into this, considering different ideas for what our application would be, but in the end the answer was obvious.

We’ll build ourselves a distributed key-value store.

What’s a Key-Value Store?

A key-value store is a kind of nonrelational database that stores data as a collection of key-value pairs. They’re very different from the better-known relational databases, like Microsoft SQL Server or PostgreSQL, that we know and love.² Where relational databases structure their data among fixed tables with well-defined data types, key-value stores are far simpler, allowing users to associate a unique identifier (the key) with an arbitrary value.

In other words, at its heart, a key-value store is really just a map with a service endpoint, as shown in Figure 5-1. It’s the simplest possible database.

Requirements

By the end of this chapter, we’re going to have built a simple, nondistributed key-value store that can do all of the things that a (monolithic) key-value store should do:

It must be able to store arbitrary key-value pairs.
It must provide service endpoints that allow a user to put, get, and delete key-value pairs.
It must be able to persistently store its data in some fashion.

Finally, the service must be idempotent. But why?

What Is Idempotence and Why Does It Matter?

The concept of idempotence has its origins in algebra, where it describes particular properties of certain mathematical operations. Fortunately, this isn’t a math book. We’re not going to talk about that (except in the sidebar at the end of this section).

In the programming world, an operation (such as a method or service call) is idempotent if calling it once has the same effect as calling it multiple times. For example, the assignment operation x=1 is idempotent, because x will always be 1 no matter how many times you assign it. Similarly, an HTTP PUT method is idempotent because PUT-ting a resource in a place multiple times won’t change anything: it won’t get any more PUT the second time.³ The operation x+=1, however, is not idempotent, because every time that it’s called, a new state is produced.

Less discussed, but also important, is the related property of nullipotence, in which a function or operation has no side effect at all. For example, the x=1 assignment and an HTTP PUT are idempotent but not nullipotent because they trigger state changes. Assigning a value to itself, such as x=x, is nullipotent because no state has changed as a result of it. Similarly, just reading data, as with an HTTP GET, usually has no side effects, so it’s also nullipotent.

Of course, that’s all very nice in theory, but why should we care in the real world? Well, as it turns out, designing your service methods to be idempotent provides a number of very real benefits:

Idempotent operations are safer: What if you make a request to a service but get no response? You’ll probably try again. But what if it heard you the first time?⁴ If the service method is idempotent, then no harm done. But if it’s not, you could have a problem. This scenario is more common than you might think. Networks are unreliable. Responses can be delayed; packets can get dropped.
Idempotent operations are generally simpler: Idempotent operations are more self-contained and easier to implement. Compare, for example, an idempotent PUT method that just adds a key-value pair into a backing data store, and a similar but nonidempotent CREATE method that returns an error if the data store already contains the key. The PUT logic is simple: receive request, set value. The CREATE, on the other hand, requires additional layers of error checking and handling, and possibly even distributed locking and coordination among any service replicas, making its service harder to scale.
Idempotent operations are more declarative: Building an idempotent API encourages the designer to focus on end states, encouraging the production of methods that are more declarative: they allow users to tell a service what needs to be done, instead of telling it how to do it. This may seem to be a fine point, but declarative methods—as opposed to imperative methods—free users from having to deal with low-level constructs, allowing them to focus on their goals and minimizing potential side effects.

In fact, idempotence provides such an advantage, particularly in a cloud native context, that some very smart people have gone so far as to assert that it’s a synonym for “cloud native.”⁵ I don’t think that I’d go quite that far, but I would say that if your service aims to be cloud native, accepting any less than idempotence is asking for trouble.

The Eventual Goal

These requirements are quite a lot to chew on, but they represent the absolute minimum for our key-value store to be usable. Later on we’ll add some important basic functionality, like support for multiple users and data encryption in transit. More importantly, though, we’ll introduce techniques and technologies that make the service more scalable, resilient, and generally capable of surviving and thriving in a cruel, uncertain universe.

Generation 0: The Core Functionality

Okay, let’s get started. First things first. Let’s first build the core functions without worrying about things like user requests and persistence; that way they can be called later from whatever web framework we decide to use:

Storing arbitrary key-value pairs: For now we can implement this with a plain map, but what kind? For the sake of simplicity, we’ll limit ourselves to keys and values that are strings, though we can choose to allow arbitrary types later. We’ll just use a map[string]string as our core data structure.
Allow put, get, and delete of key-value pairs: In this initial iteration, we’ll create a bare-bones Go API that we can call to perform the basic modification operations. Partitioning the functionality from the code that uses it will make it easier to test and easier to update in future iterations.

Your Super Simple API

The first thing that we need to do is to create the map that’ll serve as the heart of our key-value store:

var store = make(map[string]string)

Isn’t it a beauty? So simple. Don’t worry, we’ll make it more complicated later.

The first function that we’ll create is, appropriately, Put, which will be used to add records to the store. It does exactly what its name suggests: it accepts key and value strings and puts them into store. Put’s function signature includes an error return, which we’ll need later:

func Put(key, value string) error {
    store[key] = value

    return nil
}

Because we’re making the conscious choice to create an idempotent service, Put doesn’t check to see whether an existing key-value pair is being overwritten, so it’ll happily do so if asked. Multiple executions of Put with the same parameters will have the same result, regardless of any current state.

Now that we’ve established a basic pattern, writing the Get and Delete operations is just a matter of following through:

var ErrorNoSuchKey = errors.New("no such key")

func Get(key string) (string, error) {
    value, ok := store[key]

    if !ok {
        return "", ErrorNoSuchKey
    }

    return value, nil
}

func Delete(key string) error {
    delete(store, key)

    return nil
}

But look carefully: do you see how when Get returns an error, it doesn’t use errors.New? Instead it returns the prebuilt ErrorNoSuchKey error value. But why? This is an example of a sentinel error, which allows the consuming service to determine exactly what type of error it’s receiving and to respond accordingly. For example, it might do something like this:

if errors.Is(err, ErrorNoSuchKey) {
    http.Error(w, err.Error(), http.StatusNotFound)
    return
}

Now that you have your absolute minimal function set (really, really minimal), don’t forget to write tests. We’re not going to do that here, but if you’re feeling anxious to move forward (or lazy—lazy works too), you can grab the code from the GitHub repository created for this book.

Generation 1: The Monolith

Now that you have a minimally functional key-value API, you can begin building a service around it. There are a few different options for how to do this. First, you could use something like GraphQL, and while there are some decent open source packages out there that you could use, you don’t have the kind of complex data landscape to necessitate it.

Second, you could use remote procedure calls (RPCs), which are supported by the standard net/rpc package, or even gRPC, but these require additional overhead for the client, and again your data just isn’t complex enough to warrant it.

That leaves us with representational state transfer, or REST. REST isn’t a lot of people’s favorite, but it is simple, and it’s perfectly adequate for our needs.

Building an HTTP Server with net/http

Go doesn’t have any web frameworks that are quite as sophisticated or historied as something like Django or Flask. What it does have, however, is a strong set of standard libraries that are perfectly adequate for 80% of use cases. Even better: they’re designed to be extensible, so there are a number of Go web frameworks that extend them.

For now, let’s take a look at the standard HTTP handler idiom in Go, in the form of a “hello, world!” as implemented with net/http:

package main

import (
    "fmt"
    "log"
    "net/http"
)

func helloGoHandler(w http.ResponseWriter, r *http.Request) {
    fmt.Fprintln(w, "Hello net/http!")
}

func main() {
    http.HandleFunc("/", helloGoHandler)

    log.Fatal(http.ListenAndServe(":8080", nil))
}

In this example, we define a function, helloGoHandler, which satisfies the definition of an http.HandlerFunc:

type HandlerFunc func(http.ResponseWriter, *http.Request)

The http.ResponseWriter and *http.Request parameters can be used to construct the HTTP response and retrieve the request, respectively. We can use the http.HandleFunc function to register helloGoHandler as the handler function for any request that matches a given pattern (the root path, in this example).

Once you’ve registered your handlers, you can call ListenAndServe, which listens on the address addr. It also accepts a second parameter, and although it’s set to nil in this example, you’ll be using it a little later in this chapter.

You’ll notice that ListenAndServe is also wrapped in a log.Fatal call. This is because ListenAndServe always stops the execution flow, only returning in the event of an error. Therefore, it always returns a non-nil error, which we always want to log.

This example is a complete program that can be compiled and run using go run:

$ go run .

Congratulations! You’re now running the world’s tiniest web service. Now go ahead and test it with curl or your favorite web browser:

$ curl http://localhost:8080
Hello net/http!

ListenAndServe, Handlers, and HTTP Request Multiplexers

The http.ListenAndServe function starts an HTTP server with a given address and handler. If the handler is nil, which it usually is when you’re using only the standard net/http library, the DefaultServeMux value is used. What’s a handler? What is DefaultServeMux? What’s a “mux”?

A Handler (not to be confused with a HandlerFunc) is any type that satisfies the Handler interface by providing a ServeHTTP method, defined in the following:

type Handler interface {
    ServeHTTP(ResponseWriter, *Request)
}

Most handler implementations, including the default handler, act as a “mux”—short for “multiplexer”—that can direct incoming signals to one of several possible functions. When a request is received by a service that’s been started by ListenAndServe, it’s the job of a mux to compare the requested URL to the registered patterns and call the handler function associated with the one that matches most closely.

DefaultServeMux is a global value of type ServeMux, which implements the default HTTP multiplexer logic. It’s used when the handler parameter to ListenAndServe is nil.

Building an HTTP Server with gorilla/mux

For many web services, the net/http and DefaultServeMux will be perfectly sufficient. However, sometimes you’ll need the additional functionality provided by a third-party web toolkit. A popular choice is Gorilla, which, while being relatively new and less fully developed and resource-rich than something like Django or Flask, does build on Go’s standard net/http package to provide some excellent enhancements.

The gorilla/mux package—one of several packages provided as part of the Gorilla web toolkit—provides an HTTP request router and dispatcher that can fully replace DefaultServeMux, Go’s default service handler, to add several very useful enhancements to request routing and handling. We’re not going to make use of all of these features just yet, but they’ll come in handy going forward. If you’re curious and/or impatient, however, you can take a look at the gorilla/mux documentation for more information.

Creating a minimal service

Making use of the minimal gorilla/mux router is a matter of adding an import and one line of code: the initialization of a new router, which can be passed to the handler parameter of ListenAndServe:

package main

import (
    "fmt"
    "log"
    "net/http"

    "github.com/gorilla/mux"
)

func helloMuxHandler(w http.ResponseWriter, r *http.Request) {
    fmt.Fprintln(w, "Hello gorilla/mux!")
}

func main() {
    r := mux.NewRouter()

    r.HandleFunc("/", helloMuxHandler)

    log.Fatal(http.ListenAndServe(":8080", r))
}

So you should be able to just run this now with go run, right? Give it a try:

$ go run .
main.go:8:2: no required module provides package github.com/gorilla/mux; to add:
        go get github.com/gorilla/mux

It turns out that you can’t (yet). Since you’re now using a third-party package—a package that lives outside the standard library—you’re going to have to use Go modules.

Initializing your project with Go modules

Using a package from outside the standard library requires that you make use of Go modules, which were introduced in Go 1.12 to replace an essentially nonexistent dependency management system with one that’s explicit and actually quite painless to use. All of the operations that you’ll use for managing your dependencies will use one of a small handful of go mod commands.

The first thing you’re going to have to do is initialize your project. Start by creating a new, empty directory, cd into it, and create (or move) the Go file for your service there. Your directory should now contain only a single Go file.

Next, use the go mod init command to initialize the project. Typically, if a project will be imported by other projects, it’ll have to be initialized with its import path. This is less important for a standalone service like ours, though, so you can be a little more lax about the name you choose. I’ll just use example.com/gorilla; you can use whatever name you like:

$ go mod init example.com/gorilla
go: creating new go.mod: module example.com/gorilla

You’ll now have an (almost) empty module file, go.mod, in your directory:⁶

$ cat go.mod
module example.com/gorilla

go 1.20

Next, we’ll want to add our dependencies, which can be done automatically using go mod tidy:

$ go mod tidy
go: finding module for package github.com/gorilla/mux
go: found github.com/gorilla/mux in github.com/gorilla/mux v1.8.0

If you check your go.mod file, you’ll see that the dependency (and a version number) has been added:

$ cat go.mod
module example.com/gorilla

go 1.20

require github.com/gorilla/mux v1.8.0

Believe it or not, that’s all you need. If your required dependencies change in the future, you need only run go mod tidy again to rebuild the file. Now try again to start your service:

$ go run .

Since the service runs in the foreground, your terminal should pause. Calling the endpoint with curl from another terminal or browsing to it with a browser should provide the expected response:

$ curl http://localhost:8080
Hello gorilla/mux!

Success! But surely you want your service to do more than print a string, right? Of course you do. Read on!

Variables in URI paths

The Gorilla web toolkit provides a wealth of additional functionality over the standard net/http package, but one feature is particularly interesting right now: the ability to create paths with variable segments, which can even optionally contain a regular expression pattern. Using the gorilla/mux package, a programmer can define variables using the format {name} or {name:pattern}, as follows:

r := mux.NewRouter()
r.HandleFunc("/products/{key}", ProductHandler)
r.HandleFunc("/articles/{category}/", ArticlesCategoryHandler)
r.HandleFunc("/articles/{category}/{id:[0-9]+}", ArticleHandler)

The mux.Vars function conveniently allows the handler function to retrieve the variable names and values as a map[string]string:

vars := mux.Vars(request)
category := vars["category"]

In the next section, we’ll use this ability to allow clients to perform operations on arbitrary keys.

So many matchers

Another feature provided by gorilla/mux is that it allows a variety of matchers to be added to routes that let the programmer add a variety of additional matching request criteria. These include (but aren’t limited to) specific domains or subdomains, path prefixes, schemes, headers, and even custom matching functions of your own creation.

Matchers can be applied by calling the appropriate function on the *Route value that’s returned by Gorilla’s HandleFunc implementation. Each matcher function returns the affected *Route, so they can be chained. For example:

r := mux.NewRouter()

r.HandleFunc("/products", ProductsHandler).
    Host("www.example.com").                // Only match a specific domain
    Methods("GET", "PUT").                  // Only match GET+PUT methods
    Schemes("http")                         // Only match the http scheme

See the gorilla/mux documentation for an exhaustive list of available matcher functions.

Building a RESTful Service

Now that you know how to use a couple of different HTTP libraries for Go, you can use one of them to create a RESTful service that a client can interact with to execute a call to the API you built in “Your Super Simple API”. Once you’ve done this, you’ll have implemented the absolute minimal viable key-value store.

Your RESTful methods

We’re going to do our best to follow RESTful conventions, so our API will consider every key-value pair to be a distinct resource with a distinct URI that can be operated upon using the various HTTP methods. Each of our three basic operations—Put, Get, and Delete—will be requested using a different HTTP method that we summarize in Table 5-1.

The URI for your key-value pair resources will have the form /v1/key/{key}, where {key} is the unique key string. The v1 segment indicates the API version. This convention is often used to manage API changes, and while this practice is by no means required or universal, it can be helpful for managing the impact of future changes that could break existing client integrations.

Table 5-1. Your RESTful methods
Functionality	Method	Possible statuses
Put a key-value pair into the store	`PUT`	`201 (Created)`
Read a key-value pair from the store	`GET`	`200 (OK), 404 (Not Found)`
Delete a key-value pair	`DELETE`	`200 (OK)`

In “Variables in URI paths”, we discussed how to use the gorilla/mux package to register paths that contain variable segments, which will allow you to define a single variable path that handles all keys, mercifully freeing you from having to register every key independently. Then, in “So many matchers”, we discussed how to use route matchers to direct requests to specific handler functions based on various nonpath criteria, which you can use to create a separate handler function for each of the five HTTP methods that you’ll be supporting.

Implementing the create function

Okay, you now have everything you need to get started! So, let’s go ahead and implement the handler function for the creation of key-value pairs. This function has to be sure to satisfy several requirements:

It must only match HTTP PUT requests to /v1/key/{key}.
It must call the Put function from “Your Super Simple API”.
It must respond with a status 201 (Created) when a key-value pair is created.
It must respond to unexpected errors with a 500 (Internal Server Error).

All of the previous requirements are implemented in the following putHandler function. Note how the key’s value is retrieved from the request body:

// putHandler expects to be called with a PUT request for
// the "/v1/key/{key}" resource.
func putHandler(w http.ResponseWriter, r *http.Request) {
    vars := mux.Vars(r)                     // Retrieve "key" from the request
    key := vars["key"]

    value, err := io.ReadAll(r.Body)        // The request body has our value
    if err != nil {                         // If we have an error, report it
        http.Error(w,
            err.Error(),
            http.StatusInternalServerError)
        return
    }

    defer r.Body.Close()

    err = Put(key, string(value))           // Store the value as a string
    if err != nil {                         // If we have an error, report it
        http.Error(w,
            err.Error(),
            http.StatusInternalServerError)
        return
    }

    w.WriteHeader(http.StatusCreated)       // Success! Return StatusCreated
}

Now that you have your “key-value put” handler function, you can register it with your Gorilla request router for the desired path and method:

func main() {
    r := mux.NewRouter()

    // Register putHandler as the handler function for PUT
    // requests matching "/v1/key/{key}"
    r.HandleFunc("/v1/key/{key}", putHandler).Methods("PUT")

    log.Fatal(http.ListenAndServe(":8080", r))
}

Now that you have your service put together, you can run it using go run . from the project root. Do that now, and send it some requests to see how it responds.

First, use our old friend curl to send a PUT containing a short snippet of text to the /v1/key/key-a endpoint to create a key named key-a with a value of Hello, key-value store!:

$ curl -X PUT -d 'Hello, key-value store!' -v http://localhost:8080/v1/key/key-a

Executing this command provides the following output. The complete output was quite wordy, so I’ve selected the relevant bits for readability:

> PUT /v1/key/key-a HTTP/1.1
< HTTP/1.1 201 Created

The first portion, prefixed with a greater-than sign (>), shows some details about the request. The last portion, prefixed with a less-than sign (<), gives details about the server response.

In this output you can see that you did in fact transmit a PUT to the /v1/key/key-a endpoint and that the server responded with a 201 Created—as expected.

What if you hit the /v1/key/key-a endpoint with an unsupported GET method? Since we included only a PUT matcher, you should receive an error message:

$ curl -X GET -v http://localhost:8080/v1/key/key-a
> GET /v1/key/key-a HTTP/1.1
< HTTP/1.1 405 Method Not Allowed

Indeed, the server responds with a 405 Method Not Allowed error. Everything seems to be working correctly.

Implementing the read function

Now that your service has a fully functioning PUT method, it sure would be nice if you could read your data back! For our next trick, we’re going to implement the GET functionality, which has the following requirements:

It must only match HTTP GET requests for /v1/key/{key}.
It must call the Get function from “Your Super Simple API”.
It must respond with a 404 (Not Found) when a requested key doesn’t exist.
It must respond with the requested value and a status 200 if the key exists.
It must respond to unexpected errors with a 500 (Internal Server Error).

All of the previous requirements are implemented in the getHandler function. Note how the value is written to w—the handler function’s http.ResponseWriter parameter—after it’s retrieved from the key-value API:

func getHandler(w http.ResponseWriter, r *http.Request) {
    vars := mux.Vars(r)                   // Retrieve "key" from the request
    key := vars["key"]

    value, err := Get(key)                // Get value for key
    if errors.Is(err, ErrorNoSuchKey) {
        http.Error(w,err.Error(), http.StatusNotFound)
        return
    }
    if err != nil {
        http.Error(w, err.Error(), http.StatusInternalServerError)
        return
    }

    fmt.Fprint(w, value)                 // Write the value to the response
}

And now that you have the GET handler function, you can register it with the request router alongside the PUT handler:

func main() {
    r := mux.NewRouter()

    r.HandleFunc("/v1/key/{key}", putHandler).Methods("PUT")
    r.HandleFunc("/v1/key/{key}", getHandler).Methods("GET")

    log.Fatal(http.ListenAndServe(":8080", r))
}

Let’s fire up your newly improved service and see if it works:

$ curl -X PUT -d 'Hello, key-value store!' -v http://localhost:8080/v1/key/key-a
> PUT /v1/key/key-a HTTP/1.1
< HTTP/1.1 201 Created

$ curl -v http://localhost:8080/v1/key/key-a
> GET /v1/key/key-a HTTP/1.1
< HTTP/1.1 200 OK
Hello, key-value store!

It works! Now that you can get your values back, you’re able to test for idempotence as well. Let’s repeat the requests and make sure that you get the same results:

$ curl -X PUT -d 'Hello, key-value store!' -v http://localhost:8080/v1/key/key-a
> PUT /v1/key/key-a HTTP/1.1
< HTTP/1.1 201 Created

$ curl -v http://localhost:8080/v1/key/key-a
> GET /v1/key/key-a HTTP/1.1
< HTTP/1.1 200 OK
Hello, key-value store!

You do! But what if you want to overwrite the key with a new value? Will the subsequent GET have the new value? You can test that by changing the value sent by your curl slightly to be Hello, again, key-value store!:

$ curl -X PUT -d 'Hello, again, key-value store!' \
    -v http://localhost:8080/v1/key/key-a
> PUT /v1/key/key-a HTTP/1.1
< HTTP/1.1 201 Created

$ curl -v http://localhost:8080/v1/key/key-a
> GET /v1/key/key-a HTTP/1.1
< HTTP/1.1 200 OK
Hello, again, key-value store!

As expected, the GET responds back with a 200 status and your new value.

Finally, to complete your method set, you’ll just need to create a handler for the DELETE method. I’ll leave that as an exercise, though. Enjoy!

Making Your Data Structure Concurrency-Safe

Maps in Go are not atomic and are not safe for concurrent use. Unfortunately, you now have a service designed to handle concurrent requests that’s wrapped around exactly such a map.

So what do you do? Well, typically when a programmer has a data structure that needs to be read from and written to by concurrently executing goroutines, they’ll use something like a mutex—also known as a lock—to act as a synchronization mechanism. By using a mutex this way, you can ensure that exactly one process has exclusive access to a particular resource.

Fortunately, you don’t need to implement this yourself:⁷ as you may recall from “Mutexes”, Go’s sync package provides exactly what you need in the form of sync.RWMutex. The following statement defines a LockableMap struct that contains our trusty map alongside an embedded sync.RWMutex:

type LockableMap struct {
    sync.RWMutex
    m map[string]string
}

var myMap = LockableMap {
    m: make(map[string]string),
}

The myMap value has all of the methods from the embedded sync.RWMutex, allowing you to use the Lock method to take the write lock when you want to write to the myMap map:

myMap.Lock()                         // Take a write lock
defer myMap.Unlock()                 // Release the write lock

myMap.m["some_key"] = "some_value"

If another process has either a read or write lock, then Lock will block until that lock is released.

Similarly, to read from the map, you use the RLock method to take the read lock:

myMap.RLock()                        // Take a read lock
defer myMap.RUnlock()                // Release the read lock

value := myMap.m["some_key"]

fmt.Println("some_key:", value)

Read locks are less restrictive than write locks in that any number of processes can simultaneously take read locks. However, RLock will block until any open write locks are released.

Note how in both examples we place the mutex unlocks in defer statements immediately below the locks so that they’re executed whenever and however the function returns. Although not strictly required, this is a common approach to handling unlocks in simple functions.

Integrating a read-write mutex into your application

Now that we’ve reviewed how to use a sync.RWMutex, you can go back and work it into the code you created for “Your Super Simple API”.

Using the LockableMap we defined previously, you can re-create your store value just like you did the myMap value,⁸ so that it contains both a map and an embedded sync.RWMutex:

var store = LockableMap {
    m: make(map[string]string),
}

Now that you have your store struct, you can update the Get and Put functions to establish the appropriate locks. Because Get needs to only read the store map, it’ll use RLock to take a read lock only. Put, on the other hand, needs to modify the map, so it’ll need to use Lock to take a write lock:

func Get(key string) (string, error) {
    store.RLock()
    defer store.RUnlock()

    value, ok := store.m[key]

    if !ok {
        return "", ErrorNoSuchKey
    }

    return value, nil
}

func Put(key string, value string) error {
    store.Lock()
    defer store.Unlock()

    store.m[key] = value

    return nil
}

The pattern here is clear: if a function needs to modify the map (Put, Delete), it’ll use Lock to take a write lock. If it only needs to read existing data (Get), it’ll use RLock to take a read lock. We leave the creation of the Delete function as an exercise for the reader.⁹

Warning

Don’t forget to release your locks, and make sure you’re releasing the correct lock type!

Generation 2: Persisting Resource State

One of the stickiest challenges with cloud native applications is how to handle state.

There are various techniques for distributing the state of application resources between multiple service instances, but for now we’re just going to concern ourselves with the minimum viable product and consider just two ways of maintaining the state of our application:

In “Storing State in a Transaction Log File”, you’ll use a file-based transaction log to maintain a record of every time a resource is modified. If a service crashes, is restarted, or otherwise finds itself in an inconsistent state, a transaction log allows a service to reconstruct the original state by replaying the transactions.
In “Storing State in an External Database”, you’ll use an external database instead of a file to store a transaction log. It might seem redundant to use a database given the nature of the application you’re building, but externalizing data into another service designed specifically for that purpose is a common means of sharing state between service replicas and providing resilience.

You may be wondering why you’d bother using a transaction log strategy to record events when you could just use a database to store the values themselves. The transaction log approach actually has some advantages when you intend to store your data in memory most of the time. For example, it reduces the amount of write locking necessary when making mutating changes, which can dramatically increase throughput, and if the server crashes, the log can be used to reconstruct all of its state, including its history.

Using a transaction log also affords us an opportunity: given that we’re creating two different implementations with similar functionality—a transaction log written both to a file and to a database—we can describe our functionality with an interface that both implementations can satisfy. This could come in quite handy, especially if we want to be able to seamlessly choose between either implementation according to our needs.

Application State Versus Resource State

The term stateless is used a lot in the context of cloud native architecture, and state is often regarded as a Very Bad Thing. But what is state, exactly, and why is it so bad? Does an application have to be completely devoid of any kind of state to be “cloud native”? The answer is…well, it’s complicated.

First, it’s important to draw a distinction between application state and resource state. These are very different things, but they’re easily confused:

Application state: Server-side data about the application or how it’s being used by a client. A common example is client session tracking, such as to associate them with their access credentials or some other application context.
Resource state: The current state of a resource within a service at any point of time. It’s the same for every client and has nothing to do with the interaction between client and server.

Any state introduces technical challenges, but application state is particularly problematic because it often forces services to depend on server affinity—sending each of a user’s requests to the same server where their session was initiated—resulting in a more complex application and making it hard to destroy or replace service replicas.

State and statelessness will be discussed in quite a bit more detail in “State and Statelessness”.

What’s a Transaction Log?

We’ve been talking about using transaction logs for a bit, but we haven’t yet really defined what a transaction log actually is.

In its most basic form, a transaction log is just a log file that maintains a history of any mutating changes executed by the data store. If a service crashes, is restarted, or otherwise finds itself in an inconsistent state, a transaction log makes it possible to replay the transactions to reconstruct the service’s functional state and history.

Transaction logs are commonly used by database management systems to provide a degree of data resilience against crashes or hardware failures. However, while this technique can get quite sophisticated, we’ll be keeping ours pretty straightforward.

Your transaction log format

Before we get to the code, let’s decide what the transaction log should contain.

We’ll assume that our transaction log will be read only when your service is restarted or otherwise needs to recover its state and that it’ll be read from top to bottom, sequentially replaying each event. It follows that your transaction log will consist of an ordered list of mutating events. For speed and simplicity, a transaction log is also generally append-only, so when a record is deleted from your key-value store, for example, a deletion event is recorded in the log.

Given everything we’ve discussed so far, each recorded transaction event will need to include the following attributes:

Sequence number: A unique record ID, in monotonically increasing order.
Event type: A descriptor of the type of action taken; this can be PUT or DELETE.
Key: A string containing the key affected by this transaction.
Value: If the event is a PUT, the value of the transaction.

Nice and simple. Hopefully we can keep it that way.

Your transaction logger interface

The first thing we’re going to do is define a TransactionLogger interface. For now, we’re going to define only two methods: WritePut and WriteDelete, which will be used to write PUT and DELETE events, respectively, to a transaction log:

type TransactionLogger interface {
    WriteDelete(key string)
    WritePut(key, value string)
}

You’ll no doubt want to add other methods later, but we’ll cross that bridge when we come to it. For now, let’s focus on the first implementation and add additional methods to the interface as we come across them.

Storing State in a Transaction Log File

The first approach we’ll take is to use the most basic (and most common) form of transaction log, which is just an append-only log file that maintains a history of mutating changes executed by the data store. This file-based implementation has some tempting pros but some pretty significant cons as well:

Pros:

No downstream dependency: There’s no dependency on an external service that could fail or that we can lose access to.
Technically straightforward: The logic isn’t especially sophisticated. We can be up and running quickly.

Cons:

Harder to scale: You’ll need some additional way to distribute your state between nodes when you want to scale out.
Unconstrained growth: These logs have to be stored on disk, so you can’t let them grow forever. You’ll need some way of periodically compacting them.

Prototyping your transaction logger

Before we get to the code, let’s make some design decisions. First, for simplicity, the log will be written in plain text; a binary, compressed format might be more time- and space-efficient, but we can always optimize later. Second, each entry will be written on its own line; this will make it much easier to read the data later.

Finally, each transaction will include the four fields listed in “Your transaction log format”, delimited by tabs.

Now that we’ve established these fundamentals, let’s go ahead and define a type, FileTransactionLogger, which will implicitly implement the TransactionLogger interface described in “Your transaction logger interface” by defining WritePut and WriteDelete methods for writing PUT and DELETE events, respectively, to the transaction log:

type FileTransactionLogger struct {
    // Something, something, fields
}

func (l *FileTransactionLogger) WritePut(key, value string) {
    // Something, something, logic
}

func (l *FileTransactionLogger) WriteDelete(key string) {
    // Something, something, logic
}

Clearly these methods are a little light on detail, but we’ll flesh them out soon!

Defining the event type

Thinking ahead, we probably want the WritePut and WriteDelete methods to operate asynchronously. You could implement that using some kind of events channel that some concurrent goroutine could read from and perform the log writes. That sounds like a nice idea, but if you’re going to do that, you’ll need some kind of internal representation of an “event.”

That shouldn’t give you too much trouble. Incorporating all of the fields that we listed in “Your transaction log format” gives something like the Event struct, in the following:

type Event struct {
    Sequence  uint64                // A unique record ID
    EventType EventType             // The action taken
    Key       string                // The key affected by this transaction
    Value     string                // The value of the transaction
}

Seems straightforward, right? Sequence is the sequence number, and Key and Value are self-explanatory. But…what’s an EventType? Well, it’s whatever we say it is, and we’re going to say that it’s a constant that we can use to refer to the different types of events, which we’ve already established will include one each for PUT and DELETE events.

One way to do this might be to just assign some constant byte values, like this:

const (
    EventDelete byte = 1
    EventPut    byte = 2
)

Sure, this would work, but Go actually provides a better (and more idiomatic) way: iota. iota is a predefined value that can be used in a constant declaration to construct a series of related constant values.

Declaring Constants with Iota

When used in a constant declaration, iota represents successive untyped integer constants that can be used to construct a set of related constants. Its value restarts at zero in each constant declaration and increments with each constant assignment (whether or not the iota identifier is actually referenced).

An iota can also be operated upon. We demonstrate this in the following by using in multiplication, left binary shift, and division operations:

const (
    a = 42 * iota           // iota == 0; a == 0
    b = 1 << iota           // iota == 1; b == 2
    c = 3                   // iota == 2; c == 3 (iota increments anyway!)
    d = iota / 2            // iota == 3; d == 1
)

Because iota is itself an untyped number, you can use it to make typed assignments without explicit type casts. You can even assign iota to a float64 value:

const (
    u         = iota * 42   // iota == 0; u == 0 (untyped integer constant)
    v float64 = iota * 42   // iota == 1; v == 42.0 (float64 constant)
)

The iota keyword allows implicit repetition, which makes it trivial to create arbitrarily long sets of related constants, like we do in the following with the numbers of bytes in various digital units:

type ByteSize uint64

const (
    _           = iota                  // iota == 0; ignore the zero value
    KB ByteSize = 1 << (10 * iota)      // iota == 1; KB == 2^10
    MB                                  // iota == 2; MB == 2^20
    GB                                  // iota == 3; GB == 2^30
    TB                                  // iota == 4; TB == 2^40
    PB                                  // iota == 5; PB == 2^50
)

Using the iota technique, you don’t have to manually assign values to constants. Instead, you can do something like the following:

type EventType byte

const (
    _                     = iota         // iota == 0; ignore the zero value
    EventDelete EventType = iota         // iota == 1
    EventPut                             // iota == 2; implicitly repeat
)

This might not be a big deal when you have only two constants like we have here, but it can come in handy when you have a number of related constants and don’t want to be bothered manually keeping track of which value is assigned to what.

Warning

If you’re using iota as enumerations in serializations (as we are here), take care to only append to the list, and don’t reorder or insert values in the middle, or you won’t be able to deserialize later.

We now have an idea of what the TransactionLogger will look like, as well as the two primary write methods. We’ve also defined a struct that describes a single event, and created a new EventType type and used iota to define its legal values. Now we’re finally ready to get started.

Implementing your FileTransactionLogger

We’ve made some progress. We know we want a TransactionLogger implementation with methods for writing events, and we’ve created a description of an event in code. But what about the FileTransactionLogger itself?

The service will want to keep track of the physical location of the transaction log, so it makes sense to have an os.File attribute representing that. It’ll also need to remember the last sequence number that was assigned so it can correctly set each event’s sequence number; that can be kept as an unsigned 64-bit integer attribute. That’s great, but how will the FileTransactionLogger actually write the events?

One possible approach would be to keep an io.Writer that the WritePut and WriteDelete methods can operate on directly, but that would be a single-threaded approach, so unless you explicitly execute them in goroutines, you may find yourself spending more time in I/O than you’d like. Alternatively, you could create a buffer from a slice of Event values that are processed by a separate goroutine. Definitely warmer, but too complicated.

After all, why go through all of that work when we can just use standard buffered channels? Taking our own advice, we end up with a FileTransactionLogger and Write methods that look like the following:

type FileTransactionLogger struct {
    events       chan<- Event       // Write-only channel for sending events
    errors       <-chan error       // Read-only channel for receiving errors
    lastSequence uint64             // The last used event sequence number
    file         *os.File           // The location of the transaction log
}

func (l *FileTransactionLogger) WritePut(key, value string) {
    l.events <- Event{EventType: EventPut, Key: key, Value: value}
}

func (l *FileTransactionLogger) WriteDelete(key string) {
    l.events <- Event{EventType: EventDelete, Key: key}
}

func (l *FileTransactionLogger) Err() <-chan error {
    return l.errors
}

You now have your FileTransactionLogger, which has a uint64 value that’s used to track the last-used event sequence number, a write-only channel that receives Event values, and WritePut and WriteDelete methods that send Event values into that channel.

But it looks like there might be a part left over: there’s an Err method there that returns a receive-only error channel. There’s a good reason for that. We’ve already mentioned that writes to the transaction log will be done concurrently by a goroutine that receives events from the events channel. While that makes for a more efficient write, it also means that WritePut and WriteDelete can’t simply return an error when they encounter a problem, so we provide a dedicated error channel to communicate errors instead.

Creating a new FileTransactionLogger

If you’ve followed along so far, you may have noticed that none of the attributes in the FileTransactionLogger have been initialized. If you don’t fix this issue, it’s going to cause some problems. Go doesn’t have constructors, though, so to solve this you need to define a construction function, which you’ll call, for lack of a better name,¹⁰ NewFileTransactionLogger:

func NewFileTransactionLogger(filename string) (TransactionLogger, error) {
    file, err := os.OpenFile(filename, os.O_RDWR|os.O_APPEND|os.O_CREATE, 0755)
    if err != nil {
        return nil, fmt.Errorf("cannot open transaction log file: %w", err)
    }

    return &FileTransactionLogger{file: file}, nil
}

Warning

See how NewFileTransactionLogger returns a pointer type, but its return list specifies the decidedly nonpointy TransactionLogger interface type?

The reason for this is tricksy: while Go allows pointer types to implement an interface, it doesn’t allow pointers to interface types.

NewFileTransactionLogger calls the os.OpenFile function to open the file specified by the filename parameter. You’ll notice it accepts several flags that have been binary OR-ed together to set its behavior:

os.O_RDWR: Opens the file in read/write mode.
os.O_APPEND: Any writes to this file will append, not overwrite.
os.O_CREATE: If the file doesn’t exist, creates it.

There are quite a few of these flags besides the three we use here. Take a look at the os package documentation for a full listing.

We now have a construction function that ensures that the transaction log file is correctly created. But what about the channels? We could create the channels and spawn a goroutine with NewFileTransactionLogger, but that feels like we’d be adding too much mysterious functionality. Instead, we’ll create a Run method.

Appending entries to the transaction log

As of yet, there’s nothing reading from the events channel, which is less than ideal. What’s worse, the channels aren’t even initialized. Let’s change this by creating a Run method, shown in the following:

func (l *FileTransactionLogger) Run() {
    events := make(chan Event, 16)              // Make an events channel
    l.events = events

    errors := make(chan error, 1)               // Make an errors channel
    l.errors = errors

    go func() {
        for e := range events {                 // Retrieve the next Event

            l.lastSequence++                    // Increment sequence number

            _, err := fmt.Fprintf(              // Write the event to the log
                l.file,
                "%d\t%d\t%s\t%s\n",
                l.lastSequence, e.EventType, e.Key, e.Value)

            if err != nil {
                errors <- err
                return
            }
        }
    }()
}

Note

This implementation is incredibly basic. It won’t even correctly handle entries with whitespace or multiple lines!

The Run function does several important things.

First, it creates a buffered events channel. Using a buffered channel in our TransactionLogger means that calls to WritePut and WriteDelete won’t block as long as the buffer isn’t full. This lets the consuming service handle short bursts of events without being slowed by disk I/O. If the buffer does fill up, then the write methods will block until the log writing goroutine catches up.

Second, it creates an errors channel, which is also buffered, that we’ll use to signal any errors that arise in the goroutine that’s responsible for concurrently writing events to the transaction log. The buffer value of 1 allows it to send an error in a nonblocking manner.

Finally, it starts a goroutine that retrieves Event values from the events channel and uses the fmt.Fprintf function to write them to the transaction log. If fmt.Fprintf returns an error, the goroutine sends the error to the errors channel and halts.

Using a bufio.Scanner to play back file transaction logs

Even the best transaction log is useless if it’s never read.¹¹ But how do you do that?

You’ll need to read the log from the beginning and parse each line; io.ReadString and fmt.Sscanf let you do this with minimal fuss.

Channels, our dependable friends, will let your service stream the results to a consumer as it retrieves them. This might be starting to feel routine, but stop for a second to appreciate it. In most other languages, the path of least resistance here would be to read in the entire file, stash it in an array, and finally loop over that array to replay the events. But Go’s convenient concurrency primitives make it almost trivially easy to stream the data to the consumer in a much more space- and memory-efficient way.

The ReadEvents method¹² demonstrates this:

func (l *FileTransactionLogger) ReadEvents() (<-chan Event, <-chan error) {
    scanner := bufio.NewScanner(l.file)     // Create a Scanner for l.file
    outEvent := make(chan Event)            // An unbuffered Event channel
    outError := make(chan error, 1)         // A buffered error channel

    go func() {
        var e Event

        defer close(outEvent)               // Close the channels when the
        defer close(outError)               // goroutine ends

        for scanner.Scan() {
            line := scanner.Text()

            if err := fmt.Sscanf(line, "%d\t%d\t%s\t%s",
                &e.Sequence, &e.EventType, &e.Key, &e.Value); err != nil {

                outError <- fmt.Errorf("input parse error: %w", err)
                return
            }

            // Sanity check! Are the sequence numbers in increasing order?
            if l.lastSequence >= e.Sequence {
                outError <- fmt.Errorf("transaction numbers out of sequence")
                return
            }

            l.lastSequence = e.Sequence     // Update last used sequence #

            outEvent <- e                   // Send the event along
        }

        if err := scanner.Err(); err != nil {
            outError <- fmt.Errorf("transaction log read failure: %w", err)
            return
        }
    }()

    return outEvent, outError
}

The ReadEvents method can really be said to be two functions in one: the outer function initializes the file reader and creates and returns the event and error channels. The inner function runs concurrently to ingest the file contents line by line and send the results to the channels.

Interestingly, the file attribute of TransactionLogger is of type *os.File, which has a Read method that satisfies the io.Reader interface. Read is fairly low-level, but, if you wanted to, you could actually use it to retrieve the data. The bufio package, however, gives us a better way: the Scanner interface, which provides a convenient means for reading newline-delimited lines of text. We can get a new Scanner value by passing an io.Reader—an os.File in this case—to bufio.NewScanner.

Each call to the scanner.Scan method advances it to the next line, returning false (and breaking the for loop) if there aren’t any lines left. A subsequent call to scanner.Text returns the line.

Note the defer statements in the inner anonymous goroutine. These ensure that the output channels are always closed. Because defer is scoped to the function they’re declared in, these get called at the end of the goroutine, not ReadEvents.

You may recall from “Formatting I/O in Go” that the fmt.Sscanf function provides a simple (but sometimes simplistic) means of parsing strings. Like the other methods in the fmt package, the expected format is specified using a format string with various “verbs” embedded: two digits (%d) and two strings (%s), separated by tab characters (\t). Conveniently, fmt.Sscanf lets you pass in pointers to the target values for each verb, which it can update directly.¹³

Tip

Go’s format strings have a long history dating back to C’s printf and scanf, but they’ve been adopted by many other languages over the years, including C++, Java, Perl, PHP, Ruby, and Scala. You may already be familiar with them, but if you’re not, take a break now to look at the fmt package documentation.

At the end of each loop, the last-used sequence number is updated to the value that was just read, and the event is sent on its merry way. A minor point: note how the same Event value is reused on each iteration rather than creating a new one. This is possible because the outEvent channel is sending struct values, not pointers to struct values, so it already provides copies of whatever value we send into it.

Finally, the function checks for Scanner errors. The Scan method returns only a single Boolean value, which is really convenient for looping. Instead, when it encounters an error, Scan returns false and exposes the error via the Err method.

Your transaction logger interface (redux)

Now that you’ve implemented a fully functional FileTransactionLogger, it’s time to look back and see which of the new methods we can use to incorporate into the TransactionLogger interface. It actually looks like there are quite a few we might like to keep in any implementation, leaving us with the following final form for the TransactionLogger interface:

type TransactionLogger interface {
    WriteDelete(key string)
    WritePut(key, value string)
    Err() <-chan error

    ReadEvents() (<-chan Event, <-chan error)

    Run()
}

Now that that’s settled, you can finally start integrating the transaction log into your key-value service.

Initializing the FileTransactionLogger in your web service

The FileTransactionLogger is now complete! All that’s left to do now is to integrate it with your web service. The first step of this is to add a new function that can create a new TransactionLogger value, read in and replay any existing events, and call Run.

First, let’s add a TransactionLogger reference to our service.go. You can call it logger because naming is hard:

var logger TransactionLogger

Now that you have that detail out of the way, you can define your initialization function, which can look like the following:

func initializeTransactionLog() error {
    var err error

    logger, err = NewFileTransactionLogger("transaction.log")
    if err != nil {
        return fmt.Errorf("failed to create event logger: %w", err)
    }

    events, errors := logger.ReadEvents()

    e := Event{}
    ok := true

    for ok && err == nil {
        select {
        case err, ok = <-errors:            // Retrieve any errors
        case e, ok = <-events:
            switch e.EventType {
            case EventDelete:               // Got a DELETE event!
                err = Delete(e.Key)
            case EventPut:                  // Got a PUT event!
                err = Put(e.Key, e.Value)
            }
        }
    }

    logger.Run()

    return err
}

This function starts as you’d expect: it calls NewFileTransactionLogger and assigns it to logger.

The next part is more interesting: it calls logger.ReadEvents and replays the results based on the Event values received from it. This is done by looping over a select with cases for both the events and errors channels. Note how the cases in the select use the format case foo, ok = <-ch. The bool returned by a channel read in this way will be false if the channel in question has been closed, setting the value of ok and terminating the for loop.

If we get an Event value from the events channel, we call either Delete or Put as appropriate; if we get an error from the errors channel, err will be set to a non-nil value and the for loop will be terminated.

Integrating FileTransactionLogger with your web service

Now that the initialization logic is put together, all that’s left to do to complete the integration of the TransactionLogger is to add exactly three function calls into the web service. This is fairly straightforward, so we won’t walk through it here. But, briefly, you’ll need to add the following:

initializeTransactionLog to the main method
logger.WriteDelete to deleteHandler
logger.WritePut to putHandler

We’ll leave the actual integration as an exercise for the reader.¹⁴

Future improvements

We may have completed a minimal viable implementation of our transaction logger, but it still has plenty of issues and opportunities for improvement, such as the following:

There aren’t any tests.
There’s no Close method to gracefully close the transaction log file.
The service can close with events still in the write buffer: events can get lost.
Keys and values aren’t encoded in the transaction log: multiple lines or whitespace will fail to parse correctly.
The sizes of keys and values are unbound: huge keys or values can be added, filling the disk.
The transaction log is written in plain text: it will take up more disk space than it probably needs to.
The log retains records of deleted values forever: it will grow indefinitely.

All of these would be impediments in production. I encourage you to take the time to consider—or even implement—solutions to one or more of these points.

Storing State in an External Database

Databases, and data, are at the core of many, if not most, business and web applications, so it makes perfect sense that Go includes a standard interface for SQL (or SQL-like) databases in its core libraries.

But does it make sense to use a SQL database to back our key-value store? After all, isn’t it redundant for our data store to just depend on another data store? Yes, certainly. But externalizing a service’s data into another service designed specifically for that purpose—a database—is a common pattern that allows state to be shared between service replicas and provides data resilience. Besides, the point is to show how you might interact with a database, not to design the perfect application.

In this section, you’ll be implementing a transaction log backed by an external database and satisfying the TransactionLogger interface, just as you did in “Storing State in a Transaction Log File”. This would certainly work, and even have some benefits as mentioned previously, but it comes with some trade-offs:

Pros:

Externalizes application state: Less need to worry about distributed state and closer to “cloud native.”
Easier to scale: Not having to share data between replicas makes scaling out easier (but not easy).

Cons:

Introduces a bottleneck: What if you had to scale way up? What if all replicas had to read from the database at once?
Introduces an upstream dependency: Creates a dependency on another resource that might fail.
Requires initialization: What if the Transactions table doesn’t exist?
Increases complexity: Yet another thing to manage and configure.

Working with databases in Go

Databases, particularly SQL and SQL-like databases, are everywhere. You can try to avoid them, but if you’re building applications with some kind of data component, you’ll at some point have to interact with one.

Fortunately for us, the creators of the Go standard library provided the database/sql package, which provides an idiomatic and lightweight interface around SQL (and SQL-like) databases. In this section we’ll briefly demonstrate how to use this package and point out some of the gotchas along the way.

Among the most ubiquitous members of the database/sql package is sql.DB: Go’s primary database abstraction and entry point for creating statements and transactions, executing queries, and fetching results. While it doesn’t, as its name might suggest, map to any particular concept of a database or schema, it does do quite a lot of things for you, including, but not limited to, negotiating connections with your database and managing a database connection pool.

We’ll get into how you create a sql.DB value in a bit. But first, we have to talk about database drivers.

Importing a database driver

While the sql.DB type provides a common interface for interacting with a SQL database, it depends on database drivers to implement the specifics for particular database types. At the time of this writing, there are more than 60 drivers listed in the Go repository.

In the following section we’ll be working with a Postgres database, so we’ll use the third-party lib/pq Postgres driver implementation.

To load a database driver, anonymously import the driver package by aliasing its package qualifier to _ (underscore). This triggers any initializers the package might have while also informing the compiler that you have no intention of directly using it:

import (
    "database/sql"
    _ "github.com/lib/pq"   // Anonymously import the driver package
)

Now that you’ve done this, you’re finally ready to create your sql.DB value and access the database.

Implementing your PostgresTransactionLogger

In “Your transaction logger interface (redux)”, we presented the complete TransactionLogger interface, which provides a standard definition for a generic transaction log. You might recall that it defined methods for starting the logger, as well as reading and writing events to the log, as detailed here:

type TransactionLogger interface {
    WriteDelete(key string)
    WritePut(key, value string)
    Err() <-chan error

    ReadEvents() (<-chan Event, <-chan error)

    Run()
}

Our goal now is to create a database-backed implementation of TransactionLogger. Fortunately, much of our work is already done for us. Looking back at “Implementing your FileTransactionLogger” for guidance, it looks like we can create a PostgresTransactionLogger using very similar logic.

Starting with the WritePut, WriteDelete, and Err methods, you can do something like the following:

type PostgresTransactionLogger struct {
    events       chan<- Event   // Write-only channel for sending events
    errors       <-chan error   // Read-only channel for receiving errors
    db           *sql.DB        // The database access interface
}

func (l *PostgresTransactionLogger) WritePut(key, value string) {
    l.events <- Event{EventType: EventPut, Key: key, Value: value}
}

func (l *PostgresTransactionLogger) WriteDelete(key string) {
    l.events <- Event{EventType: EventDelete, Key: key}
}

func (l *PostgresTransactionLogger) Err() <-chan error {
    return l.errors
}

If you compare this to the FileTransactionLogger, it’s clear that the code so far is nearly identical. All we’ve really changed is:

Renaming (obviously) the type to PostgresTransactionLogger.
Swapping the *os.File for a *sql.DB.
Removing lastSequence; you can let the database handle the sequencing.

Creating a new PostgresTransactionLogger

That’s all well and good, but we still haven’t talked about how we create the *sql.DB value. I know how you must feel. The suspense is definitely killing me, too.

Much like we did in the NewFileTransactionLogger function, we’re going to create a construction function for our PostgresTransactionLogger, which we’ll call (quite predictably) NewPostgresTransactionLogger. However, instead of opening a file like NewFileTransactionLogger, it’ll establish a connection with the database, returning an error if it fails.

There’s a little bit of a wrinkle, though. Namely, that the setup for a Postgres connection takes a lot of parameters. At the bare minimum we need to know the host where the database lives, the name of the database, and the username and password. One way to deal with this would be to create a function like the following, which simply accepts a bunch of string parameters:

func NewPostgresTransactionLogger(host, dbName, user, password string)
    (TransactionLogger, error) { ... }

This approach is pretty ugly, though. Plus, what if you wanted to add a new parameter later? Do you chunk it onto the end of the parameter list, breaking any code that’s already using this function? Maybe worse, the parameter order isn’t clear without looking at the documentation.

There has to be a better way. So, instead of this potential horror show, you can create a small helper struct:

type PostgresDBParams struct {
    dbName   string
    host     string
    user     string
    password string
}

Unlike the big-bag-of-strings approach, this struct is small, readable, and easily extended. To use it, you can create a PostgresDBParams variable and pass it to your construction function. Here’s what that looks like:

logger, err = NewPostgresTransactionLogger(PostgresDBParams{
    host:     "localhost",
    dbName:   "kvs",
    user:     "test",
    password: "hunter2"
})

The new construction function looks something like the following:

func NewPostgresTransactionLogger(config PostgresDBParams) (TransactionLogger,
    error) {

    connStr := fmt.Sprintf("host=%s dbname=%s user=%s password=%s",
        config.host, config.dbName, config.user, config.password)

    db, err := sql.Open("postgres", connStr)
    if err != nil {
        return nil, fmt.Errorf("failed to open db: %w", err)
    }

    err = db.Ping()                 // Test the database connection
    if err != nil {
        return nil, fmt.Errorf("failed to open db connection: %w", err)
    }

    logger := &PostgresTransactionLogger{db: db}

    exists, err := logger.verifyTableExists()
    if err != nil {
        return nil, fmt.Errorf("failed to verify table exists: %w", err)
    }
    if !exists {
        if err = logger.createTable(); err != nil {
            return nil, fmt.Errorf("failed to create table: %w", err)
        }
    }

    return logger, nil
}

This does quite a few things, but fundamentally it’s not very different from NewFileTransactionLogger.

The first thing it does is to use sql.Open to retrieve a *sql.DB value. You’ll note that the connection string passed to sql.Open contains several parameters; the lib/pq package supports many more than the ones shown here. See the package documentation for a complete listing.

Many drivers, including lib/pq, don’t actually create a connection to the database immediately, so it uses db.Ping to force the driver to establish and test a connection.

Finally, it creates the PostgresTransactionLogger and uses that to verify that the transactions table exists, creating it if necessary. Without this step, the PostgresTransactionLogger will essentially assume that the table already exists, and will fail if it doesn’t.

You may have noticed that the verifyTableExists and createTable methods aren’t implemented here. This is entirely intentional. As an exercise, you’re encouraged to dive into the database/sql docs and think about how you might go about doing that. If you’d prefer not to, you can find an implementation in the GitHub repository that comes with this book.

You now have a construction function that establishes a connection to the database and returns a newly created TransactionLogger. But, once again, you need to get things started. For that, you need to implement the Run method that will create the events and errors channels and spawn the event ingestion goroutine.

Using db.Exec to execute a SQL INSERT

For the FileTransactionLogger, you implemented a Run method that initialized the channels and created the go function responsible for writing to the transaction log.

The PostgresTransactionLogger is similar. However, instead of appending a line to a file, the new logger uses db.Exec to execute a SQL INSERT to accomplish the same result:

func (l *PostgresTransactionLogger) Run() {
    events := make(chan Event, 16)              // Make an events channel
    l.events = events

    errors := make(chan error, 1)               // Make an errors channel
    l.errors = errors

    go func() {                                 // The INSERT query
        query := `INSERT INTO transactions
            (event_type, key, value)
            VALUES ($1, $2, $3)`

        for e := range events {                 // Retrieve the next Event

            _, err := l.db.Exec(                // Execute the INSERT query
                query,
                e.EventType, e.Key, e.Value)

            if err != nil {
                errors <- err
            }
        }
    }()
}

This implementation of the Run method does almost exactly what its FileTransactionLogger equivalent does: it creates the buffered events and errors channels, and it starts a goroutine that retrieves Event values from our events channel and writes them to the transaction log.

Unlike the FileTransactionLogger, which appends to a file, this goroutine uses db.Exec to execute a SQL query that appends a row to the transactions table. The numbered arguments ($1, $2, $3) in the query are placeholder query parameters, which must be satisfied when the db.Exec function is called.

Using db.Query to play back postgres transaction logs

In “Using a bufio.Scanner to play back file transaction logs”, you used a bufio.Scanner to read previously written transaction log entries.

The Postgres implementation won’t be quite as straightforward, but the principle is the same: you point at the top of your data source and read until you hit the bottom:

func (l *PostgresTransactionLogger) ReadEvents() (<-chan Event, <-chan error) {
    outEvent := make(chan Event)                // An unbuffered events channel
    outError := make(chan error, 1)             // A buffered errors channel

    go func() {
        defer close(outEvent)                   // Close the channels when the
        defer close(outError)                   // goroutine ends

        query := `SELECT sequence, event_type, key, value
                  FROM transactions
                  ORDER BY sequence`

        rows, err := l.db.Query(query)          // Run query; get result set
        if err != nil {
            outError <- fmt.Errorf("sql query error: %w", err)
            return
        }

        defer rows.Close()                      // This is important!

        e := Event{}                            // Create an empty Event

        for rows.Next() {                       // Iterate over the rows

            err = rows.Scan(                    // Read the values from the
                &e.Sequence, &e.EventType,      // row into the Event.
                &e.Key, &e.Value)

            if err != nil {
                outError <- fmt.Errorf("error reading row: %w", err)
                return
            }

            outEvent <- e                       // Send e to the channel
        }

        err = rows.Err()
        if err != nil {
            outError <- fmt.Errorf("transaction log read failure: %w", err)
        }
    }()

    return outEvent, outError
}

All of the interesting (or at least new) bits are happening in the goroutine. Let’s break them down:

query is a string that contains the SQL query. The query in this code requests four columns: sequence, event_type, key, and value.
db.Query sends query to the database and returns values of type *sql.Rows and error.
We defer a call to rows.Close. Failing to do so can lead to connection leakage!
rows.Next lets us iterate over the rows; it returns false if there are no more rows or if there’s an error.
rows.Scan copies the columns in the current row into the values we pointed at in the call.
We send event e to the output channel.
Err returns the error, if any, that may have caused rows.Next to return false.

Initializing the PostgresTransactionLogger in your web service

The PostgresTransactionLogger is pretty much complete. Now let’s go ahead and integrate it into the web service.

Fortunately, since we already had the FileTransactionLogger in place, we need to change only one line:

logger, err = NewFileTransactionLogger("transaction.log")

which becomes…

logger, err = NewPostgresTransactionLogger(PostgresDBParams{
    host:     "localhost",
    dbName:   "db-name",
    user:     "db-user",
    password: "db-password"
})

Yup. That’s it. Really.

Because this represents a complete implementation of the TransactionLogger interface, everything else stays exactly the same. You can interact with the PostgresTransactionLogger using exactly the same methods as before.

Future improvements

As with the FileTransactionLogger, the PostgresTransactionLogger represents a minimal viable implementation of a transaction logger and has lots of room for improvement. Some of the areas for improvement include, but are certainly not limited to the following:

We assume that the database and table exist, and we’ll get errors if they don’t.
The connection string is hardcoded. Even the password.
There’s still no Close method to clean up open connections.
The service can close with events still in the write buffer: events can get lost.
The log retains records of deleted values forever: it will grow indefinitely.

All of these would be (major) impediments in production. I encourage you to take the time to consider—or even implement—solutions to one or more of these points.

Generation 3: Implementing Transport Layer Security

Security. Love it or hate it, the simple fact is that security is a critical feature of any application, cloud native or otherwise. Sadly, security is often treated as an afterthought, with potentially catastrophic consequences.

There are rich tools and established security best practices for traditional environments, but this is less true of cloud native applications, which tend to take the form of several small, often ephemeral, microservices. While this architecture provides significant flexibility and scalability benefits, it also creates a distinct opportunity for would-be attackers: every communication between services is transmitted across a network, opening it up to eavesdropping and manipulation.

The subject of security can take up an entire book of its own,¹⁵ so we’ll focus on one common technique: encryption. Encrypting data “in transit” (or “on the wire”) is commonly used to guard against eavesdropping and message manipulation, and any language worth its salt—including, and especially, Go—will make it relatively low-lift to implement.

Transport Layer Security

Transport Layer Security (TLS) is a cryptographic protocol that’s designed to provide communications security over a computer network. Its use is ubiquitous and widespread, being applicable to virtually any internet communications. You’re most likely familiar with it (and perhaps using it right now) in the form of HTTPS—also known as HTTP over TLS—which uses TLS to encrypt exchanges over HTTP.

TLS encrypts messages using public-key cryptography, which we’ll discuss in some more depth in “Asymmetric encryption”. The short version, however, is that both parties possess their own key pair, which includes a public key that’s freely given out and a private key that’s known only to its owner, illustrated in Figure 5-2. Anybody can use a public key to encrypt a message, but it can be decrypted only with the corresponding private key. Using this protocol, two parties that wish to communicate privately can exchange their public keys, which can then be used to secure all subsequent communications in a way that can be read by only the intended recipient, who holds the corresponding private key.¹⁶

Certificates, certificate authorities, and trust

If TLS had a motto, it would be “Trust but verify.” Actually, scratch the trust part. Verify everything.

It’s not enough for a service to simply provide a public key.¹⁷ Instead, every public key is associated with a digital certificate, an electronic document used to prove the key’s ownership. A certificate shows that the owner of the public key is, in fact, the named subject (owner) of the certificate, and describes how the key may be used. This allows the recipient to compare the certificate against various “trusts” to decide whether it will accept it as valid.

First, the certificate must be digitally signed and authenticated by a certificate authority (CA), a trusted entity that issues digital certificates.

Second, the subject of the certificate has to match the domain name of the service the client is trying to connect to. Among other things, this helps to ensure that the certificates you’re receiving are valid and haven’t been swapped out by a man-in-the-middle.

Only then will your conversation proceed.

Warning

Web browsers or other tools will usually allow you to choose to proceed if a certificate can’t be validated. If you’re using self-signed certificates for development, for example, that might make sense. But generally speaking, heed the warnings.

Private Key and Certificate Files

TLS (and its predecessor, Secure Sockets Layer [SSL]) has been around long enough¹⁸ that you’d think we’d have settled on a single key container format, but you’d be wrong. Web searches for “key file format” will return a virtual zoo of file extensions: .csr, .key, .pkcs12, .der, and .pem just to name a few.

Of these, however, .pem seems to be the most common. It also happens to be the format that’s most easily supported by Go’s net/http package, so that’s what we’ll be using.

Privacy enhanced mail (PEM) file format

Privacy Enhanced Mail (PEM) is a common certificate container format, usually stored in .pem files, but .cer or .crt (for certificates) and .key (for public or private keys) are common too. Conveniently, PEM is also base64-encoded and therefore viewable in a text editor, and even safe to paste into (for example) the body of an email message.¹⁹

Often, .pem files will come in a pair, representing a complete key pair:

cert.pem: The server certificate (including the CA-signed public key)
key.pem: A private key, not to be shared

Going forward, we’ll assume that your keys are in this configuration. If you don’t yet have any keys and need to generate some for development purposes, instructions are available in multiple places online. If you already have a key file in some other format, converting it is beyond the scope of this book. However, the internet is a magical place, and there are plenty of tutorials online for converting between common key formats.

Securing Your Web Service with HTTPS

Now that we’ve established that security should be taken seriously, and that communication via TLS is a bare-minimum first step toward securing our communications, how do we go about doing that?

One way might be to put a reverse proxy in front of our service that can handle HTTPS requests and forward them to our key-value service as HTTP, but unless the two are colocated on the same server, we’re still sending unencrypted messages over a network. Plus, the additional service adds some architectural complexity that we might prefer to avoid. Perhaps we can have our key-value service serve HTTPS?

Actually, we can. Going all the way back to “Building an HTTP Server with net/http”, you might recall that the net/http package contains a function, Li⁠st⁠enAnd⁠Ser⁠ve, which, in its most basic form, looks something like the following:

func main() {
    http.HandleFunc("/", helloGoHandler)            // Add a root path handler

    http.ListenAndServe(":8080", nil)               // Start the HTTP server
}

In this example, we call HandleFunc to add a handler function for the root path, followed by ListenAndServe to start the service listening and serving. For the sake of simplicity, we ignore any errors returned by ListenAndServe.

There aren’t a lot of moving parts here, which is kind of nice. In keeping with that philosophy, the designers of net/http kindly provided a TLS-enabled variant of the ListenAndServe function that we’re familiar with:

func ListenAndServeTLS(addr, certFile, keyFile string, handler Handler) error {}

As you can see, ListenAndServeTLS looks and feels almost exactly like ListenAndServe except that it has two extra parameters: certFile and keyFile. If you happen to have certificate and private key PEM files, then serving HTTPS-encrypted connections is just a matter of passing the names of those files to ListenAndServeTLS:

http.ListenAndServeTLS(":8080", "cert.pem", "key.pem", nil)

This sure looks super convenient, but does it work? Let’s fire up our service (using self-signed certificates) and find out.

Dusting off our old friend curl, let’s try inserting a key-value pair. Note that we use the https scheme in our URL instead of http:

$ curl -X PUT -d 'Hello, key-value store!' -v https://localhost:8080/v1/key/key-a
* SSL certificate problem: self signed certificate
curl: (60) SSL certificate problem: self signed certificate

Well, that didn’t go as planned. As we mentioned in “Certificates, certificate authorities, and trust”, TLS expects any certificates to be signed by a CA. It doesn’t like self-signed certificates.

Fortunately, we can turn this safety check off in curl with the appropriately named --insecure flag:

$ curl -X PUT -d 'Hello, key-value store!' --insecure -v \
    https://localhost:8080/v1/key/key-a
* SSL certificate verify result: self signed certificate (18), continuing anyway.
> PUT /v1/key/key-a HTTP/2
< HTTP/2 201

We got a sternly worded warning, but it worked!

Transport Layer Summary

We’ve covered quite a lot in just a few pages. The topic of security is vast, and there’s no way we’re going to do it justice, but we were able to at least introduce TLS and how it can serve as one relatively low-cost, high-return component of a larger security strategy.

We were also able to demonstrate how to implement TLS in a Go net/http web service and saw how—as long as we have valid certificates—to secure a service’s communications without a great deal of effort.

Containerizing Your Key-Value Store

A container is a lightweight operating-system-level virtualization²⁰ abstraction that provides processes with a degree of isolation, both from their host and from other containers. The concept of the container has been around since at least 2000, but it was the introduction of Docker in 2013 that made containers accessible to the masses and brought containerization into the mainstream.

Importantly, containers are not virtual machines:²¹ they don’t use hypervisors, and they share the host’s kernel rather than carrying their own guest operating system. Instead, their isolation is provided by a clever application of several Linux kernel features like cgroups and kernel namespaces. In fact, it can be reasonably argued that containers are nothing more than a convenient abstraction, and that there’s actually no such thing as a container.

Even though they’re not virtual machines,²² containers do provide some virtual-machine-like benefits, the most obvious of which is that they allow an application, its dependencies, and much of its environment to be packaged within a single distributable artifact—a container image—that can be executed on any suitable host.

The benefits don’t stop there, however. In case you need them, here’s a few more:

Agility: Unlike virtual machines that are saddled with an entire operating system and a colossal memory footprint, containers boast image sizes in the megabyte range and startup times that measure in milliseconds. This is particularly true of Go applications, whose binaries have few, if any, dependencies.
Isolation: This was hinted at previously but bears repeating. Containers virtualize CPU, memory, storage, and network resources at the operating-system-level, providing developers with a sandboxed view of the OS that is logically isolated from other applications.
Standardization and productivity: Containers let you package an application alongside its dependencies, such as specific versions of language runtimes and libraries, as a single distributable binary, making your deployments reproducible, predictable, and versionable.
Orchestration: Sophisticated container orchestration systems like Kubernetes provide a huge number of benefits. By containerizing your application(s), you’re taking the first step toward being able to take advantage of them.

These are just four (very) motivating arguments.²³ In other words, containerization is super, super useful.

For this book, we’ll be using Docker to build our container images. Alternative build tools exist, but Docker is the most common containerization tool in use today, and the syntax for its build file—termed a Dockerfile—lets you use familiar shell scripting commands and utilities.

That being said, this isn’t a book about Docker or containerization, so our discussion will mostly be limited to the bare basics of using Docker with Go. If you’re interested in learning more, I suggest picking up a copy of Docker: Up & Running: Shipping Reliable Containers in Production by Sean P. Kane and Karl Matthias (O’Reilly).

Docker (Absolute) Basics

Before we continue, it’s important to draw a distinction between container images and the containers themselves. A container image is essentially an executable binary that contains your application runtime and its dependencies. When an image is run, the resulting process is the container. An image can be run many times to create multiple (essentially) identical containers.

Over the next few pages we’ll create a simple Dockerfile and build and execute an image. If you haven’t already, please take a moment and install the Docker Community Edition (CE).

The Dockerfile

Dockerfiles are essentially build files that describe the steps required to build an image. A very minimal—but complete—example is demonstrated in the following:

# The parent image. At build time, this image will be pulled and
# subsequent instructions run against it.
FROM ubuntu:22.04

# Update apt cache and install nginx without an approval prompt.
RUN apt-get update && apt-get install --yes nginx

# Tell Docker this image's containers will use port 80.
EXPOSE 80

# Run Nginx in the foreground. This is important: without a
# foreground process the container will automatically stop.
CMD ["nginx", "-g", "daemon off;"]

As you can see, this Dockerfile includes four different commands:

FROM: Specifies a base image that this build will extend, and will typically be a common Linux distribution, such as ubuntu or alpine. At build time, this image is pulled and run, and the subsequent commands applied to it.
RUN: Will execute any commands on top of the current image. The result will be used for the next step in the Dockerfile.
EXPOSE: Tells Docker which port(s) the container will use. See “What’s the Difference Between Exposing and Publishing Ports?” for more information on exposing ports.
CMD: The command to execute when the container is executed. There can be only one CMD in a Dockerfile.

These are four of the most common Dockerfile instructions of many available. For a complete listing, see the official Dockerfile reference.

As you may have inferred, the previous example starts with an existing Linux distribution image (Ubuntu 22.04) and installs nginx, which is executed when the container is started.

By convention, the filename of a Dockerfile is Dockerfile. Go ahead and create a new file named Dockerfile and paste the previous example into it.

Building your container image

Now that you have a simple Dockerfile, you can build it! Make sure that you’re in the same directory as your Dockerfile and enter the following:

$ docker build --tag my-nginx .

This will instruct Docker to begin the build process. If everything works correctly (and why wouldn’t it?), you’ll see the output as Docker downloads the parent image and runs the apt commands. This will probably take a minute or two the first time you run it.

At the end, you’ll see a line that looks something like the following: Successfully tagged my-nginx:latest.

If you do, you can use the docker images command to verify that your image is now present. You should see something like the following:

$ docker images
REPOSITORY      TAG         IMAGE ID           CREATED               SIZE
my-nginx        latest      64ea3e21a388       29 seconds ago        159MB
ubuntu          22.04       f63181f19b2f       3 weeks ago           72.9MB

If all has gone as planned, you’ll see at least two images listed: our parent image ubuntu:22.04 and your own my-nginx:latest image. Next step: running the service container!

What Does latest Mean?

Note the name of the image. What’s latest? That’s a simple question with a complicated answer. Docker images have two name components: a repository and a tag.

The repository name component can include the domain name of a host where the image is stored (or will be stored). For example, the repository name for an image hosted by FooCorp may be named something like docker.foo.com/ubuntu. If no repository URL is evident, then the image is either 100% local (like the image we just built) or lives in the Docker Hub.

The tag component is intended as a unique label for a particular version of an image and often takes the form of a version number. The latest tag is a default tag name that’s added by many docker operations if no tag is specified.

Using latest in production is generally considered a bad practice, however, because its contents can change—sometimes significantly—with unfortunate consequences.

Running your container image

Now that you’ve built your image, you can run it. For that, you’ll use the docker run command:

$ docker run --detach --publish 8080:80 --name nginx my-nginx
61bb4d01017236f6261ede5749b421e4f65d43cb67e8e7aa8439dc0f06afe0f3

This instructs Docker to run a container using your my-nginx image. The --detach flag will cause the container to be run in the background. Using --publish 8080:80 instructs Docker to publish port 8080 on the host bridged to port 80 in the container, so any connections to localhost:8080 will be forwarded to the container’s port 80. Finally, the --name nginx flag specifies a name for the container; without this, a randomly generated name will be assigned instead.

You’ll notice that running this command presents us with a very cryptic line containing 65 very cryptic hexadecimal characters. This is the container ID, which can be used to refer to the container in lieu of its name.

Running your container image

To verify that your container is running and is doing what you expect, you can use the docker ps command to list all running containers. This should look something like the following:

$ docker ps
CONTAINER ID    IMAGE       STATUS          PORTS                   NAMES
4cce9201f484    my-nginx    Up 4 minutes    0.0.0.0:8080->80/tcp    nginx

The preceding output has been edited for brevity (you may notice that it’s missing the COMMAND and CREATED columns). Your output should include seven columns:

CONTAINER ID: The first 12 characters of the container ID. You’ll notice it matches the output of your docker run.
IMAGE: The name (and tag, if specified) of this container’s source image. No tag implies latest.
COMMAND (not shown): The command running inside the container. Unless overridden in docker run, this will be the same as the CMD instruction in the Dockerfile. In our case, this will be nginx -g 'daemon off;'.
CREATED (not shown): How long ago the container was created.
STATUS: The current state of the container (up, exited, restarting, etc.) and how long it’s been in that state. If the state changed, then the time will differ from CREATED.
PORTS: Lists all exposed and published ports (see “What’s the Difference Between Exposing and Publishing Ports?”). In our case, we’ve published 0.0.0.0:8080 on the host and mapped it to 80 on the container so that all requests to host port 8080 are forwarded to container port 80.
NAMES: The name of the container. Docker will randomly set this if it’s not explicitly defined. Two containers with the same name, regardless of state, cannot exist on the same host at the same time. To reuse a name, you’ll first have to delete the unwanted container.

Issuing a request to a published container port

If you’ve gotten this far, then your docker ps output should show a container named nginx that appears to have port 8080 published and forwarding to the container’s port 80. If so, then you’re ready to send a request to your running container. But which port should you query?

Well, the nginx container is listening on port 80. Can you reach that? Actually, no. That port won’t be accessible because it wasn’t published to any network interface during the docker run. Any attempt to connect to an unpublished container port is doomed to failure:

$ curl localhost:80
curl: (7) Failed to connect to localhost port 80: Connection refused

You haven’t published to port 80, but you have published port 8080 and forwarded it to the container’s port 80. You can verify this with our old friend curl or by browsing to localhost:8080. If everything is working correctly, you’ll be greeted with the familiar nginx “Welcome” page illustrated in Figure 5-3.

Running multiple containers

One of the “killer features” of containerization is this: because all of the containers on a host are isolated from one another, it’s possible to run quite a lot of them—even ones that contain different technologies and stacks—on the same host, with each listening on a different published port. For example, if you wanted to run an httpd container alongside your already-running my-nginx container, you could do exactly that.

“But,” you might say, “both of those containers expose port 80! Won’t they collide?”

Great question, to which the answer is, happily, no. In fact, you can actually have as many containers as you want that expose the same port—even multiple instances of the same image—as long as they don’t attempt to publish the same port on the same network interface.

For example, if you want to run the stock httpd image, you can run it by using the docker run command again, as long as you take care to publish to a different port (8081, in this case):

$ docker run --detach --publish 8081:80 --name httpd httpd

If all goes as planned, this will spawn a new container listening on the host at port 8081. Go ahead: use docker ps and curl to test:

$ curl localhost:8081
<html><body><h1>It works!</h1></body></html>

Stopping and deleting your containers

Now that you’ve successfully run your container, you’ll probably need to stop and delete it at some point, particularly if you want to rerun a new container using the same name.

To stop a running container, you can use the docker stop command, passing it either the container name or the first few characters of its container ID (how many characters doesn’t matter, as long they can be used to uniquely identify the desired container). Using the container ID to stop our nginx container looks like this:

$ docker stop 4cce      # "docker stop nginx" will work too
4cce

The output of a successful docker stop is just the name or ID that we passed into the command. You can verify that your container has actually been stopped using docker ps --all, which will show all containers, not just the running ones:

$ docker ps
CONTAINER ID    IMAGE       STATUS                      PORTS    NAMES
4cce9201f484    my-nginx    Exited (0) 3 minutes ago             nginx

If you run the httpd container, it’ll also be displayed with a status of Up. You’ll probably want to stop it as well.

As you can see, the status of our nginx container has changed to Exited, followed by its exit code—an exit status of 0 indicates that we were able to execute a graceful shutdown—and how long ago the container entered its current status.

Now that you’ve stopped your container, you can freely delete it.

Tip

You can’t delete a running container or a container image that’s referenced by a container (running or otherwise).

To do this, you use the docker rm (or the newer docker container rm) command to remove your container, again passing it either the container name or the first few characters of the ID of the container you want to delete:

$ docker rm 4cce            # "docker rm nginx" will work too
4cce

As before, the output name or ID indicates success. If you were to go ahead and run docker ps --all again, you shouldn’t see the container listed anymore.

Building Your Key-Value Store Container

Now that you have the basics down, you can start applying them to containerizing our key-value service.

Fortunately, Go’s ability to compile into statically linked binaries makes it especially well suited for containerization. While most other languages have to be built into a parent image that contains the language runtime, like the 505 MB openjdk:22 for Java or the 1.01 GB python:3.11 for Python,²⁴ Go binaries need no runtime at all. They can be placed into a “scratch” image: an image with no parent at all.

Iteration 1: adding your binary to a FROM scratch image

To do this, you’ll need a Dockerfile. The following example is a pretty typical example of a Dockerfile for a containerized Go binary:

# We use a "scratch" image, which contains no distribution files. The
# resulting image and containers will have only the service binary.
FROM scratch

# Copy the existing binary from the host.
COPY kvs .

# Copy in your PEM files.
COPY *.pem .

# Tell Docker we'll be using port 8080.
EXPOSE 8080

# Tell Docker to execute this command on a `docker run`.
CMD ["/kvs"]

This Dockerfile is fairly similar to the previous one, except that instead of using apt to install an application from a repository, it uses COPY to retrieve a compiled binary from the filesystem it’s being built on. In this case, it assumes the presence of a binary named kvs. For this to work, we’ll need to build the binary first.

In order for your binary to be usable inside a container, it has to meet a few criteria:

It has to be compiled (or cross-compiled) for Linux.
It has to be statically linked.
It has to be named kvs (because that’s what the Dockerfile is expecting).

We can do all of these things in one command, as follows:

$ CGO_ENABLED=0 GOOS=linux go build -a -o kvs

Let’s walk through what this does:

CGO_ENABLED=0: Tells the compiler to disable cgo and statically link any C bindings. We won’t go into what this is, other than that it enforces static linking, but I encourage you to look at the cgo documentation if you’re curious.
GOOS=linux: Instructs the compiler to generate a Linux binary, cross-compiling if necessary.
-a: Forces the compiler to rebuild any packages that are already up-to-date.
-o kvs: Specifies that the binary will be named kvs.

Executing the command should yield a statically linked Linux binary. This can be verified using the file command:

$ file kvs
kvs: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked,
not stripped

Note

Linux binaries will run in a Linux container, even one running in Docker for macOS or Windows, but won’t run on macOS or Windows otherwise.

Great! Now let’s build the container image and see what comes out:

$ docker build --tag kvs .
...output omitted.

$ docker images
REPOSITORY     TAG        IMAGE ID          CREATED                SIZE
kvs            latest     7b1fb6fa93e3      About a minute ago     6.88MB
openjdk        22         834f87a06187      42 hours ago           505MB
node           20         250e9c100ea2      47 hours ago           1.1GB
python         3.11       c0e63845ae98      5 weeks ago            1.01GB

Less than 7 MB! That’s roughly two orders of magnitude smaller than the relatively massive images for other languages’ runtimes. This can come in quite handy when you’re operating at scale and have to pull your image onto a couple hundred nodes a few times a day.

But does it run? Let’s find out:

$ docker run --detach --publish 8080:8080 kvs
4a05617539125f7f28357d3310759c2ef388f456b07ea0763350a78da661afd3

$ curl -X PUT -d 'Hello, key-value store!' -v http://localhost:8080/v1/key/key-a
> PUT /v1/key/key-a HTTP/1.1
< HTTP/1.1 201 Created

$ curl http://localhost:8080/v1/key/key-a
Hello, key-value store!

Looks like it works!

Now you have a nice, simple Dockerfile that builds an image using a precompiled binary. Unfortunately, that means that you have to make sure that you (or your CI system) rebuild the binary fresh for each Docker build. That’s not too terrible, but it does mean that you need to have Go installed on your build workers. Again, not terrible, but we can certainly do better.

Iteration 2: using a multistage build

In the last section, you created a simple Dockerfile that would take an existing Linux binary and wrap it in a bare-bones “scratch” image. But what if you could perform the entire image build—Go compilation and all—in Docker?

One approach might be to use the golang image as our parent image. If you did that, your Dockerfile could compile your Go code and run the resulting binary at deploy time. This could build on hosts that don’t have the Go compiler installed, but the resulting image would be saddled with an additional 845 MB (the size of the golang:1.20 image) of entirely unnecessary build machinery.

Another approach might be to use two Dockerfiles: one for building the binary and another that containerizes the output of the first build. This is a lot closer to where you want to be, but it requires two distinct Dockerfiles that need to be sequentially built or managed by a separate script.

A better way became available with the introduction of multistage Docker builds, which allow multiple distinct builds—even with entirely different base images—to be chained together so that artifacts from one stage can be selectively copied into another, leaving behind everything you don’t want in the final image. To use this approach, you define a build with two stages: a “build” stage that generates the Go binary and an “image” stage that uses that binary to produce the final image.

To do this, you use multiple FROM statements in our Dockerfile, each defining the start of a new stage. Each stage can be arbitrarily named. For example, you might name your build stage build, as follows:

FROM golang:1.20 as build

Once you have stages with names, you can use the COPY instruction in your Dockerfile to copy any artifact from any previous stage. Your final stage might have an instruction like the following, which copies the file /src/kvs from the build stage to the current working directory:

COPY --from=build /src/kvs .

Putting these things together yields a complete, two-stage Dockerfile:

# Stage 1: Compile the binary in a containerized Golang environment
#
FROM golang:1.20 as build

# Copy the source files from the host
COPY . /src

# Set the working directory to the same place we copied the code
WORKDIR /src

# Build the binary!
RUN CGO_ENABLED=0 GOOS=linux go build -o kvs


# Stage 2: Build the Key-Value Store image proper
#
# Use a "scratch" image, which contains no distribution files
FROM scratch

# Copy the binary from the build container
COPY --from=build /src/kvs .

# If you're using TLS, copy the .pem files too
COPY --from=build /src/*.pem .

# Tell Docker we'll be using port 8080
EXPOSE 8080

# Tell Docker to execute this command on a "docker run"
CMD ["/kvs"]

Now that you have your complete Dockerfile, you can build it in precisely the same way as before. We’ll tag it as multipart this time, though, so that you can compare the two images:

$ docker build --tag kvs:multipart .
...output omitted.

$ docker images
REPOSITORY     TAG           IMAGE ID           CREATED               SIZE
kvs            latest        7b1fb6fa93e3       2 hours ago           6.88MB
kvs            multipart     b83b9e479ae7       4 minutes ago         6.56MB

This is encouraging! You now have a single Dockerfile that can compile your Go code—regardless of whether or not the Go compiler is even installed on the build worker—and that drops the resulting statically linked executable binary into a FROM scratch base to produce an extremely small image containing nothing except your key-value store service.

You don’t need to stop there, though. If you wanted to, you could add other stages as well, such as a test stage that runs any unit tests prior to the build step. We won’t go through that exercise now, however, since it’s more of the same thing, but I encourage you to try it for yourself.

Externalizing Container Data

Containers are intended to be ephemeral, and any container should be designed and run with the understanding that it can (and will) be destroyed and re-created at any time, taking all of its data with it. To be clear, this is a feature, and is very intentional, but sometimes you might want your data to outlive your containers.

For example, the ability to mount externally managed files directly into the filesystem of an otherwise general-purpose container can decouple configurations from images so you don’t have to rebuild them whenever just to change their settings. This is a powerful strategy and is probably the most common use case for container data externalization. So common, in fact, that Kubernetes even provides a resource type—ConfigMap—dedicated to it.

Similarly, you might want data generated in a container to exist beyond the lifetime of the container. Storing data on the host can be an excellent strategy for warming caches, for example. It’s important to keep in mind, however, one of the realities of cloud native infrastructure: nothing is permanent, not even servers. Don’t store anything on the host that you don’t mind possibly losing forever.

Fortunately, while “pure” Docker limits you to externalizing data directly onto local disk,²⁵ container orchestration systems like Kubernetes provide various abstractions that allow data to survive the loss of a host.

Unfortunately, this is supposed to be a book about Go, so we really can’t cover Kubernetes in detail here. If you haven’t already, I strongly encourage you to take a long look at the excellent Kubernetes documentation and the equally excellent Kubernetes: Up and Running by Brendan Burns et al. (O’Reilly).

Summary

This was a long chapter, and we touched on a lot of different topics. Consider how much we’ve accomplished!

Starting from first principles, we designed and implemented a simple monolithic key-value store, using net/http and gorilla/mux to build a RESTful service around functionality provided by a small, independent, and easily testable Go library.
We leveraged Go’s powerful interface capabilities to produce two completely different transaction logger implementations, one based on local files and using os.File and the fmt and bufio packages, and the other backed by a Postgres database and using the database/sql and github.com/lib/pq Postgres driver packages.
We discussed the importance of security in general, covered some of the basics of TLS as one part of a larger security strategy, and implemented HTTPS in our service.
Finally, we covered containerization, one of the core cloud native technologies, including how to build images and how to run and manage containers. We even containerized not only our application but also its build process.

Going forward, we’ll be extending on our key-value service in various ways when we introduce new concepts, so stay tuned. Things are about to get even more interesting.

¹ Philip Schieber, “The Wit and Wisdom of Grace Hopper,” OCLC Newsletter, March/April 1987, no. 167.

² For some definition of “love.”

³ If it does, something is very wrong.

⁴ Or, like my son, was only pretending not to hear you.

⁵ “Cloud native is not a synonym for microservices…. If cloud native has to be a synonym for anything, it would be idempotent, which definitely needs a synonym.” —Holly Cummins, Cloud Native London, 2018.

⁶ Isn’t this exciting?

⁷ It’s a good thing too. Mutexes can be pretty tedious to implement correctly!

⁸ Didn’t I tell you that we’d make it more complicated?

⁹ That’s you.

¹⁰ That’s a lie. There are probably lots of better names.

¹¹ What makes a transaction log “good” anyway?

¹² Naming is hard.

¹³ After all this time, I still think that’s pretty neat.

¹⁴ You’re welcome.

¹⁵ Ideally written by somebody who knows more than I do about security.

¹⁶ This is a gross oversimplification, but it’ll do for our purposes. I encourage you to learn more about this and correct me, though.

¹⁷ You don’t know where that key has been.

¹⁸ SSL 2.0 was released in 1995, and TLS 1.0 was released in 1999. Interestingly, SSL 1.0 had some pretty profound security flaws and was never publicly released.

¹⁹ Public keys only, please.

²⁰ Containers are not virtual machines. They virtualize the operating system instead of hardware.

²¹ Repetition intended. This is an important point.

²² Yup. I said it. Again.

²³ The initial draft had several more, but this chapter is already pretty lengthy.

²⁴ To be fair, these images are “only” 234 MB and 363 MB compressed, respectively.

²⁵ I’m intentionally ignoring solutions like Amazon’s Elastic Block Store, which can help but have issues of their own.

Part III. The Cloud Native Attributes

Chapter 6. Cloud Native Design Principles

The most important property of a program is whether it accomplishes the intention of its user.¹

C. A. R. Hoare, Communications of the ACM (October 1969)

Professor Sir Charles Antony Richard (Tony) Hoare is a brilliant guy. He invented quicksort, authored Hoare Logic for reasoning about the correctness of computer programs, and created the formal language “communicating sequential processes” (CSP) that inspired Go’s beloved concurrency model. Oh, and he developed the structured programming paradigm² that forms the foundation of all modern programming languages in common use today. He also invented the null reference. Please don’t hold that against him, though. He publicly apologized³ for it in 2009, calling it his “billion-dollar mistake.”

Tony Hoare literally invented programming as we know it. So when he says that the single most important property of a program is whether it accomplishes the intention of its user, you can take that on some authority. Think about this for a second: Hoare specifically (and quite rightly) points out that it’s the intention of a program’s users—not its creators—that dictates whether a program is performing correctly. How inconvenient that the intentions of a program’s users aren’t always the same as those of its creator!

Given this assertion, it stands to reason that a user’s first expectation about a program is that the program works. But how do you know that a program is “working”? This is actually a pretty big question, one that lies at the heart of cloud native design. The first goal of this chapter is to explore that very idea, and in the process, introduce concepts like “dependability” and “reliability” that we can use to better describe (and meet) user expectations. Finally, we’ll briefly review a number of practices commonly used in cloud native development to ensure that services meet the expectations of its users. We’ll discuss each of these in depth throughout the remainder of this book.

What’s the Point of Cloud Native?

In Chapter 1, we spent a few pages defining “cloud native,” starting with the Cloud Native Computing Foundation’s definition and working forward to the properties of an ideal cloud native service. We spent a few more pages talking about the pressures that have driven cloud native to be a thing in the first place.

What we didn’t spend so much time on, however, was the why of cloud native. Why does the concept of cloud native even exist? Why would we even want our systems to be cloud native? What’s its purpose? What makes it so special? Why should I care?

So, why does cloud native exist? The answer is actually pretty straightforward: it’s all about dependability. In the first part of this chapter, we’ll dig into the concept of dependability, what it is, why it’s important, and how it underlies all of the patterns and techniques that we call cloud native.

It’s All About Dependability

Holly Cummins, the worldwide development community practice lead for the IBM Garage, famously said that “if cloud native has to be a synonym for anything, it would be idempotent.”⁴ Cummins is absolutely brilliant, and has said a lot of absolutely brilliant things,⁵ but I think she has only half of the picture on this one. I think that idempotence is very important—perhaps even necessary for cloud native—but not sufficient. I’ll elaborate.

The history of software, particularly the network-based kind, has been one of struggling to meet the expectations of increasingly sophisticated users. Long gone are the days when a service could go down at night “for maintenance.” Users today rely heavily on the services they use, and they expect those services to be available and to respond promptly to their requests. Remember the last time you tried to start a Netflix movie and it took the longest five seconds of your life? Yeah, that.

Users don’t care that your services have to be maintained. They won’t wait patiently while you hunt down that mysterious source of latency. They just want to finish rewatching the second season of Breaking Bad.⁶

All of the patterns and techniques that we associate with cloud native—every single one—exist to allow services to be deployed, operated, and maintained at scale in unreliable environments, driven by the need to produce dependable services that keep users happy.

In other words, I think that if “cloud native” has to be a synonym for anything, it would be “dependability.”

What Is Dependability and Why Is It So Important?

I didn’t choose the word dependability arbitrarily. It’s actually a core concept in the field of systems engineering, which is full of some very smart people who say some very smart things about the design and management of complex systems. About 40 years ago, the concept of dependability in a computing context was first rigorously defined by Jean-Claude Laprie,⁷ who defined a system’s dependability according to the expectations of its users. Laprie’s original definition has been tweaked and extended over the years by various authors, but here’s my favorite:

The dependability of a computer system is its ability to avoid failures that are more frequent or more severe, and outage durations that are longer, than is acceptable to the user(s).⁸

Fundamental Concepts of Computer System Dependability (2001)

In other words, a dependable system consistently does what its users expect and can be quickly fixed when it doesn’t.

By this definition, a system is dependable only when it can be justifiably trusted. Obviously, a system can’t be considered dependable if it falls over any time one of its components glitch or if it requires hours to recover from a failure. Even if it’s been running for months without interruption, an undependable system may still be one bad day away from catastrophe: luck isn’t dependability.

Unfortunately, it’s hard to objectively gauge “user expectations.” For this reason, as illustrated in Figure 6-1, dependability is an umbrella concept encompassing several more specific as well as quantifiable attributes—availability, reliability, and maintainability—all of which are subject to similar threats that may be overcome by similar means.

So while the concept of “dependability” alone might be a little squishy and subjective, the attributes that contribute to it are quantitative and measurable enough to be useful:

Availability: The ability of a system to perform its intended function at a random moment in time. This is usually expressed as the probability that a request made of the system will be successful, defined as uptime divided by total time.
Reliability: The ability of a system to perform its intended function for a given time interval. This is often expressed as either the mean time between failures (MTBF: total time divided by the number of failures) or failure rate (number of failures divided by total time).
Maintainability: The ability of a system to undergo modifications and repairs. There are a variety of indirect measures for maintainability, ranging from calculations of cyclomatic complexity to tracking the amount of time required to change a system’s behavior to meet new requirements or to restore it to a functional state.

Note

Later authors extended Laprie’s definition of dependability to include several security-related properties, including safety, confidentiality, and integrity. Security is a huge topic, worthy of an entire book of its own, but in Chapter 12 I’ll do my best to hit the key points.

Dependability Is Not Reliability

If you’ve read any of O’Reilly’s site reliability engineering (SRE) books,⁹ you’ve already heard quite a lot about reliability. However, as illustrated in Figure 6-1, reliability is just one property that contributes to overall dependability.

If that’s true, though, then why has reliability become the standard metric for service functionality? Why are there “site reliability engineers” but no “site dependability engineers”?

There are probably several answers to these questions, but perhaps the most definitive is that the definition of “dependability” is largely qualitative. There’s no measure for it, and when you can’t measure something it’s very hard to construct a set of rules around it.

Reliability, on the other hand, is quantitative. Given a robust definition¹⁰ for what it means for a system to provide “correct” service, it becomes relatively straightforward to calculate that system’s “reliability,” making it a powerful (if indirect) measure of user experience.

Dependability: It’s Not Just for Ops Anymore

Since the introduction of networked services, it’s been the job of developers to build services and of systems administrators (“operations”) to deploy those services onto servers and keep them running. This worked well enough for a time, but it had the unfortunate side effect of incentivizing developers to prioritize feature development at the expense of stability and operations.

Fortunately, over the past decade and a half—coinciding with the DevOps movement—a new wave of technologies has become available with the potential to completely change the way technologists of all kinds do their jobs.

On the operations side, with the availability of infrastructure as a service and platforms as a service (IaaS/PaaS) and tools like Terraform and Ansible, working with infrastructure has never been more like writing software.

On the development side, the popularization of technologies like containers and serverless functions has given developers an entire new set of “operations-like” capabilities, particularly around virtualization and deployment.

As a result, the once-stark line between software and infrastructure is getting increasingly blurry. One could even argue that with the growing advancement and adoption of infrastructure abstractions like virtualization, container orchestration frameworks like Kubernetes, and software-defined behavior like service meshes, we may even be at the point where they could be said to have merged. Everything is software now.

The ever-increasing demand for service dependability has driven the creation of a whole new generation of cloud native technologies. The effects of these new technologies and the capabilities they provide have been considerable, and the traditional developer and operations roles are changing to suit them. At long last, the silos are crumbling, and, increasingly, the rapid production of dependable, high-quality services is a fully collaborative effort among all of its designers, implementors, and maintainers.

Achieving Dependability

This is where the rubber meets the road. If you’ve made it this far, congratulations.

So far we’ve discussed Laprie’s definition of “dependability,” which can be loosely paraphrased as “happy users,” and we’ve discussed the attributes—availability, reliability, and maintainability—that contribute to it. This is all well and good, but without actionable advice for how to achieve dependability, the entire discussion is purely academic.

Laprie thought so too and defined four broad categories of techniques that can be used together to improve a system’s dependability (or which, by their absence, can reduce it):

Fault prevention: Fault prevention techniques are used during system construction to prevent the occurrence or introduction of faults.
Fault tolerance: Fault tolerance techniques are used during system design and implementation to prevent service failures in the presence of faults.
Fault removal: Fault removal techniques are used to reduce the number and severity of faults.
Fault forecasting: Fault forecasting techniques are used to identify the presence, creation, and consequences of faults.

Interestingly, as illustrated in Figure 6-2, these four categories correspond surprisingly well to the five cloud native attributes that we introduced all the way back in Chapter 1.

Fault prevention and fault tolerance make up the bottom two layers of the pyramid, corresponding with scalability, loose coupling, and resilience. Designing a system for scalability prevents a variety of faults common among cloud native applications, and resiliency techniques allow a system to tolerate faults when they do inevitably arise. Techniques for loose coupling can be said to fall into both categories, preventing and enhancing a service’s fault tolerance. Together these can be said to contribute to what Laprie terms dependability procurement: the means by which a system is provided with the ability to perform its designated function.

Techniques and designs that contribute to manageability are intended to produce a system that can be easily modified, simplifying the process of removing faults when they’re identified. Similarly, observability naturally contributes to the ability to forecast faults in a system. Together, fault removal and forecasting techniques contribute to what Laprie termed dependability validation: the means by which confidence is gained in a system’s ability to perform its designated function.

Consider the implications of this relationship: what was a purely academic exercise 40 years ago has essentially been rediscovered—apparently independently—as a natural consequence of years of accumulated experience building reliable production systems. Dependability has come full circle.

In the subsequent sections we’ll explore these relationships more fully and preview later chapters, in which we discuss exactly how these two apparently disparate systems actually correspond quite closely.

Fault Prevention

At the base of our “Means of Dependability” pyramid are techniques that focus on preventing the occurrence or introduction of faults. As veteran programmers can attest, many—if not most—classes of errors and faults can be predicted and prevented during the earliest phases of development. As such, many fault prevention techniques come into play during the design and implementation of a service.

Good programming practices

Fault prevention is one of the primary goals of software engineering in general and is the explicit goal of any development methodology, from pair programming to test-driven development and code review practices. Many such techniques can really be grouped into what might be considered to be “good programming practice,” about which innumerable excellent books and articles have already been written, so we won’t explicitly cover it here.

Language features

Your choice of language can also greatly affect your ability to prevent or fix faults. Many language features that some programmers have sometimes come to expect, such as dynamic typing, pointer arithmetic, manual memory management, and thrown exceptions (to name a few), can easily introduce unintended behaviors that are difficult to find and fix, and may even be maliciously exploitable.

These kinds of features strongly motivated many of the design decisions for Go, resulting in the strongly typed garbage-collected language we have today. For a refresher about why Go is particularly well suited for the development of cloud native services, take a look back at Chapter 2.

Scalability

We briefly introduced the concept of scalability way back in Chapter 1, where it was defined as the ability of a system to continue to provide correct service in the face of significant changes in demand.

In that section, we introduced two different approaches to scaling—vertical scaling (scaling up) by resizing existing resources and horizontal scaling (scaling out) by adding (or removing) service instances—and some of the pros and cons of each.

We’ll go quite a bit deeper into each of these in Chapter 7, especially into the gotchas and downsides. We’ll also talk a lot about the problems posed by state.¹¹ For now, though, it’ll suffice to say that having to scale your service adds quite a bit of overhead, including, but not limited to, cost, complexity, and debugging.

While scaling resources is eventually often inevitable, it’s often better (and cheaper!) to resist the temptation to throw hardware at the problem and postpone scaling events as long as possible by considering runtime efficiency and algorithmic scaling. As such, we’ll cover a number of Go features and tooling that allow us to identify and fix common problems like memory leaks and lock contention that tend to plague systems at scale.

Loose coupling

Loose coupling, which we first defined in “Loose Coupling”, is the system property and design strategy of ensuring that a system’s components have as little knowledge of other components as possible. The degree of coupling between services can have an enormous—and too often under-appreciated—impact on a system’s ability to scale and to isolate and tolerate failures.

Since the beginning of microservices there have been dissenters who point to the difficulty of deploying and maintaining microservice-based systems as evidence that such architectures are just too complex to be viable. I don’t agree, but I can see where they’re coming from, given how incredibly easy it is to build a distributed monolith. The hallmark of a distributed monolith is the tight coupling between its components, which results in an application saddled with all of the complexity of microservices plus all of the tangled dependencies of the typical monolith. If you have to deploy most of your services together, or if a failed health check sends cascading failures through your entire system, you probably have a distributed monolith.

Building a loosely coupled system is easier said than done but is possible with a little discipline and reasonable boundaries. In Chapter 8, we’ll cover how to use data exchange contracts to establish those boundaries, and different synchronous and asynchronous communication models and architectural patterns and packages used to implement them and avoid the dreaded distributed monolith.

Fault Tolerance

Fault tolerance has a number of synonyms—self-repair, self-healing, resilience—that all describe a system’s ability to detect errors and prevent them from cascading into a full-blown failure. Typically, this consists of two parts: error detection, in which an error is discovered during normal service, and recovery, in which the system is returned to a state where it can be activated again.

Perhaps the most common strategy for providing resilience is redundancy: the duplication of critical components (having multiple service replicas) or functions (retrying service requests). This is a broad and very interesting field with a number of subtle gotchas that we’ll dig into in Chapter 9.

Fault Removal

Fault removal, the third of the four dependability means, is the process of reducing the number and severity of faults—latent software flaws that can cause errors—before they manifest as errors.

Even under ideal conditions, there are plenty of ways that a system can error or otherwise misbehave. It might fail to perform an expected action or perform the wrong action entirely, perhaps maliciously. Just to make things even more complicated, conditions aren’t always—or often—ideal.

Many faults can be identified by testing, which allows you to verify that the system (or at least its components) behaves as expected under known test conditions.

But what about unknown conditions? Requirements change, and the real world doesn’t care about your test conditions. Fortunately, with effort, a system can be designed to be manageable enough that its behavior can often be adjusted to keep it secure, running smoothly, and compliant with changing requirements.

We’ll briefly discuss these next.

Verification and testing

There are exactly four ways of finding latent software faults in your code: testing, testing, testing, and bad luck.

Yes, I joke, but that’s not so far from the truth: if you don’t find your software faults, your users will. If you’re lucky. If you’re not, then they’ll be found by bad actors seeking to take advantage of them.

Bad jokes aside, there are two common approaches to finding software faults in development:

Static analysis: Automated, rule-based code analysis performed without actually executing programs. Static analysis is useful for providing early feedback, enforcing consistent practices, and finding common errors and security holes without depending on human knowledge or effort.
Dynamic analysis: Verifying the correctness of a system or subsystem by executing it under controlled conditions and evaluating its behavior. More commonly referred to simply as “testing.”

Key to software testing is having software that’s designed for testability by minimizing the degrees of freedom—the range of possible states—of its components. Highly testable functions have a single purpose, with well-defined inputs and outputs and few or no side effects; that is, they don’t modify variables outside of their scope. If you’ll forgive the nerdiness, this approach minimizes the search space—the set of all possible solutions—of each function.

Testing is a critical step in software development that’s all too often neglected. The Go creators understood this and baked unit testing and benchmarking into the language itself in the form of the go test command and the testing package. Unfortunately, a deep dive into testing theory is well beyond the scope of this book, but we’ll do our best to scratch the surface in Chapter 9.

Manageability

Faults exist when your system doesn’t behave according to requirements. But what happens when those requirements change?

Designing for manageability, first introduced back in “Manageability”, allows a system’s behavior to be adjusted without code changes. A manageable system essentially has “knobs” that allow real-time control to keep your system secure, running smoothly, and compliant with changing requirements.

Manageability can take a variety of forms, including (but not limited to!) adjusting and configuring resource consumption, applying on-the-fly security remediations, adding feature flags that can turn features on or off, or even loading plug-in-defined behaviors.

Clearly, manageability is a broad topic. We’ll review a few of the mechanisms Go provides for it in Chapter 10.

Fault Forecasting

At the peak of our “Means of Dependability” pyramid (Figure 6-2) is fault forecasting, which builds on the knowledge gained and solutions implemented in the levels below it to attempt to estimate the present number, the future incidence, and the likely consequence of faults.

Too often this consists of guesswork and gut feelings instead, generally resulting in unexpected failures when a starting assumption stops being true. More systematic approaches include Failure Mode and Effects Analysis and stress testing, which are very useful for understanding a system’s possible failure modes.

In a system designed for observability, which we’ll discuss in depth in Chapter 11, failure mode indicators can be tracked so that they can be forecast and corrected before they manifest as errors. Furthermore, when unexpected failures occur—as they inevitably will—observable systems allow the underlying faults to be quickly identified, isolated, and corrected.

The Continuing Relevance of The Twelve-Factor App

In the early 2010s, developers at Heroku, a PaaS company and early cloud pioneer, realized that they were seeing web applications being developed again and again with the same fundamental flaws.

Motivated by what they felt were systemic problems in modern application development, they drafted The Twelve-Factor App. This was a set of 12 rules and guidelines constituting a development methodology for building web applications, and by extension, cloud native applications (although “cloud native” wasn’t a commonly used term at the time). The methodology was for building web applications that have the following characteristics:¹²

Use declarative formats for setup automation, to minimize time and cost for new developers joining the project
Have a clean contract with the underlying operating system, offering maximum portability between execution environments
Are suitable for deployment on modern cloud platforms, obviating the need for servers and systems administration
Minimize divergence between development and production, enabling continuous deployment for maximum agility
Can scale up without significant changes to tooling, architecture, or development practices

While not fully appreciated when it was first published in 2011, as the complexities of cloud native development have become more widely understood (and felt), The Twelve-Factor App and the properties it advocates have started to be cited as the bare minimum for any service to be cloud native.

I. Codebase

One codebase tracked in revision control, many deploys.

The Twelve-Factor App

For any given service, there should be exactly one codebase that’s used to produce any number of immutable releases for multiple deployments to multiple environments. These environments typically include a production site and one or more staging and development sites.

Having multiple services sharing the same code tends to lead to a blurring of the lines between modules, trending in time to something like a monolith, making it harder to make changes in one part of the service without affecting another part (or another service!) in unexpected ways. Instead, shared code should be refactored into libraries that can be individually versioned and included through a dependency manager.

Having a single service spread across multiple repositories, however, makes it nearly impossible to automatically apply the build and deploy phases of your service’s lifecycle.

II. Dependencies

Explicitly declare and isolate (code) dependencies.

The Twelve-Factor App

For any given version of the codebase, go build, go test, and go run should be deterministic: they should have the same result, however they’re run, and the product should always respond the same way to the same inputs.

But what if a dependency—an imported code package or installed system tool beyond the programmer’s control—changes in such a way that it breaks the build, introduces a bug, or becomes incompatible with the service?

Most programming languages offer a packaging system for distributing support libraries, and Go is no different.¹³ By using Go modules to declare all dependencies, completely and exactly, you can ensure that imported packages won’t change out from under you and break your build in unexpected ways.

To extend this somewhat, services should generally try to avoid using the os/exec package’s Command function to shell out to external tools like ImageMagick or curl.

Yes, your target tool might be available on all (or most) systems, but there’s no way to guarantee that they both exist and are fully compatible with the service everywhere it might run in the present or future. Ideally, if your service requires an external tool, that tool should be vendored into the service by including it in the service’s repository.

III. Configuration

Store configuration in the environment.

The Twelve-Factor App

Configuration—anything that’s likely to vary between environments (staging, production, developer environments, etc.)—should always be cleanly separated from the code. Under no circumstances should an application’s configuration be baked into the code.

Configuration items may include but certainly aren’t limited to the following:

URLs or other resource handles to a database or other upstream service dependencies—even if it’s not likely to change any time soon
Secrets of any kind, such as passwords or credentials for external services
Per-environment values, such as the canonical hostname for the deploy

A common means of extracting configuration from code is by externalizing them into some configuration file—often YAML¹⁴—which may or may not be checked into the repository alongside the code. This is certainly an improvement over configuration-in-code, but it’s also less than ideal.

First, if your configuration file lives outside of the repository, it’s all too easy to accidentally check it in. What’s more, such files tend to proliferate, with different versions for different environments living in different places, making it hard to see and manage configurations with any consistency.

Alternatively, you could have different versions of your configurations for each environment in the repository, but this can be unwieldy and tends to lead to some awkward repository acrobatics.

Instead of configurations as code or even as external configurations, The Twelve-Factor App recommends that configurations be stored as environment variables. Using environment variables in this way actually has a lot of advantages:

They’re standard and largely OS and language agnostic.
They’re easy to change between deployments without changing any code.
They’re easy to inject into containers.

Go has several tools for doing this.

The first—and most basic—is the os package, which provides the os.Getenv function for this purpose:

name := os.Getenv("NAME")
place := os.Getenv("CITY")

fmt.Printf("%s lives in %s.\n", name, place)

For more sophisticated configuration options, there are several excellent packages available. Of these, spf13/viper seems to be particularly popular. A snippet of Viper in action might look like the following:

viper.BindEnv("id")            // Will be uppercased automatically
viper.SetDefault("id", "13")   // Default value is "13"

id1 := viper.GetInt("id")
fmt.Println(id1)               // 13

os.Setenv("ID", "50")          // Typically done outside of the app!

id2 := viper.GetInt("id")
fmt.Println(id2)               // 50

Additionally, Viper provides a number of features that the standard packages do not, such as default values, typed variables, and reading from command-line flags, variously formatted configuration files, and even remote configuration systems like etcd and Consul.

We’ll dive more deeply into Viper and other configuration topics in Chapter 10.

IV. Backing Services

Treat backing services as attached resources.

The Twelve-Factor App

A backing service is any downstream dependency that a service consumes across the network as part of its normal operation (see “Upstream and Downstream Dependencies”). A service should make no distinction between backing services of the same type. Whether it’s an internal service that’s managed within the same organization or a remote service managed by a third party should make no difference.

To the service, each distinct upstream service should be treated as just another resource, each addressable by a configurable URL or some other resource handle, as illustrated in Figure 6-3. All resources should be treated as equally subject to the fallacies of distributed computing (see Chapter 4 for a refresher, if necessary).

In other words, a MySQL database run by your own team’s sysadmins should be treated no differently than an AWS-managed RDS instance. The same goes for any upstream service, whether it’s running in a data center in another hemisphere or in a Docker container on the same server.

A service that’s able to swap out any resource at will with another one of the same kind—internally managed or otherwise—just by changing a configuration value can be more easily deployed to different environments, more easily tested, and more easily maintained.

V. Build, Release, Run

Strictly separate build and run stages.

The Twelve-Factor App

Each (nondevelopment) deployment—the union of a specific version of the built code and a configuration—should be immutable and uniquely labeled. It should be possible, if necessary, to precisely re-create a deployment if (heaven forbid) it is necessary to roll a deployment back to an earlier version.

Typically, this is accomplished in three distinct stages, illustrated in Figure 6-4 and described in the following:

Build: In the build stage, an automated process retrieves a specific version of the code, fetches dependencies, and compiles an executable artifact we call a build. Every build should always have a unique identifier, typically a timestamp or an incrementing build number.
Release: In the release stage, a specific build is combined with a configuration specific to the target deployment. The resulting release is ready for immediate execution in the execution environment. Like builds, releases should also have a unique identifier. Importantly, producing releases with the same version of a build shouldn’t involve a rebuild of the code: to ensure environment parity, each environment-specific configuration should use the same build artifact.
Run: In the run stage, the release is delivered to the deployment environment and executed by launching the service’s processes.

Ideally, a new versioned build will be automatically produced whenever new code is deployed.

VI. Processes

Execute the app as one or more stateless processes.

The Twelve-Factor App

Service processes should be stateless and share nothing. Any data that has to be persisted should be stored in a stateful backing service, typically a database or external cache.

We’ve already spent some time talking about statelessness—and we’ll spend more in Chapter 7—so we won’t dive into this point any further.

However, if you’re interested in reading ahead, feel free to take a look at “State and Statelessness”.

VII. Data Isolation

Each service manages its own data.

Cloud Native (“Data Isolation”)

Each service should be entirely self-contained. That is, it should manage its own data and make its data accessible only via an API designed for that purpose. If this sounds familiar to you, good! This is actually one of the core principles of microservices, which we’ll discuss more in “The Microservices System Architecture”.

Frequently, this will be implemented as a request-response service like a RESTful API or remote procedure call (RPC) protocol that’s exported by listening to requests coming in on a port, but this can also take the form of an asynchronous, event-based service using a publish-subscribe messaging pattern. Both of these patterns will be described in more detail in Chapter 8.

Historical Note

The actual title of the seventh section of The Twelve-Factor App is “Port Binding” and is summarized as “export services via port binding.”¹⁵

At the time, this advice certainly made sense, but this title obscures its main point: that a service should encapsulate and manage its own data and share that data only via an API.

While many (or even most) web applications do, in fact, expose their APIs via ports, the increasing popularity of functions as a service (FaaS) and event-driven architectures means this is no longer necessarily always the case.

So, instead of the original text, I’ve decided to use the more up-to-date (and true-to-intent) summary provided by Boris Scholl et al. in Cloud Native: Using Containers, Functions, and Data to Build Next-Generation Applications (O’Reilly).

And finally, although this is something you don’t see in the Go world, some languages and frameworks allow the runtime injection of an application server into the execution environment to create a web-facing service. This practice limits testability and portability by breaking data isolation and environment agnosticism and is very strongly discouraged.

VIII. Scalability

Scale out via the process model.

The Twelve-Factor App

Services should be able to scale horizontally by adding more instances.

We talk about scalability quite a bit in this book. We even dedicated all of Chapter 7 to it. With good reason: the importance of scalability can’t be understated.

Sure, it’s certainly convenient to just beef up the one server your service is running on—and that’s fine in the (very) short term—but vertical scaling is a losing strategy in the long run. If you’re lucky, you’ll eventually hit a point where you simply can’t scale up any more. It’s more likely that your single server will either suffer load spikes faster than you can scale up or just die without warning and without a redundant failover.¹⁶ Both scenarios end with a lot of unhappy users.

IX. Disposability

Maximize robustness with fast startup and graceful shutdown.

The Twelve-Factor App

Cloud environments are fickle: provisioned servers have a funny way of disappearing at odd times. Services should account for this by being disposable: service instances should be able to be started or stopped—intentionally or not—at any time.

Services should strive to minimize the time it takes to start up to reduce the time it takes for the service to be deployed (or redeployed) to elastically scale. Go, having no virtual machine or other significant overhead, is especially good at this.

Containers provide fast startup time and are also very useful for this, but care must be taken to keep image sizes small to minimize the data transfer overhead incurred with each initial deployment of a new image. This is another area in which Go excels: its self-sufficient binaries can generally be installed into SCRATCH images, without requiring an external language runtime or other external dependencies. We demonstrated this in Chapter 5, in “Containerizing Your Key-Value Store”.

Services should also be capable of shutting down when they receive a SIGTERM signal by saving all data that needs to be saved, closing open network connections, or finishing any in-progress work that’s left or by returning the current job to the work queue.

X. Development/Production Parity

Keep development, staging, and production as similar as possible.

The Twelve-Factor App

Any possible differences between development and production should be kept as small as possible. This includes code differences, of course, but it extends well beyond that:

Code divergence: Development branches should be small and short-lived and should be tested and deployed into production as quickly as possible. This minimizes functional differences between environments and reduces the risk of both deploys and rollbacks.
Stack divergence: Rather than having different components for development and production (say, SQLite on macOS versus MySQL on Linux), environments should remain as similar as possible. Lightweight containers are an excellent tool for this. This minimizes the possibility that inconvenient differences between almost-but-not-quite-the-same implementations will emerge to ruin your day.
Personnel divergence: Once it was common to have programmers who wrote code and operators who deployed code, but that arrangement created conflicting incentives and counterproductive adversarial relationships. Keeping code authors involved in deploying their work and responsible for its behavior in production helps break down development/operations silos and aligns incentives around stability and velocity.

Taken together, these approaches help to keep the gap between development and production small, which in turn encourages rapid, automated, continuous deployment.

XI. Logs

Treat logs as event streams.

The Twelve-Factor App

Logs—a service’s never-ending stream of consciousness—are incredibly useful things, particularly in a distributed environment. By providing visibility into the behavior of a running application, good logging can greatly simplify the task of locating and diagnosing misbehavior.

Traditionally, services wrote log events to a file on the local disk. At cloud scale, however, this just makes valuable information awkward to find, inconvenient to access, and impossible to aggregate. In dynamic, ephemeral environments like Kubernetes, your service instances (and their log files) may not even exist by the time you get around to viewing them.

Instead, a cloud native service should treat log information as nothing more than a stream of events, writing each event, unbuffered, directly to stdout. It shouldn’t concern itself with implementation trivialities like routing or storage of its log events, and allow the executor to decide what happens to them.

Though seemingly simple (and perhaps somewhat counterintuitive), this small change provides a great deal of freedom.

During local development, a programmer can watch the event stream in a terminal to observe the service’s behavior. In deployment, the output stream can be captured by the execution environment and forwarded to one or more destinations, such as a log indexing system like Elasticsearch, Logstash, and Kibana (ELK) or Splunk for review and analysis, or a data warehouse for long-term storage.

We’ll discuss logs and logging, in the context of observability, in more detail in Chapter 11.

XII. Administrative Processes

Run administrative/management tasks as one-off processes.

The Twelve-Factor App

Of all of the original Twelve Factors, this is the one that most shows its age. For one thing, it explicitly advocates shelling into an environment to manually execute tasks.

To be clear: making manual changes to a server instance creates snowflakes. This is a bad thing. See “Special Snowflakes”.

Assuming you even have an environment that you can shell into, you should assume that it can (and eventually will) be destroyed and re-created at any moment.

Ignoring all of that for a moment, let’s distill the point to its original intent: administrative and management tasks should be run as one-off processes. This could be interpreted in two ways, each requiring its own approach:

If your task is an administrative process, like a data repair job or database migration, it should be run as a short-lived process. Containers and functions are excellent vehicles for such purposes.
If your change is an update to your service or execution environment, you should instead modify your service or environment construction/configuration scripts, respectively.

Special Snowflakes

Keeping servers healthy can be a challenge. At 3 a.m., when things aren’t working quite right, it’s really tempting to make a quick change and go back to bed.

Congratulations, you’ve just created a snowflake: a special server instance with manual changes that give it unique, usually undocumented, behaviors.

Even minor, seemingly harmless changes can lead to significant problems. Even if the changes are documented—which is rarely the case—snowflake servers are hard to reproduce exactly, particularly if you need to keep an entire cluster in sync.

This can lead to a bad time when you have to redeploy your service onto new hardware and can’t figure out why it’s not working.

Furthermore, because your testing environment no longer matches production, you can no longer trust your development environments to reliably reproduce your production deployment.

Instead, servers and containers should be treated as immutable. If something needs to be updated, fixed, or modified in any way, changes should be made by updating the appropriate build scripts, baking¹⁷ a new common image, and provisioning new server or container instances to replace the old ones.

As the expression goes, instances should be treated as “cattle, not pets.”

Summary

In this chapter, we considered the question “What’s the point of cloud native?” The common answer is “a computer system that works in the cloud.” But “work” can mean anything. Surely we can do better.

So we went back to thinkers like Tony Hoare and Jean-Claude Laprie, who provided the first part of the answer: dependability. That is, to paraphrase, computer systems that behave in ways that users find acceptable, despite living in a fundamentally unreliable environment.

Obviously, that’s more easily said than done, so we reviewed three schools of thought regarding how to achieve it:

Laprie’s academic “means of dependability,” which include preventing, tolerating, removing, and forecasting faults
Adam Wiggins’s The Twelve-Factor App, which took a more prescriptive (and slightly dated, in spots) approach
Our own “cloud native attributes,” based on the Cloud Native Computing Foundation’s definition of “cloud native,” which we introduced in Chapter 1 and organized this entire book around

Although this chapter was essentially a short survey of theory, there’s a lot of important, foundational information here that describes the motivations and means used to achieve what we call “cloud native.”

¹ C. A. R. Hoare, “An Axiomatic Basis for Computer Programming”, Communications of the ACM, 12, no. 10 (October 20, 1969): 576–583.

² When Edsger W. Dijkstra coined the expression “GOTO considered harmful,” he was referencing Hoare’s work in structured programming.

³ Tony Hoare, “Null References: The Billion Dollar Mistake”, InfoQ.com, August 25, 2009.

⁴ Holly Cummins, Cloud Native Is About Culture, Not Containers, Cloud Native London, 2018.

⁵ If you ever have a chance to see her speak, I strongly recommend you take it.

⁶ Remember what Walt did to Jane that time? That was so messed up.

⁷ Jean-Claude Laprie, “Dependable Computing and Fault Tolerance: Concepts and Terminology”, FTCS-15 The 15th Int’l Symposium on Fault-Tolerant Computing, June 1985, 2–11.

⁸ Algirdas Avižienis et al., “Fundamental Concepts of Computer System Dependability”, Research Report No. 1145, LAAS-CNRS, April 2001.

⁹ If you haven’t, start with Site Reliability Engineering: How Google Runs Production Systems by Betsey Beyer et al. (O’Reilly, 2016). It really is very good.

¹⁰ Many organizations use service-level objectives (SLOs) for precisely this purpose.

¹¹ Application state is hard, and when done wrong, it’s poison to scalability.

¹² Adam Wiggins, The Twelve-Factor App, 2017.

¹³ Although it was for too long!

¹⁴ The world’s worst configuration language (except for all the other ones).

¹⁵ Adam Wiggins, “Port Binding”, The Twelve-Factor App, 2017.

¹⁶ Probably at three in the morning.

¹⁷ Baking is a term sometimes used to refer to the process of creating a new container or server image.

Chapter 7. Scalability

Some of the best programming is done on paper, really. Putting it into the computer is just a minor detail.¹

Max Kanat-Alexander, Code Simplicity: The Fundamentals of Software

In the summer of 2016, I joined a small company that digitized the kind of forms and miscellaneous paperwork that government bureaucracies are known and loved for. The state of their core application was pretty typical of early-stage startups, so we got to work and, by that fall, had managed to containerize it, describe its infrastructure in code, and fully automate its deployment.

One of our clients was a small coastal city in southeastern Virginia, so when Hurricane Matthew—the first Category 5 Atlantic hurricane in nearly a decade—was forecast to make landfall not far from there, the local officials dutifully declared a state of emergency and used our system to create the necessary paperwork for citizens to fill out. Then they posted it to social media, and a million people all logged in at the same time.

When the pager went off, the on-call checked the metrics and found that aggregated CPU for the servers was pegged at 100% and that hundreds of thousands of requests were timing out.

So we added a zero to the desired server count, created a “to-do” task to implement autoscaling, and went back to our day. Within 24 hours, the rush had passed, so we scaled the servers in.

What did we learn from this, other than the benefits of autoscaling?²

First, it underscored the fact that without the ability to scale, our system would have certainly suffered extended downtime. But being able to add resources on demand meant that we could serve our users even under load far beyond what we had ever anticipated. As an added benefit, if any one server failed, its work could have been divided among the survivors.

Second, having far more resources than necessary isn’t just wasteful; it’s expensive. The ability to scale our instances back in when demand ebbed meant that we were paying for only the resources that we needed. A major plus for a startup on a budget.

Unfortunately, because unscalable services can seem to function perfectly well under initial conditions, scalability isn’t always a consideration during service design. While this might be perfectly adequate in the short term, services that aren’t capable of growing much beyond their original expectations also have a limited lifetime value. What’s more, it’s often fiendishly difficult to refactor a service for scalability, so building with it in mind can save both time and money in the long run.

At its core, this is meant to be a Go book, or at least more of a Go book than an infrastructure or architecture book. While we will discuss things like scalable architecture and messaging patterns, much of this chapter will focus on demonstrating how Go can be used to produce services that lean on the other (non-infrastructure) part of the scalability equation: efficiency.³

What Is Scalability?

You may recall that the concept of scalability was first introduced way back in Chapter 1, where it was defined as the ability of a system to continue to provide correct service in the face of significant changes in demand. By this definition, a system can be considered to be scalable if it doesn’t need to be redesigned to perform its intended function during steep increases in load.

Note that this definition⁴ doesn’t actually say anything at all about adding physical resources. Rather, it calls out a system’s ability to handle large swings in demand. The thing being “scaled” here is the magnitude of the demand. While adding resources is one perfectly acceptable means of achieving scalability, it isn’t exactly the same as being scalable. To make things just a little more confusing, the word scaling can also be applied to a system, in which case it does mean a change in the amount of dedicated resources.

So how do we handle high demand without adding resources? As we’ll discuss in “Scaling Postponed: Efficiency”, systems built with efficiency in mind are inherently more scalable by virtue of their ability to gracefully absorb high levels of demand, without immediately having to resort to adding hardware in response to every dramatic swing in demand, and without having to massively overprovision “just in case.”

Different Forms of Scaling

Unfortunately, even the most efficient of efficiency strategies has its limit, and eventually you’ll find yourself needing to scale your service to provide additional resources. There are two different ways that this can be done (see Figure 7-1), each with its own associated pros and cons:

Vertical scaling: A system can be vertically scaled (or scaled up) by increasing its resource allocations. In a public cloud, an existing server can be vertically scaled fairly easily just by changing its instance size, but only until you run out of larger instance types (or money).
Horizontal scaling: A system can be horizontally scaled (or scaled out) by duplicating the system or service to limit the burden on any individual server. Systems using this strategy can typically scale to handle greater amounts of load, but as you’ll see in “State and Statelessness”, the presence of state can make this strategy difficult or impossible for some systems.

These two terms are used to describe the most common way of thinking about scaling: taking an entire system and just making more of it. There are a variety of other scaling strategies available, however.

Perhaps the most common of these is functional partitioning, which you’re no doubt already familiar with, if not by name. Functional partitioning involves decomposing complex systems into smaller functional units that can be independently optimized, managed, and scaled. You might recognize this as a generalization of a number of best practices ranging from basic program design to advanced distributed systems design.

Another approach common in systems with large amounts of data—particularly databases—is sharding. Systems that use this strategy distribute load by dividing their data into partitions called shards, each of which holds a specific subset of the larger dataset. A basic example of this is presented in “Minimizing locking with sharding”.

The Four Common Bottlenecks

As the demands on a system increase, there will inevitably come a point at which one resource just isn’t able to keep pace, effectively stalling any further efforts to scale. That resource has become a bottleneck.

Returning the system to operable performance levels requires identifying and addressing the bottleneck. This can be done in the short-term by increasing the bottlenecked component—vertically scaling—by adding more memory or up-sizing the CPU, for instance. As you might recall from the discussion in “Different Forms of Scaling”, this approach isn’t always possible (or cost-effective), and it can never be relied on forever.

However, it’s often possible to address a bottleneck by enhancing or reducing the burden on the affected component by utilizing another resource that the system still has in abundance. A database might avoid disk I/O bottlenecking by caching data in RAM; conversely, a memory-hungry service could page data to disk. Horizontally scaling doesn’t make a system immune to bottlenecking: adding more instances can mean more communication overhead, which puts additional strain on the network. Even highly concurrent systems can become victims of their own inner workings as the demand on them increases and phenomena like lock contention come into play. Using resources effectively often means making trade-offs.

Of course, fixing a bottleneck requires that you first identify the constrained component, and while there are many different resources that can emerge as targets for scaling efforts—whether by actually scaling the resource or by using it more efficiently—such efforts tend to focus on just four resources:

CPU: The number of operations per unit of time that can be performed by a system’s central processor and a common bottleneck for many systems. Scaling strategies for CPU include caching the results of expensive deterministic operations (at the expense of memory) or simply increasing the size or number of processors (at the expense of network I/O if scaling out).
Memory: The amount of data that can be stored in main memory. While today’s systems can store incredible amounts of data on the order of tens or hundreds of gigabytes, even this can fall short, particularly for data-intensive systems that lean on memory to circumvent disk I/O speed limits. Scaling strategies include offloading data from memory to disk (at the expense of disk I/O) or an external dedicated cache (at the expense of network I/O), or simply increasing the amount of available memory.
Disk I/O: The speed at which data can be read from and written to a hard disk or other persistent storage medium. Disk I/O is a common bottleneck on highly parallel systems that read and write heavily to disk, such as databases. Scaling strategies include caching data in RAM (at the expense of memory) or using an external dedicated cache (at the expense of network I/O).
Network I/O: The speed at which data can be sent across a network, either from a particular point or in aggregate. Network I/O translates directly into how much data the network can transmit per unit of time. Scaling strategies for network I/O are often limited,⁵ but network I/O is particularly amenable to various optimization strategies that we’ll discuss shortly.

As the demand on a system increases, it’ll almost certainly find itself bottlenecked by one of these, and while there are efficiency strategies that can be applied, those tend to come at the expense of one or more other resources, so you’ll eventually find your system being bottlenecked again by another resource.

State and Statelessness

We briefly touched on statelessness in “Application State Versus Resource State”, where we described application state—server-side data about the application or how it’s being used by a client—as something to be avoided if at all possible. But this time, let’s spend a little more time discussing what state is, why it can be problematic, and what we can do about it.

It turns out that “state” is strangely difficult to define, so I’ll do my best on my own. For the purposes of this book, I’ll define state as the set of an application’s variables that, if changed, affect the behavior of the application.⁶

Application State Versus Resource State

Most applications have some form of state, but not all state is created equal. It comes in two kinds, one of which is far less desirable than the other.

First, there’s application state, which exists any time an application needs to remember an event locally. Whenever somebody talks about a stateful application, they’re usually talking about an application that’s designed to use this kind of local state. Local is an operative word here.

Second, there’s resource state, which is the same for every client and has nothing to do with the actions of clients, like data stored in an external data store or managed by configuration management. It’s misleading, but saying that an application is stateless doesn’t mean that it doesn’t have any data, just that it’s been designed in such a way that it’s free of any local persistent data. Its only state is resource state, often because all of its state is stored in some external data store.

To illustrate the difference between the two, imagine an application that tracks client sessions, associating them with some application context. If users’ session data was maintained locally by the application, that would be considered “application state.” But if the data was stored in an external database, then it could be treated as a remote resource, and it would be “resource state.”

Note

The term stateless means that a service has no application state. It says nothing about resource state.

Application state is something of the “antiscalability.” Multiple instances of a stateful service will quickly find their individual states diverging due to different inputs being received by each replica. While server affinity can provide a workaround to this specific condition by ensuring that each of a client’s requests are made to the same server, this strategy poses a considerable data risk since the failure of any single server is likely to result in a loss of data.

Advantages of Statelessness

So far, we’ve discussed the differences between application state and resource state, and we’ve even suggested—without much evidence (yet)—that application state is bad. However, statelessness provides some very noticeable advantages:

Scalability: The most visible and most often cited benefit is that stateless applications can handle each request or interaction independent of previous requests. This means that any service replica can handle any request, allowing applications to grow, shrink, or be restarted without losing data required to handle any in-flight sessions or requests. This is especially important when autoscaling your service, because the instances, nodes, or pods hosting the service can (and usually will) be created and destroyed unexpectedly.
Durability: Data that lives in exactly one place (such as a single service replica) can (and, at some point, will) get lost when that replica goes away for any reason. Remember: everything in “the cloud” evaporates eventually.
Simplicity: Without any application state, stateless services are freed from the need to…well…manage their state.⁷ Not being burdened with having to maintain service-side state synchronization, consistency, and recovery logic⁸ makes stateless APIs less complex and therefore easier to design, build, and maintain.
Cacheability: APIs provided by stateless services are relatively easy to design for cacheability. If a service knows that the result of a particular request will always be the same, regardless of who’s making it or when, the result can be safely set aside for easy retrieval later, increasing efficiency and reducing response time.

These might seem like four different things, but there’s overlap with respect to what they provide. Specifically, statelessness makes services both simpler and safer to build, deploy, and maintain.

Scaling Postponed: Efficiency

In the context of cloud computing, we usually think of scalability in terms of the ability of a system to add network and computing resources. Often neglected, however, is the role of efficiency in scalability. Specifically, the ability for a system to handle changes in demand without having to add (or greatly overprovision) dedicated resources.

While it can be argued that most people don’t care about program efficiency most of the time, this starts to become less true as demand on a service increases. If a language has a relatively high per-process concurrency overhead—often the case with dynamically typed languages—it will consume all available memory or compute resources much more quickly than a lighter-weight language, and consequently require resources and more scaling events to support the same demand.

This was a major consideration in the design of Go’s concurrency model, whose goroutines aren’t threads at all but lightweight routines multiplexed onto multiple OS threads. Each costs little more than the allocation of stack space, allowing potentially millions of concurrently executing routines to be created.

As such, in this section we’ll cover a selection of Go features and tooling that allow us to avoid common scaling problems, such as memory leaks and lock contention, and to identify and fix them when they do arise.

Efficient Caching Using an LRU Cache

Caching to memory is a very flexible efficiency strategy that can be used to relieve pressure on anything from CPU to disk I/O or network I/O, or even just to reduce latency associated with remote or otherwise slow-running operations.

The concept of caching certainly seems straightforward. You have something you want to remember the value of—like the result of an expensive (but deterministic) calculation—and you put it in a map for later. Right?

Well, you could do that, but you’ll soon start running into problems. What happens as the number of cores and goroutines increases? Since you didn’t consider concurrency, you’ll soon find your modifications stepping on one another, leading to some unpleasant results. Also, since we forgot to remove anything from our map, it’ll continue growing indefinitely until it consumes all available memory.

What we need is a caching data structure that does the following:

Supports concurrent read, write, and delete operations
Scales well as the number of cores and goroutines increases
Won’t grow without limit to consume all available memory

One common solution to this dilemma is an LRU (Least Recently Used) cache: a particularly lovely data structure that tracks how recently each of its keys has been “used” (read or written). When a value is added to the cache such that it exceeds a predefined capacity, the cache is able to “evict” (delete) its least recently used value.

A detailed discussion of how to implement an LRU cache is beyond the scope of this book, but I will say that it’s quite clever. As illustrated in Figure 7-2, an LRU cache contains a doubly linked list (which actually contains the values) and a map that associates each key to a node in the linked list. Whenever a key is read or written, the appropriate node is moved to the bottom of the list, such that the least recently used node is always at the top.

There are a couple of Go LRU cache implementations available, though none in the core libraries (yet). Perhaps the most common can be found as part of the golang/groupcache library. However, I prefer HashiCorp’s open source extension to groupcache, hashicorp/golang-lru, which is better documented and includes sync.RWMutexes for concurrency safety.

HashiCorp’s library contains two construction functions, each of which returns a pointer of type *Cache and an error:

// New creates an LRU cache with the given capacity.
func New[K comparable, V any](size int) (*Cache[K, V], error) {}

// NewWithEvict creates an LRU cache with the given capacity and also accepts
// an "eviction callback" function that's called when an eviction occurs.
func NewWithEvict[K comparable, V any](size int,
    onEvicted func(key K, value V)) (*Cache[K, V], error) {}

The *Cache struct has a number of attached methods, the most useful of which are as follows:

// Add adds a value to the cache and returns true if an eviction occurred.
func (c *Cache[K, V]) Add(key K, value V) bool {}

// Check if a key is in the cache (without updating the recent-ness).
func (c *Cache[K, V]) Contains(key K) bool {}

// Get looks up a key's value and returns (value, true) if it exists.
// If the value doesn't exist, it returns (nil, false).
func (c *Cache[K, V]) Get(key K) (V, bool) {}

// Len returns the number of items in the cache.
func (c *Cache[K, V]) Len() int {}

// Remove removes the provided key from the cache.
func (c *Cache[K, V]) Remove(key K) bool {}

There are several other methods as well. Take a look at the GoDocs for a complete list.

In the following example, we create and use an LRU cache with a capacity of two. To better highlight evictions, we include a callback function that prints some output to stdout whenever an eviction occurs. Note that we’ve decided to initialize the cache variable in an init function, a special function that’s automatically called before the main function and after the variable declarations have evaluated their initializers:

package main

import (
    "fmt"

    lru "github.com/hashicorp/golang-lru/v2"
)

var cache *lru.Cache[int, string]

func init() {
    cache, _ = lru.NewWithEvict(2,
        func(key int, value string) {
            fmt.Printf("Evicted: key=%d value=%s\n", key, value)
        },
    )
}

func main() {
    cache.Add(1, "a") // adds 1
    cache.Add(2, "b") // adds 2; cache is now at capacity

    fmt.Println(cache.Get(1)) // "a true"; 1 now most recently used

    cache.Add(3, "c") // adds 3, evicts key 2

    fmt.Println(cache.Get(2)) // " false" (not found)
}

In the preceding program, we create a cache with a capacity of two, which means that the addition of a third value will force the eviction of the least recently used value.

After adding the values {1:"a"} and {2:"b"} to the cache, we call cache.Get(1), which makes {1:"a"} more recently used than {2:"b"}. When we add {3:"c"} in the next step, {2:"b"} is evicted, so the next cache.Get(2) shouldn’t return a value.

If we run this program, we’ll be able to see this in action. We’ll get the following output:

$ go run lru.go
a true
Evicted: key=2 value=b
 false

The LRU cache is an excellent data structure to use as a global cache for most use cases, but it does have a limitation: at very high levels of concurrency—on the order of several million operations per second—it will start to experience some contention.

Unfortunately, at the time of this writing, Go still doesn’t seem to have a very high throughput cache implementation.⁹

Efficient Synchronization

A commonly repeated Go proverb is “Don’t communicate by sharing memory; share memory by communicating.” In other words, channels are generally preferred over shared data structures.

This is a pretty powerful concept. After all, Go’s concurrency primitives—goroutines and channels—provide a powerful and expressive synchronization mechanism, such that a set of goroutines using channels to exchange references to data structures can often allow locks to be dispensed with altogether.

(If you’re a bit fuzzy on the details of channels and goroutines, don’t stress. Take a moment to flip back to “Goroutines”. It’s okay. I’ll wait.)

That being said, Go does provide more traditional locking mechanisms by way of the sync package. But if channels are so great, why would we want to use something like a sync.Mutex, and when would we use it?

Well, as it turns out, channels are spectacularly useful, but they’re not the solution to every problem. Channels shine when you’re working with many discrete values, and are the better choice for passing ownership of data, distributing units of work, or communicating asynchronous results. Mutexes, on the other hand, are ideal for synchronizing access to caches or other large stateful structures.

At the end of the day, no tool solves every problem. Ultimately, the best option is to use whichever is most expressive and/or simplest.

Share memory by communicating

Threading is easy; locking is hard.

In this section, we’re going to use a classic example—originally presented in Andrew Gerrand’s classic Go Blog article “Share Memory by Communicating”¹⁰—to demonstrate this truism and show how Go channels can make concurrency safer and easier to reason about.

Imagine, if you will, a hypothetical program that polls a list of URLs by sending it a GET request and waiting for the response. The catch is that each request can spend quite a bit of time waiting for the service to respond: anywhere from milliseconds to seconds (or more), depending on the service. Exactly the kind of operation that can benefit from a bit of concurrency, isn’t it?

In a traditional threading environment that depends on locking for synchronization, you might structure its data something like the following:

type Resource struct {
    url        string
    polling    bool
    lastPolled int64
}

type Resources struct {
    data []*Resource
    lock *sync.Mutex
}

As you can see, instead of having a slice of URL strings, we have two structs—Resource and Resources—each of which is already saddled with a number of synchronization structures beyond the URL strings we really care about.

To multithread the polling process in the traditional way, you might have a Poller function like the following running in multiple threads:

func Poller(res *Resources) {
    for {
        // Get the least recently polled Resource and mark it as being polled
        res.lock.Lock()

        var r *Resource

        for _, v := range res.data {
            if v.polling {
                continue
            }
            if r == nil || v.lastPolled < r.lastPolled {
                r = v
            }
        }

        if r != nil {
            r.polling = true
        }

        res.lock.Unlock()

        if r == nil {
            continue
        }

        // Poll the URL

        // Update the Resource's polling and lastPolled
        res.lock.Lock()
        r.polling = false
        r.lastPolled = time.Nanoseconds()
        res.lock.Unlock()
    }
}

This does the job, but it has a lot of room for improvement. It’s about a page long, hard to read, hard to reason about, and doesn’t even include the URL polling logic or gracefully handle exhaustion of the Resources pool.

Now let’s take a look at the same functionality implemented using Go channels. In this example, Resource has been reduced to its essential component (the URL string), and Poller is a function that receives Resource values from an input channel and sends them to an output channel when they’re done:

type Resource string

func Poller(in, out chan *Resource) {
    for r := range in {
        // Poll the URL

        // Send the processed Resource to out
        out <- r
    }
}

It’s so…simple. We’ve completely shed the clockwork locking logic in Poller, and our Resource data structure no longer contains bookkeeping data. In fact, all that’s left are the important parts.

But what if we wanted more than one Poller process? Isn’t that what we were trying to do in the first place? The answer is, once again, gloriously simple: goroutines. Take a look at the following:

for i := 0; i < numPollers; i++ {
    go Poller(in, out)
}

By executing numPollers goroutines, we’re creating numPollers concurrent processes, each reading from and writing to the same channels.

A lot has been omitted from the previous examples to highlight the relevant bits. For a walkthrough of a complete, idiomatic Go program that uses these ideas, see the “Share Memory by Communicating” Codewalk.

Reduce blocking with buffered channels

At some point in this chapter, you’ve probably thought to yourself, “Sure, channels are great, but writing to channels still blocks.” After all, every send operation on a channel blocks until there’s a corresponding receive, right? Well, as it turns out, this is only mostly true. At least, it’s true of default, unbuffered channels.

However, as we first described in “Channel buffering”, it’s possible to create channels that have an internal message buffer. Send operations on such buffered channels block only when the buffer is full and receives from a channel block only when the buffer is empty.

You may recall that buffered channels can be created by passing an additional capacity parameter to the make function to specify the size of the buffer:

ch := make(chan type, capacity)

Buffered channels are especially useful for handling “bursty” loads. In fact, we already used this strategy in Chapter 5 when we initialized our FileTransactionLogger. Distilling some of the logic that’s spread through that chapter produces something like the following:

type FileTransactionLogger struct {
    events       chan<- Event       // Write-only channel for sending events
    lastSequence uint64             // The last used event sequence number
}

func (l *FileTransactionLogger) WritePut(key, value string) {
    l.events <- Event{EventType: EventPut, Key: key, Value: value}
}

func (l *FileTransactionLogger) Run() {
    l.events = make(chan Event, 16)             // Make an events channel

    go func() {
        for e := range events {                 // Retrieve the next Event
            l.lastSequence++                    // Increment sequence number
        }
    }()
}

In this segment, we have a WritePut function that can be called to send a message to an events channel, which is received in the for loop inside the goroutine created in the Run function. If events was a standard channel, each send would block until the anonymous goroutine completed a receive operation. That might be fine most of the time, but if several writes came in faster than the goroutine could process them, then the upstream client would be blocked.

By using a buffered channel, we made it possible for this code to handle small bursts of up to 16 closely clustered write requests. Importantly, however, the 17th write would block.

It’s also important to consider that using buffered channels like this creates a risk of data loss should the program terminate before any consuming goroutines are able to clear the buffer.

Minimizing locking with sharding

As lovely as channels are, as we mentioned in “Efficient Synchronization”, they don’t solve every problem. A common example of this is a large, central data structure, such as a cache, that can’t be easily decomposed into discrete units of work.¹¹

When shared data structures have to be concurrently accessed, it’s standard to use a locking mechanism, such as the mutexes provided by the sync package, as we do in “Making Your Data Structure Concurrency-Safe”. For example, we might create a struct that contains a map and an embedded sync.RWMutex:

var cache = struct {
    sync.RWMutex
    data map[string]string
}{data: make(map[string]string)}

When a routine wants to write to the cache, it would carefully use cache.Lock to establish the write lock, and cache.Unlock to release the lock when it’s done. We might even want to wrap it in a convenience function as follows:

func ThreadSafeWrite(key, value string) {
    cache.Lock()                                    // Establish write lock
    cache.data[key] = value
    cache.Unlock()                                  // Release write lock
}

By design, this restricts write access to whichever routine happens to have the lock. This pattern generally works just fine. However, as we discussed in Chapter 4, as the number of concurrent processes acting on the data increases, the average amount of time that processes spend waiting for locks to be released also increases. You may remember the name for this unfortunate condition: lock contention.

While this might be resolved in some cases by scaling the number of instances, this also increases complexity and latency, as distributed locks need to be established and writes need to establish consistency. An alternative strategy for reducing lock contention around shared data structures within an instance of a service is vertical sharding, in which a large data structure is partitioned into two or more structures, each representing a part of the whole. Using this strategy, only a portion of the overall structure needs to be locked at a time, decreasing overall lock contention.

You may recall that we discussed vertical sharding in some detail in “Sharding”. If you’re unclear on vertical sharding theory or implementation, feel free to take some time to go back and review that section.

Memory Leaks Can…fatal error: runtime: out of memory

Memory leaks are a class of bugs in which memory is not released even after it’s no longer needed. These bugs can be quite subtle and often plague languages like C++ in which memory is manually managed. But while garbage collection certainly helps by attempting to reclaim memory occupied by objects that are no longer in use by the program, garbage-collected languages like Go aren’t immune to memory leaks. Data structures can still grow unbounded, unresolved goroutines can still accumulate, and even unstopped time.Ticker values can get away from you.

In this section, we’ll review a few common causes of memory leaks particular to Go and how to resolve them.

Leaking goroutines

I’m not aware of any actual data on the subject,¹² but based purely on my own personal experience, I strongly suspect that goroutines are the single largest source of memory leaks in Go. Whenever a goroutine is executed, it’s initially allocated a small memory stack—2,048 bytes—that can be dynamically adjusted up or down as it runs to suit the needs of the process. The precise maximum stack size depends on a lot of things,¹³ but it’s essentially reflective of the amount of available physical memory.

Normally, when a goroutine returns, its stack is either deallocated or set aside for recycling.¹⁴ Whether by design or by accident, however, not every goroutine actually returns. For example:

func leaky() {
    ch := make(chan string)

    go func() {
        s := <-ch
        fmt.Println("Message:", s)
    }()
}

In the previous example, the leaky function creates a channel and executes a goroutine that reads from that channel. The leaky function returns without error, but if you look closely you’ll see that no values are ever sent to ch, so the goroutine will never return and its stack will never be deallocated. There’s even collateral damage: because the goroutine references ch, that value can’t be cleaned up by the garbage collector.

So we now have a bona fide memory leak. If such a function is called regularly, the total amount of memory consumed will slowly increase over time until it’s completely exhausted.

This is a contrived example, but there are good reasons why a programmer might want to create long-running goroutines, so it’s usually quite hard to know whether such a process was created intentionally.

So what do we do about this? Dave Cheney offers some excellent advice here: “You should never start a goroutine without knowing how it will stop…. Every time you use the go keyword in your program to launch a goroutine, you must know how, and when, that goroutine will exit. If you don’t know the answer, that’s a potential memory leak.”¹⁵

This may seem like obvious, even trivial, advice, but it’s incredibly important. It’s all too easy to write functions that leak goroutines, and those leaks can be a pain to identify and find.

Forever ticking tickers

Often you’ll want to add some kind of time dimension to your Go code, to execute it at some point in the future or repeatedly at some interval, for example.

The time package provides two useful tools to add such a time dimension to Go code execution: time.Timer, which fires at some point in the future, and time.Ticker, which fires repeatedly at some specified interval.

However, where time.Timer has a finite useful life with a defined start and end, time.Ticker has no such limitation. A time.Ticker can live forever. Maybe you can see where this is going.

Both Timers and Tickers use a similar mechanism: each provides a channel that’s sent a value whenever it fires. The following example uses both:

func timely() {
    timer := time.NewTimer(5 * time.Second)
    ticker := time.NewTicker(1 * time.Second)

    done := make(chan bool)

    go func() {
        for {
            select {
            case <-ticker.C:
                fmt.Println("Tick!")
            case <-done:
                return
            }
        }
    }()

    <-timer.C
    fmt.Println("It's time!")
    close(done)
}

The timely function executes a goroutine that loops at regular intervals by listening for signals from ticker—which occur every second—or from a done channel that returns the goroutine. The line <-timer.C blocks until the 5-second timer fires, allowing done to be closed, triggering the case <-done condition and ending the loop.

The timely function completes as expected, and the goroutine has a defined return, so you could be forgiven for thinking that everything’s fine. There’s a particularly sneaky bug here, though: running time.Ticker values contain an active goroutine that can’t be cleaned up. Because we never stopped the timer, timely contains a memory leak.

The solution: always be sure to stop your timers. A defer works quite nicely for this purpose:

func timelyFixed() {
    timer := time.NewTimer(5 * time.Second)
    ticker := time.NewTicker(1 * time.Second)
    defer ticker.Stop()                         // Be sure to stop the ticker!

    done := make(chan bool)

    go func() {
        for {
            select {
            case <-ticker.C:
                fmt.Println("Tick!")
            case <-done:
                return
            }
        }
    }()

    <-timer.C
    fmt.Println("It's time!")
    close(done)
}

By calling ticker.Stop(), we shut down the underlying Ticker, allowing it to be recovered by the garbage collector and preventing a leak.

On Efficiency

In this section, we covered a number of common methods for improving the efficiency of your programs, ranging from using an LRU cache rather than a map to constrain your cache’s memory footprint, to approaches for effectively synchronizing your processes, to preventing memory leaks. While these sections might not seem particularly closely connected, they’re all important for building programs that scale.

Of course, there are countless other methods that I would have liked to include as well but wasn’t able to given the fundamental limits imposed by time and space.

In the next section, we’ll change themes once again to cover some common service architectures and their effects on scalability. These might be a little less focused on Go specifically, but they’re critical for a study of scalability, especially in a cloud native context.

Service Architectures

The concept of the microservice first appeared in the early 2010s as a refinement and simplification of the earlier service-oriented architecture (SOA) and a response to the monoliths—server-side applications contained within a single large executable—that were then the most common architectural model of choice.¹⁶

At the time, the idea of the microservice architecture—a single application composed of multiple small services, each running in its own process and communicating with lightweight mechanisms—was revolutionary. Unlike monoliths, which require the entire application to be rebuilt and deployed for any change to the system, microservices were independently deployable by fully automated deployment mechanisms. This sounds small, even trivial, but its implications were (and are) vast.

If you ask most programmers to compare monoliths to microservices, most of the answers you get will probably be something about how monoliths are slow, sluggish, and bloated, while microservices are small and agile. Sweeping generalizations are always wrong, though, so let’s take a moment to ask ourselves whether this is true, and whether monoliths might sometimes be the right choice.

We will begin by defining what we mean when we talk about monoliths and microservices.

The Monolith System Architecture

In a monolith architecture, all of the functionally distinguishable aspects of a service are coupled together in one place. A common example is a web application whose UI, data layer, and business logic are all intermingled, often on a single server.

Traditionally, enterprise applications have been built in three main parts, as illustrated in Figure 7-3: a client-side interface running on the user’s machine, a relational database where all of the application’s data lives, and a server-side application that handles all user input, executes all business logic, and reads and writes data to the database.

At the time, this pattern made sense. All the business logic ran in a single process, making development easier, and you could even scale by running more monoliths behind a load balancer, usually using sticky sessions to maintain server affinity. Things were perfectly fine, and for many years this was by far the most common way of building web applications.

Even today, for relatively small or simple applications (for some definition of “small” and “simple”), this works perfectly well (though I still strongly recommend statelessness over server affinity).

However, as the number of features and general complexity of a monolith increases, difficulties start to arise:

Monoliths are usually deployed as a single artifact, so making even a small change generally requires a new version of the entire monolith to be built, tested, and deployed.
Despite even the best of intentions and efforts, monolith code tends to decrease in modularity over time, making it harder to make changes in one part of the service without affecting another part in unexpected ways.
Scaling the application means creating replicas of the entire application, not just the parts that need it.

The larger and more complex the monolith gets, the more pronounced these effects tend to become. By the early to mid-2000s, these issues were well known, leading frustrated programmers to experiment with breaking their big, complex services into smaller, independently deployable and scalable components. By 2012, this pattern even had a name: microservices architecture.

The Microservices System Architecture

The defining characteristic of a microservices architecture is a service whose functional components have been divided into a set of discrete subservices that can be independently built, tested, deployed, and scaled.

This is illustrated in Figure 7-4, in which a UI service—perhaps an HTML-serving web application or a public API—interacts with clients, but rather than handling the business logic locally, it makes secondary requests of one or more component services to handle some specific functionality. Those services might in turn even make further requests of yet more services.

While the microservices architecture has a number of advantages over the monolith, there are significant costs to consider. On one hand, microservices provide some significant benefits:

A clearly defined separation of concerns supports and reinforces modularity, which can be very useful for larger or multiple teams.
Microservices should be independently deployable, making them easier to manage and making it possible to isolate errors and failures.
In a microservices system, it is possible for different services to use the technology—language, development framework, data storage, etc.—that is most appropriate to its function.

These benefits shouldn’t be underestimated: the increased modularity and functional isolation of microservices tends to produce components that are themselves generally far more maintainable than a monolith with the same functionality. The resulting system isn’t just easier to deploy and manage but also easier to understand, reason about, and extend for a larger number of programmers and teams.

Warning

Mixing different technologies may sound appealing in theory, but use restraint. Each adds new requirements for tooling and expertise. The pros and cons of adopting a new technology—any new technology¹⁷—should always be carefully considered.

The discrete nature of microservices makes them far easier to maintain, deploy, and scale than monoliths. However, while these are real benefits that can pay real dividends, there are some downsides as well:

The distributed nature of microservices makes them subject to the fallacies of distributed computing (see Chapter 4), which makes them significantly harder to program and debug.
Sharing any kind of state between your services can often be extremely difficult.
Deploying and managing multiple services can be quite complex and tends to demand a high level of operational maturity.

So given these, which do you choose? The relative simplicity of the monolith or the flexibility and scalability of microservices? You might have noticed that most of the benefits of microservices pay off as the application gets larger or the number of teams working on it increases. For this reason, many authors advocate starting with a monolith and decomposing it later.

On a personal note, I’ll mention that I’ve never seen any organization successfully break apart a large monolith, but I’ve seen many try. That’s not to say it’s impossible, just that it’s hard. I can’t tell you whether you should start your system as microservices or with a monolith and break it up later. I’d certainly get a lot of angry emails if I tried. But please, whatever you do, stay stateless.

Serverless Architectures

Serverless computing is a pretty popular topic in web application architecture, and a lot of (digital) ink has been spilled about it. Much of this hype has been driven by the major cloud providers, which have invested heavily in serverlessness, but not all of it.

But what is serverless computing, really?

Well, as is often the case, it depends on who you ask. For the purposes of this book, however, we’re defining it as a form of utility computing in which some server-side logic, written by a programmer, is transparently executed in a managed ephemeral environment in response to some predefined trigger. This is also sometimes referred to as “functions as a service,” or “FaaS.” All of the major cloud providers offer FaaS implementations, such as AWS’s Lambda or GCP’s Cloud Functions.

Such functions are quite flexible and can be usefully incorporated into many architectures. In fact, as we’ll discuss shortly, entire serverless architectures can even be built that don’t use traditional services at all but are instead built entirely from FaaS resources and third-party managed services.

Be Suspicious of Hype

I may sound like a grizzled old dinosaur here, but I’ve learned to be wary of new technologies that nobody really understands that claim to solve all of our problems.

According to the research and advisory firm Gartner, which specializes in studying IT and technology trends, serverless infrastructure is descending into the “Trough of Disillusionment”¹⁸ of its “hype cycle”. This is eventually, but inevitably, followed by the “Slope of Enlightenment.”

In time, people start to figure out what the technology is really useful for (not everything) and when to use it (not always), and it enters the “Slope of Enlightenment” and “Plateau of Productivity.” I’ve learned the hard way that it’s usually best to wait until a technology has entered these two later phases before investing heavily in its use.

That being said: serverless computing is intriguing, and it does seem appropriate for some use cases.

The pros and cons of serverlessness

As with any other architectural decision, the choice to go with a partially or entirely serverless architecture should be carefully weighed against all available options. While serverlessness provides some clear benefits—some obvious (no servers to manage!), others less so (cost and energy savings)—it’s very different from traditional architectures and carries its own set of downsides.

That being said, let’s start weighing. Let’s start with the advantages:

Operational management: Perhaps the most obvious benefit of serverless architectures is that there’s considerably less operational overhead.¹⁹ There are no servers to provision and maintain, no licenses to buy, and no software to install.
Scalability: When using serverless functions, it’s the provider—not the user—who’s responsible for scaling capacity to meet demand. As such, the implementor can spend less time and effort considering and implementing scaling rules.
Reduced costs: FaaS providers typically use a “pay-as-you-go” model, charging for the time and memory allocated only when the function is run. This can be considerably more cost-effective than deploying traditional services to (likely underutilized) servers.
Productivity: In a FaaS model, the unit of work is an event-driven function. This model tends to encourage a “function first” mindset, resulting in code that’s often simpler, more readable, and easier to test.

It’s not all roses, though. There are some real downsides to serverless architectures that need to be taken into consideration as well:

Startup latency: When a function is first called, it has to be “spun up” by the cloud provider. This typically takes less than a second, but in some cases can add 10 or more seconds to the initial requests. This is known as the cold start delay. What’s more, if the function isn’t called for several minutes—the exact time varies between providers—it’s “spun down” by the provider so that it has to endure another cold start when it’s called again. This usually isn’t a problem if your function doesn’t have enough idle time to get spun down but can be a significant issue if your load is particularly “bursty.”
Observability: While most of the cloud vendors provide some basic monitoring for their FaaS offerings, it’s usually quite rudimentary. While third-party providers have been working to fill the void, the quality and quantity of data available from your ephemeral functions is often less than desired.
Testing: While unit testing tends to be pretty straightforward for serverless functions, integration testing is quite hard. It’s often difficult or impossible to simulate the serverless environment, and mocks are approximations at best.

Cost: Although the pay-as-you-go model can be considerably cheaper when demand is lower, there is a point at which this is no longer true. In fact, extremely high levels of load can grow to be quite expensive.

Clearly, there’s quite a lot to consider—on both sides—and while there is a great deal of hype around serverless at the moment, to some degree I think it’s merited. However, while serverlessness promises (and largely delivers) scalability and reduced costs, it does have quite a few gotchas, including, but not limited to, testing and debugging challenges. Not to mention the increased burden on operations around observability!²⁰

Finally, as we’ll see in the next section, serverless architectures also require much more up-front planning than traditional architectures. While some people might call this a positive feature, it can add significant complexity.

Serverless services

As mentioned previously, FaaS are flexible enough to serve as the foundation of entire serverless architectures that don’t use traditional services at all but are instead built entirely from FaaS resources and third-party managed services.

Let’s take, as an example, the familiar three-tier system in which a client issues a request to a service, which in turn interacts with a database. A good example is the key-value store we started in Chapter 5, whose (admittedly primitive) monolithic architecture might look something like what’s shown in Figure 7-5.

To convert this monolith into a serverless architecture, we’ll need to use an API gateway: a managed service that’s configured to expose specific HTTP endpoints and to direct requests to each endpoint to a specific resource—typically a FaaS function—that handles requests and issues responses. Using this architecture, our key-value store might look something like what’s shown in Figure 7-6.

In this example, we’ve replaced the monolith with an API gateway that supports three endpoints: GET /v1/{key}, PUT /v1/{key}, and DELETE /v1/{key} (the {key} component indicates that this path will match any string, and refer to the value as key).

The API gateway is configured so that requests to each of its three endpoints are directed to a different handler function—getKey, putKey, and deleteKey, respectively—which performs all of the logic for handling that request and interacting with the backing database. Granted, this is an incredibly simple application and doesn’t account for things like authentication (which can be provided by a number of excellent third-party services like Auth0 or Okta), but some things are immediately evident.

First, there are a greater number of moving parts that you have to get your head around, which necessitates quite a bit more up-front planning and testing. For example, what happens if there’s an error in a handler function? What happens to the request? Does it get forwarded to some other destination, or is it perhaps sent to a dead-letter queue for further processing?

Do not underestimate the significance of this increase in complexity! Replacing in-process interactions with distributed, fully managed components tends to introduce a variety of problems and failure cases that simply don’t exist in the former. You may well have turned a relatively simple problem into an enormously complex one. Complexity kills; simplicity scales.

Second, with all of these different components, there’s a need for more sophisticated distributed monitoring than you’d need with a monolith or small microservices system. Due to the fact that FaaS relies heavily on the cloud provider, this may be challenging or, at least, awkward.

Finally, the ephemeral nature of FaaS means that all state, even short-lived optimizations like caches, has to be externalized to a database, an external cache (like Redis), or network file/object store (like S3). Again, this can be argued to be a Good Thing, but it does add to up-front complexity.

Summary

This was a difficult chapter to write, not because there isn’t much to say but because scalability is such a huge topic with so many different things I could have drilled down into. Every one of these battled in my brain for weeks. I even ended up throwing away some perfectly good architecture content that, in retrospect, simply wasn’t appropriate for this book. Fortunately, I was able to salvage a whole other chunk of work about messaging that ended up getting moved into Chapter 8. I think it’s happier there anyway.

In those weeks, I spent a lot of time thinking about what scalability really is and about the role that efficiency plays in it. Ultimately, I think that the decision to spend so much time on programmatic—rather than infrastructural—solutions to scaling problems was the right one.

All told, I think the end result is a good one. We certainly covered a lot of ground:

We reviewed the different axes of scaling and how scaling out is often the best long-term strategy.
We discussed state and statelessness and why application state is essentially “antiscalability.”
We learned a few strategies for efficient in-memory caching and avoiding memory leaks.
We compared and contrasted monolithic, microservice, and serverless architectures.

That’s quite a lot, and although I wish I’d been able to drill down in some more detail, I’m pleased to have been able to touch on the things I did.

¹ Max Kanat-Alexander, Code Simplicity: The Fundamentals of Software (O’Reilly, 2012).

² Honestly, if we had autoscaling in place, I probably wouldn’t even remember that this happened.

³ If you want to know more about cloud native infrastructure and architecture, a bunch of excellent books on the subject have already been written. I particularly recommend Cloud Native Infrastructure by Justin Garrison and Kris Nova, and Cloud Native Transformation by Pini Reznik et al. (both O’Reilly).

⁴ This is my definition. I acknowledge that it diverges from other common definitions.

⁵ Some cloud providers impose lower network I/O limits on smaller instances. Increasing the size of the instance may increase these limits in some cases.

⁶ If you have a better definition, let me know. I’m already thinking about the third edition.

⁷ I know I said the word state a bunch of times there. Writing is hard.

⁸ See also: idempotence.

⁹ However, if you’re interested in learning more about high-performance caching in Go, take a look at Manish Rai Jain’s excellent post on the subject, “The State of Caching in Go”, on the Dgraph Blog.

¹⁰ Andrew Gerrand, “Share Memory by Communicating”, The Go Blog, July 13, 2010. Portions of this section are modifications based on work created and shared by Google and used according to terms described in the Creative Commons 4.0 Attribution License.

¹¹ You could probably shoehorn channels into a solution for interacting with a cache, but you might find it difficult to make it simpler than locking.

¹² If you are, let me know!

¹³ Dave Cheney wrote an excellent article on this topic called “Why Is a Goroutine’s Stack Infinite?” that I recommend you take a look at if you’re interested in the dynamics of goroutine memory allocation.

¹⁴ There’s a good article by Vincent Blanchon on the subject of goroutine recycling entitled “Go: How Does Go Recycle Goroutines?”.

¹⁵ Dave Cheney, “Never Start a Goroutine without Knowing How It Will Stop”, dave.cheney.net, December 22, 2016.

¹⁶ Not that they’ve gone away.

¹⁷ Yes, even Go.

¹⁸ Yefim Natis et al., “Hype Cycle for Cloud Platform Services, 2022.” Gartner, Gartner Research, August 2022.

¹⁹ It’s right in the name!

²⁰ Sorry, there’s no such thing as NoOps.

Chapter 8. Loose Coupling

We build our computers the way we build our cities—over time, without a plan, on top of ruins.¹

Ellen Ullman, “The Dumbing-Down of Programming” (May 1998)

Coupling is one of those fascinating concepts that seems straightforward in theory but is actually quite challenging in practice. As we’ll discuss, there are lots of ways in which coupling can be introduced in a system, which means it’s also a big subject. As you might imagine, this chapter is an ambitious one, and we cover a lot of ground.

First, we’ll introduce the subject, diving more deeply into the concept of “coupling” and discussing the relative merits of “loose” versus “tight” coupling. We’ll present some of the most common coupling mechanisms and discuss how some kinds of tight coupling can lead to the dreaded “distributed monolith.”

Next, we’ll talk about interservice communications and how fragile exchange protocols are a common way of introducing tight coupling to distributed systems. We’ll cover some of the common protocols in use today to minimize the degree of coupling between two services.

In the third part, we’ll change direction for a bit, away from distributed systems and into the implementations of the services themselves. We’ll talk about services as code artifacts, subject to coupling resulting from mingling implementations and violating separation of concerns, and present the use of plug-ins as a way to dynamically add implementations.

Finally, we’ll close with a discussion of hexagonal architecture, an architectural pattern that makes loose coupling the central pillar of its design philosophy.

Throughout the chapter, we’ll do our best to balance theory, architecture, and implementation. Most of the chapter will be spent on the fun stuff: discussing a variety of strategies for managing coupling, particularly (but not exclusively) in the distributed context, and demonstrating by extending our example key-value store.

Coupling

Coupling is a rather romantic-sounding term describing the degree of direct knowledge between components. For example, a client that sends requests to a service is by definition coupled to that service. The degree of that coupling can vary considerably, however, falling anywhere between two extremes.

Tightly coupled components have a great deal of knowledge about another component. Perhaps both require the same version of a shared library to communicate, or maybe the client needs an understanding of the server’s architecture or database schema. It’s easy to build tightly coupled systems when optimizing for the short term, but they have a huge downside: the more tightly coupled two components are, the more likely that a change to one component will necessitate corresponding changes to the other. As a result, tightly coupled systems lose many of the benefits of a distributed architecture.

In contrast, loosely coupled components have minimal direct knowledge of one another. They’re relatively independent, typically interacting via a change-robust abstraction. Systems designed for loose coupling require more up-front planning, but they can be more freely upgraded, redeployed, or even entirely rewritten without greatly affecting the systems that depend on them. They also tend to be more easily testable because their components are easier to isolate and test (or mock) independently of one another.

Put simply, if you want to know how tightly coupled your system is, ask how many and what kind of changes can be made to one component without adversely affecting another.

Note

Some amount of coupling isn’t necessarily a bad thing, especially early in a system’s development. It can be tempting to overabstract and overcomplicate, but premature optimization is still the root of all evil.

Coupling in Different Computing Contexts

The term coupling in the computing context predates microservices and service-oriented architecture by quite a bit and has been used for many years to describe the degree of knowledge that one component has about another.

In programming, code can be tightly coupled when a dependent class directly references a concrete implementation instead of an abstraction (such as an interface; see “Interfaces”). In Go, this might be a function that requires an os.File when an io.Reader would do.

Multiprocessor systems that communicate by sharing memory can be said to be tightly coupled. In a loosely coupled system, components are connected through a Message Transfer System (MTS) (see “Efficient Synchronization” for a refresher on how Go solves this problem with channels).

It’s important to note that there might, on occasion, be good reason to tightly couple certain components. Eliminating abstractions and other intermediate layers can reduce overhead, which can be a useful optimization if speed is a critical system requirement.

Since this book is largely about distributed architectures, we’ll focus on coupling between services that communicate across a network, but keep in mind that there are other ways that software can be tightly coupled to resources in its environment.

Coupling Takes Many Forms

There’s no limit to the ways components in a distributed system can find themselves tightly coupled. However, while these all share one fundamental flaw—they all depend on some property of another component that they wrongly assume won’t change—most can be grouped into a few broad classes according to the resource that they’re coupled to.

Shared dependencies

In 2016, Facebook’s Ben Christensen gave a talk at the Microservices Practitioner Summit where he spoke about another increasingly common mechanism for tightly coupling distributed services, introducing the term distributed monolith in the process.

Ben described an antipattern in which services were required to use specific libraries—and versions of libraries—in order to launch and interact with one another. Such systems find themselves saddled with a fleet-wide dependency, such that upgrading these shared libraries can force all services to have to upgrade in lockstep. This shared dependency has tightly coupled all of the services in the fleet.

Distributed Monoliths

In Chapter 7, we made the case that monoliths, at least for complex systems with multiple distinct functions, are (generally) less desirable, and microservices are (generally) the way to go.² Of course, that’s easier said than done, in large part because it’s so easy to accidentally create a distributed monolith: a microservice-based system containing tightly coupled services.

In a distributed monolith, even small changes to one service can necessitate changes to others, often triggering unintended consequences. Services often can’t be deployed independently, so deployments have to be carefully orchestrated, and errors in one component can send faults rippling through the entire system. Rollbacks are functionally impossible.

In other words, a distributed monolith is a “worst of all worlds” system that pairs the management and complexity overhead of having multiple services with the dependencies and entanglements of a monolith, losing many of the benefits of microservices in the process. Avoid at all costs.

Fragile messaging protocols

Remember SOAP (Simple Object Access Protocol)? Statistically speaking, probably not.³ SOAP was a messaging protocol developed in the late 1990s that was designed for extensibility and implementation neutrality. SOAP services provided a contract that clients could follow to format their requests.⁴ The concept of the contract was something of a breakthrough at the time, but SOAP’s implementation was exceedingly fragile: if the contract changed in any way, the clients had to be updated along with it. In other words, SOAP clients were tightly coupled to their services.

It didn’t take long for people to realize that this was a problem, and SOAP quickly lost its shine. It’s since been largely replaced by REST, which, while a considerable improvement, can often introduce its own tight coupling. In 2016, Google released gRPC (gRPC Remote Procedure Calls),⁵ an open source framework with a number of useful features, including, importantly, allowing loose coupling between components.

We’ll discuss some of these more contemporary options in “Messaging Protocols”, where we’ll see how to use Go’s net/http package to build a REST/HTTP client and extend our key-value store with a gRPC frontend.

Shared point-in-time

Often systems are designed in such a way that clients expect an immediate response from services. Systems using this request-response messaging pattern implicitly assume that a service is present and ready to promptly respond. If it’s not, the request will fail. It can be said that they’re temporally coupled, or coupled in time.

Temporal coupling isn’t necessarily bad practice, though. It might even be preferable, particularly when there’s a human waiting for a timely response. We even detail how to construct such a client in the section “Request-Response Messaging”.

If the response isn’t necessarily time-constrained, then a safer approach may be to send messages to an intermediate queue that recipients can retrieve from when they’re ready, a messaging pattern commonly referred to as publish-subscribe messaging (“pub-sub” for short). We discuss this patten in more detail in “Publish-Subscribe Messaging”.

Fixed addresses

It’s the nature of microservices that they need to talk to one another. But to do that, they first have to find one another. This process of locating services on a network is called service discovery, which we’ll discuss in a bit more detail in “Service Discovery”.

Traditionally, services lived at relatively fixed, well-known network locations that could be discovered by referencing some centralized registry. At first this took the form of manually maintained hosts files to map host names to IP addresses, but as networks scaled up so did the adoption of domain names and DNS.

Traditional DNS works well for long-lived services whose locations on the network rarely change, but the increased popularity of ephemeral, microservice-based applications has ushered in a world in which the lifespans of service instances are often measurable in seconds or minutes rather than months or years. In such dynamic environments, URLs and traditional DNS become just another form of tight coupling.

This need for dynamic, fluid service discovery has driven the adoption of entirely new strategies like service meshes, which use proxies to provide a dedicated infrastructure layer to facilitate service-to-service communications between distributed resources.

Note

Unfortunately, we won’t be able to cover the fascinating and fast-developing topic of service meshes in this book. But the service mesh field is rich, with a number of mature open source projects with active communities—such as Cilium, Linkerd, and Istio—and even commercial offerings like HashiCorp’s Consul.

Service Discovery

Service discovery makes it possible for services to find and communicate with one another without relying on hardcoded IP addresses or endpoints. It works by allowing services to register with a record of services called a service catalog that acts as a single source of truth that allows services to query and communicate with one another.

This approach provides a number of benefits over earlier approaches:

It supports scalability because it allows services to scale up or down without manual intervention. As instances are added or removed, the service registry updates automatically, ensuring that clients always have access to available instances.
It provides flexibility by decoupling service location from the service itself, providing an abstraction that allows services to move, be replaced, or upgraded without affecting clients.
It enhances resilience because it allows applications to gracefully handle service failures and automatically route traffic to healthy instances.

Service discovery is a cornerstone of loosely coupled, resilient cloud native applications. By abstracting service locations and enabling dynamic discovery, it allows applications to scale and adapt to changing environments seamlessly.

There are two primary kinds of service discovery mechanisms: client-side discovery and server-side discovery:

Client-side discovery: In client-side discovery, the client is responsible for determining the location of service instances. The client queries a service registry, retrieves the list of available instances, and selects one to use, typically using a load-balancing algorithm. Examples include Netflix Eureka and Consul.

Server-side discovery: In server-side discovery, the client makes a request to a service discovery server, which then queries the registry and forwards the request to an appropriate service instance. The client remains unaware of the service instance details. Examples are AWS Elastic Load Balancer (ELB) and Kubernetes.

Service discovery includes three distinct operations:

Service registration: When a service starts, that service registers itself with the service registry, which maintains an up-to-date mapping of service names to their corresponding instances and addresses. This registration can also include metadata such as service health, version, and load metrics, which can be used for intelligent routing and load balancing.
Service deregistration: When a service instance shuts down, it deregisters itself from the registry. Heartbeat mechanisms generally ensure that stale entries are also removed automatically in case of less graceful failures.
Service discovery: Clients or discovery servers query the service registry to get a list of available service instances. Depending on the mechanism (client-side or server-side), the actual service instance is selected and the request is routed accordingly.

By providing dynamic and automated service registration, deregistration, and discovery, service discovery mechanisms ensure that applications remain highly scalable, flexible, and resilient, adapting seamlessly to changes in the environment without manual intervention.

Messaging Protocols

Communication and message passing are critical functions of distributed systems, and all distributed systems depend on some form of messaging to receive instructions and directions, exchange information, and provide results and updates. Of course, a message is useless if the recipient can’t understand it.

In order for services to communicate, they must first establish an implicit or explicit contract that defines how messages will be structured. While such a contract is necessary, it also effectively couples the components that depend on it.

It’s actually very easy to introduce tight coupling this way, the degree of which is reflected in the protocol’s ability to change safely. Does it allow backward- and forward-compatible changes, like protocol buffers and gRPC, or do minor changes to the contract effectively break communications, as is the case with SOAP?

Messaging Patterns

Of course, the data exchange protocol and its contract aren’t the only variables in inter-service communications. There are, in fact, two broad classes of messaging patterns:

Request-response (synchronous): A two-way message exchange in which a requester (the client) issues a request of a receiver (the service) and waits for a response. A textbook example is HTML.
Publish-subscribe (asynchronous): A one-way message exchange in which a requester (the publisher) issues a message, often via some kind of messaging middleware, which can be retrieved asynchronously and acted on by one or more services (consumers).

Each of these patterns has a variety of implementations and particular use cases, each with their own pros and cons. While we won’t be able to cover every possible nuance, we’ll do our best to provide a usable survey and some direction about how they may be implemented in Go.

There are other patterns, but these are the most common. For more details, and more detailed guidance on what kind to use for your particular use case, check out Sam Newman’s excellent book, Building Microservices, 2nd edition (O’Reilly).

Request-Response Messaging

As its name suggests, systems using a request-response, or synchronous, messaging pattern communicate using a series of coordinated requests and responses, in which a requester (or client) submits a request to a receiver (or service) and waits until the receiver responds (hopefully) with the requested data or service (see Figure 8-1).

The most obvious example of this pattern might be HTTP, which is so ubiquitous and well-established that it’s been extended beyond its original purpose and now underlies common messaging protocols like REST and GraphQL.

The request-response pattern has the advantages of being relatively easy to reason about and straightforward to implement, and has long been considered the default messaging pattern, particularly for public-facing services. However, it is also “point-to-point,” involving exactly one requester and receiver, and requires the requesting process to pause until it receives a response.

Together, these properties make the request-response pattern a good choice for straightforward exchanges between two points where a response can be expected in a reasonably short amount of time, but less than ideal when a message has to be sent to multiple receivers or when a response might take longer than a requester might want to wait.

Common request-response implementations

Over the years, a multitude of bespoke request-response protocols have been developed for any number of purposes. Over time, this has largely settled down, giving way to three major implementations:

REST: You’re likely already familiar with REST, which we discussed in some detail in “Building an HTTP Server with net/http”. REST has some things going for it. It’s human-readable and easy to implement, making it a good choice for outward-facing services (which is why we chose it in Chapter 5). We’ll discuss a little more in “Issuing HTTP requests with net/http”.
Remote procedure calls: Remote procedure call (RPC) frameworks allow programs to execute procedures in a different address space, often on another computer. Go provides a standard Go-specific RPC implementation in the form of net/rpc. There are also two big language-agnostic RPC players: Apache Thrift and gRPC. While both are similar in design and usage goals, gRPC seems to have taken the lead with respect to adoption and community support. We’ll discuss gRPC in much more detail in “Remote procedure calls with gRPC”.
GraphQL: A relative newcomer on the scene, GraphQL is a query and manipulation language generally considered an alternative to REST and is particularly powerful when working with complex datasets. We don’t discuss GraphQL in much detail in this book, but I encourage you to look into it the next time you’re designing an outward-facing API.

Issuing HTTP requests with net/http

HTTP is perhaps the most common request-response protocol, particularly for public-facing services, underlying popular API formats like REST and GraphQL. If you’re interacting with an HTTP service, you’ll need some way to programmatically issue requests to the service and retrieve the response.

Fortunately, the Go standard library comes with excellent HTTP client and server implementations in the form of the net/http package. You may remember net/http from “Building an HTTP Server with net/http”, where we used it to build the first iteration of our key-value store.

The net/http package includes, among other things, convenience functions for issuing GET, HEAD, and POST methods. The following are the signatures for the first of these, http.Get and http.Head:

// Get issues a GET to the specified URL
func Get(url string) (*http.Response, error) {}

// Head issues a HEAD to the specified URL
func Head(url string) (*http.Response, error) {}

These functions are straightforward and are both used similarly: each accepts a string that represents the URL of interest, and each returns an error value and a pointer to an http.Response struct.

The http.Response struct is particularly useful because it contains all kinds of useful information about the service’s response to our request, including the returned status code and the response body.

A small selection of the http.Response struct is in the following:

type Response struct {
    Status     string       // e.g. "200 OK"
    StatusCode int          // e.g. 200

    // Header maps header keys to values.
    Header Header

    // Body represents the response body.
    Body io.ReadCloser

    // ContentLength records the length of the associated content. The
    // value -1 indicates that the length is unknown.
    ContentLength int64

    // Request is the request that was sent to obtain this Response.
    Request *Request
}

There are some useful things in there! Of particular interest is the Body field, which provides access to the HTTP response body. It’s an io.ReadCloser interface, which tells us two things: that the response body is streamed on demand as it’s read and that it has a Close method that we’re expected to call.

The following code block demonstrates several things: how to use the http.Get convenience function, how to close the response body, and how to use io.ReadAll to read the entire response body as a string (if you’re into that kind of thing):

package main

import (
    "fmt"
    "io"
    "net/http"
)

func main() {
    resp, err := http.Get("http://example.com")  // Send an HTTP GET
    if err != nil {
        panic(err)
    }
    defer resp.Body.Close()                      // Close your response!

    body, err := io.ReadAll(resp.Body)           // Read body as []byte
    if err != nil {
        panic(err)
    }

    fmt.Println(string(body))
}

In this example, we use the http.Get function to issue a GET request to the URL http://example.com, which returns a pointer to an http.Response struct and an error value.

As we mentioned previously, access to the HTTP response body is provided via the resp.Body variable, which implements io.ReadCloser. Note how we defer the call resp.Body.Close(). This is vital: failing to close your response body can sometimes lead to some unfortunate memory leaks.

Because Body implements io.Reader, we have various means to retrieve its data. In this case, we use the very reliable io.ReadAll, which conveniently returns the entire response body as a []byte slice, which we print as a string.

Warning

Always remember to use Close() to close your response body!

Not doing so can lead to some unfortunate memory leaks.

We’ve already seen the http package’s Get and Head functions, but how do we issue POSTs? Fortunately, similar convenience functions exist for that too. Two, in fact: http.Post and http.PostForm. The signatures for each of these are as follows:

// Post issues a POST to the specified URL
func Post(url, contentType string, body io.Reader) (*Response, error) {}

// PostForm issues a POST to the specified URL, with data's keys
// and values URL-encoded as the request body
func PostForm(url string, data url.Values) (*Response, error) {}

The first of these, Post, expects an io.Reader that provides the body—such as a file of a JSON object—of the post. We demonstrate how to upload JSON text in a POST in the following code:

package main

import (
    "fmt"
    "io"
    "net/http"
    "strings"
)

const json = `{ "name":"Matt", "age":44 }`   // This is our JSON

func main() {
    in := strings.NewReader(json)            // Wrap JSON with an io.Reader

    // Issue HTTP POST, declaring our content-type as "text/json"
    resp, err := http.Post("http://example.com/upload", "text/json", in)
    if err != nil {
        panic(err)
    }
    defer resp.Body.Close()                  // Close your response!

    message, err := io.ReadAll(resp.Body)
    if err != nil {
        panic(err)
    }

    fmt.Printf(string(message))
}

A Possible Pitfall of Convenience Functions

We’ve been referring to the http package’s Get, Head, Post, and PostForm functions as “convenience functions,” but what does that mean?

It turns out that, under the hood, each of them is actually calling a method on a default *http.Client value, a concurrency-safe type that Go uses to manage the internals of communicating over HTTP.

The code for the Get convenience function, for example, is actually a call to the default client’s http.Client.Get method:

func Get(url string) (resp *Response, err error) {
    return DefaultClient.Get(url)
}

As you can see, when you use http.Get, you’re actually using http.DefaultClient. Because http.Client is concurrency safe, it’s possible to have only one of these, predefined as a package variable to use as a singleton.

The source code for the creation of DefaultClient itself is somewhat plain, creating a zero-value http.Client:

var DefaultClient = &Client{}

Generally, this is perfectly fine. However, there’s a potential issue here, and it involves timeouts. The http.Client methods are capable of asserting timeouts that terminate long-running requests. This is super useful. Unfortunately, the default timeout value is 0, which Go interprets as “no timeout.”

Okay, so Go’s default HTTP client will never time out. Is that a problem? Usually not, but what if it connects to a server that doesn’t respond and doesn’t close the connection? The result would be an especially nasty and nondeterministic memory leak.

How do we fix this? Well, as it turns out, http.Client does support timeouts; we just have to enable that functionality by creating a custom Client and setting a timeout:

var client = &http.Client{
    Timeout: 10 * time.Second,
}
response, err := client.Get(url)

Take a look at the net/http package documentation for more information about http.Client and its available settings.

Remote procedure calls with gRPC

gRPC is an efficient, polyglot data exchange framework that was originally developed by Google as the successor to Stubby, a general-purpose RPC framework that had been in use internally at Google for more than a decade. It was open sourced in 2015 under the name gRPC and taken over by the Cloud Native Computing Foundation in 2017.

Unlike REST, which is essentially just a set of unenforced best practices, gRPC is a fully featured data exchange framework, which, like other RPC frameworks such as SOAP, Apache Thrift, Java RMI, and CORBA (to name a few), allows a client to execute specific methods implemented on different systems as if they were local functions.

This approach has a number of advantages over REST, including, but not limited to the following:

Conciseness: Its messages are more compact, consuming less network I/O.
Speed: Its binary exchange format is much faster to marshal and unmarshal.
Strong-typing: It’s natively strongly typed, eliminating a lot of boilerplate and removing a common source of errors.
Feature-rich: It has a number of built-in features such as authentication, encryption, timeout, and compression (to name a few) that you would otherwise have to implement yourself.

That’s not to say that gRPC is always the best choice. Compared to REST, gRPC has these downsides:

Contract-driven: gRPC’s contracts make it less suitable for external-facing services.
Binary format: gRPC data isn’t human-readable, making it harder to inspect and debug.

Tip

gRPC is an immense and rich subject that this modest section can’t fully do justice to. If you’re interested in learning more, I recommend the official “Introduction to gRPC” and the excellent gRPC: Up and Running by Kasun Indrasiri and Danesh Kuruppu (O’Reilly).

Interface definition with protocol buffers

As is the case with most RPC frameworks, gRPC requires you to define a service interface. By default, gRPC uses protocol buffers for this purpose, though it’s possible to use an alternative interface definition language (IDL) like JSON if you want.

To define a service interface, the author uses the protocol buffers schema to describe the service methods that can be called remotely by a client in a .proto file. This is then compiled into a language-specific interface (Go code, in our case).

As illustrated in Figure 8-2, gRPC servers implement the resulting source code to handle client calls, while the client has a stub that provides the same methods as the server.

Yes, this seems very hand-wavey and abstract right now. Keep reading for some more details!

Installing the protocol compiler

Before we proceed, we’ll first need to install the protocol buffer compiler, protoc, and the Go protocol buffers plug-in. We’ll use these to compile .proto files into Go service interface code:

If you’re using Linux or macOS, the simplest and easiest way to install protoc is to use a package manager. To install it on a Debian-flavored Linux, you can use apt or apt-get:
```
$ apt install -y protobuf-compiler
$ protoc --version
```
The easiest way to install protoc on macOS is to use Homebrew:
```
$ brew install protobuf
$ protoc --version
```
Run the following command to install the Go protocol buffers plug-in:
```
$ go install google.golang.org/protobuf/cmd/protoc-gen-go
```
The compiler plug-in protoc-gen-go will be installed in $GOBIN, defaulting to $GOPATH/bin. It must be in your $PATH for protoc to find it.

Warning

This book uses the proto3 version of the protocol buffers language. Be sure to check your version of protoc after installation to make sure that it supports proto3.

If you’re using another OS, your chosen package manager has an old version, or if you just want to make sure you have the latest and greatest, you can find the instructions for installing the precompiled binaries on gRPC’s Protocol Buffer Compiler Installation page.

The message definition structure

Protocol buffers are a language-neutral mechanism for serializing structured data. You can think of it as a binary version of XML.⁶ Protocol buffer data is structured as messages, where each message is a small record of information containing a series of name-value pairs called fields.

The first step when working with protocol buffers is to define the message structure by defining it in a .proto file. A basic example is presented in Example 8-1.

Example 8-1. An example `.proto` file; the `message` blocks define remote procedure payloads

syntax = "proto3";

option go_package = "github.com/cloud-native-go/ch08/point";

// Point represents a labeled position on a 2-dimensional surface
message Point {
  int32 x = 1;
  int32 y = 2;
  string label = 3;
}

// Line contains start and end Points
message Line {
  Point start = 1;
  Point end = 2;
  string label = 3;
}

// Polyline contains any number (including zero) of Points
message Polyline {
  repeated Point point = 1;
  string label = 2;
}

As you may have noticed, the protocol buffer syntax is reminiscent of C/C++, complete with its semicolon and commenting syntax.

The first line of the file specifies that you’re using proto3 syntax: if you don’t do this, the protocol buffer compiler will assume you’re using proto2. This must be the first nonempty, noncomment line of the file.

The second line uses the option keyword to specify the full import path of the Go package that’ll contain the generated code.

Finally, we have three message definitions, which describe the structure of the payload messages. In this example, we have three messages of increasing complexity:

Point: Contains x and y integer values, and a label string
Line: Contains exactly two Point values
Polyline: Uses the repeated keyword to indicate that it can contain any number of Point values

Each message contains zero or more fields that have a name and a type. Note that each field in a message definition has a field number that’s unique for that message type. These are used to identify fields in the message binary format and should not be changed once your message type is in use.

If this raises a “tight coupling” red flag in your mind, you get a gold star for paying attention. For this reason, protocol buffers provide explicit support for updating message types, including marking a field as reserved so that it can’t be accidentally reused.

This example is incredibly simple, but don’t let that fool you: protocol buffers are capable of some sophisticated encodings. See the Protocol Buffers Language Guide for more information.

The key-value message structure

So how do we make use of protocol buffers and gRPC to extend the example key-value store that we started in Chapter 5?

Let’s say that we want to implement gRPC equivalents to the Get, Put, and Delete functions we are already exposing via RESTful methods. The message formats for that might look something like the .proto file in Example 8-2.

Example 8-2. `keyvalue.proto`—the messages that will be passed to and from our key-value service procedures

syntax = "proto3";

option go_package = "github.com/cloud-native-go/ch08/keyvalue";

// GetRequest represents a request to the key-value store for the
// value associated with a particular key
message GetRequest {
  string key = 1;
}

// GetResponse represents a response from the key-value store for a
// particular value
message GetResponse {
  string value = 1;
}

// PutRequest represents a request to the key-value store for the
// value associated with a particular key
message PutRequest {
  string key = 1;
  string value = 2;
}

// PutResponse represents a response from the key-value store for a
// Put action.
message PutResponse {}

// DeleteRequest represents a request to the key-value store to delete
// the record associated with a key
message DeleteRequest {
  string key = 1;
}

// DeleteResponse represents a response from the key-value store for a
// Delete action.
message DeleteResponse {}

Tip

Don’t let the names of the message definitions confuse you: they represent messages (nouns) that will be passed to and from functions (verbs) that we’ll define in the next section.

In the .proto file, which we’ll call keyvalue.proto, we have three Request message definitions describing messages that will be sent from the client to the server, and three Response message definitions describing the server’s response messages.

You may have noticed that we don’t include error or status values in the message response definitions. As you’ll see in “Implementing the gRPC client”, these are unnecessary because they’re included in the return values of the gRPC client functions.

Defining our service methods

Now that we’ve completed our message definitions, we’ll need to describe the methods that’ll use them.

To do that, we extend our keyvalue.proto file, using the rpc keyword to define our service interfaces. Compiling the modified .proto file will generate Go code that includes the service interface code and client stubs (see Example 8-3).

Example 8-3. `keyvalue.proto`—the procedures for our key-value service

service KeyValue {
  rpc Get(GetRequest) returns (GetResponse);

  rpc Put(PutRequest) returns (PutResponse);

  rpc Delete(DeleteRequest) returns (DeleteResponse);
}

Tip

In contrast to the messages defined in Example 8-2, the rpc definitions represent functions (verbs) that will send and receive messages (nouns).

In this example, we add three methods to our service:

Get: Accepts a GetRequest and returns a GetResponse
Put: Accepts a PutRequest and returns a PutResponse
Delete: Accepts a DeleteRequest and returns a DeleteResponse

Note that we don’t actually implement the functionality here. We’ll do that later.

The previous methods are all examples of unary RPC definitions, in which a client sends a single request to the server and gets a single response back. This is the simplest of the four service methods types. Various streaming modes are also supported, but these are beyond the scope of this simple primer. The gRPC documentation discusses these in more detail.

Compiling your protocol buffers

Now that you have a .proto file complete with message and service definitions, the next thing you need to do is generate the classes you’ll need to read and write messages. To do this, you need to run the protocol buffer compiler protoc on our keyvalue.proto.

If you haven’t installed the protoc compiler and Go protocol buffers plug-in, follow the directions in “Installing the protocol compiler” to do so.

Now you can run the compiler, specifying the source directory ($SOURCE_DIR) where your application’s source code lives (which defaults to the current directory), the destination directory ($DEST_DIR; often the same as $SOURCE_DIR), and the path to your keystore.proto. Because we want Go code, you use the --go_out option. protoc provides equivalent options to generate code for other supported languages as well.

In this case, we would invoke the following:

$ protoc --proto_path=$SOURCE_DIR \
    --go_out=$DEST_DIR --go_opt=paths=source_relative \
    --go-grpc_out=$DEST_DIR --go-grpc_opt=paths=source_relative \
    $SOURCE_DIR/keyvalue.proto

The go_opt and go-grpc_opt flags tell protoc to place the output files in the same relative directory as the input file. Our keyvalue.proto file results in two files, named keyvalue.pb.go and keyvalue_grpc.pb.go.

Without these flags, the output files would be placed in a directory named after the Go package’s import path. Our keyvalue.proto file, for example, would result in a file named github.com/cloud-native-go/ch08/keyvalue/keyvalue.pb.go.

Implementing the gRPC service

To implement our gRPC server, we’ll need to implement the generated service interface, which defines the server API for our key-value service. It can be found in keyvalue_grpc.pb.go as KeyValueServer:

type KeyValueServer interface {
    Get(context.Context, *GetRequest) (*GetResponse, error)
    Put(context.Context, *PutRequest) (*PutResponse, error)
    Delete(context.Context, *DeleteRequest) (*PutResponse, error)
}

As you can see, the KeyValueServer interface specifies our Get, Put, and Delete methods: each accepts a context.Context and a request pointer and returns a response pointer and an error.

Tip

As a side effect of its simplicity, it’s dead easy to mock requests to, and responses from, a gRPC server implementation.

To implement our server, we’ll make use of a generated struct that provides a default implementation for the KeyValueServer interface, which, in our case, is named UnimplementedKeyValueServer. It’s so named because it includes default “unimplemented” versions of all of our client methods attached, which look something like the following:

type UnimplementedKeyValueServer struct {}

func (*UnimplementedKeyValueServer) Get(context.Context, *GetRequest)
        (*GetResponse, error) {

    return nil, status.Errorf(codes.Unimplemented, "method not implemented")
}

By embedding the UnimplementedKeyValueServer, we’re able to implement our key-value gRPC server. This is demonstrated with the following code, in which we implement the Get method. The Put and Delete methods are omitted for brevity:

package main

import (
    "context"
    "log"
    "net"

    pb "github.com/cloud-native-go/ch08/keyvalue"
    "google.golang.org/grpc"
)

// server is used to implement KeyValueServer. It MUST embed the generated
// struct pb.UnimplementedKeyValueServer
type server struct {
    pb.UnimplementedKeyValueServer
}

func (s *server) Get(ctx context.Context, r *pb.GetRequest)
        (*pb.GetResponse, error) {

    log.Printf("Received GET key=%v", r.Key)

    // The local Get function is implemented back in Chapter 5
    value, err := Get(r.Key)

    // Return expects a GetResponse pointer and an err
    return &pb.GetResponse{Value: value}, err
}

func main() {
    // Create a gRPC server and register our KeyValueServer with it
    s := grpc.NewServer()
    pb.RegisterKeyValueServer(s, &server{})

    // Open a listening port on 50051
    lis, err := net.Listen("tcp", ":50051")
    if err != nil {
        log.Fatalf("failed to listen: %v", err)
    }

    // Start accepting connections on the listening port
    if err := s.Serve(lis); err != nil {
        log.Fatalf("failed to serve: %v", err)
    }
}

In this snippet, we implement and start our service in four steps:

Create the server struct. Our server struct embeds pb.UnimplementedKeyValueServer. This is not optional: gRPC requires your server struct to similarly embed its generated UnimplementedXXXServer.
Implement the service methods. We implement the service methods defined in the generated pb.KeyValueServer interface. Conveniently, because the pb.UnimplementedKeyValueServer includes stubs for all of these service methods, we don’t have to implement them all right away.
Register our gRPC server. In the main function, we create a new instance of the server struct and register it with the gRPC framework. This is similar to how we registered handler functions in “Building an HTTP Server with net/http”, except we register an entire instance rather than individual functions.
Start accepting connections. Finally, we open a listening port⁷ using net.Listen, which we pass to the gRPC framework via s.Serve to begin listening.

It could be argued that gRPC provides the best of both worlds by providing the freedom to implement any desired functionality without having to be concerned with building many of the tests and checks usually associated with a RESTful service.

Implementing the gRPC client

Because all of the client code is generated, making use of a gRPC client is fairly straightforward.

The generated client interface will be named XXXClient, which in our case will be KeyValueClient, shown in the following:

type KeyValueClient interface {
    Get(ctx context.Context, in *GetRequest, opts ...grpc.CallOption)
        (*GetResponse, error)

    Put(ctx context.Context, in *PutRequest, opts ...grpc.CallOption)
        (*PutResponse, error)

    Delete(ctx context.Context, in *DeleteRequest, opts ...grpc.CallOption)
        (*PutResponse, error)
}

All of the methods described in our source .proto file are specified here, each accepting a request type pointer and returning a response type pointer and an error.

Additionally, each of the methods accepts a context.Context (if you’re rusty on what this is or how it’s used, take a look at “The Context Package”) and zero or more instances of grpc.CallOption. CallOption is used to modify the behavior of the client when it executes its calls. More detail on this can be found in the gRPC API documentation.

I demonstrate how to create and use a gRPC client in the following:

package main

import (
    "context"
    "log"
    "os"
    "strings"
    "time"

    pb "github.com/cloud-native-go/ch08/keyvalue"
    "google.golang.org/grpc"
)

func main() {
    // Set up a connection to the gRPC server
    conn, err := grpc.Dial("localhost:50051",
        grpc.WithInsecure(), grpc.WithBlock(), grpc.WithTimeout(time.Second))
    if err != nil {
        log.Fatalf("did not connect: %v", err)
    }
    defer conn.Close()

    // Get a new instance of our client
    client := pb.NewKeyValueClient(conn)

    var action, key, value string

    // Expect something like "set foo bar"
    if len(os.Args) > 2 {
        action, key = os.Args[1], os.Args[2]
        value = strings.Join(os.Args[3:], " ")
    }

    // Use context to establish a 1-second timeout.
    ctx, cancel := context.WithTimeout(context.Background(), time.Second)
    defer cancel()

    // Call client.Get() or client.Put() as appropriate.
    switch action {
    case "get":
        r, err := client.Get(ctx, &pb.GetRequest{Key: key})
        if err != nil {
            log.Fatalf("could not get value for key %s: %v\n", key, err)
        }
        log.Printf("Get %s returns: %s", key, r.Value)

    case "put":
        _, err := client.Put(ctx, &pb.PutRequest{Key: key, Value: value})
        if err != nil {
            log.Fatalf("could not put key %s: %v\n", key, err)
        }
        log.Printf("Put %s", key)

    default:
        log.Fatalf("Syntax: go run [get|put] KEY VALUE...")
    }
}

The preceding example parses command-line values to determine whether it should do a Get or a Put operation.

First, it establishes a connection with the gRPC server using the grpc.Dial function, which takes a target address string, and one or more grpc.DialOption arguments that configure how the connection gets set up. In our case we use the following:

WithInsecure: Disables transport security for this ClientConn. Don’t use insecure connections in production.
WithBlock: Makes Dial block until a connection is established, otherwise the connection will occur in the background.
WithTimeout: Makes a blocking Dial return an error if it takes longer than the specified amount of time.

Next, it uses NewKeyValueClient to get a new KeyValueClient, and gets the various command-line arguments.

Finally, based on the action value, we call either client.Get or client.Put, both of which return an appropriate return type and an error.

Once again, these functions look and feel exactly like local function calls. No checking status codes, hand-building our own clients, or any other funny business.

Publish-Subscribe Messaging

In the request-response configuration, the client makes an implicit assumption that a service is present, listening, and ready to promptly respond. However, if any of those things aren’t true, the request will fail. Sometimes this might be perfectly acceptable, but if the response isn’t necessarily tightly time-bound, then publish-subscribe messaging might be an appropriate choice.

In publish-subscribe (“pub-sub” for short) messaging, a sender, called a publisher, sends a message to a central message broker (sometimes called an event bus) that forwards the message to one or more recipients, called subscribers. Clients may choose to publish to the broker to send data, subscribe to the broker to receive data, or both.

This arrangement is illustrated in Figure 8-3.

In other words, producers and consumers are loosely coupled. They’re effectively unaware of each other’s existence, which allows for independent evolution and deployment of services.

Messages versus events

While the terms message and event are sometimes used interchangeably, they have distinct meanings. A message typically conveys a piece of data from one system to another, often as part of a larger workflow. An event, on the other hand, signifies that something has happened, usually triggering further actions or processing.

In this context, an event stream is a sequence of events that are ordered and time-stamped, often used to track changes over time in a system. For a deeper dive into event-driven architectures, consider Adam Bellemare’s Building Event-Driven Microservices (O’Reilly), which provides comprehensive insights into this topic.

Asynchronous communication

One of the key advantages of publish-subscribe messaging is its asynchronous nature. Unlike request-response models, where communication is synchronous and time-bound, publish-subscribe allows messages to be sent and received independently of the producer’s and consumer’s state. This decoupling of time makes the system more resilient and flexible.

Middleware and message brokers

Publish-subscribe messaging generally relies on middleware, commonly referred to as message brokers or event brokers. These brokers mediate communication between producers and consumers, handling tasks such as message routing, delivery, and persistence. Examples of popular message brokers include Apache Kafka, RabbitMQ, and Google Cloud Pub/Sub.

Consumer processing

Consumers in a publish-subscribe system can be either short-running or long-running processes. Short-running consumers, often implemented as FaaS, handle individual messages quickly and scale horizontally based on demand. Long-running consumers, on the other hand, maintain persistent connections to the broker and handle streams of messages continuously.

Loose Coupling Local Resources with Plug-ins

At first glance, the topic of loose coupling of local—as opposed to remote or distributed—resources might seem mostly irrelevant to a discussion of “cloud native” technologies. But you might be surprised how often such patterns come in handy.

For example, it’s often useful to build services or tools that can accept data from different kinds of input sources (such as a REST interface, a gRPC interface, and a chatbot interface) or generate different kinds of outputs (such as generating different kinds of logging or metric formats). As an added bonus, designs that support such modularity can also make mocking resources for testing dead simple.

As we’ll see in “Hexagonal Architecture”, entire software architectures have even been built around this concept.

No discussion of loose coupling would be complete without a review of plug-in technologies.

In-Process Plug-ins with the plugin Package

Go provides a native plug-in system in the form of the standard plugin package. This package is used to open and access Go plug-ins, but it’s not necessary to actually build the plug-ins themselves.

As we’ll demonstrate in the following, the requirements for building and using a Go plug-in are pretty minimal. It doesn’t have to even know it’s a plug-in or even import the plugin package. A Go plug-in has three real requirements: it must be in the main package, it must export one or more functions or variables, and it must be compiled using the -buildmode=plugin build flag. That’s it, really.

Plug-in vocabulary

Before we continue, we need to define a few terms that are particular to plug-ins. Each of the following describes a specific plug-in concept, and each has a corresponding type or function implementation in the plugin package. We’ll go into all of these in more detail in our example:

Plug-in

A plug-in is a Go main package with one or more exported functions and variables that has been built with the -buildmode=plugin build flag. It’s represented in the plugin package by the Plugin type.

Open

Opening a plug-in is the process of loading it into memory, validating it, and discovering its exposed symbols. A plug-in at a known location in the file system can be opened using the Open function, which returns a *Plugin value:

func Open(path string) (*Plugin, error) {}

Symbol

A plug-in symbol is any variable or function that’s exported by the plug-in’s package. Symbols can be retrieved by “looking them up” and are represented in the plugin package by the Symbol type:

type Symbol any

Look up

Looking up describes the process of searching for and retrieving a symbol exposed by a plug-in. The plugin package’s Lookup method provides that functionality and returns a Symbol value:

func (p *Plugin) Lookup(symName string) (Symbol, error) {}

In the next section, we present a toy example that demonstrates how these resources are used, and dig into a little detail in the process.

A toy plug-in example

You can learn only so much from a review of the API, even one as minimal as the plugin package. So, let’s build ourselves a toy example: a program that tells you about various animals,⁸ as implemented by plug-ins.

For this example, we’ll be creating three independent packages with the following package structure:

~/cloud-native-go/ch08/go-plugin
├── duck
│   └── duck.go
├── frog
│   └── frog.go
└── main
    └── main.go

The duck/duck.go and frog/frog.go files each contain the source code for one plug-in. The main/main.go file contains our example’s main function, which will load and use the plug-ins we’ll generate by building frog.go and duck.go.

The complete source code for this example is available in this book’s companion GitHub repository.

The Sayer interface

For a plug-in to be useful, the functions that access it need to know what symbols to look up and what contract those symbols conform to.

One convenient—but by no means required—way to do this is to use an interface that a symbol can be expected to satisfy. In our particular implementation, our plug-ins will expose just one symbol—Animal—which we’ll expect to conform to the following Sayer interface:

type Sayer interface {
    Says() string
}

This interface describes only one method, Says, which returns a string that says what an animal says.

The Go plugin code

We have source for two separate plug-ins in duck/duck.go and frog/frog.go. In the following snippet, the first of these, duck/duck.go, is shown in its entirety and displays all of the requirements of a plug-in implementation:

package main

type duck struct{}

func (d duck) Says() string {
    return "quack!"
}

// Animal is exported as a symbol.
var Animal duck

As described in the introduction to this section, the requirements for Go plug-in are really, really minimal: it just has to be a main package that exports one or more variables or functions.

The previous plug-in code describes and exports just one feature—Animal—that satisfies the preceding Sayer interface. Recall that exported package variables and symbols are exposed on the plug-in as shared library symbols that can be looked up later. In this case, our code will have to look specifically for the exported Animal symbol.

In this example we have only one symbol, but there’s no explicit limit to the number of symbols we can have. We could have exported many more features, if we wanted to. We won’t show the frog/frog.go file here because it’s essentially the same. But it’s important to know that the internals of a plug-in don’t matter as long as it satisfies the expectations of its consumer. These are the expectations:

The plug-in exposes a symbol named Animal.
The Animal symbol adheres to the contract defined by the Sayer interface.

Building the plug-ins

Building a Go plug-in is similar to building any other Go main package, except that you have to include the -buildmode=plugin build parameter.

To build our duck/duck.go plug-in code, we do the following:

$ go build -buildmode=plugin -o duck/duck.so duck/duck.go

The result is a shared object (.so) file in ELF (Executable Linkable Format) format:

$ file duck/duck.so
duck/duck.so: Mach-O 64-bit dynamically linked shared library x86_64

ELF files are commonly used for plug-ins because once they’re loaded into memory by the kernel, they expose symbols in a way that allows for easy discovery and access.

Using our Go plug-ins

Now that we’ve built our plug-ins, which are patiently sitting there with their .so extensions, we need to write some code that’ll load and use them. Note that even though we have our plug-ins fully built and in place, we haven’t had to reach for the plugin package yet. However, now that we want to actually use our plug-ins, we get to change that.

The process of finding, opening, and consuming a plug-in requires several steps, which I demonstrate next.

Import the plugin package

First things first: we have to import the plugin package, which will provide us the tools we need to open and access our plug-ins.

In this example, we import four packages: fmt, log, os, and, most relevant to this example, plugin:

import (
    "fmt"
    "log"
    "os"
    "plugin"
)

Find our plug-in

To load a plug-in, we have to find its relative or absolute file path. For this reason, plug-in binaries are usually named according to some pattern and placed somewhere where they can be easily discovered, like the user’s command path or other standard fixed location.

For simplicity, our implementation assumes that our plug-in has the same name as the user’s chosen animal and lives in a path relative to the execution location:

if len(os.Args) != 2 {
    log.Fatal("usage: run main/main.go animal")
}

// Get the animal name, and build the path where we expect to
// find the corresponding shared object (.so) file.
name := os.Args[1]
module := fmt.Sprintf("./%s/%s.so", name, name)

Importantly, this approach means that our plug-in doesn’t need to be known—or even exist—at compile time. In this manner, we’re able to implement whatever plug-ins we want at any time, and load and access them dynamically as we see fit.

Open our plug-in

Now that we think we know our plug-in’s path, we can use the Open function to “open” it, loading it into memory and discovering its available symbols. The Open function returns a *Plugin value that can then be used to look up any symbols exposed by the plug-in:

// Open our plugin and get a *plugin.Plugin.
p, err := plugin.Open(module)
if err != nil {
    log.Fatal(err)
}

When a plug-in is first opened by the Open function, the init functions of all packages that aren’t already part of the program are called. The package’s main function is not run.

When a plug-in is opened, a single canonical *Plugin value representation of it is loaded into memory. If a particular path has already been opened, subsequent calls to Open will return the same *Plugin value.

A plug-in can’t be loaded more than once and can’t be closed.

Look up your symbol

To retrieve a variable or function exported by our package—and therefore exposed as a symbol by the plug-in—we have to use the Lookup method to find it. Unfortunately, the plugin package doesn’t provide any way to list all of the symbols exposed by a plug-in, so we have to know the name of our symbol ahead of time:

// Lookup searches for a symbol named "Animal" in plug-in p.
symbol, err := p.Lookup("Animal")
if err != nil {
    log.Fatal(err)
}

If the symbol exists in the plug-in p, then Lookup returns a Symbol value. If the symbol doesn’t exist in p, then a non-nil error is returned instead.

Assert and use your symbol

Now that we have our Symbol, we can convert it into the form we need and use it however we want. To make things nice and easy for us, the Symbol type is essentially a rebranded any value. From the plugin source code:

type Symbol any

This means that as long as we know what our symbol’s type is, we can use type assertion to coerce it into a concrete type value that can be used however we see fit:

// Asserts that the symbol interface holds a Sayer.
animal, ok := symbol.(Sayer)
if !ok {
    log.Fatal("that's not a Sayer")
}

// Now we can use our loaded plug-in!
fmt.Printf("A %s says: %q\n", name, animal.Says())

In the previous code, we assert that the symbol value satisfies the Sayer interface. If it does, we print what our animal says. If it doesn’t, we’re able to exit gracefully.

Executing our example

Now that we’ve written our main code that attempts to open and access the plug-in, we can run it like any other Go main package, passing the animal name in the arguments:

$ go run main/main.go duck
A duck says: "quack!"

$ go run main/main.go frog
A frog says: "ribbit!"

We can even implement arbitrary plug-ins later without changing our main source code:

$ go run main/main.go fox
A fox says: "ring-ding-ding-ding-dingeringeding!"

HashiCorp’s Go Plug-in System Over RPC

HashiCorp’s Go plugin system has been in wide use—both internally by HashiCorp and elsewhere—since at least 2016, predating the release of Go’s standard plugin package by about a year.

Unlike Go plug-ins, which use shared libraries, HashiCorp’s plug-ins are standalone processes that are executed by using exec.Command, which has some obvious benefits over shared libraries:

They can’t crash your host process: Because they’re separate processes, a panic in a plug-in doesn’t automatically crash the plug-in consumer.
They’re more version-flexible: Go plug-ins are famously version-specific. HashiCorp plug-ins are far less so, expecting only that plug-ins adhere to a contract. HashiCorp plug-ins also support explicit protocol versioning.
They’re relatively secure: HashiCorp plug-ins have access to only the interfaces and parameters passed to them, as opposed to the entire memory space of the consuming process.

They do have a couple of downsides, though:

More verbose: HashiCorp plug-ins require more boilerplate than Go plug-ins.
Lower performance: Because all data exchange with HashiCorp plug-ins occurs over RPC, they’re generally less performant than Go plug-ins.

That being said, let’s take a look at what it takes to assemble a simple plug-in.

Another toy plug-in example

So we can compare apples to apples, we’re going to work through a toy example that’s functionally identical to the one for the standard plugin package in “A toy plug-in example”: a program that tells you what various animals say. As before, we’ll be creating several independent packages with the following structure:

~/cloud-native-go/ch08/hashicorp-plugin
├── commons
│   └── commons.go
├── duck
│   └── duck.go
└── main
    └── main.go

As before, the duck/duck.go file contains the source code for a plug-in, and the main/main.go file contains our example’s main function that loads and uses the plug-in. Because both of these are independently compiled to produce executable binaries, both files are in the main package.

The commons package is new. It contains some resources that are shared by the plug-in and the consumer, including the service interface and some RPC boilerplate.

As before, the complete source code for this example is available in this book’s companion GitHub repository.

Common code

The commons package contains some resources that are shared by both the plug-in and the consumer, so in our example it’s imported by both the plug-in and client code.

The package contains the RPC stubs that are used by the underlying net/rpc machinery to define the service abstraction for the host and allow the plug-ins to construct their service implementations.

The Sayer interface

The first of these stubs is the Sayer interface. This is our service interface, which provides the service contract that the plug-in service implementations must conform to and that the host can expect.

It’s identical to the interface that we used in “The Sayer interface”:

type Sayer interface {
    Says() string
}

The Sayer interface describes only one method: Says. Although this code is shared, as long as this interface doesn’t change, the shared contract will be satisfied and the degree of coupling is kept fairly low.

The SayerPlugin struct

The more complex of the common resources is the Sa⁠ye⁠rPl⁠ug⁠in struct, shown in the following. It’s an implementation of plugin.Plugin, the primary plug-in interface from the github.com/hashicorp/go-plugin package.

Warning

The package declaration inside the github.com/hashicorp/go-plugin repository is plugin, not go-plugin, as its path might suggest. Adjust your imports accordingly!

The Client and Server methods are used to describe our service according to the expectations of Go’s standard net/rpc package. We won’t cover that package in this book, but if you’re interested, you can find a wealth of information in the Go documentation:

type SayerPlugin struct {
    Impl Sayer
}

func (SayerPlugin) Client(b *plugin.MuxBroker, c *rpc.Client)
        (any, error) {

    return &SayerRPC{client: c}, nil
}

func (p *SayerPlugin) Server(*plugin.MuxBroker) (any, error) {
    return &SayerRPCServer{Impl: p.Impl}, nil
}

Both methods accept a plugin.MuxBroker, which is used to create multiplexed streams on a plug-in connection. While useful, this is a more advanced use case that we won’t have time to cover in this book.

The SayerRPC client implementation

SayerPlugin’s Client method provides an implementation of our Sayer interface that communicates over an RPC client—the appropriately named SayerRPC struct—shown in the following:

type SayerRPC struct{ client *rpc.Client }

func (g *SayerRPC) Says() string {
    var resp string

    err := g.client.Call("Plugin.Says", new(any), &resp)
    if err != nil {
        panic(err)
    }

    return resp
}

SayerRPC uses Go’s RPC framework to remotely call the Says method implemented in the plug-in. It invokes the Call method attached to the *rpc.Client, passing in any parameters (Says doesn’t have any parameters, so we pass an empty any), and retrieves the response, which it puts into the resp string.

The handshake configuration

HandshakeConfig is used by both the plug-in and host to do a basic handshake between the host and the plug-in. If the handshake fails—if the plug-in was compiled with a different protocol version, for example—a user-friendly error is shown. This prevents users from executing bad plug-ins or executing a plug-in directly. Importantly, this is a UX feature, not a security feature:

var HandshakeConfig = plugin.HandshakeConfig{
    ProtocolVersion:  1,
    MagicCookieKey:   "BASIC_PLUGIN",
    MagicCookieValue: "hello",
}

The SayerRPCServer server implementation

SayerPlugin’s Server method provides a definition of an RPC server—the SayerRPCServer struct—to serve the actual methods in a way that’s consistent with net/rpc:

type SayerRPCServer struct {
    Impl Sayer    // Impl contains our actual implementation
}

func (s *SayerRPCServer) Says(args any, resp *string) error {
    *resp = s.Impl.Says()
    return nil
}

SayerRPCServer doesn’t implement the Sayer service. Instead, its Says method calls into a Sayer implementation—Impl—that we’ll provide when we use this to build our plug-in.

Our plug-in implementation

Now that we’ve assembled the code that’s common between the host and plug-ins—the Sayer interface and the RPC stubs—we can build our plug-in code. The code in this section represents the entirety of our main/main.go file.

Just like standard Go plug-ins, HashiCorp plug-ins are compiled into standalone executable binaries, so they must be in the main package. Effectively, every HashiCorp plug-in is a small, self-contained RPC server:

package main

We have to import our commons package, as well as the hashicorp/go-plugin package, whose contents we’ll reference as plugin:

import (
    "github.com/cloud-native-go/ch08/hashicorp-plugin/commons"
    "github.com/hashicorp/go-plugin"
)

In our plug-ins we get to build our real implementations. We can build an implementation however we want,⁹ as long as it conforms to the Sayer interface that we define in the commons package:

type Duck struct{}

func (g *Duck) Says() string {
    return "Quack!"
}

Finally, we get to our main function. It’s somewhat “boilerplate-y,” but it’s essential:

func main() {
    // Create and initialize our service implementation.
    sayer := &Duck{}

    // pluginMap is the map of plug-ins we can dispense.
    var pluginMap = map[string]plugin.Plugin{
        "sayer": &commons.SayerPlugin{Impl: sayer},
    }

    plugin.Serve(&plugin.ServeConfig{
        HandshakeConfig: handshakeConfig,
        Plugins:         pluginMap,
    })
}

The main function does three things. First, it creates and initializes our service implementation, a *Duck value, in this case.

Next, it maps the service implementation to the name “sayer” in the pluginMap. If we wanted to, we could actually implement several plug-ins, listing them all here with different names.

Finally, we call plugin.Serve, which starts the RPC server that will handle any connections from the host process, allowing the handshake with the host to proceed and the service’s methods to be executed as the host sees fit.

Our host process

We now have our host process; the main command that acts as a client that finds, loads, and executes the plug-in processes.

As you’ll see, using HashiCorp plug-ins isn’t all that different from the steps described for Go plug-ins in “Using our Go plug-ins”.

Import the hashicorp/go-plugin and commons packages

As usual, we start with our package declaration and imports. The imports mostly aren’t interesting, and their necessity should be clear from examination of the code.

The two that are interesting (but not surprising) are github.com/hashicorp/go-plugin, which we, once again, have to reference as plugin, and our commons package, which contains the interface and handshake configuration, both of which must be agreed upon by the host and the plug-ins:

package main

import (
    "fmt"
    "log"
    "os"
    "os/exec"

    "github.com/cloud-native-go/ch08/hashicorp-plugin/commons"
    "github.com/hashicorp/go-plugin"
)

Find our plug-in

Since our plug-in is an external file, we have to find it. Again, for simplicity, our implementation assumes that our plug-in has the same name as the user’s chosen animal and lives in a path relative to the execution location:

func main() {
    if len(os.Args) != 2 {
        log.Fatal("usage: run main/main.go animal")
    }

    // Get the animal name, and build the path where we expect to
    // find the corresponding executable file.
    name := os.Args[1]
    module := fmt.Sprintf("./%s/%s", name, name)

    // Does the file exist?
    _, err := os.Stat(module)
    if os.IsNotExist(err) {
        log.Fatal("can't find an animal named", name)
    }
}

It bears repeating that the value of this approach is that our plug-in—and its implementation—doesn’t need to be known, or even exist, at compile time. We’re able to implement whatever plug-ins we want at any time and use then dynamically as we see fit.

Create our plug-in client

The first way that a HashiCorp RPC plug-in differs from a Go plug-in is the way that it retrieves the implementation. Where Go plug-ins have to be “opened” and their symbol “looked up,” HashiCorp plug-ins are built on RPC and therefore require an RPC client.

This actually requires two steps and two clients: a *plugin.Client that manages the lifecycle of the plug-in subprocess and a protocol client—a plugin.ClientProtocol implementation—that can communicate with the plug-in subprocess.

This awkward API is mostly historical but is used to split the client that deals with subprocess management and the client that does RPC management:

// pluginMap is the map of plug-ins we can dispense.
var pluginMap = map[string]plugin.Plugin{
    "sayer": &commons.SayerPlugin{},
}

// Launch the plugin process!
client := plugin.NewClient(&plugin.ClientConfig{
    HandshakeConfig: commons.HandshakeConfig,
    Plugins:         pluginMap,
    Cmd:             exec.Command(module),
})
defer client.Kill()

// Connect to the plugin via RPC
rpcClient, err := client.Client()
if err != nil {
    log.Fatal(err)
}

Most of this snippet consists of defining the parameters of the plug-in that we want in the form of a plugin.ClientConfig. The complete list of available client configurations is lengthy. This example uses only three:

HandshakeConfig: The handshake configuration. This has to match the plug-in’s own handshake configuration or we’ll get an error in the next step.
Plugins: A map that specifies the name and type of plug-in we want.
Cmd: An *exec.Cmd value that represents the command for starting the plug-in subprocess.

With all of the configuration stuff out of the way, we can first use plugin.NewClient to retrieve a *plugin.Client value, which we call client.

Once we have that, we can use client.Client to request a protocol client. We call this rpcClient because it knows how to use RPC to communicate with the plug-in subprocess.

Connect to our plug-in and dispense our Sayer

Now that we have our protocol client, we can use it to dispense our Sayer implementation:

    // Request the plug-in from the client
    raw, err := rpcClient.Dispense("sayer")
    if err != nil {
        log.Fatal(err)
    }

    // We should have a Sayer now! This feels like a normal interface
    // implementation, but is actually over an RPC connection.
    sayer := raw.(commons.Sayer)

    // Now we can use our loaded plug-in!
    fmt.Printf("A %s says: %q\n", name, sayer.Says())
}

Using the protocol client’s Dispense function, we’re able to finally retrieve our Sayer implementation as an any, which we can assert as a commons.Sayer value and immediately use exactly as if we were using a local value.

Under the covers, our sayer is in fact a SayerRPC value, and calls to its functions trigger RPC calls that are executed in our plug-in’s address space.

In the next section, we’ll introduce the hexagonal architecture, an architectural pattern built around the entire concept of loose coupling by using easily exchangeable “ports and adapters” to connect to its environment.

Hexagonal Architecture

Hexagonal architecture—also known as the “ports and adapters” pattern—is an architectural pattern that uses loose coupling and inversion of control as its central design philosophy to establish clear boundaries between business and peripheral logic.

In a hexagonal application, the core application doesn’t know any details at all about the outside world, operating entirely through loosely coupled ports and technology-specific adapters.

This approach allows the application to, for example, expose different APIs (REST, gRPC, a test harness, etc.) or use different data sources (database, message queues, local files, etc.) without impacting its core logic or requiring major code changes.

Note

It took me an embarrassingly long time to realize that the name “hexagonal architecture” doesn’t actually mean anything. Alistair Cockburn, the author of hexagonal architecture, chose the shape because it gave him enough room to illustrate the design.

The Architecture

As illustrated in Figure 8-4, hexagonal architecture is composed of three components conceptually arranged in and around a central hexagon:

The core application

The application proper, represented by the hexagon. This contains all of the business logic but has no direct reference to any technology, framework, or real-world device. The business logic shouldn’t depend on whether it exposes a REST or a gRPC API, or whether it gets data from a database or a .csv file. Its only view of the world should be through ports.

Ports and adapters

The ports and adapters are represented on the edge of the hexagon. Ports allow different kinds of actors to “plug in” and interact with the core service. Adapters can “plug into” a port and translate signals between the core application and an actor.

For example, your application might have a “data port” into which a “data adapter” might plug. One data adapter might write to a database, while another might use an in-memory datastore or automated test harness.

Actors

The actors can be anything in the environment that interacts with the core application (users, upstream services, etc.) or that the core application interacts with (storage devices, downstream services, etc.). They exist outside the hexagon.

In a traditional layered architecture, all of the dependencies point in the same direction, with each layer depending on the one below it.

In a hexagonal architecture, however, all dependencies point inward: the core business logic doesn’t know any details about the outer world, the adapters know how to ferry information to and from the core, and the adapters in the outer world know how to interact with the actors.

Implementing a Hexagonal Service

To illustrate this, we’re going to refactor our old friend, the key-value store.

If you’ll recall from Chapter 5, the core application of our key-value store reads and writes to an in-memory map, which can be accessed via a RESTful (or gRPC) frontend. Later in the same chapter, we implemented a transaction logger, which knows how to write all transactions to somewhere and read them all back when the system restarts.

We’ll reproduce important snippets of the service here, but if you want a refresher on what we did, go back and do so now.

By this point in the book, we’ve accumulated a couple of different implementations for a couple of different components of our service that seem like good candidates for ports and adapters in a hexagonal architecture:

The frontend: Back in “Generation 1: The Monolith”, we implemented a REST frontend, and then in “Remote procedure calls with gRPC”, we implemented a separate gRPC frontend. We can describe these with a single “driver” port into which we’ll be able to plug either (or both!) as adapters.
The transaction logger: In “What’s a Transaction Log?”, we created two implementations of a transaction log. These seem like a natural choice for a “driven” port and adapters.

While all the logic for all of these already exists, we’ll need to do some refactoring to make this architecture “hexagonal”:

Our original core application—originally described in “Generation 0: The Core Functionality”—uses exclusively public functions. We’ll refactor those into struct methods to make it easier to use in a “ports and adapters” format.
Both the RESTful and gRPC frontends are already consistent with hexagonal architecture, since the core application doesn’t know or care about them, but they’re constructed in a main function. We’ll convert these into FrontEnd adapters into which we can pass our core application. This pattern is typical of a “driver” port.
The transaction loggers themselves won’t need much refactoring, but they’re currently embedded in the frontend logic. When we refactor the core application, we’ll add a transaction logger port so that the adapter can be passed into the core logic. This pattern is typical of a “driven” port.

In the next section, we’ll begin taking the existing components and refactoring them in accordance with hexagonal principles.

Our refactored components

For the sake of this example, all of our components live under the github.com/cloud-native-go/examples/ch08/hexarch package:

~/cloud-native-go/ch08/hexarch/
├── core
│   └── core.go
├── frontend
│   ├── grpc.go
│   └── rest.go
├── main.go
└── transact
    ├── filelogger.go
    └── pglogger.go

core: The core key-value application logic. Importantly, it has no dependencies outside of the Go standard libraries.
frontend: Contains the REST and gRPC frontend driver adapters. These have a dependency on core.
transact: Contains the file and PostgreSQL transaction logger driven adapters. These also have a dependency on core.
main.go: Makes the core application instance, into which it passes the driven components, and which it passes to the driver adapters.

In the companion GitHub repository the complete source code is also available.

Now that we have our very high-level structure, let’s go ahead and implement our first plug.

Our first plug

You may remember that we also implemented a transaction log to maintain a record of every time a resource is modified so that if our service crashes, is restarted, or otherwise finds itself in an inconsistent state, it can reconstruct its complete state by replaying the transactions.

In “Your transaction logger interface”, we represented a generic transaction logger with the TransactionLogger:

type TransactionLogger interface {
    WriteDelete(key string)
    WritePut(key, value string)
}

For brevity, we define only the WriteDelete and WritePut methods.

A common aspect of “driven” adapters is that the core logic acts on them, so the core application has to know about the port. As such, this code lives in the core package.

Our core application

In our original implementation in “Your Super Simple API”, the transaction logger was used by the frontend. In a hexagonal architecture, we move the port—in the form of the TransactionLogger interface—into the core application:

package core

import (
    "errors"
    "log"
    "sync"
)

type KeyValueStore struct {
    m        map[string]string
    transact TransactionLogger
}

func NewKeyValueStore(tl TransactionLogger) *KeyValueStore {
    return &KeyValueStore{
        m:        make(map[string]string),
        transact: tl,
    }
}

func (store *KeyValueStore) Delete(key string) error {
    delete(store.m, key)
    store.transact.WriteDelete(key)
    return nil
}

func (store *KeyValueStore) Put(key string, value string) error {
    store.m[key] = value
    store.transact.WritePut(key, value)
    return nil
}

Comparing the previous code with the original form in “Generation 0: The Core Functionality”, you’ll see some significant changes.

First, Put and Delete aren’t pure functions anymore: they’re now methods on a new KeyValueStore struct, which also has the map data structure. We’ve also added a NewKeyValueStore function that initializes and returns a new KeyValueStore pointer value.

Finally, KeyValueStore now has a TransactionLogger, which Put and Delete act on appropriately. This is our port.

Our TransactionLogger adapters

In Chapter 5, we created two TransactionLogger implementations:

In “Implementing your FileTransactionLogger”, we describe a file-based implementation.
In “Implementing your PostgresTransactionLogger”, we describe a PostgreSQL-backed implementation.

Both of these have been moved to the transact package. They hardly have to change at all, except to account for the fact that the TransactionLogger interface and Event struct now live in the core package.

But how do we determine which one to load? Well, Go doesn’t have annotations or any fancy dependency injection features,¹⁰ but there are still a couple of ways you can do this.

The first option is to use plug-ins of some kind (this is actually a primary use case for Go plug-ins). This might make sense if you want changing adapters to require zero code changes.

More commonly, you’ll see some kind of “factory” function¹¹ that’s used by the initializing function. While this still requires code changes to add adapters, they’re isolated to a single, easily modified location. A more sophisticated approach might accept a parameter or configuration value to choose which adapter to use.

An example of a TransactionLogger factory function might look like the following:

func NewTransactionLogger(logger string) (core.TransactionLogger, error) {
    switch logger {
    case "file":
        return NewFileTransactionLogger(os.Getenv("TLOG_FILENAME"))

    case "postgres":
        return NewPostgresTransactionLogger(
            PostgresDbParams{
                dbName: os.Getenv("TLOG_DB_HOST"),
                host: os.Getenv("TLOG_DB_DATABASE"),
                user: os.Getenv("TLOG_DB_USERNAME"),
                password: os.Getenv("TLOG_DB_PASSWORD"),
            }
        )

    case "":
        return nil, fmt.Errorf("transaction logger type not defined")

    default:
        return nil, fmt.Errorf("no such transaction logger %s", s)
    }
}

In this example, the NewTransactionLogger function accepts a string that specifies the desired implementation, returning either one of our implementations or an error. We use the os.Getenv function to retrieve the appropriate parameters from environment variables.

Our frontend port

What about our frontends? If you will recall, we now have two frontend implementations:

In “Generation 1: The Monolith” in Chapter 5, we built a RESTful interface using net/http and gorilla/mux.
In “Remote procedure calls with gRPC”, earlier in this chapter, we built an RPC interface with gRPC.

Both of these implementations include a main function where we configure and start the service to listen for connections.

Since they’re “driver” ports, we need to pass the core application to them, so let’s refactor both frontends into structs according to the following interface:

package frontend

type FrontEnd interface {
    Start(kv *core.KeyValueStore) error
}

The FrontEnd interface serves as our “frontend port,” which all frontend implementations are expected to satisfy. The Start method accepts the core application API in the form of a *core.KeyValueStore and will also include the setup logic that formerly lived in a main function.

Now that we have this, we can refactor both frontends so that they comply with the FrontEnd interface, starting with the RESTful frontend. As usual, the complete source code for this and the gRPC service refactor are available in this book’s companion GitHub repository:

package frontend

import (
    "net/http"

    "github.com/cloud-native-go/examples/ch08/hexarch/core"
    "github.com/gorilla/mux"
)

// restFrontEnd contains a reference to the core application logic,
// and complies with the contract defined by the FrontEnd interface.
type restFrontEnd struct {
    store *core.KeyValueStore
}

// deleteHandler handles the logic for the DELETE HTTP method.
func (f *restFrontEnd) deleteHandler(w http.ResponseWriter,
        r *http.Request) {

    vars := mux.Vars(r)
    key := vars["key"]

    err := f.store.Delete(key)
    if err != nil {
        http.Error(w, err.Error(), http.StatusInternalServerError)
        return
    }
}

// ...other handler functions omitted for brevity.

// Start includes the setup and start logic that previously
// lived in a main function.
func (f *restFrontEnd) Start(store *core.KeyValueStore) error {
    // Remember our core application reference.
    f.store = store

    r := mux.NewRouter()

    r.HandleFunc("/v1/{key}", f.getHandler).Methods("GET")
    r.HandleFunc("/v1/{key}", f.putHandler).Methods("PUT")
    r.HandleFunc("/v1/{key}", f.deleteHandler).Methods("DELETE")

    return http.ListenAndServe(":8080", r)
}

Comparing the previous code to the code we produced in “Generation 1: The Monolith”, some differences stand out:

All functions are now methods attached to a restFrontEnd struct.
All calls to the core application go through the store value that lives in the restFrontEnd struct.
Creating the router, defining the handlers, and starting the server now live in the Start method.

Similar changes will have been made for our gRPC frontend implementation to make it consistent with the FrontEnd port.

This new arrangement makes it easier for a consumer to choose and plug in a “frontend adapter,” as demonstrated in the following.

Putting it all together

Here, we have our main function, in which we plug all of the components into our application:

package main

import (
    "log"

    "github.com/cloud-native-go/examples/ch08/hexarch/core"
    "github.com/cloud-native-go/examples/ch08/hexarch/frontend"
    "github.com/cloud-native-go/examples/ch08/hexarch/transact"
)

func main() {
    // Create our TransactionLogger. This is an adapter that will plug
    // into the core application's TransactionLogger port.
    tl, err := transact.NewTransactionLogger(os.Getenv("TLOG_TYPE"))
    if err != nil {
        log.Fatal(err)
    }

    // Create Core and tell it which TransactionLogger to use.
    // This is an example of a "driven agent"
    store := core.NewKeyValueStore(tl)
    store.Restore()

    // Create the front end.
    // This is an example of a "driving agent."
    fe, err := frontend.NewFrontEnd(os.Getenv("FRONTEND_TYPE"))
    if err != nil {
        log.Fatal(err)
    }

    log.Fatal(fe.Start(store))
}

First, we create a transaction logger according to the environment TLOG_TYPE. We do this first because the “transaction logger port” is “driven,” so we’ll need to provide it to the application to plug it in.

We then create our KeyValueStore value, which represents our core application functions and provides an API for ports to interact with, and provide it with any driven adapters.

Next, we create any “driver” adapters. Since these act on the core application API, we provide the API to the adapter instead of the other way around as we would with a “driven” adapter. This means we could also create multiple frontends here, if we wanted, by creating a new adapter and passing it the KeyValueStore that exposes the core application API.

Finally, we call Start on our frontend, which instructs it to start listening for connections. At last, we have a complete hexagonal service!

Summary

We covered a lot of ground in this chapter but really only scratched the surface of all the different ways that components can find themselves tightly coupled and all the different ways of managing each of those tightly coupled components.

In the first half of the chapter, we focused on the coupling that can result from how services communicate. We talked about the problems caused by fragile exchange protocols like SOAP and demonstrated REST and gRPC, which are less fragile because they can be changed to some degree without necessarily forcing client upgrades. We also touched on coupling “in time,” in which one service implicitly expects a timely response from another, and how publish-subscribe messaging might be used to relieve this.

In the second half, we addressed some of the ways that systems can minimize coupling to local resources. After all, even distributed services are just programs, subject to the same limitations of the architectures and implementations as any program. Plug-in implementations and hexagonal architectures are two ways of doing this by enforcing separation of concerns and inversion of control.

Unfortunately, we didn’t get to drill down into some other fascinating topics like service discovery, but, sadly, I had to draw a line somewhere before this subject got away from me!

¹ Ellen Ullman, “The Dumbing-Down of Programming”, Salon, May 12, 1998.

² This is actually a pretty nuanced discussion. See “Service Architectures”.

³ Get off my lawn.

⁴ In XML, no less. We didn’t know any better at the time.

⁵ At Google, even the acronyms are recursive.

⁶ If you’re into that kind of thing.

⁷ If you wanted to be creative, this could be a FileListener or even a stdio stream.

⁸ Yes, I know the animal thing has been done before. Sue me.

⁹ So, naturally, we’re building a duck. Obviously.

¹⁰ Good riddance.

¹¹ I’m sorry.

Chapter 9. Resilience

Safety work is today recognized as an economic necessity. It is the study of the right way to do things.

Robert W. Campbell, addressing the Third National Safety Council Congress & Expo (1914)

Late one September night, at just after two in the morning, a portion of Amazon’s internal network quietly stopped working.¹ This event was brief, and not particularly interesting, except that it happened to affect a sizable number of the servers that supported the Amazon DynamoDB service.

Most days, this wouldn’t be such a big deal. Any affected servers would just try to reconnect to the cluster by retrieving their membership data from a dedicated metadata service. If that failed, they would temporarily take themselves offline and try again.

But this time, when the network was restored, a small army of storage servers simultaneously requested their membership data from the metadata service, overwhelming it so that requests—even ones from previously unaffected servers—started to time out. Storage servers dutifully responded to the timeouts by taking themselves offline and retrying (again), further stressing the metadata service, causing even more servers to go offline, and so on. Within minutes, the outage had spread to the entire cluster. The service was effectively down, taking a number of dependent services down with it.

To make matters worse, the sheer volume of retry attempts—a “retry storm”—put such a burden on the metadata service that it even became entirely unresponsive to requests to add capacity. The on-call engineers were forced to explicitly block requests to the metadata service just to relieve enough pressure to allow them to manually scale up.

Finally, nearly five hours after the initial network hiccup that triggered the incident, normal operations resumed, putting an end to what must have been a long night for all involved.

Keep on Ticking: Why Resilience Matters

So, what was the root cause of Amazon’s outage? Was it the network disruption? Was it the storage servers’ enthusiastic retry behavior? Was it the metadata service’s response time, or maybe its limited capacity?

Clearly, what happened that early morning didn’t have a single root cause. Failures in complex systems never do.² Rather, the system failed as complex systems do: with a failure in a subsystem, which triggered a latent fault in another subsystem, causing it to fail, followed by another, and another, until eventually the entire system went down. What’s interesting, though, is that if any of the components in our story—the network, the storage servers, the metadata service—had been able to isolate and recover from failures elsewhere in the system, the overall system likely would have recovered without human intervention.

Unfortunately, this is just one example of a common pattern. Complex systems fail in complex (and often surprising) ways, but they don’t fail all at once: they fail one subsystem at a time. For this reason, resilience patterns in complex systems take the form of bulwarks and safety valves that work to isolate failures at component boundaries. Frequently, a failure contained is a failure avoided.

This property, the measure of a system’s ability to withstand and recover from errors and failures, is its resilience. A system can be considered resilient if it can continue operating correctly—possibly at a reduced level—rather than failing completely when one of its subsystems fails.

What Does It Mean for a System to Fail?

For want of a nail, the shoe was lost,

for want of a shoe, the horse was lost;

for want of a horse, the rider was lost;

all for want of care about a horse-shoe nail.

Benjamin Franklin, The Way to Wealth (1758)

If we want to know what it means for a system to fail, we first have to ask what a “system” is.

This is important. Bear with me.

By definition, a system is a set of components that work together to accomplish an overall goal. So far, so good. But here’s the important part: each component of a system—a subsystem—is also a complete system unto itself, that in turn is composed of still smaller subsystems, and so on, and so on.

Take a car, for example. Its engine is one of dozens of subsystems, but it—like all the others—is also a complex system with a number of subsystems of its own, including a cooling subsystem, which includes a thermostat, which includes a temperature switch, and so on. Those are just some of thousands of components and subcomponents and sub-subcomponents. It’s enough to make the mind spin: so many things that can fail. What happens when they do?

As we mentioned earlier—and discussed in some depth in Chapter 6—failures of complex systems don’t happen all at once. They unravel in predictable steps:

All systems contain faults, which we lovingly refer to as “bugs” in the software world. A tendency for a temperature switch in a car engine to stick would be a fault. So would the metadata service’s limited capacity and the storage server’s retry behavior in the DynamoDB case study.⁴ Under the right conditions, a fault can be exercised to produce an error.
An error is any discrepancy between the system’s intended and actual behavior. Many errors can be caught and handled appropriately, but if they’re not, they can—singly or in accumulation—give rise to a failure. A stuck temperature switch in a car engine’s thermostat is an error.
Finally, a system can be said to be experiencing a failure when it’s no longer able to provide correct service.⁵ A temperature switch that no longer responds to high temperatures can be said to have failed. A failure at the subsystem level becomes a fault at the system level.

This last bit bears repeating: a failure at the subsystem level becomes a fault at the system level. A stuck temperature switch causes a thermostat to fail, preventing coolant from flowing through the radiator, raising the temperature of the engine, causing it to stall and the car to stop.⁶

That’s how systems fail. It starts with the failure of one component—one subsystem—which causes an error in one or more components that interact with it, and ones that interact with that, and so on, propagating upward until the entire system fails.

This isn’t just academic. Knowing how complex systems fail—one component at a time—makes the means of resisting failures clearer: if a fault can be contained before it propagates all the way to the system level, the system may be able to recover (or at least fail on its own terms).

Building for Resilience

In a perfect world, it would be possible to rid a system of every possible fault, but this isn’t realistic, and it’s wasteful and unproductive to try. By instead assuming that all components are destined to fail eventually—which they absolutely are—and designing them to respond gracefully to errors when they do occur, you can produce a system that’s functionally healthy even when some of its components are not.

There are a lot of ways to increase the resiliency of a system. Redundancy, such as deploying multiple components of the same type, is probably the most common approach. Specialized logic like circuit breakers and request throttles can be used to isolate specific kinds of errors, preventing them from propagating. Faulty components can even be reaped—or intentionally allowed to fail—to benefit the health of the larger system.

Resilience is a particularly rich subject. We’ll explore several of these approaches—and more—over the remainder of the chapter.

Cascading Failures

The reason the DynamoDB case study is so appropriate is that it demonstrates so many different ways that things that can go wrong at scale.

Take, for example, how the failure of a group of storage servers caused requests to the metadata service to time out, which in turn caused more storage servers to fail, which increased the pressure on the metadata service, and so on. This is an excellent example of a particular—and particularly common—failure mode known as a cascading failure. Once a cascading failure has begun, it tends to spread quickly, often on the order of a few minutes.

The mechanisms of cascading failures can vary a bit, but one thing they share is some kind of positive feedback mechanism. One part of a system experiences a local failure—a reduction in capacity, an increase in latency, etc.—that causes other components to attempt to compensate for the failed component in a way that exacerbates the problem, eventually leading to the failure of the entire system.

The classic cause of cascading failures is overload, illustrated in Figure 9-1. This occurs when one or more nodes in a set fails, causing the load to be catastrophically redistributed to the survivors. The increase in load overloads the remaining nodes, causing them to fail from resource exhaustion, taking the entire system down.

The nature of positive feedback often makes it difficult to scale your way out of a cascading failure by adding more capacity. New nodes can be overwhelmed as quickly as they come online, often contributing the feedback that took the system down in the first place. Sometimes, the only fix is to take your entire service down—perhaps by explicitly blocking the problematic traffic—in order to recover, and then slowly reintroduce load.

But how do you prevent cascading failures in the first place? This will be the subject of the next section (and, to some extent, most of this chapter).

Preventing Overload

Every service, however well-designed and implemented, has its functional limitations. This is particularly evident in services intended to handle and respond to client requests.⁷ For any such service, there exists some request frequency, a threshold beyond which bad things will start to happen. So, how do we keep a large number of requests from accidentally (or intentionally!) bringing our service down?

Ultimately, a service that finds itself in such a situation has no choice but to reject—partially or entirely—some number of requests. There are two main strategies for doing this:

Throttling: Throttling is a relatively straightforward strategy that kicks in when requests come in faster than some predetermined frequency, typically by just refusing to handle them. This is often used as a preventative measure by ensuring that no particular user consumes more resources than they would reasonably require.
Load shedding: Load shedding is a little more adaptive. Services using this strategy intentionally drop (“shed”) some proportion of load as they approach overload conditions by either refusing requests or falling back into a degraded mode.

These strategies aren’t mutually exclusive; a service may choose to employ either or both of them, according to its needs.

Throttling

As we discussed in Chapter 4, a throttle pattern works a lot like the throttle in a car, except that instead of limiting the amount of fuel entering an engine, it limits the number of requests that a user (human or otherwise) can make to a service in a set period of time.

The general-purpose throttle example that we provided in “Throttle” was relatively simple, and effectively global, at least as written. However, throttles are also frequently applied on a per-user basis to provide something like a usage quota so that no one caller can consume too much of a service’s resources.

In the following code, we demonstrate a throttle implementation that, while still using a token bucket,⁸ is otherwise quite different in several ways.

First, instead of having a single bucket that’s used to gate all incoming requests, the following implementation throttles on a per-user basis, returning a function that accepts a “key” parameter that’s meant to represent a username or some other unique identifier.

Second, rather than attempting to “replay” a cached value when imposing a throttle limit, the returned function returns a Boolean that indicates when a throttle has been imposed. Note that the throttle doesn’t return an error when it’s activated: throttling isn’t an error condition, so we don’t treat it as one.

Finally, and perhaps most interestingly, it doesn’t actually use a timer (a time.Ticker) to explicitly add tokens to buckets on some regular cadence. Rather, it refills buckets on demand, based on the time elapsed between requests. This strategy means that we don’t have to dedicate background processes to filling buckets until they’re actually used, which will scale much more effectively:

// Effector is the function that you want to subject to throttling.
type Effector func(context.Context) (string, error)

// Throttled wraps an Effector. It accepts the same parameters, plus a
// "UID" string that represents a caller identity. It returns the same,
// plus a bool that's true if the call is not throttled.
type Throttled func(context.Context, string) (bool, string, error)

// A bucket tracks the requests associated with a UID.
type bucket struct {
    tokens uint
    time   time.Time
}

// Throttle accepts an Effector function, and returns a Throttled
// function with a per-UID token bucket with a capacity of max
// that refills at a rate of refill tokens every d.
func Throttle(e Effector, max uint, refill uint, d time.Duration) Throttled {
    // buckets maps UIDs to specific buckets
    buckets := map[string]*bucket{}

    return func(ctx context.Context, uid string) (bool, string, error) {
        b := buckets[uid]

        // This is a new entry! It passes. Assumes that capacity >= 1.
        if b == nil {
            buckets[uid] = &bucket{tokens: max - 1, time: time.Now()}

            str, err := e(ctx)
            return true, str, err
        }

        // Calculate how many tokens we now have based on the time
        // passed since the previous request.
        refillInterval := uint(time.Since(b.time) / d)
        tokensAdded := refill * refillInterval
        currentTokens := b.tokens + tokensAdded

        // We don't have enough tokens. Return false.
        if currentTokens < 1 {
            return false, "", nil
        }

        // If we've refilled our bucket, we can restart the clock.
        // Otherwise, we figure out when the most recent tokens were added.
        if currentTokens > max {
            b.time = time.Now()
            b.tokens = max - 1
        } else {
            deltaTokens := currentTokens - b.tokens
            deltaRefills := deltaTokens / refill
            deltaTime := time.Duration(deltaRefills) * d

            b.time = b.time.Add(deltaTime)
            b.tokens = currentTokens - 1
        }

        str, err := e(ctx)

        return true, str, err
    }
}

Like the example in “Throttle”, this Throttle function accepts a function literal that conforms to the Effector contract, plus some values that define the size and refill rate of the underlying token bucket.

Instead of returning another Effector, however, it returns a Throttled function, which in addition to wrapping the effector with the throttling logic adds a “key” input parameter, which represents a unique user identifier, and a Boolean return value, which indicates whether the function has been throttled (and therefore not executed).

As interesting as you may (or may not) find the Throttle code, it’s still not production ready. First of all, it’s not entirely safe for concurrent use. A production implementation will probably want to lock on the record values, and possibly the bucket map. Second, there’s no way to purge old records. In production, we’d probably want to use something like an LRU cache, like the one we described in “Efficient Caching Using an LRU Cache”, instead.

In the following code, we show a toy example of how Throttle might be used in a RESTful web service:

var throttled = Throttle(getHostname, 1, 1, time.Second)

func getHostname(ctx context.Context) (string, error) {
    if ctx.Err() != nil {
        return "", ctx.Err()
    }

    return os.Hostname()
}

func throttledHandler(w http.ResponseWriter, r *http.Request) {
    ok, hostname, err := throttled(r.Context(), r.RemoteAddr)

    if err != nil {
        http.Error(w, err.Error(), http.StatusInternalServerError)
        return
    }

    if !ok {
        http.Error(w, "Too many requests", http.StatusTooManyRequests)
        return
    }

    w.WriteHeader(http.StatusOK)
    w.Write([]byte(hostname))
}

func main() {
    r := mux.NewRouter()
    r.HandleFunc("/hostname", throttledHandler)
    log.Fatal(http.ListenAndServe(":8080", r))
}

The previous code creates a small web service with a single (somewhat contrived) endpoint at /hostname that returns the service’s hostname. When the program is run, the throttled var is created by wrapping the getHostname function—which provides the actual service logic—by passing it to Throttle, which we defined previously.

When the router receives a request for the /hostname endpoint, the request is forwarded to the throttledHandler function, which performs the calls to throttled, receiving a bool indicating throttling status, the hostname string, and an error value. A defined error causes us to return a 500 Internal Server Error, and a throttled request gets a 429 Too Many Requests. If all else goes well, we return the hostname and a status 200 OK.

Note that the bucket values are stored locally, so this implementation can’t really be considered production-ready either. If you want this to scale out, you might want to store the record values in an external cache of some kind so that multiple service replicas can share them.

Load shedding

It’s an unavoidable fact of life that, as load on a server increases beyond what it can handle, something eventually has to give.

Load shedding is a technique used to predict when a server is approaching that saturation point and then mitigating the saturation by dropping some proportion of traffic in a controlled fashion. Ideally, this will prevent the server from overloading and failing health checks, serving with high latency, or just collapsing in a graceless, uncontrolled failure.

Unlike quota-based throttling, load shedding is reactive, typically engaging in response to depletion of a resource like CPU, memory, or request-queue depth.

Perhaps the most straightforward form of load shedding is a per-task throttling that drops requests when one or more resources exceed a particular threshold. For example, if your service provides a RESTful endpoint, you might choose to return an HTTP 503 (service unavailable). The gorilla/mux web toolkit, which we found very effective in Chapter 5 in the section “Building an HTTP Server with gorilla/mux”, makes this fairly straightforward by supporting “middleware” handler functions that are called on every request:

const MaxQueueDepth = 1000

// Middleware function, which will be called for each request.
// If queue depth is exceeded, it returns HTTP 503 (service unavailable).
func loadSheddingMiddleware(next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        // CurrentQueueDepth is fictional and for example purposes only.
        if CurrentQueueDepth() > MaxQueueDepth {
            log.Println("load shedding engaged")

            http.Error(w,
                err.Error(),
                http.StatusServiceUnavailable)
            return
        }

        next.ServeHTTP(w, r)
    })
}

func main() {
    r := mux.NewRouter()

    // Register middleware
    r.Use(loadSheddingMiddleware)

    log.Fatal(http.ListenAndServe(":8080", r))
}

Gorilla mux middlewares are called on every request, each taking a request, doing something with it, and passing it down to another middleware or the final handler. This makes them perfect for implementing general request logging, header manipulation, ResponseWriter hijacking, or in our case, resource-reactive load shedding.

Our middleware uses the fictional CurrentQueueDepth() (your actual function will depend on your implementation) to check the current queue depth, and rejects requests with an HTTP 503 (service unavailable) if the value is too high. More sophisticated implementations might even be smarter about choosing which work is dropped by prioritizing particularly important requests.

Graceful service degradation

Resource-sensitive load shedding works well, but in some applications it’s possible to act a little more gracefully by significantly decreasing the quality of responses when the service is approaching overload. Such graceful degradation takes the concept of load shedding one step further by strategically reducing the amount of work needed to satisfy each request instead of just rejecting requests.

There are as many ways of doing this as there are services, and not every service can be degraded in a reasonable manner, but common approaches include falling back on cached data or less expensive—if less precise—algorithms.

Play It Again: Retrying Requests

When a request receives an error response, or doesn’t receive a response at all, it should just try again, right? Well, kinda. Retrying makes sense, but it’s a lot more nuanced than that.

Take this snippet for example, a version of which I’ve found in a production system:

res, err := SendRequest()
for err != nil {
    res, err = SendRequest()
}

It seems seductively straightforward, doesn’t it? It will repeat failed requests, but that’s also exactly what it will do. So when this logic was deployed to a few hundred servers and the service to which it was issuing requests went down, the entire system went with it. A review of the service metrics, shown in Figure 9-2, revealed this.

It seems that when the downstream service failed, our service—every single instance of it—entered its retry loop, making thousands of requests per second and bringing the network to its knees so severely that we were forced to essentially restart the entire system.

This is actually a very common kind of cascading failure known as a retry storm (which we also saw in the introduction to this chapter). In a retry storm, well-meaning logic intended to add resilience to a component, acts against the larger system. Frequently, even when the conditions that caused the downstream service to go down are resolved, it can’t come back up because it’s instantly brought under too much load.

But, retries are a good thing, right?

Yes, but whenever you implement retry logic, you should always include a backoff algorithm, which we’ll conveniently discuss in the next section.

Backoff Algorithms

When a request to a downstream service fails for any reason, “best” practice is to retry the request. But how long should you wait? If you wait too long, important work may be delayed. Too little and you risk overwhelming the target, the network, or both.

The common solution is to implement a backoff algorithm that introduces a delay between retries to reduce the frequency of attempts to a safe and acceptable rate.

There are a variety of backoff algorithms available, the simplest of which is to include a short, fixed-duration pause between retries, as follows:

res, err := SendRequest()
for err != nil {
    time.Sleep(2 * time.Second)
    res, err = SendRequest()
}

In the previous snippet, SendRequest is used to issue a request, returning string and error values. However, if err isn’t nil, the code enters a loop, sleeping for two seconds before retrying, repeating indefinitely until it receives a nonerror response.

In Figure 9-3, we illustrate the number of requests generated by 1,000 simulated instances using this method.⁹ As you can see, while the fixed-delay approach might reduce the request count compared to having no backoff at all, the overall number of requests is still quite consistently high.

A fixed-duration backoff delay might work fine if you have a very small number of retrying instances, but it doesn’t scale very well, since a sufficient number of requestors can still overwhelm the network.

However, we can’t always assume that any given service will have a small enough number of instances not to overwhelm the network with retries, or that our service will even be the only one retrying. For this reason, many backoff algorithms implement an exponential backoff, in which the durations of the delays between retries roughly doubles with each attempt up to some fixed maximum.

A common (but flawed, as you’ll soon see) exponential backoff implementation might look something like the following:

res, err := SendRequest()
base, cap := time.Second, time.Minute

for backoff := base; err != nil; backoff <<= 1 {
    if backoff > cap {
        backoff = cap
    }
    time.Sleep(backoff)
    res, err = SendRequest()
}

In this snippet, we specify a starting duration, base, and a fixed maximum duration, cap. In the loop, the value of backoff starts at base and doubles each iteration to a maximum value of cap.

You would think that this logic would help mitigate the network load and retry request burden on downstream services. Simulating this implementation for 1,000 nodes, however, tells another story, illustrated in Figure 9-4.

It would seem that having 1,000 nodes with exactly the same retry schedule still isn’t optimal, since the retries are now clustering, possibly generating enough load in the process to cause problems. So, in practice, pure exponential backoff doesn’t necessarily help as much we’d like.

It would seem that we need some way to spread the spikes out so that the retries occur at a roughly constant rate. The solution is to include an element of randomness, called jitter. Adding jitter to our previous backoff function results in something like the snippet here:

res, err := SendRequest()
base, cap := time.Second, time.Minute

for backoff := base; err != nil; backoff <<= 1 {
    if backoff > cap {
        backoff = cap
    }

    jitter := rand.Int63n(int64(backoff * 3))
    sleep := base + time.Duration(jitter)
    time.Sleep(sleep)
    res, err = SendRequest()
}

Simulating running this code on 1,000 nodes produces the pattern presented in Figure 9-5.

Warning

The rand package’s top-level functions produce a deterministic sequence of values each time the program is run. If you don’t use the rand.Seed function to provide a new seed value, they behave as if seeded by rand.Seed(1) and always produce the same “random” sequence of numbers, putting right back into the pattern shown in Figure 9-4.

When we use exponential backoff with jitter, the number of retries decreases over a short interval—so as not to overstress services that are trying to come up—and spreads them out over time so that they occur at an approximately constant rate.

Who would have thought there was more to retrying requests than retrying requests?

Circuit Breaking

We first introduced the Circuit Breaker pattern in Chapter 4 as a function that degrades potentially failing method calls as a way to prevent larger or cascading failures. That definition still holds, and because we’re not going to extend or change it much, we won’t dig into it in too much detail here.

To review, the Circuit Breaker pattern tracks the number of consecutive failed requests made to a downstream component. If the failure count passes a certain threshold, the circuit is “opened,” and all attempts to issue additional requests fail immediately (or return some defined fallback). After a waiting period, the circuit automatically “closes,” resuming its normal state and allowing requests to be made normally.

Tip

Not all resilience patterns are defensive.

Sometimes it pays to be a good neighbor.

A properly applied Circuit Breaker pattern can mean the difference between system recovery and cascading failure. In addition to the obvious benefits of not wasting resources or clogging the network with doomed requests, a circuit breaker (particularly one with a backoff function) can give a malfunctioning service enough room to recover, allowing it to come back up and restore correct service.

The Circuit Breaker pattern was covered in some detail in Chapter 4, so that’s all we’re going to say about it here. Take a look at “Circuit Breaker” for more background and code examples. The addition of jitter to the example’s backoff function is left as an exercise for the reader.¹⁰

Timeouts

The importance of timeouts isn’t always appreciated. However, the ability for a client to recognize when a request is unlikely to be satisfied allows the client to release resources that it—and any upstream requestors it might be acting on behalf of—might otherwise hold on to. This holds just as true for a service, which may find itself holding onto requests until long after a client has given up.

For example, imagine a basic service that queries a database. If that database should suddenly slow so that queries take a few seconds to complete, requests to the service—each holding onto a database connection—could accumulate, eventually depleting the connection pool. If the database is shared, it could even cause other services to fail, resulting in a cascading failure.

If the service had timed out instead of holding on to the database, it could have degraded service instead of failing outright.

In other words, if you think you’re going to fail, fail fast.

Using Context for service-side timeouts

We first introduced context.Context back in Chapter 4 as Go’s idiomatic means of carrying deadlines and cancellation signals between processes.¹¹ If you’d like a refresher, or just want to put yourself in the right frame of mind before continuing, go ahead and take a look at “The Context Package”.

You might also recall that later in the same chapter, in “Timeout”, we covered the Timeout pattern, which uses Context to not only allow a process to stop waiting for an answer once it’s clear that a result may not be coming, but to also notify other functions with derived Contexts to stop working and release any resources that they might also be holding on to.

This ability to cancel not just local functions, but subfunctions, is so powerful that it’s generally considered good form for functions to accept a Context value if they have the potential to run longer than a caller might want to wait, which is almost always true if the call traverses a network.

For this reason, there are many excellent samples of Context-accepting functions scattered throughout Go’s standard library. Many of these can be found in the sql package, which includes Context-accepting versions of many of its functions. For example, the DB struct’s QueryRow method has an equivalent QueryRowContext that accepts a Context value.

A function that uses this technique to provide the username of a user based on an ID value might look something like the following:

func UserName(ctx context.Context, id int) (string, error) {
    const query = "SELECT username FROM users WHERE id=?"

    dctx, cancel := context.WithTimeout(ctx, 15*time.Second)
    defer cancel()

    var username string
    err := db.QueryRowContext(dctx, query, id).Scan(&username)

    return username, err
}

The UserName function accepts a context.Context and an id integer, but it also creates its own derived Context with a rather long timeout. This approach provides a default timeout that automatically releases any open connections after 15 seconds—longer than many clients are likely to be willing to wait—while also being responsive to cancellation signals from the caller.

The responsiveness to outside cancellation signals can be quite useful. The http framework provides yet another excellent example of this, as demonstrated in the following UserGetHandler HTTP handler function:

func UserGetHandler(w http.ResponseWriter, r *http.Request) {
    vars := mux.Vars(r)
    id := vars["id"]

    // Get the request's context. This context is canceled when
    // the client's connection closes, the request is canceled
    // (with HTTP/2), or when the ServeHTTP method returns.
    rctx := r.Context()

    ctx, cancel := context.WithTimeout(rctx, 10*time.Second)
    defer cancel()

    username, err := UserName(ctx, id)

    switch {
    case errors.Is(err, sql.ErrNoRows):
        http.Error(w, "no such user", http.StatusNotFound)
    case errors.Is(err, context.DeadlineExceeded):
        http.Error(w, "database timeout", http.StatusGatewayTimeout)
    case err != nil:
        http.Error(w, err.Error(), http.StatusInternalServerError)
    default:
        w.Write([]byte(username))
    }
}

In UserGetHandler, the first thing we do is retrieve the request’s Context via its Context method. Conveniently, this Context is canceled when the client’s connection closes, when the request is canceled (with HTTP/2), or when the ServeHTTP method returns.

From this we create a derived context, applying our own explicit timeout, which will cancel the Context after 10 seconds, no matter what.

Because the derived context is passed to the UserName function, we are able to draw a direct causative line between closing the HTTP request and closing the database connection: if the request’s Context closes, all derived Contexts close as well, ultimately ensuring that all open resources are released as well in a loosely coupled manner.

Timing out HTTP/REST client calls

Back in “A Possible Pitfall of Convenience Functions”, we presented one of the pitfalls of the http “convenience functions” like http.Get and http.Post: that they use the default timeout. Unfortunately, the default timeout value is 0, which Go interprets as “no timeout.”

The mechanism we presented at the time for setting timeouts for client methods was to create a custom Client value with a nonzero Timeout value, as follows:

var client = &http.Client{
    Timeout: 10 * time.Second,
}

response, err := client.Get(url)

This works perfectly fine, and, in fact, will cancel a request in exactly the same way as if its Context is canceled. However, what if you want to use an existing or derived Context value? For that you’ll need access to the underlying Context, which you can get by using http.NewRequestWithContext, the Context-accepting equivalent of http.NewRequest, which allows a programmer to specify a Context that controls the entire lifetime of the request and its response.

This isn’t as much of a divergence as it might seem. In fact, looking at the source code for the Get method on the http.Client shows that under the covers, it’s just using NewRequest:

func (c *Client) Get(url string) (resp *Response, err error) {
    req, err := NewRequest("GET", url, nil)
    if err != nil {
        return nil, err
    }

    return c.Do(req)
}

As you can see, the standard Get method calls NewRequest to create a *Request value, passing it the method name and URL (the last parameter accepts an optional io.Reader for the request body, which we don’t need here). A call to the Do function executes the request proper.

Not counting an error check and the return, the entire method consists of just one call. It would seem that if we wanted to implement similar functionality that also accepts a Context value, we could do so without much hassle.

One way to do this might be to implement a GetContext function that accepts a Context value:

type ClientContext struct {
    http.Client
}

func (c *ClientContext) GetContext(ctx context.Context, url string)
        (resp *http.Response, err error) {

    req, err := http.NewRequestWithContext(ctx, "GET", url, nil)
    if err != nil {
        return nil, err
    }

    return c.Do(req)
}

Our new GetContext function is functionally identical to the canonical Get, except that it also accepts a Context value, which it uses to call ht⁠tp.N⁠ew⁠Re⁠qu⁠es⁠tW⁠ithCon⁠te⁠xt instead of http.NewRequest.

Using our new ClientContext would be very similar to using a standard ht⁠tp.Cli⁠ent value, except instead of calling client.Get we’d call client.GetContext (and pass along a Context value, of course):

func main() {
    client := &ClientContext{}
    ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
    defer cancel()

    response, err := client.GetContext(ctx, "http://www.example.com")
    if err != nil {
        log.Fatal(err)
    }

    bytes, _ := io.ReadAll(response.Body)
    fmt.Println(string(bytes))
}

But does it work? It’s not a proper test with a testing library, but we can manually kick the tires by setting the deadline to 0 and running it:

$ go run .
2020/08/25 14:03:16 Get "http://www.example.com": context deadline exceeded
exit status 1

And it would seem that it does! Excellent.

Timing out gRPC client calls

Just like http.Client, gRPC clients default to “no timeout” but also allow timeouts to be explicitly set.

As we saw in “Implementing the gRPC client”, gRPC clients typically use the grpc.Dial function to establish a connection to a client, and that a list of grpc.DialOption values—constructed via functions like grpc.WithInsecure and grpc.WithBlock—can be passed to it to configure how that connection is set up.

Among these options is grpc.WithTimeout, which can be used to configure a client dialing timeout:

opts := []grpc.DialOption{
    grpc.WithInsecure(),
    grpc.WithBlock(),
    grpc.WithTimeout(5 * time.Second),
}
conn, err := grpc.Dial(serverAddr, opts...)

However, while grpc.WithTimeout might seem convenient on the face of it, it’s actually been deprecated for some time, largely because its mechanism is inconsistent (and redundant) with the preferred Context timeout method. We show it here for the sake of completion.

Warning

The grpc.WithTimeout option is deprecated and will eventually be removed. Use grpc.DialContext and context.WithTimeout instead.

Instead, the preferred method of setting a gRPC dialing timeout is the very convenient (for us) grpc.DialContext function, which allows us to use (or reuse) a context.Context value. This is actually doubly useful, because gRPC service methods accept a Context value anyway, so there really isn’t even any additional work to be done:

func TimeoutKeyValueGet() *pb.Response {
    // Use context to set a 5-second timeout.
    ctx, cancel := context.WithTimeout(context.Background(), 5 * time.Second)
    defer cancel()

    // We can still set other options as desired.
    opts := []grpc.DialOption{grpc.WithInsecure(), grpc.WithBlock()}

    conn, err := grpc.DialContext(ctx, serverAddr, opts...)
    if err != nil {
        log.Fatalf(err)
    }
    defer conn.Close()

    client := pb.NewKeyValueClient(conn)

    // We can reuse the same Context in the client calls.
    response, err := client.Get(ctx, &pb.GetRequest{Key: key})
    if err != nil {
        log.Fatalf(err)
    }

    return response
}

As advertised, TimeoutKeyValueGet uses grpc.DialContext—to which we pass a context.Context value with a 5-second timeout—instead of grpc.Dial. The opts list is otherwise identical except, obviously, that it no longer includes grpc.WithTimeout.

Note the client.Get method call. As we mentioned previously, gRPC service methods accept a Context parameter, so we simply reuse the existing one. Importantly, reusing the same Context value will constrain both operations under the same timeout calculation—a Context will time out regardless of how it’s used—so be sure to take that into consideration when planning your timeout values.

Idempotence

As we discussed at the top of Chapter 4, cloud native applications by definition exist in and are subject to all of the idiosyncrasies of a networked world. It’s a plain fact of life that networks—all networks—are unreliable, and messages sent across them don’t always arrive at their destination on time (or at all).

What’s more, if you send a message but don’t get a response, you have no way to know what happened. Did the message get lost on its way to the recipient? Did the recipient get the message, and the response got lost? Maybe everything is working fine, but the round trip is just taking a little longer than usual?

In such a situation, the only option is to send the message again. But it’s not enough to cross your fingers and hope for the best. It’s important to plan for this inevitability by making it safe to resend messages by designing the functions for idempotence.

You might recall that we briefly introduced the concept of idempotence in “What Is Idempotence and Why Does It Matter?”, in which we defined an idempotent operation as one that has the same effect after multiple applications as a single application. As the designers of HTTP understood, it also happens to be an important property of any cloud native API that guarantees that any communication can be safely repeated (see “The Origins of Idempotence on the Web” for a bit on that history).

The actual means of achieving idempotence will vary from service to service, but there are some consistent patterns that we’ll review in the remainder of this section.

The Origins of Idempotence on the Web

The concepts of idempotence and safety, at least in the context of networked services, were first defined way back in 1997 in the HTTP/1.1 standard.¹²

An interesting aside: that ground-breaking proposal, as well as the HTTP/1.0 “informational draft” that preceded it the year before,¹³ were authored by two greats.

The primary author of the original HTTP/1.0 draft (and the last author of the proposed HTTP/1.1 standard) was Sir Timothy John Berners-Lee, who is credited for inventing the World Wide Web, the first web browser, and the fundamental protocols and algorithms allowing the web to scale—for which he was awarded with an ACM Turing Award, a knighthood, and various honorary degrees.

The primary author of the proposed HTTP/1.1 standard (and the second author of the original HTTP/1.0 draft) was Roy Fielding, then a graduate student at the University of California Irvine. Despite being one of the original authors of the World Wide Web, Fielding is perhaps known for his doctoral dissertation, in which he invented REST.¹⁴

How do I make my service idempotent?

Idempotence isn’t baked into the logic of any particular framework. Even in HTTP—and by extension, REST—idempotence is a matter of convention and isn’t explicitly enforced. There’s nothing stopping you from—by oversight or on purpose—implementing a nonidempotent GET if you really want to.¹⁵

One of the reasons that idempotence is sometimes so tricky is because it relies on logic built into the core application, rather than at the REST or gRPC API layer. For example, if back in Chapter 5 we had wanted to make our key-value store consistent with traditional CRUD (create, read, update, and delete) operations (and therefore not idempotent), we might have done something like this:

var store = make(map[string]string)

func Create(key, value string) error {
    if _, ok := store[key]; ok {
        return errors.New("duplicate key")
    }

    store[key] = value
    return nil
}

func Update(key, value string) error {
    if _, ok := store[key]; !ok {
        return errors.New("no such key")
    }

    store[key] = value
    return nil
}

func Delete(key string) error {
    if _, ok := store[key]; ok {
        return errors.New("no such key")
    }

    delete(store, key)
    return nil
}

This CRUD-like service implementation may be entirely well-meaning, but if any of these methods have to be repeated, the result would be an error. What’s more, there’s also a fair amount of logic involved in checking against the current state that wouldn’t be necessary in an equivalent idempotent implementation like the following:

var store = make(map[string]string)

func Set(key, value string) {
    store[key] = value
}

func Delete(key string) {
    delete(store, key)
}

This version is a lot simpler, in more than one way. First, we no longer need separate “create” and “update” operations, so we can combine these into a single Set function. Also, not having to check the current state with each operation reduces the logic in each method, a benefit that continues to pay dividends as the service increases in complexity.

Finally, if an operation has to be repeated, it’s no big deal. For both the Set and Delete functions, multiple identical calls will have the same result. They are idempotent.

What about scalar operations?

“So,” you might say, “that’s all well and good for operations that are either done or not done, but what about more complex operations? Operations on scalar values, for example?”

That’s a fair question. After all, it’s one thing to PUT a thing in a place: it’s either been PUT or it hasn’t. All you have to do is not return an error for re-PUTs. Fine.

But what about an operation like “add $500 to account 12345”? Such a request might carry a JSON payload that looks something like the following:

{
    "credit":{
        "accountID": 12345,
        "amount": 500
    }
}

Repeated applications of this operation would lead to an extra $500 going to account 12345, and while the owner of the account might not mind so much, the bank probably would.

But consider what happens when we add a transactionID value to our JSON payload:

{
    "credit":{
        "accountID": 12345,
        "amount": 500,
        "transactionID": 789
    }
}

It may require some more bookkeeping, but this approach provides a workable solution to our dilemma. By tracking transactionID values, the recipient can safely identify and reject duplicate transactions. Idempotence achieved!

Service Redundancy

Redundancy—the duplication of critical components or functions of a system with the intention of increasing reliability of the system—is often the first line of defense when it comes to increasing resilience in the face of failure.

We’ve already discussed one particular kind of redundancy—messaging redundancy, also known as “retries”—in “Play It Again: Retrying Requests”. In this section, however, we’ll consider the value of replicating critical system components so that if any one fails, one or more others are there to pick up the slack.

In a public cloud, this would mean deploying your component to multiple server instances, ideally across multiple zones or even across multiple regions. In a container orchestration platform like Kubernetes, this may even just be a matter of setting your replica count to a value greater than one.

As interesting as this subject is, however, we won’t actually spend too much time on it. Service replication is an architectural subject that’s been thoroughly covered in many other sources.¹⁶ This is supposed to be a Go book, after all. Still, we’d be remiss to have an entire chapter about resilience and not even mention it.

Designing for Redundancy

The effort involved in designing a system so that its functions can be replicated across multiple instances can yield significant dividends. But exactly how much? Well…a lot. You can feel free to take a look at the following box if you’re interested in the math, but if you don’t, you can just trust me on this one.

Reliability by the Numbers

Imagine, if you will, a service with “two-nines”—or 99%—of availability. Any given request to this system has a theoretical probability of success of 0.99, which is denoted . This actually isn’t very good, but that’s the point.

What kind of availability can you get if two of these identical instances are arranged in parallel, so that both have to be down to interrupt service?¹⁷ This kind of arrangement can be diagrammed as shown here:

What’s the resulting system availability? What we really want to know is: what’s the probability that both instances will be unavailable? To answer this, we take the product of each component’s probability of failure:

This method generalizes to any number of components arranged in parallel, so that the availability of any components is equal to one minus the product of their unavailabilities:

When all are equal, then this can be simplified to the following:

So what about our example? Well, with two components, each with a 99% availability, we get the following:¹⁸

99.99%. Four nines. That’s an improvement of two orders of magnitude, which isn’t half bad. What if we added a third replica? Extending this out a little, we get some interesting results, summarized in the following table:

Components	Availability	Downtime per year	Downtime per month
One component	99% (“2-nines”)	3.65 days	7.31 hours
Two parallel components	99.99% (“4-nines”)	52.60 minutes	4.38 minutes
Three parallel components	99.9999% (“6-nines”)	31.56 seconds	2.63 seconds

Incredibly, three parallel instances, each of which isn’t exactly awesome on its own, can provide a very impressive 6-nines of availability! This is why cloud providers advise customers to deploy their applications with three replicas.

But what if the components are arranged serially, like a load balancer in front of our components? This might look something like the following:

In this kind of arrangement, if either component is unavailable, the entire system is unavailable. Its availability is the product of the availabilities of its components:

When all are equal, then this can be simplified to the following:

So what if we slapped a dodgy load balancer instance in front of our fancy 99.9999% available set of service replicas? As it turns out, the result isn’t so good:

That’s even lower than the load balancer by itself! This is important, because as it turns out:

Warning

The total reliability of a sequential system cannot be higher than the reliability of any one of its sequences of subsystems.

Autoscaling

Very often, the amount of load that a service is subjected to varies over time. The textbook example is the user-facing web service where load increases during the day and decreases at night. If such a service is built to handle the peak load, it’s wasting time and money at night. If it’s built only to handle the nighttime load, it will be overburdened in the daytime.

Autoscaling is a technique that builds on the idea of load balancing by automatically adding or removing resources—be they cloud server instances or Kubernetes pods—to dynamically adjust capacity to meet current demand. This ensures that your service can meet a variety of traffic patterns, anticipated or otherwise.

As an added bonus, applying autoscaling to your cluster can save money by right-sizing resources according to service requirements.

All major cloud providers provide a mechanism for scaling server instances, and most of their managed services implicitly or explicitly support autoscaling. Container orchestration platforms like Kubernetes also include support for autoscaling, both for the number of pods (horizontal autoscaling) and their CPU and memory limits (vertical autoscaling).

Autoscaling mechanics vary considerably between cloud providers and orchestration platforms, so a detailed discussion of how to gather metrics and configure things like predictive autoscaling is beyond the scope of this book. However, here are some key points to remember:

Set reasonable maximums, so that unusually large spikes in demand (or, heaven forbid, cascade failures) don’t completely blow your budget. The throttling and load shedding techniques that we discussed in “Preventing Overload” are also useful here.

Minimize startup times. If you’re using server instances, bake machine images beforehand to minimize configuration time at startup. This is less of an issue on Kubernetes, but container images should still be kept small and startup times reasonably short.
No matter how fast your startup, scaling takes a nonzero amount of time. Your service should have some wiggle room without having to scale.
As we discussed in “Scaling Postponed: Efficiency”, the best kind of scaling is the kind that never needs to happen.

Healthy Health Checks

In “Service Redundancy”, we briefly discussed the value of redundancy—the duplication of critical components or functions of a system with the intention of increasing overall system reliability—and its value for improving the resilience of a system.

Multiple service instances means having a load-balancing mechanism—a service mesh or dedicated load balancer—but what happens when a service instance goes bad? Certainly, we don’t want the load balancer to continue sending traffic its way. So what do we do?

Enter the health check. In its simplest and most common form, a health check is implemented as an API endpoint that clients—load balancers, as well as monitoring services, service registries, etc.—can use to ask a service instance if it’s alive and healthy. For example, a service might provide an HTTP endpoint (/health is a common naming choice) that returns a 200 OK if the replica is healthy and a 503 Service Unavailable when it’s not. More sophisticated implementations can even return different status codes for different states: HashiCorp’s Consul service registry interprets any 2XX status as a success, a 429 Too Many Requests as a warning, and anything else as a failure.

Having an endpoint that can tell a client when a service instance is healthy (or not) sounds great and all, but it invites the question of what, exactly, does it mean for an instance to be “healthy”?

Tip

Health checks are like bloom filters. A failing health check means a service isn’t up, but a health check that passes means the service is probably “healthy.” (Credit: Cindy Sridharan)¹⁹

What Does It Mean for a Service to Be “Healthy”?

We use the word healthy in the context of services and service instances, but what exactly do we mean when we say that? Well, as is so often the case, there’s a simple answer and a complex answer. Probably a lot of answers in between, too.

We’ll start with the simple answer. Reusing an existing definition, an instance is considered “healthy” when it’s “available.” That is, when it’s able to provide correct service.

Unfortunately, it isn’t always so clear-cut. What if the instance itself is functioning as intended, but a downstream dependency is malfunctioning? Should a health check even make that distinction? If so, should the load balancer behave differently in each case? Should an instance be reaped and replaced if it’s not the one at fault, particularly if all service replicas are affected?

Unfortunately, there aren’t any easy answers to these questions, so instead of answers, I’ll offer the next best thing: a discussion of the three most common approaches to health checking and their associated advantages and disadvantages. Your own implementations will depend on the needs of your service and your load-balancing behavior.

Liveness and Readiness Probes

While health checks tend to be designed to probe whether a system is generally healthy, Kubernetes and similar orchestration systems often use distinct liveness probes and readiness probes to ascertain that a service is both functional and ready to serve traffic:

Liveness probe: A liveness probe determines whether an application inside a container is alive and functioning correctly. Failure indicates that the application is in a broken state and must be restarted. Liveness probes are used to catch deadlocks or other issues where an application is running but is unable to make progress.
Readiness probe: A readiness probe checks whether an application is fully initialized and ready to serve traffic. If a readiness probe fails, the orchestration system won’t send traffic to that container until it passes the probe again, though the container isn’t restarted. Readiness probes are particularly useful for applications with long initialization times.

So, what if you want to run a service in Kubernetes, but all it has is a general health check? How should you configure your liveness and readiness probes?

Well, it depends on what the health check is checking. If the check accurately reflects both that the application is running and is ready to serve traffic, then you can use it for both probes. Any health check serving both roles should be reasonably lightweight though, since it’ll be called frequently by both probes.

The Three Types of Health Checks

When a service instance fails, it’s usually because of one of the following:

A local failure, like an application error or resource—CPU, memory, database connections, etc.—depletion
A remote failure in some dependency—a database or other downstream service—that affects the functioning of the service

These two broad categories of failures give rise to three (yes, three) health checking strategies, each with its own fun little pros and cons:

Reachability checks: Do little more than return a “success” signal. They make no additional attempt to determine the status of the service, and say nothing about the service except that it’s listening and reachable. Then again, sometimes this is enough. We’ll talk more about reachability checks in “Reachability checks”.
Shallow health checks: Go further than reachability checks by verifying that the service instance is likely to be able to function. These health checks test only local resources, so they’re unlikely to fail on many instances simultaneously, but they can’t say for certain whether a particular request service instance will be successful. We’ll wade into shallow health checks in “Shallow health checks”.
Deep health checks: Provide a much better understanding of instance health, since they actually inspect the ability of a service instance to perform its function, which also exercises downstream resources like databases. While thorough, they can be expensive and are susceptible to false positives. We’ll dig into deep health checks in “Deep health checks”.

Reachability checks

A reachability endpoint always returns a “success” value, no matter what. While this might seem trivial to the point of uselessness—after all, what is the value of a health check that doesn’t say anything about health—reachability checks actually can provide some useful information by confirming that:

The service instance is listening and accepting new connections on the expected port
The instance is reachable over the network
Any firewall, security group, or other configurations are correctly defined

This simplicity comes with a predictable cost, of course. The absence of any active health checking logic makes reachability checks of limited use when it comes to evaluating whether a service instance can actually perform its function.

Reachability probes are also dead easy to implement. Using the net/http package, we can do the following:

func healthReachabilityHandler(w http.ResponseWriter, r *http.Request) {
    w.WriteHeader(http.StatusOK)
    w.Write([]byte("OK"))
})

func main() {
    r := mux.NewRouter()
    http.HandleFunc("/health", healthReachabilityHandler)
    log.Fatal(http.ListenAndServe(":8080", r))
}

The previous snippet shows how little work can go into a reachability check. In it, we create and register a /health endpoint that does nothing but return a 200 OK (and the text OK, just to be thorough).

Warning

If you’re using the gorilla/mux package, any registered middleware (like the load shedding function from “Load shedding”) can affect your health checks!

Shallow health checks

Shallow health checks go further than reachability checks by verifying that the service instance is likely to be able to function, but stop short of investigating in any way that might exercise a database or other downstream dependency.

Shallow health checks can evaluate any number of conditions that could adversely affect the service, including (but certainly not limited to):

The availability of key local resources (memory, CPU, database connections)
The ability to read or write local data, which checks for disk space, permissions, and hardware malfunctions, such as disk failure
The presence of support processes, such as monitoring or updater processes

Shallow health checks are more definitive than reachability checks, and their specificity means that any failures are unlikely to affect the entire fleet at once.²⁰ However, shallow checks are prone to false positives: if your service is down because of some issue involving an external resource, a shallow check will miss it. What you gain in specificity, you sacrifice in sensitivity.

A shallow health check might look something like the following example, which tests the service’s ability to read and write to and from local disk:

func healthShallowHandler(w http.ResponseWriter, r *http.Request) {
    // Create our test file.
    // This will create a filename like /tmp/shallow-123456
    tmpFile, err := os.CreateTemp(os.TempDir(), "shallow-")
    if err != nil {
        http.Error(w, err.Error(), http.StatusServiceUnavailable)
        return
    }
    defer os.Remove(tmpFile.Name())

    // Make sure that we can write to the file.
    text := []byte("Check.")
    if _, err = tmpFile.Write(text); err != nil {
        http.Error(w, err.Error(), http.StatusServiceUnavailable)
        return
    }

    // Make sure that we can close the file.
    if err := tmpFile.Close(); err != nil {
        http.Error(w, err.Error(), http.StatusServiceUnavailable)
        return
    }

    // We got this far -- we're healthy.
    w.WriteHeader(http.StatusOK)
}

func main() {
    r := mux.NewRouter()
    http.HandleFunc("/health", healthShallowHandler)
    log.Fatal(http.ListenAndServe(":8080", r))
}

This simultaneously checks for available disk space, write permissions, and malfunctioning hardware, which can be useful to test, particularly if the service needs to write to an on-disk cache or other transient files.

An observant reader might notice that it writes to the default directory to use for temporary files. On Linux, this is /tmp, which is often actually a RAM drive. This might be useful to test as well, but if you want to test for the ability to write to disk on Linux, you’ll need to specify a different directory, or this becomes a very different test.

Deep health checks

Deep health checks directly inspect the ability of a service to interact with its adjacent systems. This provides much better understanding of instance health by potentially identifying issues with dependencies, like invalid credentials, the loss of connectivity to data stores, or other unexpected networking issues.

However, thorough, deep health checks can be quite expensive. They can take a long time and place a burden on dependencies, particularly if you’re running too many of them or running them too often.

Tip

You don’t need to test every dependency in your health checks—focus on the ones that are required for the service to operate.

Also, when testing multiple downstream dependencies, evaluate them concurrently if possible.

What’s more, because the failure of a dependency will be reported as a failure of the instance, deep checks are especially susceptible to false positives. Combined with the lower specificity compared to a shallow check—issues with dependencies will be felt by the entire fleet—and you have the potential for a cascading failure.

If you’re using deep health checks, you should take advantage of strategies like circuit breaking (which we covered in “Circuit Breaking”) where you can, and your load balancer should “fail open” (which we’ll discuss in “Failing Open”) whenever possible.

Here we have a trivial example of a possible deep health check that evaluates a database by calling a hypothetical service’s GetUser function:

func healthDeepHandler(w http.ResponseWriter, r *http.Request) {
    // Retrieve the context from the request and add a 5-second timeout
    ctx, cancel := context.WithTimeout(r.Context(), 5*time.Second)
    defer cancel()

    // service.GetUser is a hypothetical method on a service interface
    // that executes a database query
    if err := service.GetUser(ctx, 0); err != nil {
        http.Error(w, err.Error(), http.StatusServiceUnavailable)
        return
    }

    // All good -- return OK
    w.WriteHeader(http.StatusOK)
}

func main() {
    r := mux.NewRouter()
    http.HandleFunc("/health", healthDeepHandler)
    log.Fatal(http.ListenAndServe(":8080", r))
}

Ideally, a dependency test should execute an actual system function but also be lightweight to the greatest reasonable degree. In this example, the GetUser function triggers a database query that satisfies both of these criteria.²¹

“Real” queries are generally preferable to just pinging the database for two reasons. First, they’re a more representative test of what the service is doing. Second, they allow you to leverage end-to-end query time as a measure of database health. The previous example actually does this—albeit in a very binary fashion—by using Context to set a hard timeout value, but you could choose to include more sophisticated logic instead.

Failing Open

What if all of your instances simultaneously decide that they’re unhealthy? If you’re using deep health checks, this can actually happen quite easily (and, perhaps, regularly). Depending on how your load balancer is configured, you might find yourself with zero instances serving traffic, possibly causing failures rippling across your system.

Fortunately, some load balancers handle this quite cleverly by “failing open.” If a load balancer that fails open has no healthy targets—that is, if all of its targets’ health checks are failing—it will route traffic to all of its targets.

This is slightly counterintuitive behavior, but it makes deep health checks a little safer to use by allowing traffic to continue to flow even when a downstream dependency may be having a bad day.

Graceful Shutdowns

The life of a cloud native service can be unforgiving, and often short.²² An orchestration system like Kubernetes won’t hesitate to terminate a running container to clear room for something higher priority. Entire cluster nodes can even be scaled-down, requiring everything on them to shut down as well.

Shutdowns are risky, however. There’s potential for data corruption, lost transactions, or other kinds of ugliness that can degrade user experience and trust. This is especially true of abrupt, or “hard,” shutdowns. But that doesn’t always have to be the case; when a shutdown is explicit and intentionally executed, affected systems can choose to execute a graceful shutdown.

Graceful shutdowns involve more than simply turning off a service. They provide an opportunity to make sure that it doesn’t leave a mess by thoughtfully cleaning up any active connections, managing in-flight data, and making sure that resources are cleanly released.

Signals and Traps

A signal is an asynchronous OS-level notification sent to a process to notify it of an event and trigger-specific behavior. Common uses of signals are to interrupt, suspend, terminate, or kill a process.

When a signal is sent, the OS interrupts the target process’s normal flow of execution to deliver the signal. If the process has registered a signal handler, a programmer-defined function that’s invoked when the signal is delivered, the process can “catch” (or “trap”) the signal and execute the routine. Otherwise, the default signal handler is executed.

Common POSIX signals

Linux supports about 30 specific signals, many of which are defined in the POSIX standard. Most of these are too low-level to be useful to us here, but there are five signals in particular that you’re quite likely to encounter:

SIGTERM (Signal terminate): Sent to a process to request its termination, allowing for a graceful shutdown. Can be caught and acted upon or simply ignored.
SIGINT (Signal interrupt): Sent to a process by its controlling terminal when a user wishes to interrupt the process, such as by pressing Ctrl+C.
SIGQUIT (Signal quit): Sent to a process by its controlling terminal when the user requests that the process quit and perform a core dump.
SIGKILL (Signal kill): Sent to cause a process to terminate immediately. SIGKILL cannot be caught or ignored, and the receiving process cannot perform any clean-up upon receiving this signal.
SIGHUP (Signal hangup): Originally designed to notify a process of a serial line drop (a hang-up), but now many services interpret this signal as a request to reload their configuration files and flush their log files.

If you’re interested in seeing the entire list, you can find them in the man page on the subject.

Note

This section is largely specific to Linux and other POSIX-compliant operating systems. The only signal values guaranteed to be available on all operating systems (including Windows) are SIGINT and SIGKILL.

Catching signals

Nestled within the Go standard libraries is the os/signal package, which offers a number of functions that provide access to incoming signals. The standard documentation goes into great detail about the standard behavior of signals and how to interact with them. More detail than we can get into here. I recommend you take a look; it’s good information.²³

Perhaps the most commonly used application of the os/signal package is to respond to incoming signals. Go does this in a very Go way: by emitting signal events along a channel:

// Create the signal channel
c := make(chan os.Signal, 1)

// Relay incoming SIGTERM, SIGINT, and SIGQUIT signals to the channel
signal.Notify(c, syscall.SIGTERM, syscall.SIGINT, syscall.SIGQUIT)

// Wait for a signal
<-c

In this example, we create a channel of type chan os.Signal and pass it to the signal.Notify function, along with a selection of syscall.Signal constants representing the signals we’re interested in. The signal.Notify function causes the specified signal types to be relayed to the channel, so reading on channel c will block until one of those signals is caught.

Warning

The signals SIGKILL and SIGSTOP cannot be caught by any program and cannot be affected by the os/signal package.

If you prefer to work with Context values, the signal package also includes a gem in the form of the NotifyContext function:

func NotifyContext(parent context.Context, signals ...os.Signal)
    (context.Context, context.CancelFunc)

This variation on the Notify function works a lot like the functions we discussed in “The Context Package” in that it accepts a parent context.Context value and returns a child Context and a cancellation function. As usual, this child Context gets marked as done when its parent is marked as done or the cancellation function is called, but it’s also marked as done when any of the indicated signals arrive. This can be quite useful for cleaning up resources across process boundaries.

Note

If no signals are specified to signal.Notify or si⁠gn⁠al.N⁠ot⁠ifyCon⁠te⁠xt, then all incoming signals will be relayed. What fun!

Stop Incoming Requests

When a service receives a shutdown signal, the first thing it should do is stop accepting requests. Fortunately, readiness probes provide a convenient means of telling an orchestration system to stop sending it traffic.

As we discussed in “Liveness and Readiness Probes”, a readiness probe is a specific kind of health check used by Kubernetes and other orchestration systems to determine whether a service is ready to serve traffic. If a readiness probe suddenly starts to fail, the orchestration system responds by stopping traffic to that container until it passes the probe again.

This can be done by returning an error status on your readiness endpoint. Any HTTP status of 400 or higher will do, but a 503 (service unavailable) is generally used for this purpose.

Warning

Don’t try this with your liveness checks. The orchestration system may forcibly kill your service (via an uncatchable SIGKILL) before it’s fully cleaned up!

Clean Up Your Resources

Once you’ve stopped handling incoming requests, the last step in a graceful shutdown process is to clean up any open resources.

The exact form this takes will depend on the nature of your application, but generally it involves waiting for any outstanding requests to complete and cleanly closing any open resources. You’ll want to close any open file handlers and database connections, flush any logs, etc.

Putting It Into Action

Now that we have the building blocks of a graceful shutdown, let’s put them into action!

Next I show how to use the signal.NotifyContext and http.RegisterOnShutdown functions to implement a graceful shutdown in a service constructed using the http package:

func main() {
    // Get a context that closes on SIGTERM, SIGINT, or SIGQUIT
    ctx, cancel := signal.NotifyContext(
        context.Background(),
        syscall.SIGTERM, syscall.SIGINT, syscall.SIGQUIT)
    defer cancel()

    server := &http.Server{Addr: ":8080"}

    // Register a cleanup function to be automatically
    // called when the server is shut down
    server.RegisterOnShutdown(doCleanup)

    // Register the readiness and liveness probes.
    http.Handle("/ready", handleReadiness(ctx))
    http.Handle("/health", handleLiveness())

    // This goroutine will respond to context closure
    // by shutting down the server
    go func() {
        // Read from the context's Done channel
        // This operation will block until the context closes
        <-ctx.Done()

        log.Println("Got shutdown signal.")

        // Wait for the readiness probe to detect the failure
        <-time.After(5 * time.Second)

        // Issue the shutdown proper. Don't pass the
        // already-closed Context value to it!
        if err := server.Shutdown(context.Background()); err != nil {
            log.Printf("Error while stopping HTTP listener: %s", err)
        }
    }()

    // Begin listening on :8080
    log.Println(server.ListenAndServe())
}

func handleReadiness(ctx context.Context) http.Handler {
    f := func(w http.ResponseWriter, r *http.Request) {
        select {
        case <-ctx.Done():
            w.WriteHeader(http.StatusServiceUnavailable)
        default:
            w.WriteHeader(http.StatusOK)
        }
    }
    return http.HandlerFunc(f)
}

The main function has a few moving parts. Most of them should be familiar to you from Chapter 5, but if you need a refresher, go ahead and take a look at “Building an HTTP Server with net/http”.

Starting out, we use the signal.NotifyContext function to retrieve a new context.Context value. We’ve used a lot of Context values, but this one is interesting because it’ll be canceled if the service receives a SIGTERM, SIGINT, or SIGQUIT signal.

We then create an http.Server value as usual. What’s new, though, is that we then call the RegisterOnShutdown method on it to specify a function, doCleanup, that will be automatically called when the server is shut down.

Next, we register our function handlers, a lot like we did before. There would usually be several of these to respond to a variety of patterns, but we add just two: a readiness probe and a liveness probe. Of these, the handleReadiness function is particularly interesting, in that the handler that it returns has exactly one job: to respond to requests with http.StatusServiceUnavailable if the Context value is canceled, and http.StatusOK if not.

Now that the basic setup is complete, we next define a very special goroutine that also waits for ctx to be canceled, and calls server.Shutdown when it does. As the name of this function implies, this will trigger a shutdown of the service, which in turns triggers doCleanup, the shutdown function that we registered previously.

Summary

This was an interesting chapter to write. There’s quite a lot to say about resilience, and so much crucial supporting operational background. As with other chapters, I had to make some tough calls about what would make it in and what wouldn’t. At more than 40 pages, it still turned out a fair bit longer than I intended, but I’m quite satisfied with the outcome. It’s a reasonable compromise between too little information and too much, and between operational background and actual Go implementations.

We reviewed what it means for a system to fail and how complex systems fail (that is, one component at a time). This led naturally to discussing a particularly nefarious, yet common, failure mode: cascading failures. In a cascade failure, a system’s own attempts to recover hasten its collapse. We covered common measures of preventing cascading failures on the server side: throttling and load shedding.

Retries in the face of errors can contribute a lot to a service’s resilience, but as we saw in the DynamoDB case study, they can also contribute to cascade failures when applied naively. We dug deep into measures that can be taken on the client side as well, including circuit breakers, timeouts, and especially exponential backoff algorithms. There were several pretty graphs involved. I spent a lot of time on the graphs.

All of this led to conversations about service redundancy, how it affects reliability (with a little math thrown in, just for fun), and when and how to best leverage autoscaling.

Of course, you can’t talk about autoscaling without talking about resource “health.” We asked (and did our best to answer) what it means for an instance to be “healthy,” and how that translates into health checks. We covered the three kinds of health checks and weighed their pros and cons, paying particular attention to their relative sensitivity/specificity trade-offs.

In Chapter 10, we’ll take a break from the operational topics for a bit and wade into the subject of manageability: the art and science of changing the tires on a moving car.

¹ “Summary of the Amazon DynamoDB Service Disruption and Related Impacts in the US-East Region”, Amazon AWS, September 2015.

² Richard I. Cook, “How Complex Systems Fail”, 1998.

³ If you’re interested in a complete academic treatment, I highly recommend Reliability and Availability Engineering by Kishor S. Trivedi and Andrea Bobbio (Cambridge University Press, 2017).

⁴ Importantly, many faults are evident only in retrospect.

⁵ See? We eventually got there.

⁶ Go on, ask me how I know this.

⁷ Especially if the service is available on the open sewer that is the public internet.

⁸ Wikipedia contributors, “Token bucket”, Wikipedia, June 5, 2019.

⁹ Available in the associated GitHub repository is the code used to simulate all data in this section.

¹⁰ Doing that here felt redundant, but I’ll admit that I may have gotten a bit lazy.

¹¹ And, technically, request-scoped values, but the correctness of this functionality is debatable.

¹² Roy Fielding et al., “Hypertext Transfer Protocol—HTTP/1.1”, Proposed Standard, RFC 2068, June 1997.

¹³ Tim Berners-Lee et al., “Hypertext Transfer Protocol—HTTP/1.0”, Informational, RFC 1945, May 1996.

¹⁴ Roy Fielding, “Architectural Styles and the Design of Network-Based Software Architectures”, PhD dis., University of California, Irvine, 2000, pp. 76–106.

¹⁵ You monster.

¹⁶ Building Secure and Reliable Systems: Best Practices for Designing, Implementing, and Maintaining Systems by Heather Adkins—and a host of other authors (O’Reilly)—is one excellent example.

¹⁷ Brace yourself. We’re going in.

¹⁸ This assumes that the failure rates of the components are absolutely independent, which is very unlikely in the real world. Treat as you would spherical cows in a vacuum.

¹⁹ Cindy Sridharan (@copyconstruct), “Health checks are like bloom filters…”, Twitter (now X), August 5, 2018.

²⁰ Though I’ve seen it happen.

²¹ It’s an imaginary function, so let’s just agree that that’s true.

²² There’s a good reason why this book is subtitled “Building Reliable Services in Unreliable Environments.”

²³ Go Standard Library: os/signal.

Chapter 10. Manageability

Everyone knows that debugging is twice as hard as writing a program in the first place. So if you’re as clever as you can be when you write it, how will you ever debug it?¹

Brian Kernighan, The Elements of Programming Style (1978)

In a perfect world, you’d never have to deploy a new version of your service or (heaven forbid!) shut down your entire system to fix or modify it to meet new requirements.

Then again, in a perfect world, unicorns would exist and four out of five dentists would recommend we eat pie for breakfast.²

Clearly, we don’t live in a perfect world. But while unicorns might never exist,³ you don’t have to resign yourself to a world where you have to update your code whenever you need to alter your system’s behavior.

While you’ll probably always have to make code changes to update core logic, it is possible to build your systems so that you—or, critically, somebody else—can change a surprising variety of behaviors without having to recode and redeploy.

You may recall that we introduced this important attribute of cloud native systems back in “Manageability”, where we defined it as the ease with which a system’s behavior can be modified to keep it secure, running smoothly, and compliant with changing requirements.

While this sounds straightforward, there’s actually quite a bit more to manageability than you might think. It goes far beyond configuration files (though that’s certainly part of it). In this chapter, we’ll discuss what it means to have a manageable system, and we’ll cover some of the techniques and implementations that can allow you to build a system that can change almost as quickly as its requirements.

What Is Manageability and Why Should I Care?

When considering manageability, it’s common to think in terms of a single service. Can my service be configured easily? Does it have all the knobs and dials that it might need?

However, this misses the larger point by focusing on the component at the expense of the system. Manageability doesn’t end at the service boundary. For a system to be manageable, the entire system has to be considered.

Take a moment to reconsider manageability with a complex system in mind. Can its behavior be easily modified? Can its components be modified independently of one another? Can they be easily replaced, if necessary? How do we know when that is?

Manageability encompasses all possible dimensions of a system’s behavior. Its functions can be said to fall into four broad categories:⁵

Configuration and control: It’s important that setting up and configuring a system—and each of its components—should be easily configurable for optimal availability and performance. Some systems need regular or real-time control, so having the right “knobs and levers” is absolutely fundamental. This is where we’ll focus most of our attention in this chapter.
Monitoring, logging, and alerting: These functions keep track of the system’s ability to do its job and are critical to effective system management. After all, without them, how would we know when our system requires management? As vital as these features are to manageability, we won’t discuss them in this chapter. Instead, they get an entire chapter of their own in Chapter 11.
Deployment and updates: Even in the absence of code changes, the ability to easily deploy, update, roll back, and scale system components is valuable, especially when there are many systems to manage. Obviously, this is useful during the initial deployment, but it comes into effect throughout a system’s lifetime any time it has to be updated. Fortunately, its lack of external runtimes and singular executable artifacts make this an area in which Go excels.
Service discovery and inventory: A key feature of cloud native systems is their distributed nature. It’s critical that components be able to quickly and accurately detect one another, a function called service discovery. Since service discovery is an architectural feature rather than a programmatic one, we won’t go too deeply into it in this book.

Because this is more of a Go book than it is an architecture book,⁶ it focuses largely on service implementations. For that reason only—not because it’s more important—most of this chapter will similarly focus on service-level configuration. Unfortunately, an in-depth discussion of these is beyond the scope of this book.⁷

Managing complex computing systems is generally difficult and time-consuming, and the costs of managing them can far exceed the costs of the underlying hardware and software. By definition, a system designed to be manageable can be managed more efficiently, and therefore more cheaply. Even if you don’t consider management costs, complexity reduction can have a huge impact on the likelihood of human error, making it easier and faster to undo when it inevitably creeps in. In that way, manageability directly impacts reliability, availability, and security, making it a key ingredient of system dependability.

Configuring Your Application

The most basic function of manageability is the ability to configure an application. In an ideally configurable application, anything that’s likely to vary between environments—staging, production, developer environments, etc.—will be cleanly separated from the code and be externally definable in some way.

You may recall that The Twelve-Factor App—a set of 12 rules and guidelines for building web applications that we introduced way back in Chapter 6—had quite a bit to say on this subject. In fact, the third of its 12 rules—“III. Configuration”—was concerned entirely with application configuration, about which it says:

Store configuration in the environment.

As written, The Twelve-Factor App insists that all configurations should be stored in environment variables. There are a plenty of opinions on this, but in the years since its publication the industry seems to have reached a general consensus this is what really matters:

Configuration should be strictly separated from the code: Configuration—anything that’s likely to vary between environments—should always be cleanly separated from the code. While configuration can vary substantially across deploys, code does not. Configuration shouldn’t be baked into the code. Ever.
Configurations should be stored in version control: Storing configurations in version control—separately from the code—allows you to quickly roll back a configuration change if necessary and aids system re-creation and restoration. Some deployment frameworks, like Kubernetes, make this distinction naturally and relatively seamlessly by providing configuration primitives like the ConfigMap.

These days, it’s still quite common to see applications configured mainly by environment variables, but it’s just as common to see command-line flags and configuration files with various formats. Sometimes an application will even support more than one of these options. In the subsequent sections, we’ll review some of these methods, their various pros and cons, and how they can be implemented in Go.

Configuration Good Practice

When you’re building an application, you have a lot of options in how you define, implement, and deploy your application configurations. However, in my experience, I’ve found that certain general practices produce better long- and short-term outcomes:

Version control your configurations: Yes, I’m repeating myself, but this bears repeating. Configuration files should be stored in version control before being deployed to the system. This makes it possible to review them before deployment, to quickly reference them afterward, and to quickly roll back a change if necessary. It’s also helpful if (and when) you need to re-create and restore your system.
Don’t roll your own format: Write your configuration files using a standard format like JSON, YAML, or TOML. We’ll cover some of these later in the chapter. If you must roll your own format, be sure that you’re comfortable with the idea of maintaining it—and forcing any future maintainers to deal with it—forever.
Make the zero value useful: Don’t use nonzero default values unnecessarily. This is actually a good rule in general; there’s even a “Go proverb” about it.⁸ Whenever possible, the behavior that results from an undefined configuration should be acceptable, reasonable, and unsurprising. A simple, minimal configuration makes errors less likely.

Configuring with Environment Variables

As we discussed in Chapter 6, and reviewed previously, using environment variables to define configuration values is the method advocated for in The Twelve-Factor App. There’s some merit to this preference: environment variables are universally supported, they ensure that configurations don’t get accidentally checked into the code, and using them generally requires less code than using a configuration file. They’re also perfectly adequate for small applications.

On the other hand, the process of setting and passing environment variables can be ugly, tedious, and verbose. While some applications support defining environment variables in a file, this largely defeats the purpose of using environment variables in the first place.

The implicit nature of environment variables can introduce some challenges as well. Since you can’t easily learn about the existence and behavior of environment variables by looking at an existing configuration file or checking the help output, applications that rely on them can sometimes be harder to use, and errors in them harder to debug.

As with most high-level languages, Go makes environment variables easily accessible. It does this through the standard os package, which provides the os.Getenv function for this purpose:

name := os.Getenv("NAME")
place := os.Getenv("CITY")

fmt.Printf("%s lives in %s.\n", name, place)

The os.Getenv function retrieves the value of the environment variable named by the key, but if the variable isn’t present, it’ll return an empty string. If you need to distinguish between an empty value and an unset value, Go also provides the os.LookEnv function, which returns both the value and a bool that’s false if the variable isn’t set:

if val, ok := os.LookupEnv(key); ok {
    fmt.Printf("%s=%s\n", key, val)
} else {
    fmt.Printf("%s not set\n", key)
}

This functionality is pretty minimal but perfectly adequate for many (if not most) purposes. If you’re in need of more sophisticated options, like default values or typed variables, there are several excellent third-party packages that provide this functionality. Viper (spf13/viper)—which we’ll discuss in “Viper: The Swiss Army Knife of Configuration Packages”—is particularly popular.

Configuring with Command-Line Arguments

As configuration methods go, command-line arguments are definitely worth considering, at least for smaller, less-complex applications. After all, they’re explicit, and details of their existence and usage are usually available via a --help option.

The standard flag package

Go includes the flag package, which is a basic command-line parsing package, in its standard library. While flag isn’t particularly feature-rich, it’s fairly straightforward to use, and—unlike os.Getenv—supports typing out of the box.

Take, for example, the following program, which uses flag to implement a basic command that reads and outputs the values of command-line flags:

package main

import (
    "flag"
    "fmt"
)

func main() {
    // Declare a string flag with a default value "foo"
    // and a short description. It returns a string pointer.
    strp := flag.String("string", "foo", "a string")

    // Declare number and Boolean flags, similar to the string flag.
    intp := flag.Int("number", 42, "an integer")
    boolp := flag.Bool("boolean", false, "a boolean")

    // Call flag.Parse() to execute command-line parsing.
    flag.Parse()

    // Print the parsed options and trailing positional arguments.
    fmt.Println("string:", *strp)
    fmt.Println("integer:", *intp)
    fmt.Println("boolean:", *boolp)
    fmt.Println("args:", flag.Args())
}

As you can see from the previous code, the flag package allows you to register command-line flags with types, default values, and short descriptions, and to map those flags to variables. We can see a summary of these flags by running the program and passing it the -help flag:

$ go run . -help
Usage of /var/folders/go-build618108403/exe/main:
  -boolean
        a boolean
  -number int
        an integer (default 42)
  -string string
        a string (default "foo")

The help output presents us with a list of all of the available flags. Exercising all of these flags gives us something like the following:

$ go run . -boolean -number 27 -string "A string." Other things.
string: A string.
integer: 27
boolean: true
args: [Other things.]

It works! However, the flag package seems to have a couple of issues that limit its usefulness.

First, as you may have noticed, the resulting flag syntax seems a little…nonstandard. Many of us have come to expect CLIs to follow the GNU argument standard, with long-named options prefixed by two dashes (--version) and short, single-letter equivalents (-v).

Second, all flag does is parse flags (though to be fair, it doesn’t claim to do any more than that), and while that’s nice, it’s not as powerful as it could be. It sure would be nice if we could map commands to functions, wouldn’t it?

The Cobra command-line parser

The flags package is perfectly fine if all you need to do is parse flags, but if you’re in the market for something a little more powerful to build your CLIs, you might want to consider the Cobra package. Cobra has a number of features that make it a popular choice for building fully featured CLIs. It’s used in a number of high-profile projects, including Kubernetes, CockroachDB, Docker, Istio, and Helm.

In addition to providing fully POSIX-compliant flags (short and long versions), Cobra also supports nested subcommands and automatically generates help (--help) output and autocomplete for various shells. It also integrates with Viper, which we’ll cover in “Viper: The Swiss Army Knife of Configuration Packages”.

Cobra’s primary downside, as you might imagine, is that it’s quite complex relative to the flags package. Using Cobra to implement the program from “The standard flag package” looks like the following:

package main

import (
    "fmt"
    "os"
    "github.com/spf13/cobra"
)

var strp string
var intp int
var boolp bool

var rootCmd = &cobra.Command{
    Use:  "flags",
    Long: "A simple flags experimentation command, built with Cobra.",
    Run:  flagsFunc,
}

func init() {
    rootCmd.Flags().StringVarP(&strp, "string", "s", "foo", "a string")
    rootCmd.Flags().IntVarP(&intp, "number", "n", 42, "an integer")
    rootCmd.Flags().BoolVarP(&boolp, "boolean", "b", false, "a boolean")
}

func flagsFunc(cmd *cobra.Command, args []string) {
    fmt.Println("string:", strp)
    fmt.Println("integer:", intp)
    fmt.Println("boolean:", boolp)
    fmt.Println("args:", args)
}

func main() {
    if err := rootCmd.Execute(); err != nil {
        fmt.Println(err)
        os.Exit(1)
    }
}

In contrast to the flags-package version, which basically just reads some flags and prints the results, the Cobra program has a bit more complexity, with several distinct parts.

First, we declare the target variables with package scope, rather than locally within a function. This is necessary because they have to be accessible to both the init function and the function that implements the command logic proper.

Next, we create a cobra.Command struct, rootCmd, that represents the root command. A separate cobra.Command instance is used to represent every command and subcommand that the CLI makes available. The Use field spells out the command’s one-line usage message, and Long is the long message displayed in the help output. Run is a function of type func(cmd *Command, args []string) that implements the actual work to be done when the command is executed.

Typically, commands are constructed in an init function. In our case, we add three flags—string, number, and boolean—to our root command along with their short flags, default values, and descriptions.

Every command gets an automatically generated help output, which we can retrieve using the --help flag:

$ go run . --help
A simple flags experimentation command, built with Cobra.

Usage:
  flags [flags]

Flags:
  -b, --boolean         a boolean
  -h, --help            help for flags
  -n, --number int      an integer (default 42)
  -s, --string string   a string (default "foo")

This makes sense, and it’s also pretty! But does it run as we expect? Executing the command (using the standard flags style), gives us the following output:

$ go run . --boolean --number 27 --string "A string." Other things.
string: A string.
integer: 27
boolean: true
args: [Other things.]

The outputs are identical; we have achieved parity. But this is just a single command. One of the benefits of Cobra is that it also allows subcommands.

What does this mean? Take, for example, the git command. In this example, git would be the root command. By itself, it doesn’t do much, but it has a series of subcommands—git clone, git init, git blame, etc.—that are related but are each distinct operations of their own.

Cobra provides this capability by treating commands as a tree structure. Each command and subcommand (including the root command) are represented by a distinct cobra.Command value. These are attached to one another using the (c *Command) AddCommand(cmds ...*Command) function. We demonstrate this in the following example by turning the flags command into a subcommand of a new root, which we call cng (for Cloud Native Go).

To do this, we first have to rename the original rootCmd to flagsCmd. We add a Short attribute to define its short description in help output, but it’s otherwise identical. But now we need a new root command, so we create that as well:

var flagsCmd = &cobra.Command{
    Use:   "flags",
    Short: "Experiment with flags",
    Long:  "A simple flags experimentation command, built with Cobra.",
    Run:   flagsFunc,
}

var rootCmd = &cobra.Command{
    Use:  "cng",
    Long: "A super simple command.",
}

Now we have two commands: the root command, cng, and a single subcommand, flags. The next step is to add the flags subcommand to the root command so that it’s immediately beneath the root in the command tree. This is typically done in an init function, which we demonstrate here:

func init() {
    flagsCmd.Flags().StringVarP(&strp, "string", "s", "foo", "a string")
    flagsCmd.Flags().IntVarP(&intp, "number", "n", 42, "an integer")
    flagsCmd.Flags().BoolVarP(&boolp, "boolean", "b", false, "a boolean")

    rootCmd.AddCommand(flagsCmd)
}

In the preceding init function, we keep the three Flags methods, except we now call them on flagsCmd.

What’s new, however, is the AddCommand method, which allows us to add flagsCmd to rootCmd as a subcommand. We can repeat AddCommand as many times as we like with multiple Command values, adding as many subcommands (or sub-subcommands, or sub-sub-subcommands) as we want.

Now that we’ve told Cobra about the new flags subcommand, its information is reflected in the generated help output:

$ go run . --help
A super simple command.

Usage:
  cng [command]

Available Commands:
  flags       Experiment with flags
  help        Help about any command

Flags:
  -h, --help   help for cng

Use "cng [command] --help" for more information about a command.

Now, according to this help output, we have a top-level root command named cng that has two available subcommands: our flags command and an automatically generated help subcommand that lets a user view any subcommand’s help. For example, help flags provides us with information and instructions for the flags subcommand:

$ go run . help flags
A simple flags experimentation command, built with Cobra.

Usage:
  cng flags [flags]

Flags:
  -b, --boolean         a boolean
  -h, --help            help for flags
  -n, --number int      an integer (default 42)
  -s, --string string   a string (default "foo")

Kind of neat, huh?

This is a tiny, tiny sample of what the Cobra library is capable of, but it’s more than sufficient to let us to build a robust set of configuration options. If you’re interested in learning more about Cobra and how you can use it to build powerful CLIs, take a look at its GitHub repository and its listing on GoDoc.

Configuring with Files

Last but not least, we have what is probably the most commonly used configuration option: the configuration file.

Configuration files have a lot of advantages over environment variables, particularly for more complex applications. They tend to be more explicit and comprehensible by allowing behaviors to be logically grouped and annotated. Often, understanding how to use a configuration file is just a matter of looking at its structure or an example of its use.

Configuration files are particularly useful when managing a large number of options, which is an advantage they have over both environment variables and command-line flags. Command-line flags in particular can sometimes result in some pretty long statements that can be tedious and difficult to construct.

Files aren’t the perfect solution though. Depending on your environment, distributing them at scale in a way that maintains parity across a cluster can be a challenge. This situation can be improved by having a single “source of truth,” such as a distributed key-value store like etcd or HashiCorp Consul, or a central source code repository from which the deployment automatically draws its configuration, but this adds complexity and a dependency on another resource.

Fortunately, most orchestration platforms provide specialized configuration resources—such as Kubernetes’s ConfigMap object—that largely alleviate the distribution problem.

There are probably dozens of file formats that have been used for configuration over the years, but in recent years, two in particular have stood out: JSON and YAML. In the next few sections, we’ll go into each of these—and how to use them in Go—in a little more detail.

Our configuration data structure

Before we proceed with a discussion of file formats and how to decode them, we should discuss the two general ways in which configurations can be unmarshalled:

Configuration keys and values: Can be mapped to corresponding fields in a specific struct type. For example, a configuration that contains the attribute host: localhost, could be unmarshalled into a struct type that has a Host string field.
Configuration data: Can be decoded and unmarshalled into one or more, possibly nested, maps of type map[string]any. This can be convenient when you’re working with arbitrary configurations, but it’s awkward to work with.

If you know what your configuration is likely to look like in advance (which you generally do), then the first approach to decoding configurations, mapping them to a data structure created for that purpose, is by far the easiest. Although it’s possible to decode and do useful work with arbitrary configuration schemas, doing so can be tedious and isn’t advisable for most configuration purposes.

So, for the remainder of this section, our example configurations will correspond to the following Config struct:

type Config struct {
    Host string
    Port uint16
    Tags map[string]string
}

Warning

For a struct field to be marshallable or unmarshallable by any encoding package, it must begin with a capital letter to indicate that it’s exported by its package.

For each of our examples we’ll start with the Config struct, occasionally enhancing it with format-specific tags or other decorations.

Working with JSON

JSON (JavaScript Object Notation) was invented in the early 2000s, growing out of the need for a modern data interchange format to replace XML and other formats in use at the time. It’s based on a subset of the JavaScript scripting language, making it both relatively human-readable and efficient for machines to generate and parse while also offering the semantics for lists and mappings that were absent from XML.

As common and successful as JSON is, it does have some drawbacks. It’s generally considered less user-friendly than YAML. Its syntax is especially unforgiving and can be easily broken by a misplaced (or missing) comma, and it doesn’t even support comments.

However, of the formats presented in this chapter, it’s the only one that’s supported in Go’s standard library.

What follows is a brief introduction into encoding and decoding data to and from JSON. For a somewhat more thorough review, take a look at Andrew Gerrand’s “JSON and Go” on The Go Blog.

Encoding JSON

The first step to understanding how to decode JSON (or any configuration format) is understanding how to encode it. This may seem strange, particularly in a section about reading configuration files, but encoding is important to the general subject of JSON encoding and provides a handy means of generating, testing, and debugging your configuration files.⁹

JSON encoding and decoding is supported by Go’s standard encoding/json package, which provides a variety of helper functions useful for encoding, decoding, formatting, validating, and otherwise working with JSON.

Among these is the json.Marshal function, which accepts an any value v and returns a []byte array containing a JSON-encoded representation of v:

func Marshal(v any) ([]byte, error) {}

In other words, a value goes in and JSON comes out.

This function really is as straightforward to use as it looks. For example, if we have an instance of Config, we can pass it to json.Marshal to get its JSON encoding:

c := Config{
    Host: "localhost",
    Port: 1313,
    Tags: map[string]string{"env": "dev"},
}

bytes, err := json.Marshal(c)

fmt.Println(string(bytes))

If everything works as expected, err will be nil, and bytes will be a []byte value containing the JSON. The fmt.Println output will look something like the following:

{"Host":"localhost","Port":1313,"Tags":{"env":"dev"}}

Tip

The json.Marshal function traverses the value of v recursively, so any internal structs will be encoded as well as nested JSON.

That was pretty painless, but if we’re generating a configuration file, it sure would be nice if the text was formatted for human consumption. Fortunately, encoding/json also provides the following json.MarshalIndent function, which returns “pretty-printed” JSON:

func MarshalIndent(v any, prefix, indent string) ([]byte, error) {}

As you can see, json.MarshalIndent works a lot like json.Marshal, except that also takes prefix and indent strings, as demonstrated here:

bytes, err := json.MarshalIndent(c, "", "   ")
fmt.Println(string(bytes))

The preceding snippet prints exactly what we’d hope to see:

{
   "Host": "localhost",
   "Port": 1313,
   "Tags": {
      "env": "dev"
   }
}

The result is prettily printed JSON, formatted for humans like you and me¹⁰ to read. This is a very useful method for bootstrapping configuration files!

Decoding JSON

Now that we know how to encode a data structure into JSON, let’s take a look at how to decode JSON as an existing data structure.

To do that, we use the conveniently named json.Unmarshal function:

func Unmarshal(data []byte, v any) error {}

The json.Unmarshal function parses the JSON-encoded text contained in the data array and stores the result in the value pointed to by v. Importantly, if v is nil or isn’t a pointer, json.Unmarshal returns an error.

But, what type should v be, exactly? Ideally, it would be a pointer to a data structure whose fields exactly correspond to the JSON structure. While it’s possible to unmarshal arbitrary JSON into an unstructured map, as we’ll discuss in “Decoding Arbitrary JSON”, this really should be done only if you really don’t have any other choice.

As we’ll see, though, if you have a data type that reflects your JSON’s structure, then json.Unmarshal is able to update it directly. To do this, we first have to create an instance where our decoded data will be stored:

c := Config{}

Now that we have our storage value, we can call json.Unmarshal, to which we pass a []byte that contains our JSON data and a pointer to c:

bytes := []byte(`{"Host":"127.0.0.1","Port":1234,"Tags":{"foo":"bar"}}`)
err := json.Unmarshal(bytes, &c)

If bytes contains valid JSON, then err will be nil and the data from bytes will be stored in the struct c. Printing the value of c should now provide output like the following:

{127.0.0.1 1234 map[foo:bar]}

Neat! However, what happens when the structure of the JSON doesn’t exactly match the Go type? Let’s find out:

c := Config{}
bytes := []byte(`{"Host":"127.0.0.1", "Food":"Pizza"}`)
err := json.Unmarshal(bytes, &c)

Interestingly, this snippet doesn’t produce an error as you might expect. Instead, c now contains the following values:

{127.0.0.1 0 map[]}

It would seem that the value of Host was set, but Food, which has no corresponding value in the Config struct, was ignored. As it turns out, json.Unmarshal will decode only the fields that it can find in the target type. This behavior can actually be quite useful if you want to cherry-pick a few specific fields out of a big JSON blob.

Decoding Arbitrary JSON

As we briefly mentioned in “Our configuration data structure”, it’s possible to decode and do useful work with arbitrary JSON, but doing so can be pretty tedious and should be done only if you don’t know the structure of your JSON beforehand.

Take, for example, this entirely arbitrary JSON data:

bytes := []byte(`{"Foo":"Bar", "Number":1313, "Tags":{"A":"B"}}`)

Without knowing this data’s structure ahead of time, we can actually use json.Unmarshal to decode it into an any value:

var f any
err := json.Unmarshal(bytes, &f)

Outputting the new value of f with fmt.Println yields an interesting result:

map[Number:1313 Foo:Bar Tags:map[A:B]]

It would seem that the underlying value of f is now a map whose keys are strings and values are stored as empty interfaces. It’s functionally identical to a value defined as:

f := map[string]any{
    "Foo":  "Bar",
    "Number":  1313,
    "Tags": map[string]any{"A": "B"},
}

Even though its underlying value is a map[string]any, f still has a type of any. We’ll need to use a type assertion to access its values:

m := f.(map[string]any)

fmt.Printf("<%T> %v\n", m, m)
fmt.Printf("<%T> %v\n", m["Foo"], m["Foo"])
fmt.Printf("<%T> %v\n", m["Number"], m["Number"])
fmt.Printf("<%T> %v\n", m["Tags"], m["Tags"])

Executing the previous snippet produces the following output:

<map[string]interface {}> map[Number:1313 Foo:Bar Tags:map[A:B]]
<string> Bar
<float64> 1313
<map[string]interface {}> map[A:B]

Field formatting with struct field tags

Under the covers, marshalling works by using reflection to examine a value and generate appropriate JSON for its type. For structs, the struct’s field names are directly used as the default JSON keys, and the struct’s field values become the JSON values. Unmarshalling works essentially the same way, except in reverse.

What happens when you marshal a zero-value struct? Well, as it turns out, when you marshal a Config{} value, for example, this is the JSON you get:

{"Host":"","Port":0,"Tags":null}

This isn’t all that pretty. Or efficient. Is it really necessary to even output the empty values at all?

Similarly, struct fields have to be exported—and therefore capitalized—to be written or read. Does that mean that we’re stuck with uppercase field names?

Fortunately, the answer to both questions is “no.”

Go supports the use of struct field tags—short strings that appear in a struct after the type declaration of a field—that allow metadata to be added to specific struct fields. Field tags are most commonly used by encoding packages to modify encoding and decoding behavior at the field level.

Go struct field tags are special strings containing one or more key-value pairs enclosed in backticks, after a field’s type declaration:

type User struct {
    Name string `example:"name"`
}

In this example, the struct’s Name field is tagged with example:"name". These tags can be accessed using runtime reflection via the reflect package, but their most common use case is to provide encoding and decoding directives.

The encoding/json package supports several such tags. The general format uses the json key in the struct field’s tag and a value that specifies the name of the field, possibly followed by a comma-separated list of options. The name may be empty in order to specify options without overriding the default field name.

The available options supported by encoding/json are shown here:

Customizing JSON keys

By default, a struct field will case-sensitively map to a JSON key of the exact same name. A tag overrides this default name by setting the first (or only) value in the tag’s options list.

Example: CustomKey string `json:"custom_key"`

Omitting empty values

By default, a field will always appear in the JSON, even if it’s empty. Using the omitempty option will cause fields to be skipped if they contain a zero-value. Note the leading comma in front of omitempty!

Example: OmitEmpty string `json:",omitempty"`

Ignoring a field

Fields using the - (dash) option will always be completely ignored during encoding and decoding.

Example: IgnoredName string `json:"-"`

A struct that uses all of the previous tags might look like the following:

type Tagged struct {
    // CustomKey will appear in JSON as the key "custom_key".
    CustomKey   string `json:"custom_key"`

    // OmitEmpty will appear in JSON as "OmitEmpty" (the default),
    // but will only be written if it contains a nonzero value.
    OmitEmpty   string `json:",omitempty"`

    // IgnoredName will always be ignored.
    IgnoredName string `json:"-"`

    // TwoThings will appear in JSON as the key "two_things",
    // but only if it isn't empty.
    TwoThings   string `json:"two_things,omitempty"`
}

For more information on how json.Marshal encodes data, take a look at the function’s documentation.

Working with YAML

YAML (YAML Ain’t Markup Language)¹¹ is an extensible file format that’s popular with projects like Kubernetes that depend on complex, hierarchical configurations. It’s highly expressive, though its syntax can also be a bit brittle, and configurations that use it can start to suffer from readability issues as they scale up.

Unlike JSON, which was originally created as a data interchange format, YAML is largely a configuration language at heart. Interestingly, however, YAML 1.2 is a superset of JSON, and the two formats are largely interconvertible. YAML does have some advantages over JSON though: it can self-reference, it allows embedded block literals, and it supports comments and complex data types.

Unlike JSON, YAML isn’t supported in Go’s core libraries. While there are a few YAML packages to choose from, the standard choice is Go-YAML. Version 1 of Go-YAML started in 2014 as an internal project within Canonical to port the well-known libyaml C library to Go. As a project, it’s exceptionally mature and well maintained. Its syntax is also conveniently very similar to encoding/json.

Encoding YAML

Using Go-YAML to encode data is a lot like encoding JSON. Exactly like it. In fact, the signatures for both packages’ Marshal functions are identical. Like its encoding/json equivalent, Go-YAML’s yaml.Marshal function also accepts an any value and returns its YAML encoding as a []byte value:

func Marshal(v any) ([]byte, error) {}

Just as we did in “Encoding JSON”, we demonstrate its use by creating an instance of Config, which we pass to yaml.Marshal to get its YAML encoding:

c := Config{
    Host: "localhost",
    Port: 1313,
    Tags: map[string]string{"env": "dev"},
}

bytes, err := yaml.Marshal(c)

Once again, if everything works as expected, err will be nil and bytes will be a []byte value containing the YAML. Printing the string value of bytes will provide something like the following:

host: localhost
port: 1313
tags:
  env: dev

Also, just like the version provided by encoding/json, Go-YAML’s Marshal function traverses the value v recursively. Any composite types that it finds—arrays, slices, maps, and structs—will be encoded appropriately and will be present in the output as nested YAML elements.

Decoding YAML

In keeping with the theme we’ve established with the similarity of the Marshal functions from encoding/json and Go-YAML, the same consistency is evident between the two packages’ Unmarshal functions:

func Unmarshal(data []byte, v any) error {}

Again, the yaml.Unmarshal function parses the YAML-encoded data in the data array and stores the result in the value pointed to by v. If v is nil or not a pointer, yaml.Unmarshal returns an error. As shown here, the similarities are clear:

// Caution: Indent this YAML with spaces, not tabs.
bytes := []byte(`
host: 127.0.0.1
port: 1234
tags:
    foo: bar
`)

c := Config{}
err := yaml.Unmarshal(bytes, &c)

Just as we did in “Decoding JSON”, we pass yaml.Unmarshal a pointer to a Config instance, whose fields correspond to the fields found in the YAML. Printing the value of c should (once again) provide output like the following:

{127.0.0.1 1234 map[foo:bar]}

There are other behavioral similarities between encoding/json and Go-YAML:

Both will ignore attributes in a source document that cannot be mapped to the Unmarshal function. Again, this can be useful if you care about only a subset of the document, but it can be a “gotcha,” too: if you forget to export the struct field, Unmarshal will always silently ignore it, and it’ll never get set.
Both are capable of unmarshalling arbitrary data by passing an any value to Unmarshal. However, while json.Unmarshal will provide a map[string]any, yaml.Unmarshal will return a map[any]any. A minor difference but another potential gotcha!

Struct field tags for YAML

In addition to the “standard” struct field tags—custom keys, omitempty, and - (dash)—detailed in “Field formatting with struct field tags”, Go-YAML supports two additional tags particular to YAML marshal formatting:

Flow style

Fields using the flow option will be marshalled using the flow style, which can be useful for structs, sequences, and maps.

Example: Flow map[string]string `yaml:"flow"`

Inlining structs and maps

The inline option causes all of a struct or map fields or keys to be processed as if they were part of the outer struct. For maps, keys must not conflict with the keys of other struct fields.

Example: Inline map[string]string `yaml:",inline"`

A struct that uses both of these options might look like the following:

type TaggedMore struct {
    // Flow will be marshalled using a "flow" style
    // (useful for structs, sequences and maps).
    Flow map[string]string `yaml:"flow"`

    // Inlines a struct or a map, causing all of its fields
    // or keys to be processed as if they were part of the outer
    // struct. For maps, keys must not conflict with the yaml
    // keys of other struct fields.
    Inline map[string]string `yaml:",inline"`
}

As you can see, the tagging syntax is also consistent, except that instead of using the json prefix, Go-YAML tags use the yaml prefix.

Watching for configuration file changes

When working with configuration files, you’ll inevitably be confronted with a situation in which changes have to be made to the configuration of a running program. If it doesn’t explicitly watch for and reload changes, then it’ll generally have to be restarted to reread its configuration, which can be annoying at best and introduce downtime at worst.

At some point, you’re going to have to decide how you want your program to respond to such changes.

The first (and least complex) option is to do nothing and just expect the program to have to restart when its configuration changes. This is actually a fairly common choice, since it ensures that no trace of the former configuration exists. It also allows a program to “fail fast” when an error is introduced into the configuration file: the program just has to spit out an angry error message and refuse to start.

However, you might prefer to add logic to your program that detects changes in your configuration file (or files) and reloads them appropriately.

Making your configuration reloadable

If you’d like your internal configuration representations to reload whenever the underlying file changes, you’ll have to plan a little ahead.

First, you’ll want to have a single global instance of your configuration struct. For now, we’ll use a Config instance of the kind we introduced in “Our configuration data structure”. In a slightly larger project, you might even put this in a config package:

var config Config

Often you’ll see code in which an explicit config parameter is passed to just about every method and function. I’ve seen this quite a lot, often enough to know that this particular antipattern just makes life harder. Also, because the configuration now lives in N places instead of one, it also tends to make configuration reloading more complicated.

Once we have our config value, we’ll want to add the logic that reads the configuration file and loads it into the struct. Something like the following loadConfiguration function will do just fine:

func loadConfiguration(filepath string) (Config, error) {
    dat, err := os.ReadFile(filepath)   // Ingest file as []byte
    if err != nil {
        return Config{}, err
    }

    config := Config{}

    err = yaml.Unmarshal(dat, &config)      // Do the unmarshal
    if err != nil {
        return Config{}, err
    }

    return config, nil
}

Our loadConfiguration function works almost the same way that we discussed in “Working with YAML”, except that it uses the os.ReadFile function from the os standard library to retrieve the bytes that it passes to yaml.Unmarshal. The choice to use YAML here was entirely arbitrary.¹² The syntax for a JSON configuration would be practically identical.

Now that we have logic to load our configuration file into a canonical struct, we need something to call it whenever it gets a notification that the file has changed. For that we have startListening, which monitors an updates channel:

func startListening(updates <-chan string, errors <-chan error) {
    for {
        select {
        case filepath := <-updates:
            c, err := loadConfiguration(filepath)
            if err != nil {
                log.Println("error loading config:", err)
                continue
            }
            config = c

        case err := <-errors:
            log.Println("error watching config:", err)
        }
    }
}

As you can see, startListening accepts two channels: updates, which emits the name of a file (presumably the configuration file) when that file changes, and an errors channel.

It watches both channels in a select inside of an infinite loop so that if a configuration file changes, the updates channel sends its name, which is then passed to loadConfiguration. If loadConfiguration doesn’t return a non-nil error, then the Config value it returns replaces the current one.

Stepping back another level, we have an init function that retrieves the channels from a watchConfig function and passes them to startListening, which it runs as a goroutine:

func init() {
    updates, errors, err := watchConfig("config.yaml")
    if err != nil {
        panic(err)
    }

    go startListening(updates, errors)
}

But what’s this watchConfig function? Well, we don’t quite know the details yet. We’ll figure that out in the next couple of sections. We do know that it implements some configuration watching logic and that it has a function signature that looks like the following:

func watchConfig(filepath string) (<-chan string, <-chan error, error) {}

The watchConfig function, whatever its implementation, returns two channels—a string channel that sends the path of the updated configuration file and an error channel that notifies about invalid configurations—and an error value that reports if there’s a fatal error on startup.

The exact implementation of watchConfig can go a couple of different ways, each with its pros and cons. Let’s take a look at the two most common.

Polling for configuration changes

Polling, where you check for changes in your configuration file on some regular cadence, is a common way of watching a configuration file. A standard implementation uses a time.Ticker to recalculate a hash of your configuration file every few seconds and reload if the hash changes.

Go makes a number of common hash algorithms available in its crypto package, each of which lives in its own subpackage of crypto and satisfies both the crypto.Hash and io.Writer interfaces.

For example, Go’s standard implementation of SHA256 can be found in crypto/sha256. To use it, you use its sha256.New function to get a new sha256.Hash value, into which you then write the data you want to calculate the hash of, just as you would any io.Writer. When that’s complete, you use its Sum method to retrieve the resulting hash sum:

func calculateFileHash(filepath string) (string, error) {
    file, err := os.Open(filepath)  // Open the file for reading
    if err != nil {
        return "", err
    }
    defer file.Close()              // Be sure to close your file!

    hash := sha256.New()            // Use the Hash in crypto/sha256

    if _, err := io.Copy(hash, file); err != nil {
        return "", err
    }

    sum := fmt.Sprintf("%x", hash.Sum(nil))  // Get encoded hash sum

    return sum, nil
}

Generating a hash for a configuration has three distinct parts. First, we get a []byte source in the form of an io.Reader. In this example, we use an io.File. Next, we copy those bytes from the io.Reader to our sha256.Hash instance, which we do with a call to io.Copy. Finally, we use the Sum method to retrieve the hash sum from hash.

Now that we have our calculateFileHash function, creating our watchConfig implementation is just a matter of using a time.Ticker to concurrently check it on some cadence and emit any positive results (or errors) to the appropriate channel:

func watchConfig(filepath string) (<-chan string, <-chan error, error) {
    errs := make(chan error)
    changes := make(chan string)
    hash := ""

    go func() {
        ticker := time.NewTicker(time.Second)

        for range ticker.C {
            newhash, err := calculateFileHash(filepath)
            if err != nil {
                errs <- err
                continue
            }

            if hash != newhash {
                hash = newhash
                changes <- filepath
            }
        }
    }()

    return changes, errs, nil
}

The polling approach has some benefits. It’s not especially complex, which is always a big plus, and it works for any operating system. Perhaps most interestingly, because hashing cares only about the configuration’s contents, it even can be generalized to detect changes in places like remote key-value stores that aren’t technically files.

Unfortunately, the polling approach can be a little computationally wasteful, especially for very large or many files. By its nature, it also incurs a brief delay between the time the file is changed and the detection of that change. If you’re definitely working with local files, it would probably be more efficient to watch OS-level filesystem notifications, which we discuss in the next section.

Watching OS filesystem notifications

Polling for changes works well enough, but this method has some drawbacks. Depending on your use case, you may find it more efficient to instead monitor OS-level filesystem notifications.

Actually doing so, however, is complicated by the fact that each operating system has a different notification mechanism. Fortunately, the fsnotify package provides a workable abstraction that supports most operating systems.

To use this package to watch one or more files, you use the fsnotify.NewWatcher function to get a new fsnotify.Watcher instance, and use the Add method to register more files to watch. The Watcher provides two channels, Events and Errors, which send notifications of file events and errors, respectively.

For example, if we wanted to watch our config file, we could do something like the following:

func watchConfigNotify(filepath string) (<-chan string, <-chan error, error) {
    changes := make(chan string)

    watcher, err := fsnotify.NewWatcher()         // Get an fsnotify.Watcher
    if err != nil {
        return nil, nil, err
    }

    err = watcher.Add(filepath)                    // Tell watcher to watch
    if err != nil {                                // our config file
        return nil, nil, err
    }

    go func() {
        changes <- filepath                        // First is ALWAYS a change

        for event := range watcher.Events {        // Range over watcher events
            if event.Op&fsnotify.Write == fsnotify.Write {
                changes <- event.Name
            }
        }
    }()

    return changes, watcher.Errors, nil
}

Note the statement event.Op & fsnotify.Write == fsnotify.Write, which uses a bitwise AND (&) to filter for “write” events. We do this because the fsnotify.Event can potentially include multiple operations, each of which is represented as one bit in an unsigned integer. For example, a simultaneous fsnotify.Write (2, binary 0b00010) and fsnotify.Chmod (16, binary 0b10000) would result in an event.Op value of 18 (binary 0b10010). Because 0b10010 & 0b00010 = 0b00010, the bitwise AND allows us to guarantee that an operation includes a fsnotify.Write.

Viper: The Swiss Army Knife of Configuration Packages

Viper (spf13/viper) bills itself as a complete configuration solution for Go applications, and justifiably so. Among other things, it allows application configuration by a variety of mechanisms and formats, including, in order of precedence, the following:

Explicitly set values: This takes precedence over all other methods and can be useful during testing.
Command-line flags: Viper is designed to be a companion to Cobra, which we introduced in “The Cobra command-line parser”.
Environment variables: Viper has full support for environment variables. Importantly, Viper treats environment variables as case-sensitive!
Configuration files, in multiple file formats: Out of the box, Viper supports JSON and YAML with the packages we introduced previously, as well as TOML, HCL, INI, envfile, and Java Properties files. It can also write configuration files to help bootstrap your configurations and even optionally supports live watching and rereading of configuration files.
Remote key-value stores: Viper can access key-value stores like etcd or Consul and can watch them for changes.

Viper also supports features like default values and typed variables, which the standard packages typically don’t provide.

Keep in mind though, that while Viper does a lot, it’s also a pretty big hammer that brings in a lot of dependencies. If you’re trying to build a slim, streamlined application, Viper may be more than you need.

Explicitly setting values in Viper

Viper allows you to use the viper.Set function to explicitly set values from, for example, command-line flags or the application logic. This can be pretty handy during testing:

viper.Set("Verbose", true)
viper.Set("LogFile", LogFile)

Explicitly set values have the highest priority and override values that would be set by other mechanisms.

Working with command-line flags in Viper

Viper was designed to be a companion to the Cobra library, which we briefly discussed in the context of constructing CLIs in “The Cobra command-line parser”. This close integration with Cobra makes it straightforward to bind command-line flags to configuration keys.

Viper provides the viper.BindPFlag function, which allows individual command-line flags to be bound to a named key, and viper.BindPFlags, which binds a full flag set using each flag’s long name as the key.

Because the actual value of the configuration value is set when the binding is accessed, rather than when it’s called, you can call viper.BindPFlag in an init function as we do here:

var rootCmd = &cobra.Command{ /* omitted for brevity */ }

func init() {
    rootCmd.Flags().IntP("number", "n", 42, "an integer")
    viper.BindPFlag("number", rootCmd.Flags().Lookup("number"))
}

In the preceding snippet, we declare a &cobra.Command and define an integer flag called “number.” Note that we use the IntP method instead of IntVarP, since there’s no need to store the value of the flag in an external value when Cobra is used in this way. Then, using the viper.BindPFlag function, we bind the “number” flag to a configuration key of the same name.

After it’s been bound (and the command-line flags parsed), the value of the bound key can be retrieved from Viper by using the viper.GetInt function:

n := viper.GetInt("number")

Working with environment variables in Viper

Viper provides several functions for working with environment variables as a configuration source. The first of these is viper.BindEnv, which is used to bind a configuration key to an environment variable:

viper.BindEnv("id")                     // Bind "id" to var "ID"
viper.BindEnv("port", "SERVICE_PORT")   // Bind "port" to var "SERVICE_PORT"

id := viper.GetInt("id")
id := viper.GetInt("port")

If only a key is provided, viper.BindEnv will bind to the environment variable matching the key. More arguments can be provided to specify one or more environment variables to bind to. In both cases, Viper automatically assumes that the name of the environment variable is in all caps.

Viper provides several additional helper functions for working with environment variables. See the Viper GoDoc for more details on these.

Working with configuration files in Viper

A discussion of local configuration files may seem unexpected in a book on cloud native, but files are still a commonly used structure in any context. After all, shared filesystems—be they Kubernetes ConfigMaps or NFS mounts—are quite common, and even cloud native services can be deployed by a configuration management system that installs a read-only local copy of a file for all service replicas to read. A configuration file could even be baked or mounted into a container image in a way that looks—as far as a containerized service is concerned—exactly like any other local file.

Reading configuration files

To read configurations from files, Viper just needs to know the names of the files and where to look for them. Also, it needs to know its type, if that can’t be inferred from a file extension. The viper.ReadInConfig function instructs Viper to find and read the configuration file, potentially returning an error value if something goes wrong. All of those steps are demonstrated here:

viper.SetConfigName("config")

// Optional if the config has a file extension
viper.SetConfigType("yaml")

viper.AddConfigPath("/etc/service/")
viper.AddConfigPath("$HOME/.service")
viper.AddConfigPath(".")

if err := viper.ReadInConfig(); err != nil {
    panic(fmt.Errorf("fatal error reading config: %w", err))
}

As you can see, Viper can search multiple paths for a configuration file. Unfortunately, at this time, a single Viper instance supports reading only a single configuration file.

Watching and rereading configuration files in Viper

Viper natively allows your application to watch a configuration file for modifications and reload when changes are detected, which means that configurations can change without having to restart the server for them to take effect.

By default, this functionality is turned off. The viper.WatchConfig function can be used to enable it. Additionally, the viper.OnConfigChange function allows you to specify a function that’s called whenever the configuration file is updated:

viper.WatchConfig()
viper.OnConfigChange(func(e fsnotify.Event) {
    fmt.Println("Config file changed:", e.Name)
})

Warning

Make sure that any calls to viper.AddConfigPath are made before calling viper.WatchConfig.

Interestingly, Viper actually uses the fsnotify/fsnotify package behind the scenes, the same mechanism that we detailed in “Watching for configuration file changes”.

Using remote key-value stores with Viper

Perhaps the most interesting feature of Viper is its ability to read a configuration string written in any supported format from a path in a remote key-value store, like etcd or HashiCorp Consul. These values take precedence over default values but are overridden by configuration values retrieved from disk, command-line flags, or environment variables.

To enable remote support in Viper, you first have to do a blank import of the viper/remote package:

import _ "github.com/spf13/viper/remote"

A remote key-value configuration source can then be registered using the viper.AddRemoteProvider method, whose signature is as follows:

func AddRemoteProvider(provider, endpoint, path string) error {}

The provider parameter can be one of etcd, consul, or firestore.
The endpoint is the URL of the remote resource. An odd quirk of Viper is that the etcd provider requires the URL to include a scheme (http://ip:port), while Consul requires no scheme (ip:port).
The path is the path in the key-value store to retrieve the configuration from.

To read a JSON-formatted configuration file from an etcd service, for example, you’ll do something like the following:

viper.AddRemoteProvider("etcd", "http://127.0.0.1:4001","/config/service.json")
viper.SetConfigType("json")
err := viper.ReadRemoteConfig()

Note that even though the configuration path includes a file extension, we also use viper.SetConfigType to explicitly define the configuration type. This is because from Viper’s perspective, the resource is just a stream of bytes, so it can’t automatically infer the format.¹³ As of the time of writing, the supported formats are json, toml, yaml, yml, properties, props, prop, env, and dotenv.

Multiple providers may be added, in which case they’re searched in the order in which they were added.

This is a very basic introduction to what Viper can do with remote key-value stores. For more details about how to use Viper to read from Consul, watch for configuration changes, or read encrypted configurations, take a look at Viper’s README.

Setting defaults in Viper

Unlike all of the other packages we reviewed in this chapter, Viper optionally allows default values to be defined for a key, by way of the SetDefault function.

Default values can sometimes be useful, but care should be taken with this functionality. As mentioned in “Configuration Good Practice”, useful zero values are generally preferable to implicit defaults, which can lead to surprising behaviors when thoughtlessly applied.

A snippet of Viper showing default values in action might look like the following:

viper.BindEnv("id")             // Will be upper-cased automatically
viper.SetDefault("id", "13")    // Default value is "13"

id1 := viper.GetInt("id")
fmt.Println(id1)                // 13

os.Setenv("ID", "50")           // Explicitly set the envvar

id2 := viper.GetInt("id")
fmt.Println(id2)                // 50

Default values have the lowest priority and will take effect only if a key isn’t explicitly set by another mechanism.

Feature Management with Feature Flags

Feature flagging (or feature toggling)¹⁴ is a software development pattern designed to increase the speed and safety with which new features can be developed and delivered by allowing specific functionality to be turned on or off during runtime, without having to deploy new code.

A feature flag is essentially a conditional in your code that enables or disables a feature based on some external criteria, often (but not always) a configuration setting. By setting the configuration to different values, a developer can, for example, choose to enable an incomplete feature for testing and disable it for other users.

Having the ability to release a product with unfinished features provides a number of powerful benefits.

First, feature flags allow many small incremental versions of software to be delivered without the overhead of branching and merging that comes with using feature branches. In other words, feature flags decouple the release of a feature from its deployment. Combined with the fact that feature flags, by their very nature, require code changes to be integrated as early as possible, which both encourages and facilitates continuous deployment and delivery. As a result, developers get more rapid feedback about their code, which in turn allows smaller, faster, and safer iterations.

Second, not only can feature flags allow features to be more easily tested before they’re deemed ready for release, but they can also do so dynamically. For example, logic can be used to build feedback loops that can be combined with a circuit breaker–like pattern to enable or disable flags automatically under specific conditions.

Finally, logically executing flags can even be used to target feature rollouts to specific subsets of users. This technique, called feature gating, can be used as an alternative to proxy rules for canary deployments and staged or geographically based rollouts. When combined with observability techniques, feature gating can even allow you to more easily execute experiments like A/B testing or targeted tracing that instrument particular slices of the user base, or even single customers.

The Evolution of a Feature Flag

In this section, we’ll step through the iterative implementation of a feature flag with a function taken directly from the key-value REST service that we built in Chapter 5. Starting with the baseline function, we’ll progress through several evolutionary stages, from flaglessness all the way to a dynamic feature flag that toggles on for a particular subset of users.

In our scenario, we’ve decided that we want to be able to scale our key-value store, so we want to update the logic so that it’s backed by a fancy distributed data structure instead of a local map.

Generation 0: The Initial Implementation

For our first iteration, we’ll start with the getHandler function from “Implementing the read function”. You may recall that getHandler is an HTTP handler function that satisfies the HandlerFunc interface defined in the net/http package. If you’re a little rusty on what that means, you may want to take a look back at “Building an HTTP Server with net/http”.

The initial handler function, copied almost directly from Chapter 5 (minus some of its error handling, for brevity) is shown here:

func getHandler(w http.ResponseWriter, r *http.Request) {
    vars := mux.Vars(r)                     // Retrieve "key" from the request
    key := vars["key"]

    value, err := Get(key)                  // Get value for key
    if err != nil {                         // Unexpected error!
        http.Error(w,
            err.Error(),
            http.StatusInternalServerError)
        return
    }

    w.Write([]byte(value))                  // Write the value to the response
}

As you can see, this function has no feature toggle logic (or indeed anything to toggle to). All it does is retrieve the key from the request variables, use the Get function to retrieve the value associated with that key, and write that value to the response.

In our next implementation, we’ll start testing a new feature: a fancy distributed data structure to replace the local map[string]string that’ll allow the service to scale beyond a single instance.

Generation 1: The Hardcoded Feature Flag

In this implementation, we’ll imagine that we’ve built our new and experimental distributed backend and made it accessible via the NewGet function.

Our first attempt at creating a feature flag introduces a condition that allows us to use a simple Boolean value, useNewStorage, to switch between the two implementations:

// Set to true if you're working on the new storage backend
const useNewStorage bool = false;

func getHandler(w http.ResponseWriter, r *http.Request) {
    vars := mux.Vars(r)
    key := vars["key"]

    var value string
    var err error

    if useNewStorage {
        value, err = NewGet(key)
    } else {
        value, err = Get(key)
    }

    if err != nil {
        http.Error(w,
            err.Error(),
            http.StatusInternalServerError)
        return
    }

    w.Write([]byte(value))
}

This first iteration shows some progress, but it’s far from where we want to be. Having the flag condition fixed in the code as a hardcoded value makes it possible to toggle between implementations well enough for local testing, but it won’t be easy to test both together in an automated and continuous manner.

Plus, you’ll have to rebuild and redeploy the service whenever you want to change the algorithm you’re using in a deployed instance, which largely negates the benefits of having a feature flag in the first place.

Tip

Practice good feature flag hygiene! If you haven’t updated a feature flag in a while, consider removing it.

Generation 2: The Configurable Flag

A little time has gone by, and the shortcomings of hardcoded feature flags have become evident. For one thing, it would be really nice if we could use an external mechanism to change the value of the flag so we can test both algorithms in our tests.

In this example, we use Viper to bind and read an environment variable, which we can now use to enable or disable the feature at runtime. The choice of configuration mechanism isn’t really important here. All that matters is that we’re able to externally update the flag without having to rebuild the code:

func getHandler(w http.ResponseWriter, r *http.Request) {
    vars := mux.Vars(r)
    key := vars["key"]

    var value string
    var err error

    if FeatureEnabled("use-new-storage", r) {
        value, err = NewGet(key)
    } else {
        value, err = Get(key)
    }

    if err != nil {
        http.Error(w,
            err.Error(),
            http.StatusInternalServerError)
        return
    }

    w.Write([]byte(value))
}

func FeatureEnabled(flag string, r *http.Request) bool {
    return viper.GetBool(flag)
}

In addition to using Viper to read the environment variable that sets the use-new-storage flag, we’ve also introduced a new function: FeatureEnabled. At the moment, all this does is perform viper.GetBool(flag), but more importantly it also concentrates the flag-reading logic in a single place. We’ll see exactly what the benefit of this is in the next iteration.

You might be wondering why FeatureEnabled accepts an *http.Request. Well, it doesn’t use it yet, but it’ll make sense in the next iteration.

Generation 3: Dynamic Feature Flags

The feature is now deployed but turned off behind a feature flag. Now we’d like to be able to test it in production on a specific subset of your user base. It’s clear that we’re not going to be able to implement this kind of flag with a configuration setting. Instead, we’ll have to build dynamic flags that can figure out for themselves if they should be set. That means associating flags with functions.

Dynamic flags as functions

The first step in building dynamic flag functions is deciding what the signature of the functions will be. While it’s not strictly required, it’s helpful to define this explicitly with a function type like the one shown here:

type Enabled func(flag string, r *http.Request) (bool, error)

The Enabled function type is the prototype for all of our dynamic feature flags functions. Its contract defines a function that accepts the flag name as a string and the *http.Request, and it returns a bool that’s true if the requested flag is enabled.

Implementing a dynamic flag function

Using the contract provided by the Enabled type, we can now implement a function that we can use to determine whether a request is coming from a private network by comparing the request’s remote address against a standard list of IP ranges allocated for private networks:

// The list of CIDR ranges associated with internal networks.
var privateCIDRs []*net.IPNet

// We use an init function to load the privateCIDRs slice.
func init() {
    for _, cidr := range []string{
        "10.0.0.0/8",
        "172.16.0.0/12",
        "192.168.0.0/16",
    } {
        _, block, _ := net.ParseCIDR(cidr)
        privateCIDRs = append(privateCIDRs, block)
    }
}

// fromPrivateIP receives the flag name (which it ignores) and the
// request. If the request's remote IP is in a private range per
// RFC1918, it returns true.
func fromPrivateIP(flag string, r *http.Request) (bool, error) {
    // Grab the host portion of the request's remote address
    remoteIP, _, err := net.SplitHostPort(r.RemoteAddr)
    if err != nil {
        return false, err
    }

    // Turn the remote address string into a *net.IPNet
    ip := net.ParseIP(remoteIP)
    if ip == nil {
        return false, errors.New("couldn't parse ip")
    }

    // Loopbacks are considered "private."
    if ip.IsLoopback() {
        return true, nil
    }

    // Search the CIDRs list for the IP; return true if found.
    for _, block := range privateCIDRs {
        if block.Contains(ip) {
            return true, nil
        }
    }

    return false, nil
}

As you can see, the fromPrivateIP function conforms to Enabled by receiving a string value (the flag name) and an *http.Request (specifically, the instance associated with the initiating request). It returns true if the request originates from a private IP range (as defined by RFC 1918).

To make this determination, the fromPrivateIP function first retrieves the remote address, which contains the network address that sent the request, from the *http.request. After parsing off the host IP with net.SplitHostPort and using net.ParseIP to parse it into a *net.IP value, it compares the originating IP against each of the private CIDR ranges contained in privateCIDRs, returning true if a match is found.

Warning

This function also returns true if the request is traversing a load balancer or reverse proxy. A production-grade implementation will need to be aware of this and would ideally be PROXY protocol-aware.

Of course, this function is just an example. I used it because it’s relatively simple, but a similar technique can be used to enable or disable a flag for a geographic region, a fixed percentage of users, or even a specific customer.

The flag function lookup

Now that we have a dynamic flag function in the form of fromPrivateIP, we have to implement some mechanism of associating flags with it, by name. Perhaps the most straightforward way of doing this is to use a map of flag name strings to Enabled functions:

var enabledFunctions map[string]Enabled

func init() {
    enabledFunctions = map[string]Enabled{}
    enabledFunctions["use-new-storage"] = fromPrivateIP
}

Using a map in this manner to indirectly reference functions provides us with a good deal of flexibility. We can even associate a function with multiple flags, if we like. This could be useful if we want a set of related features to always be active under the same conditions.

You may have noticed that we’re using an init function to fill the enabledFunctions map. But wait, didn’t we already have an init function?

Yes, we did, and that’s okay. The init function is special: you’re allowed to have multiple init functions if you like.

The router function

Finally, we get to tie everything together.

We do this by refactoring the FeatureEnabled function to look up the appropriate dynamic flag function, call it if it finds it, and return the result:

func FeatureEnabled(flag string, r *http.Request) bool {
    // Explicit flags take precedence
    if viper.IsSet(flag) {
        return viper.GetBool(flag)
    }

    // Retrieve the flag function, if any. If none exists,
    // return false
    enabledFunc, exists := enabledFunctions[flag]
    if !exists {
        return false
    }

    // We now have the flag function: call it and return
    // the result
    result, err := enabledFunc(flag, r)
    if err != nil {
        log.Println(err)
        return false
    }

    return result
}

At this point, FeatureEnabled has become a full-fledged router function that can dynamically control which code path is live according to explicit feature-flag settings and the output of flag functions. In this implementation, flags that have been explicitly set take precedence over everything else. This allows automated tests to verify both sides of a flagged feature.

Our implementation uses a simple in-memory lookup to determine the behavior of particular flags, but this could just as easily be implemented with a database or other data source, or even a sophisticated managed service like LaunchDarkly. Keep in mind, though, that these solutions do introduce a new dependency.

Summary

Manageability isn’t the most glamorous subject in the cloud native world—or any world, really—but I still really enjoyed how much we got our hands dirty with details in this chapter.

We dug into some of the nuts and bolts of various configuration styles, including environment variables, command-line flags, and variously formatted files. We even went over a couple of strategies for detecting configuration changes to trigger a reload. That’s not to mention Viper, which pretty much does all of that and more.

I do feel like there may be some potential to go a lot deeper on some things, and I might have had it not been for the constraints of time and space. Feature flags and feature management are a pretty big subject, for example, and I definitely would have liked to have been able to explore them a bit more. Some subjects, like deployments and service discovery, we couldn’t even cover at all. I guess we have some things to look forward to in the next edition, right?

As much as I enjoyed this chapter, I’m especially excited about Chapter 11, in which we’ll get to dive into observability in general, and OpenTelemetry in particular.

Finally, I’ll leave you with some advice: always be yourself, and remember that luck comes from hard work.

¹ Brian W. Kernighan and P.J. Plauger, The Elements of Programming Style (McGraw-Hill, 1978).

² Staff, America’s Test Kitchen, Perfect Pie: Your Ultimate Guide to Classic and Modern Pies, Tarts, Galettes, and More, America’s Test Kitchen, 2019.

³ They’re doing some pretty amazing things with genetic engineering. Don’t stop believing.

⁴ “Systems and Software Engineering: Vocabulary”, ISO/IEC/IEEE 24765:2010(E), December 15, 2010.

⁵ Radle Byron et al., “What Is Manageability?”, NI, National Instruments, July 1, 2024.

⁶ Or so I told my editors. Hi, Melissa!

⁷ This makes me sad. These are important topics, but we have to focus.

⁸ Rob Pike, “Go Proverbs”, Gopherfest, YouTube, November 18, 2015.

⁹ Neat trick, huh?

¹⁰ Well, like you.

¹¹ Seriously, that really is what it stands for.

¹² Also, I just love JSON so much.

¹³ Or that feature just hasn’t been implemented. I don’t know.

¹⁴ I’ve also seen “feature switch,” “feature flipper,” “feature toggle,” “conditional feature,” and more. The industry seems to have settled on “feature flag,” probably because the other names are just a little silly.

Chapter 11. Observability

Data is not information, information is not knowledge, knowledge is not understanding, understanding is not wisdom.¹

Clifford Stoll, High-Tech Heretic: Reflections of a Computer Contrarian

“Cloud native” is still a pretty new concept, even for computing. As far as I can tell, the term cloud native only started entering our vocabulary just after the founding of the Cloud Native Computing Foundation in the middle of 2015.²

As an industry, we’re largely still trying to figure out exactly what “cloud native” means, and with each of the major public cloud providers regularly launching new services—each seeming to offer more abstraction than the last—even what little agreement we have is shifting over time.

One thing is clear, though: the functions (and failures) of the network and hardware layers are being increasingly abstracted and replaced with API calls and events. Every day we move closer to a world of software-defined everything. All of our problems are becoming software problems.

While we certainly sacrifice a fair share of control over the platforms our software runs on, we win big in overall manageability and reliability,³ allowing us to focus our limited time and attention on our software. However, this also means that proportionally more of our failures now originate from within our own services and the interactions between them. No amount of fancy frameworks or protocols can solve the problem of bad software. As I said way back in Chapter 1, a kludgy application in Kubernetes is still kludgy.

Things are complicated in this brave new software-defined, highly distributed world. The software is complicated, the platforms are complicated, together they’re really complicated, and more often than not, we have no idea what’s going on. Gaining visibility into our services has become more important than ever, and about the only thing that we do know is that traditional monitoring tools and techniques simply aren’t up to the task. Clearly, we need something new. Not just a new technology, or even a new set of techniques, but an entirely new way of thinking.

What Is Observability?

Observability is the subject of an awful lot of buzz right now. It’s kind of a big deal. But what is observability, actually? How is it different from (and how is it like) traditional monitoring and alerting with logs and metrics and tracing? Most importantly, how do we “do observability”?

Observability isn’t just marketing hype, although it’s easy to think that based on all the attention it’s getting.

It’s actually pretty simple. Observability is a system property, no different than resilience or manageability, that reflects how well a system’s internal states can be inferred from knowledge of its external outputs. A system can be considered observable when it’s possible to quickly and consistently ask novel questions about it with minimal prior knowledge and without having to reinstrument or build new code. An observable system lets you ask it questions that you haven’t thought of yet.

Ultimately, observability is more than tooling, despite what some vendors may try to tell you (and sell you). You can’t “buy observability” any more than you can “buy reliability.” No tooling will make your system observable just because you’re using it any more than a hammer will by itself make a bridge structurally sound. The tools can get you partway there, but it’s up to you to apply them correctly.

This is much easier said than done, of course. Building observability into a complex system demands moving past searching for “known unknowns” and embracing the fact that we often can’t even fully understand its state at a given snapshot in time. Understanding all possible failure (or nonfailure) states in a complex system is pretty much impossible. The first step to achieving observability is to stop looking for specific, expected failure modes—the “known unknowns”—as if this isn’t the case.

Why Do We Need Observability?

Observability is the natural evolution of traditional monitoring, driven by the new challenges introduced by complex distributed architectures.

The first of these is simply the pure scale of many modern cloud native systems, which increasingly have too much stuff for our limited human brains with their limited human attention spans to handle. All of the data generated by multiple concurrently operating interconnected systems provides more things than we can reasonably watch, more data than we can reasonably process, and more correlations than we can reasonably make.

More importantly, however, is that the nature of cloud native systems is fundamentally different from the more traditional architectures of not-so-long-ago. Their environmental and functional requirements are different, the way they function—and the way they fail—is different, and the guarantees they need to provide are different.

How do you monitor distributed systems given the ephemerality of modern applications and the environments in which they reside? How can you pinpoint a defect in a single component within the complex web of a highly distributed system? These are the problems that “observability” seeks to address.

How Is Observability Different from “Traditional” Monitoring?

On its face, the line between monitoring and observability may seem fuzzy. After all, both are about being able to ask questions of a system. The difference is in the types of questions that are and can be asked.

Traditionally, monitoring focuses on asking questions in the hope of identifying or predicting some expected or previously observed failure modes. In other words, it centers on “known unknowns.” The implicit assumption is that the system is expected to behave—and therefore fail—in a specific, predictable way. When a new failure mode is discovered—usually the hard way—its symptoms are added to the monitoring suite, and the process begins again.

This approach works well enough when a system is fairly simple, but it has some problems. First, asking a new question of a system often means writing and shipping new code. This isn’t flexible, it definitely isn’t scalable, and it’s super annoying.

Second, at a certain level of complexity, the number of “unknown unknowns” in a system starts to overwhelm the number of “known unknowns.” Failures are more often unpredicted, less often predictable, and are nearly always the outcome of many things going wrong. Monitoring for every possible failure mode becomes effectively impossible.

Monitoring is something you do to a system to find out it isn’t working. Observability techniques, on the other hand, emphasize understanding a system by allowing you to correlate events and behaviors. Observability is a property a system has that lets you ask why it isn’t working.

The “Three Pillars of Observability”

The three pillars of observability is the collective name by which the three most common (and foundational) tools in the observability kit—logging, metrics, and distributed tracing—are sometimes referred. These three parts are as follows, in the order that we’ll be discussing them:

Distributed tracing

Distributed tracing (or simply tracing) follows a request as it propagates through a (typically distributed) system, allowing the entire end-to-end request flow to be reconstructed as a directed acyclic graph (DAG) called a trace. Analysis of these traces can provide insight into how a system’s components interact, making it possible to pinpoint failures and performance issues.

Distributed tracing will be discussed in more detail in “Distributed Tracing”.

Metrics

Metrics involves the collection of numerical data points representing the state of various aspects of a system at specific points in time. Collections of data points, representing observations of the same subject at various times, are particularly useful for visualization and mathematical analysis and can be used to highlight trends, identify anomalies, and predict future behavior.

We’ll discuss more about metrics in “Metrics”.

Logging

Logging is the process of appending records of noteworthy events to an immutable record—the log—for later review or analysis. A log can take a variety of forms, from a continuously appended file on disk to a full-text search engine like Elasticsearch. Logs provides valuable, context-rich insight into application-specific events emitted by processes. However, it’s important that log entries are properly structured; not doing so can sharply limit their utility.

We’ll dive into logging in more detail in “Logging”.

While each of these methods is useful on its own, a truly observable system will interweave them so that each can reference the others. For example, metrics might be used to track down a subset of misbehaving traces, and those traces might highlight logs that could help to find the underlying cause of the behavior.

If you take nothing else away from this chapter, remember that observability is just a system property, like resilience or manageability, and that no tooling, framework, or vendor can just “give you” observability. The so-called “three pillars” are just techniques that can be used to build in that property.

The (So-Called) “Three Pillars”

This name “three pillars of observability” is often criticized—with good reason, I think—because it’s easy to interpret it as suggesting that they’re how you can “do observability.” But while just having logging, metrics, and tracing won’t necessarily make a system more observable, each of the “three pillars” are powerful tools that, if well understood and used well together, can provide deep insight into the internal state of your system.

Another common criticism is that the term implies that observability is the combination of three very different things, when each of the three tools simply provides a different view of the same thing, which ultimately enhances a singular ability to understand the state of your system. It’s through the integration of these three approaches that it becomes possible to take the first steps toward observability.

OpenTelemetry

As of the time of writing, OpenTelemetry (or “OTel,” as the cool kids are calling it)⁴ is one of about three dozen “incubating” member projects of the Cloud Native Computing Foundation, and arguably one of the most interesting projects in the entire CNCF project catalog.

Unlike most CNCF projects, OpenTelemetry isn’t a service, per se. Rather, it’s an effort to standardize how telemetry data—traces, metrics, and (eventually) logs—are expressed, collected, and transferred. Its multiple repositories include a collection of specifications, along with APIs and reference implementations in various languages, including Go.⁵

The instrumentation space is a crowded one, with perhaps dozens of vendors and tools that have come and gone over the years, each with its own unique implementations. OpenTelemetry seeks to unify this space—and all of the vendors and tools within it—around a single vendor-neutral specification that standardizes how telemetry data is collected and sent to backend platforms. There have been other attempts to standardize before. In fact, OpenTelemetry is the merger of two such earlier projects: OpenTracing and OpenCensus, which it unifies and extends into a single set of vendor-neutral standards.

In this chapter, we’ll review each of the “three pillars,” their core concepts, and how to use OpenTelemetry to instrument your code and forward the resulting telemetry to a backend of your choice. However, it’s important to note that OpenTelemetry is a big subject that deserves a book of its own to truly do it justice, but I’ll do my best to provide sufficient coverage to at least make it a practical introduction.

If and when you’re ready to do a deep dive into the subject, the book Observability Engineering by Charity Majors et al. (O’Reilly) is an excellent resource.⁶

The OpenTelemetry Components

OpenTelemetry extends and unifies earlier attempts at creating telemetry standards, in part by including abstractions and extension points in the SDK where you can insert your own implementations. This makes it possible to, for example, implement custom exporters that can interface with a vendor of your choice.

To accomplish this level of modularity, OpenTelemetry was designed with the following five core components:

Specifications: The OpenTelemetry specifications describe the requirements and expectations for all OpenTelemetry APIs, SDKs, and data protocols.
API: Language-specific interfaces and implementations based on the specifications that can be used to add OpenTelemetry instrumentation to an application.
SDK: The concrete OpenTelemetry implementations that sit between the APIs and the Exporters, providing functionality such as (for example) state tracking and batching data for transmission. An SDK also offers a number of configuration options for behaviors such as request filtering and transaction sampling.
Exporters: In-process SDK plug-ins that are capable of sending data to a specific destination, which may be local (such as a log file or stdout), or remote (such as Jaeger, or a commercial solution like Honeycomb or ServiceNow).⁷ Exporters decouple the instrumentation from the backend, making it possible to change destinations without having to reinstrument your code.
Collector: An optional, but very useful, vendor-agnostic service that can receive and process telemetry data before forwarding it to one or more destinations. It can be run either as a sidecar process alongside your application or as a standalone proxy elsewhere, providing greater flexibility for sending the application telemetry. This can be particularly useful in the kind of tightly controlled environments that are common in the enterprise.

You may have noticed the absence of an official OpenTelemetry backend. Well, there isn’t one. OpenTelemetry is concerned only with the collection, processing, and sending of telemetry data, and relies on you to provide a telemetry backend to receive and store the data.

There are other components as well, but the five listed can be considered to be OpenTelemetry’s core components. The relationships between them are illustrated in Figure 11-1.

Finally, broad language support is a central aim of the project. As of the time of this writing, OpenTelemetry provides APIs and SDKs for Go, Python, Java, Ruby, Erlang/Elixir, PHP, JavaScript, C#/.NET, Rust, C++, and Swift.

Distributed Tracing

Throughout this book, we’ve spent a fair amount of time talking about the benefits of microservices architectures and distributed systems. The unfortunate reality—as I’m sure has already become clear—is that such architectures also introduce a variety of new and “interesting” problems.

It’s been said that fixing an outage in a distributed system can feel like solving a murder mystery, which is a glib way of saying that when something isn’t working, somewhere in the system, it’s often a challenge just knowing where to start looking for the source of the problem before you can find and fix it.

This is exactly the kind of problem that distributed tracing was invented to solve. By tracking requests as they propagate through the system—even across process, network, and security boundaries—tracing can help you to (for example) pinpoint component failures, identify performance bottlenecks, and analyze service dependencies.

Tip

Tracing is usually discussed in the context of distributed systems, but a complex monolithic application can also benefit from tracing, especially if it contends for resources like network, disk, or mutexes.

In this section, we’ll go into more depth on distributed tracing, its core concepts, and how to use OpenTelemetry to instrument your code and forward the resulting telemetry to a backend of your choice.

Unfortunately, the constraints of time and space permit us to dig only so far into this topic. If you’d like to learn more about tracing, you might be interested in Distributed Tracing in Practice by Austin Parker et al. (O’Reilly).

Distributed Tracing Concepts

When discussing tracing, there are two fundamental concepts you need to know about, spans and traces:

Spans: A span describes a unit of work performed by a request, such as a fork in the execution flow or hop across the network, as it propagates through a system. Each span has an associated name, a start time, and a duration. They can be (and typically are) nested and ordered to model causal relationships.
Traces: A trace represents all of the events—individually represented as spans—that make up a request as it flows through a system. A trace may be thought of as a directed acyclic graph (DAG) of spans, or more concretely as a “stack trace” in which each span represents the work done by one component.

This relationship between a request trace and spans is illustrated in Figure 11-2, in which we see two different representations of the same request as it flows through five different services to generate five spans.

When a request begins in the first (edge) service, it creates the first span—the root span—which will form the first node in the span trace. The root span is automatically assigned a globally unique trace ID, which is passed along with each subsequent hop in the request lifecycle. The next point of instrumentation creates a new span with the provided trace ID, perhaps choosing to insert or otherwise enrich the metadata associated with the request, before sending the trace ID along again with the next request.

Each hop along the flow is represented as one span. When the execution flow reaches the instrumented point at one of these services, a record is emitted with any metadata. These records are usually asynchronously logged to disk before being submitted out of band to a collector, which can then reconstruct the flow of execution based on different records emitted by different parts of the system.

Figure 11-2 demonstrates the two most common ways of illustrating a trace containing five spans, lettered A through E, in the order that they were created. On the left side, the trace is represented in DAG form; the root span A starts at time 0 and lasts for 350 ms, until the response is returned for the last service E. On the right, the same data is illustrated as a bar diagram with a time axis, in which the position and length of the bars reflect the start times and durations, respectively.

Distributed Tracing with OpenTelemetry

Using OpenTelemetry to instrument your code includes two phases: configuration and instrumentation. This is true whether you’re instrumenting for tracing or metrics (or both), although the specifics change slightly between the two. For both tracing and metric instrumentation, the configuration phase is executed exactly once in a program, usually in the main function, and includes the following steps:

The first step is to retrieve and configure the appropriate exporters for your target backends. Tracing exporters implement the SpanExporter interface (which in OpenTelemetry v1.28.0 is located in the go.opentelemetry.io/otel/sdk/trace package, often aliased to sdktrace). As we’ll discuss in “Creating the tracing exporters”, several stock exporters are included with OpenTelemetry, but custom implementations exist for many telemetry backends.
Before instrumenting your code for tracing, the exporters—and any other appropriate configuration options—are passed to the SDK to create the “tracer provider,” which, as we’ll show in “Creating a tracer provider”, will serve as the main entry point for the OpenTelemetry tracing API for the lifetime of your program.
Once you’ve created your tracer provider, it’s a good practice to set it as your “global” tracer provider. As we’ll see in “Setting the global tracer provider”, this makes it discoverable via the otel.GetTracerProvider function, which allows libraries and other dependencies that also use the OpenTelemetry API to more easily discover the SDK and emit telemetry data.

Once the configuration is complete, instrumenting your code requires only a few small steps:

Before you can instrument an operation, you first have to obtain a Tracer, which has the central role of keeping track of trace and span information, from the (usually global) tracer provider. We’ll discuss this in more detail in “Obtaining a tracer”.
Once you have a handle to your Tracer, you can use it to create and start the actual Span value that you’ll use to instrument your code. We’ll cover this in some detail in “Starting and ending spans”.
Finally, you can also choose to add metadata to your spans, including human-readable, timestamped messages called events, and key-value pairs called attributes. We’ll cover span metadata in “Setting span metadata”.

OpenTelemetry Tracing Imports

There are many, many packages in the OpenTelemetry framework. Fortunately, for the purposes of this section, we’ll be able to focus on just a subset of these.

The examples in this section were created using OpenTelemetry v1.28.0, which was the latest release at the time of writing. If you choose to follow along with the code presented in this section, you’ll need to import the following packages from that release:

import (
    "go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/attribute"
    "go.opentelemetry.io/otel/codes"
    "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
    "go.opentelemetry.io/otel/exporters/stdout/stdouttrace"
    "go.opentelemetry.io/otel/sdk/resource"
    sdktrace "go.opentelemetry.io/otel/sdk/trace"
    semconv "go.opentelemetry.io/otel/semconv/v1.28.0"
    "go.opentelemetry.io/otel/trace"
)

As usual, the complete code examples are available in the GitHub repository associated with this book.

Creating the tracing exporters

The first thing you have to do when using OpenTelemetry is create and configure your exporters. Tracing exporters implement the SpanExporter interface, which in OpenTelemetry v1.28.0 lives in the go.opentelemetry.io/otel/sdk/trace package, which is often aliased to sdktrace to reduce package naming collisions.

You may recall from “The OpenTelemetry Components” that OpenTelemetry exporters are in-process plug-ins that know how to convert metric or trace data and send it to a particular destination. This destination may be local (stdout or a log file) or remote (such as Jaeger, or a commercial solution like Honeycomb or ServiceNow).

If you want to do anything worthwhile with the instrumentation data you collect, you’ll need at least one exporter. One is usually enough, but you can define as many as you like, should you have the need. Exporters are instantiated and configured once at program startup before being passed to the OpenTelemetry SDK. This will be covered in more detail in “Creating a tracer provider”.

OpenTelemetry comes with a number of included exporters for both tracing and metrics. Two of these are demonstrated in the following.

The Console Exporter

OpenTelemetry’s Console Exporter allows you to write telemetry data as JSON to standard output. This is very handy for debugging or writing to log files.

Creating an instance of the Console Exporter is just a matter of calling stdouttrace.New, which in OpenTelemetry v1.28.0 lives in the go.opentelemetry.io/otel/exporters/stdout/stdouttrace package.

Like most exporters’ creation functions, stdouttrace.New is a variadic function that can accept zero or more configuration options. We demonstrate with one of these—the option to “pretty-print” its JSON output—here:

stdExporter, err := stdouttrace.New(
    stdouttrace.WithPrettyPrint(),
)

Here we use the stdouttrace.New function, which returns both our exporter and an error value. We’ll see what its output looks like when we run our example in “Putting It All Together: Distributed Tracing”.

Note

For more information about the Console Exporter, please refer to its page in the relevant OpenTelemetry documentation.

The OTLP Exporter

The Console Exporter may be useful for local debugging, but OpenTelemetry also includes a number of trace exporters designed to forward data to specialized backends like Jaeger and Zipkin.

More recently, however, there’s been a trend to consolidate around a single protocol called the OpenTelemetry protocol (OTLP). Over the last several years, OTLP has reached something of a threshold level of maturity, and support for it among tracing backends has become quite broad, not only among open source backends but commercial providers as well.

The OTLP Exporter (as its name suggests) knows how to encode tracing telemetry data in the OTLP tracing format. We’ll use it to send data via gRPC to the Jaeger distributed tracing system. You can retrieve an exporter value using the otlptracegrpc.New function, as shown here:

const jaegerEndpoint = "localhost:4317"

otlpExporter, err := otlptracegrpc.New(
    context.Background(),
    otlptracegrpc.WithEndpoint(jaegerEndpoint),
    otlptracegrpc.WithInsecure(),
)

In OpenTelemetry v1.28.0, the gRPC OTLP Exporter can be found in the go.o⁠pe⁠nte⁠le⁠met⁠ry.io/⁠ote⁠l/ex⁠por⁠ter⁠s/ot⁠lp/ot⁠lpt⁠ra⁠ce/o⁠tlp⁠tra⁠ceg⁠rpc package. There’s an HTTP implementation as well, which lives in otlptracehttp.

You may have noticed that otlptracegrpc.New works a lot like stdouttrace.New in that it’s a variadic function that accepts zero or more configuration options, returning an export.SpanExporter (the OTLP Exporter) and an error value.

The options passed to otlptracegrpc.New are:

otlptracegrpc.WithEndpoint: Used to define the URL that points to the target Jaeger’s collector endpoint.
otlptracegrpc.WithInsecure: Disables client transport security for the gRPC connection, just like gr⁠pc.W⁠ithIns⁠ecu⁠re back in “Implementing the gRPC client”. Don’t use insecure connections in production.

There are quite a few other configuration options available, but only these two are used for the sake of brevity. If you’re interested in more detail, please refer to its page in the relevant OpenTelemetry documentation.

Creating a tracer provider

To generate traces, you first have to create and initialize a tracer provider, represented in OpenTelemetry by the TracerProvider type. In OpenTelemetry v1.28.0, it lives in the go.opentelemetry.io/otel/sdk/trace package, which is often aliased to sdktrace to avoid naming collisions.

A TracerProvider is a stateful value that serves as the main entry point for the OpenTelemetry tracing API, including, as we’ll see in the next section, providing access to a Tracer that in turn provides new Span values.

To create a tracer provider, we use the sdktrace.NewTracerProvider function:

tp := sdktrace.NewTracerProvider(
    sdktrace.WithResource(res),
    sdktrace.WithSyncer(stdExporter),
    sdktrace.WithSyncer(jaegerExporter),
)

In this example, the two exporters that we created in “Creating the tracing exporters”—stdExporter and jaegerExporter—are provided to sdk⁠tra⁠ce.N⁠ewTra⁠cer⁠Pro⁠vid⁠er, instructing the SDK to use them for exporting telemetry data.

There are several other options that can be provided to sdktrace.NewTracerProvider, including defining a Syncer or a SpanProcessor. These concepts are (unfortunately) beyond the scope of this book, but more information on these can be found in the OpenTelemetry SDK Specification.

Setting the global tracer provider

Once you’ve created your tracer provider, it’s generally a good practice to set it as your global tracer provider via the SetTracerProvider function. In OpenTelemetry v1.28.0, this and all of OpenTelemetry’s global options live in the go.o⁠pen⁠te⁠lemet⁠ry.io⁠/ot⁠el package.

Here we set the global tracer provider to be the value of tp, which we created in the previous section:

otel.SetTracerProvider(tp)

Setting the global tracer provider makes it discoverable via the ot⁠el.G⁠etT⁠rac⁠erPro⁠vid⁠er function. This allows libraries and other dependencies that use the OpenTelemetry API to more easily discover the SDK and emit telemetry data:

gtp := otel.GetTracerProvider()

Warning

If you don’t explicitly set a global tracer provider, otel.GetTracerProvider will return a no-op TracerProvider implementation that returns a no-op Tracer that provides no-op Span values.

Obtaining a tracer

In OpenTelemetry, a Tracer is a specialized type that keeps track of trace and span information, including what span is currently active. Before you can instrument an operation, you first have to use a (usually global) tracer provider’s Tracer method to obtain a trace.Tracer value:

tr := otel.GetTracerProvider().Tracer("fibonacci")

TracerProvider’s Tracer method accepts a string parameter to set its name. By convention, tracers are named after the component they are instrumenting, usually a library or a package.

Now that you have your tracer, your next step will be to use it to create and start a new Span instance.

Starting and ending spans

Once you have a handle to a Tracer, you can use it to create and start new Span values representing named and timed operations within a traced workflow. In other words, a Span value represents the equivalent of one step in a stack trace.

In OpenTelemetry v1.28.0, both the Span and Tracer interfaces can be found in the go.opentelemetry.io/otel/trace. Their relationship can be deduced by a quick review of Tracer’s definition code:

type Tracer interface {
    Start(ctx context.Context, spanName string, opts ...trace.SpanOption)
        (context.Context, trace.Span)
}

Yes, that’s really all there is. Tracer’s only method, Start, accepts three parameters: a context.Context value, which is the mechanism that Tracer uses to keep track of spans; the name of the new span, which by convention is usually the name of the function or component being evaluated; and zero or more span configuration options.

Note

Unfortunately, a thorough discussion of the available span configurations is beyond the scope of this book, but if you’re interested, more detail is available in the relevant Go Documentation.

Importantly, Start returns not just the new Span but also a context.Context. This is a new Context instance derived from the one that was passed in. As we’ll see shortly, this is important when we want to create child Span values.

Now that you have all of the pieces in place, you can begin instrumenting our code. To do this, you request a Span value from your Tracer via its Start method, as shown in the following:

const serviceName = "foo"

func main() {
    // EXPORTER SETUP OMITTED FOR BREVITY

    // Retrieve the Tracer from the otel TracerProvider.
    tr := otel.GetTracerProvider().Tracer(serviceName)

    // Start the root span; receive a child context (which now
    // contains the trace ID), and a trace.Span.
    ctx, sp := tr.Start(context.Background(), "main")
    defer sp.End()     // End completes the span.

    SomeFunction(ctx)
}

In this snippet we use Tracer’s Start method to create and start a new Span, which returns a derived context and our Span value. It’s important to note that we ensure that the Span is ended by calling it in a defer so that SomeFunction is entirely captured in the root Span.

Of course, we’ll also want to instrument SomeFunction. Since it receives the derived context we got from the original Start, it can now use that Context to create its own subspan:

func SomeFunction(ctx context.Context) {
    tr := otel.GetTracerProvider().Tracer(serviceName)
    _, sp := tr.Start(ctx, "SomeFunction")
    defer sp.End()

    // Do something MAGICAL here!
}

The only differences between main and SomeFunction are the names of the spans and the Context values. It’s significant that SomeFunction uses the Context value derived from the original Start call in main.

Setting span metadata

Now that you have a Span, what do you do with it?

If you do nothing at all, that’s okay. As long as you’ve remembered to End your Span (preferably in a defer statement), a minimal timeline for your function will be collected.

However, the value of your span can be enhanced with the addition of two types of metadata: attributes and events.

Attributes

Attributes are key-value pairs that are associated with spans. They can be used later for aggregating, filtering, and grouping traces.

If known ahead of time, attributes can be added when a span is created by passing them as option parameters to the tr.Start method using the WithAttributes function:

ctx, sp := tr.Start(ctx, "attributesAtCreation",
    trace.WithAttributes(
        attribute.String("greeting", "hello"),
        attribute.String("foo", "bar"),
    ))
defer sp.End()

Here we call tr.Start to start a new span, passing it our active context.Context value and a name. But Start is also a variadic function that can accept zero or more options, so we opt to use the WithAttributes function to pass two string attributes: hello=world and foo=far.

The WithAttributes function accepts an attribute.KeyValue type, from OpenTelemetry’s go.opentelemetry.io/otel/attribute package. Values of this type can be created using the various type methods, such as attribute.String, seen previously. Methods exist for many Go types. See the attribute package’s documentation for more information.

Attributes don’t have to be added at span creation time. They can be added later in a span’s lifecycle as well, as long as the span hasn’t yet been completed:

answer := LifeTheUniverseAndEverything()
span.SetAttributes(attribute.Int("answer", answer))

Resource attributes

A resource is an optional—but useful—construct that may be used to represent the entity producing telemetry as resource attributes. For example, a service that’s running in a container on Kubernetes has a service name, a pod name, a namespace, and possibly a deployment name. All of these attributes can be included in the resource.

In your observability backend, you can use resource information to better investigate interesting behavior. For example, if your trace or metrics data indicate latency in your system, you can narrow it down to a specific container, pod, or Kubernetes deployment.

New resources are created a lot like attributes:

serviceName    = "Fibonacci"
serviceVersion = "0.0.2"

resources := resource.NewWithAttributes(
    semconv.SchemaURL,
    semconv.ServiceName(serviceName),
    semconv.ServiceVersion(serviceVersion),
)

Note the use of the semconv package here. This package provides a set of constants that align with OpenTelemetry’s semantic conventions, which provide a common and consistent naming scheme across a codebase, libraries, and platforms.

There’s also a useful default resource that contains a selection of useful metadata. If you want to include it alongside your custom data,⁹ the resource package provides a helpful function to merge two resources:

resources, err := resource.Merge(resource.Default(), resources)
if err != nil {
    return nil, fmt.Errorf("failed to merge resources: %w", err)
}

Once your resource is constructed, it must be passed to sdk⁠tra⁠ce.N⁠ewT⁠rac⁠erPro⁠vid⁠er during the creation of the tracer provider, which you hopefully recall from “Creating a tracer provider”:

provider := sdktrace.NewTracerProvider(
    ...
    sdktrace.WithResource(resources),
)

Events

An event is a timestamped, human-readable message on a span that represents something happening during the span’s lifetime.

For example, if your function requires exclusive access to a resource that’s under a mutex, you might find it useful to add events when you acquire and release the lock:

span.AddEvent("Acquiring mutex lock")
mutex.Lock()

// Do something amazing.

span.AddEvent("Releasing mutex lock")
mutex.Unlock()

If you like, you can even add attributes to your events:

span.AddEvent("Canceled by external signal",
    attribute.Int("pid", 1234),
    attribute.String("signal", "SIGHUP"))

Autoinstrumentation

Autoinstrumentation, broadly, refers to instrumentation code that you didn’t write. This is a useful feature that can spare you from a fair amount of unnecessary bookkeeping.

OpenTelemetry supports autoinstrumentation through various wrappers and helper functions around many popular frameworks and libraries, including ones that we cover in this book, like net/http, gorilla/mux, and grpc.

While using these functionalities doesn’t free you from having to configure OpenTelemetry at startup, they do remove some of the effort associated with having to manage your traces.

Autoinstrumenting net/http and gorilla/mux

In OpenTelemetry v1.28.0, autoinstrumentation support for both the standard net/http library and gorilla/mux, both of which we first covered in Chapter 5 in the context of building a RESTful web service, is provided by the go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp package.

Its use is refreshingly minimalist. Take, for example, this standard idiom in net/http for registering a handler function to the default mux¹⁰ and starting the HTTP server:

func main() {
    http.HandleFunc("/", helloGoHandler)
    log.Fatal(http.ListenAndServe(":3000", nil))
}

In OpenTelemetry, a handler function can be autoinstrumented by passing it to the otelhttp.NewHandler function, the signature for which is shown here:

func NewHandler(handler http.Handler, operation string, opts ...Option)
    http.Handler

The otelhttp.NewHandler function accepts and returns a handler function. It works by wrapping the passed handler function in a second handler function that creates a span with the provided name and options so that the original handler acts like middleware within the returned span-handling function.

A typical application of the otelhttp.NewHandler function is shown in the following:

func main() {
    http.Handle("/",
        otelhttp.NewHandler(http.HandlerFunc(helloGoHandler), "root"))
    log.Fatal(http.ListenAndServe(":3000", nil))
}

You’ll notice that we have to cast the handler function to an http.HandlerFunc before passing it to otelhttp.NewHandler. This wasn’t necessary before because ht⁠tp.H⁠and⁠leFu⁠nc performs this operation automatically before calling http.Handle itself.

If you’re using gorilla/mux, the change is almost the same, except that you’re using the gorilla mux instead of the default mux:

func main() {
    r := mux.NewRouter()
    r.Handle("/",
        otelhttp.NewHandler(http.HandlerFunc(helloGoHandler), "root"))
    log.Fatal(http.ListenAndServe(":3000", r))
}

You’ll need to repeat this for each handler function you want to instrument, but either way the total amount of code necessary to instrument your entire service is pretty minimal.

Autoinstrumenting gRPC

In OpenTelemetry v1.28.0, autoinstrumentation support for gRPC, which we introduced in Chapter 8 in the context of loosely coupled data interchange, is provided by the go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc package.¹¹

Just like autoinstrumentation for net/http, autoinstrumentation for gRPC is very minimalist, leveraging gRPC interceptors. We haven’t talked about gRPC interceptors at all yet, and unfortunately a full treatment of gRPC interceptors is beyond the scope of this book. They can be described as the gRPC equivalent to middleware in gorilla/mux, which we leveraged in “Load shedding” to implement automatic load shedding.

As their name implies, gRPC interceptors can intercept gRPC requests and responses to, for example, inject information into the request, update the response before it’s returned to the client, or implement a crosscutting functionality like authorization, logging, or caching.

Note

If you’d like to learn a little more about gRPC interceptors, the article “Interceptors in gRPC-Web” on the gRPC blog offers a good introduction to the subject. For a more in-depth coverage, you might want to invest in a copy of gRPC: Up and Running by Kasun Indrasiri and Danesh Kuruppu (O’Reilly).

Taking a look at a slice of the original service code from “Implementing the gRPC service”, you can see two of the operative functions:

s := grpc.NewServer()
pb.RegisterKeyValueServer(s, &server{})

In this snippet, we create a new gRPC server and pass that along to our autogenerated code package to register it. Interceptors can be added to a gRPC server using the grpc.UnaryInterceptor and/or grpc.StreamInterceptor, the former of which is used to intercept unary (standard request–response) service methods, and the latter of which is used for intercepting streaming methods.

To autoinstrument your gRPC server, you use one or both of these functions to add one or more off-the-shelf OpenTelemetry interceptors, depending on the types of requests your service handles:

s := grpc.NewServer(
    grpc.UnaryInterceptor(otelgrpc.UnaryServerInterceptor()),
    grpc.StreamInterceptor(otelgrpc.StreamServerInterceptor()),
)

pb.RegisterKeyValueServer(s, &server{})

While the service we built in Chapter 8 uses exclusively unary methods, the preceding snippet adds interceptors for both unary and stream methods for the sake of demonstration.

Getting the current span from context

If you’re taking advantage of autoinstrumentation, a trace will automatically be created for each request. While convenient, this also means that you don’t have your current Span immediately on hand for you to enhance with application-specific attribute and event metadata. So, what do you do?

Fear not! Since your application framework has conveniently placed the span data inside the current context, the data is easily retrievable:

func printSpanHandler(w http.ResponseWriter, req *http.Request) {
    ctx := req.Context()                    // Get the request Context

    span := trace.SpanFromContext(ctx)      // Get the current span

    fmt.Printf("current span: %v\n", span)  // Why not print the span?
}

Putting It All Together: Distributed Tracing

Using all the parts that we’ve discussed in this section, let’s now build a small web service. Because we’re going to instrument this service with tracing, the ideal service would make a whole lot of function calls but would still be pretty small.

We’re going to build a Fibonacci service. Its requirements are minimal: it will be able to accept an HTTP GET request, in which the nth Fibonacci number can be requested using parameter n on the GET query string. For example, to request the sixth Fibonacci number, you should be able to curl the service as http://localhost:3000?n=6.

To do this, we’ll use a total of three functions. Starting from the inside and working our way out, these are:

The service API: This will do the Fibonacci computation proper—at the request of the service handler—by recursively calling itself, with each call generating its own span.
The service handler: This is an HTTP handler function as defined by the net/http package, which will be used just like in “Building an HTTP Server with net/http” to receive the client request, call the service API, and return the result in the response.
The main function: In the main function, the OpenTelemetry exporters are created and registered, the service handler function is provided to the HTTP framework, and the HTTP server is started.

The Fibonacci service API

The service API at the very core of the service is where the actual computation is performed. In this case, it’s an implementation of the Fibonacci method to calculate the nth Fibonacci number.

Just like any good service API, this function doesn’t know (or care) how it’s being used, so it has no knowledge of HTTP requests or responses:

const serviceName = "Fibonacci"

func Fibonacci(ctx context.Context, n int) int {
    ctx, sp := otel.GetTracerProvider().Tracer(serviceName).Start(
        ctx,
        "Fibonacci",
        trace.WithAttributes(attribute.Int("fibonacci.n", n)),
    )
    defer sp.End()

    result := 1
    if n > 1 {
        a := Fibonacci(ctx, n-1)
        b := Fibonacci(ctx, n-2)
        result = a + b
    }

    sp.SetAttributes(attribute.Int("fibonacci.result", result))

    return result
}

In this example, the Fibonacci function doesn’t know how it’s being used, but it does know about the OpenTelemetry package. Autoinstrumentation can trace only what it wraps. Anything within the API will need to instrument itself.

This function’s use of otel.GetTracerProvider ensures that it’ll get the global TracerProvider, assuming that it was configured by the consumer. If no global tracer provider has been set, these calls will be no-ops.

Tip

For extra credit, take a minute to add support for Context cancellation to the Fibonacci function.

The Fibonacci service handler

This is an HTTP handler function as defined by the net/http package.

It’ll be used in our service just like in “Building an HTTP Server with net/http”: to receive the client request, call the service API, and return the result in the response:

func fibHandler(w http.ResponseWriter, req *http.Request) {
    ctx := req.Context()

    // Get the Span associated with the current context and
    // attach the parameter and result as attributes.
    sp := trace.SpanFromContext(ctx)

    args := req.URL.Query()["n"]
    if len(args) != 1 {
        msg := "wrong number of arguments"
        sp.SetStatus(codes.Error, msg)
        http.Error(w, msg, http.StatusBadRequest)
        return
    }

    sp.SetAttributes(attribute.String("fibonacci.argument", args[0]))

    n, err := strconv.Atoi(args[0])
    if err != nil {
        msg := fmt.Sprintf("couldn't parse index n: %s", err.Error())
        sp.SetStatus(codes.Error, msg)
        http.Error(w, msg, http.StatusBadRequest)
        return
    }

    sp.SetAttributes(attribute.Int("fibonacci.parameter", n))

    // Call the child function, passing it the request context.
    result := Fibonacci(ctx, n)

    sp.SetAttributes(attribute.Int("fibonacci.result", result))

    // Finally, send the result back in the response.
    fmt.Fprintln(w, result)
}

Note that it doesn’t have to create or end a Span; autoinstrumentation will do that for us.

It does, however, set some attributes on the current span. To do this, it uses trace.SpanFromContext to retrieve the current span from the request context. Once it has the span, it’s free to add whatever metadata it likes.

Note

The trace.SpanFromContext function will return a no-op span if it can’t find a span associated with the context passed to it.

Building the tracer provider

As we discussed in “Creating a tracer provider”, the job of the tracer provider is to collect, process, and export span data. Its creation is a pretty important step in tracing instrumentation.

In the following block, we define a complete function, newTracerProvider:

const jaegerEndpoint = "localhost:4317"

func newTracerProvider(ctx context.Context) (*sdktrace.TracerProvider, error) {
    // Create and configure the OTLP exporter for Jaeger
    otlpExporter, err := otlptracegrpc.New(
        ctx,
        otlptracegrpc.WithEndpoint(jaegerEndpoint),
        otlptracegrpc.WithInsecure(),
    )
    if err != nil {
        return nil, fmt.Errorf("failed to build OtlpExporter: %w", err)
    }

    // Create and configure the TracerProvider exporter using the
    // newly created exporters.
    tp := sdktrace.NewTracerProvider(
        sdktrace.WithResource(res),
        sdktrace.WithBatcher(otlpExporter),
    )

    return tp, nil
}

If we wanted to, this is also where we’d create and configure any resources, which we covered briefly in “Resource attributes”, but they’re optional (if very useful), so we omit them here for brevity.

For more brevity, we create only the one exporter, but you can add as many exporters as your heart desires with repeated calls of WithBatcher.

The service’s main function

At this point, all of the hard work has been done. All we have left to do is configure OpenTelemetry, register the handler function with the default HTTP mux, and start the service:

func main() {
    ctx, cancel := context.Background()

    tp, err := newTracerProvider(ctx)
    if err != nil {
        slog.ErrorContext(ctx, err.Error())
        return
    }

    // Handle shutdown properly so nothing leaks
    defer func() { tp.Shutdown(ctx) }()

    // Registers tp as the global trace provider to allow
    // auto-instrumentation to access it
    otel.SetTracerProvider(tp)

    fmt.Println("Browse to localhost:3000?n=6")

    http.Handle("/", otelhttp.NewHandler(http.HandlerFunc(fibHandler), "root"))

    if err := http.ListenAndServe(":3000", nil); err != nil {
        slog.ErrorContext(ctx, err.Error())
        return
    }
}

Most of the setup work is done in newTracerProvider, so there isn’t much left to do in the main function besides set the global tracer provider, which we discussed in “Setting the global tracer provider”, and make sure that the Tra⁠cerPro⁠vid⁠er instance is properly cleaned up by deferring a call to its Shutdown method.

The last two statements are spent autoinstrumenting and registering the handler function and starting the HTTP service, just as we did in “Autoinstrumentation”.

Starting your services

Before we continue, we’ll want to start a Jaeger service to receive the telemetry data provided by the Jaeger exporter that we included. For a little more background on Jaeger, see “What Is Jaeger?”.

If you have Docker installed, you can start a Jaeger service with the following command:

$ docker run -d --name jaeger   \
  -p 4317:4317                \
  -p 16686:16686                \
  jaegertracing/all-in-one:1.52

Once the service is up and running, you’ll be able to access its web interface by browsing to http://localhost:16686. Obviously, there won’t be any data there yet, though.

Now for the fun part: start your service by running its main function:

$ go run .

Your terminal should pause. As usual, you can stop the service with a Ctrl-C.

Finally, in another terminal, you can now send a request to the service:

$ curl localhost:3000?n=6
13

After a short pause, you should be rewarded with a result. In this case, 13.

Be careful with the value of n. If you make n too large, it might take the service a long time to respond, or even crash.

Console exporter output

Now that you’ve issued a request to your service, take a look at the terminal you used to start your service. You should see several JSON blocks that resemble the following:

{
    "Name":"root",
    "SpanContext":{
       "TraceID":"2fa1365e3fe3b603ab5268a64725c647",
       "SpanID":"bc7f17daf0dc845c",
       "TraceFlags":"01"
    },
    "Parent":{
       "TraceID":"00000000000000000000000000000000",
       "SpanID":"0000000000000000",
       "TraceFlags":"00"
    },
    "StartTime":"2024-01-01T13:44:24.600967-05:00",
    "EndTime":"2024-01-01T13:44:24.601064604-05:00",
    "Attributes":[
       {
          "Key":"fibonacci.parameter",
          "Value":{
             "Type":"INT64",
             "Value":6
          }
       },
       {
          "Key":"fibonacci.result",
          "Value":{
             "Type":"INT64",
             "Value":13
          }
       }
    ]
}

These JSON objects are the output of the Console Exporter (which, remember, we’ve configured to pretty-print). There should be one per span, which is quite a few.

The preceding example (which has been pruned considerably) is from the root span. As you can see, it includes quite a few interesting bits of data, including its start and end times, and its trace and span IDs. It even includes the two attributes that we explicitly set: the value of the input n parameter and the final result of our query.

Viewing your results in Jaeger

Now that you’ve generated your trace and sent it to Jaeger, it’s time to visualize it. Jaeger just happens to provide a slick web UI for exactly that purpose!

To check it out, browse to http://localhost:16686 with your favorite web browser. Select Fibonacci in the Service dropdown, and click the Find traces button. You should be presented with output similar to that shown in Figure 11-3.

Each bar in the visualization represents a single span. You can even view a specific span’s data by clicking on it, which reveals the same data contained in the (quite verbose) console output that you saw in “Console exporter output”.

Metrics

Metrics is the collection of numerical data about a component, process, or activity over time. The number of potential metric sources is vast and includes (but isn’t limited to) things like computing resources (CPU, memory used, disk and network I/O), infrastructure (instance replica count, autoscaling events), applications (request count, error count), and business metrics (revenue, customer sign-ups, bounce rate, cart abandonment). Of course, these are just a handful of trivial examples. For a complex system, the number of distinct metrics can range into the many thousands.

A metric data point, representing one observation of a particular aspect of the target (such as the number of hits an endpoint has received), is called a sample. Each sample has a name, a value, and a millisecond-precision timestamp. Also—at least in modern systems like Prometheus—a set of key-value pairs called labels.

Cardinality

Cardinality is an important concept in observability. It has its origins in set theory, where it’s defined as the number of elements in a set. For example, the set contains five elements, so therefore has a cardinality of five.

The term was later adopted by database designers to refer to the number of distinct values in a table column.¹² For example, “eye color” would have a low cardinality, while “username” would be quite high.

More recently, however, the term cardinality has been adopted for monitoring, where it’s come to refer to the number of unique combinations of metric names and dimensions—the number of distinct label values attached to a particular metric—in your monitoring system. High cardinality information is critical for observability because it means that the data has many different ways that it can be queried, making it all the more likely that you’ll be able to ask it a question that you’d never thought to ask before.

By itself, a single sample is of limited use, but a sequence of successive samples taken at different times with the same name and labels—a time series—can be incredibly useful. As illustrated in Figure 11-4, collecting samples as a time series allows metrics to be easily visualized by plotting the data points on a graph, in turn making it easier to see trends or to observe anomalies or outliers.

In this figure, we show a time series of the metric aws.ec2.network_in for one AWS EC2 instance. Time is on the x-axis (specifically, one month spanning November–December 2020). The y-axis represents the instantaneous rate at which the instance is receiving network data at that moment. Visualizing the time series this way, it becomes obvious that traffic to the instance spikes each weekday. Interestingly, November 25–27—the days spanning the day before to the day after Thanksgiving in the United States that year—are the exceptions.

The true power of metrics, however, isn’t its ability to be visually represented for human eyes; it’s that its numerical nature makes it particularly amenable to mathematical modeling. For example, you might use trend analysis to detect anomalies or predict future states, which in turn can inform decisions or trigger alerts.

Push Versus Pull Metric Collection

There are two primary architectures in the universe of metrics: push-based and pull-based (so called because of the relationship between the components being monitored and the collector backend).

In push-based metrics, monitored components “push” their data to a central collector backend. In pull-based metrics, the inverse is true: the collector actively retrieves metrics by “pulling” them from HTTP endpoints exposed by the monitored components (or by sidecar services deployed for this purpose, also confusingly called “exporters”; see “Prometheus Exporters”). Both approaches are illustrated in Figure 11-5.

What follows is a short description of each of these approaches, along with a very limited list of some arguments for and against each approach. Unfortunately, there are bounteous arguments, many quite nuanced—far too nuanced to delve into here—so we’ll have to be content with some of the common ones.

Push-based metric collection

In push-based metric collection, an application, either directly or via a parallel agent process, periodically sends data to a central collector backend. Push implementations, like Ganglia, Graphite, and StatsD, tend to be the more common (even default) approach, perhaps in part because the push model tends to be quite a bit easier to reason about.

Push messages are typically unidirectional, being emitted by the monitored components or monitoring agent and sent to a central collector. This places a bit less burden on the network relative to the (bidirectional) pull model, and can reduce the complexity of the network security model, since components don’t have to make a metrics endpoint accessible to the collector. It’s also easier to use the push model to monitor highly ephemeral components such as short-lived containers or serverless functions.

There are some downsides to the push model, though. First, you need to know where to send your request. While there are lots of ways to do this, each has its downside, ranging from hardcoded addresses (which are hard to change) to DNS lookups or service discovery (which may add unacceptable latency). Scaling can also sometimes be an issue, in that it’s entirely possible for a large number of components to effectively DDoS your collector backend.

Pull-based metric collection

In the pull-based collection model, the collector backend periodically (on some configurable cadence) scrapes a metric endpoint exposed by a component, or by a proxy deployed for this purpose. Perhaps the best-known example of a pull-based system is Prometheus.

The pull approach offers some notable advantages. Exposing a metric endpoint decouples the components being observed from the collector itself, which provides all of the benefits of loose coupling. For example, it becomes easier to monitor a service during development, or even manually inspect a component’s health with a web browser. It’s also much easier for a pull model to tell if a target is down.

However, the pull approach has a discovery issue of its own, in that the collector has to somehow know where to find the services it’s supposed to monitor. This can be a bit of a challenge, particularly if your system isn’t using dynamic service discovery. Load balancers are of little help here, either, since each request will be forwarded to a random instance, greatly reducing the effective collection rate (since each of N instances receives 1/N of the pulls) and severely muddying what data is collected (since all of the instances tend to look like a single target). Finally, pull-based collection can make it somewhat harder to monitor short-lived ephemeral things like serverless functions, necessitating a solution like the Prometheus push gateway.

What Is Prometheus?

Prometheus is an open source monitoring and alerting toolkit. It uses a pull model over HTTP to scrape metric data, storing it as high-dimensionality values in its time series database.

Prometheus consists of a core server, which is responsible for the acquisition and storage of data, as well as a variety of other optional components, including a push gateway for supporting data pushes by short-lived jobs, and an alert manager for handling alerts. While Prometheus isn’t intended as a dashboarding solution, it also provides a basic web UI and query language, PromQL, to make data more easily accessible.

In January 2015, SoundCloud publicly released Prometheus as an open source project under the Apache license, and in May 2016, Prometheus joined the Cloud Native Computing Foundation as its second hosted project (after Kubernetes). It advanced to graduated status in August 2018.

But which is better?

Since the push and pull approaches are, it would seem, polar opposites of one another, it’s common for people to wonder which is better.¹³ That’s a hard question, and as is often the case when comparing technical methodologies, the answer is a resounding “it depends.”

Of course, that’s never stopped a sufficiently motivated programmer from stridently arguing one side or another, but at the end of the day, the “better” approach is the one that satisfies the requirements of your system. Of course (and quite unsatisfyingly) that could be both equally. We technical types abhor ambiguity, yet it stubbornly insists on existing anyway.

So, I will close this section with the words of Brian Brazil, a core developer of Prometheus:

From an engineering standpoint, in reality, the question of push versus pull largely doesn’t matter. In either case, there’s advantages and disadvantages, and with engineering effort, you can work around both cases.¹⁴

Metrics with OpenTelemetry

For the most part, instrumenting for metrics with OpenTelemetry works a lot like instrumenting for tracing, but the two are different enough to possibly cause some confusion. For both tracing and metric instrumentation, the configuration phase is executed exactly once in a program, usually in the main function, and includes the following steps:

Create and configure the appropriate metric exporter for the target backend. As we’ll discuss in “Creating your metric exporters”, several stock exporters are included with OpenTelemetry.
Define the global meter provider, which will serve as your program’s main entry point into the OpenTelemetry metric API throughout its lifetime. As we’ll see in “Setting the global meter provider”, this makes the meter exporter discoverable via the otel.GetMeterProvider function, which allows libraries and other dependencies that use the OpenTelemetry API to more easily access the SDK and emit telemetry data.
If your metric backend is pull-based, like Prometheus, you’ll have to expose a metric endpoint. You’ll see how the Prometheus exporter leverages Go’s standard http package in “Exposing the metrics endpoint”.

Once the configuration is complete, instrumenting your code requires only a couple small steps:

Before you can instrument an operation, you first have to obtain a Meter, through which metric collection is configured and reported, from the meter provider. We’ll discuss this in more detail in “Obtaining a meter”.
Finally, once you have a Meter, you can use it to instrument your code. There are two ways this can be done, either by explicitly recording measurements or by creating observers that can autonomously and asynchronously collect data. Both of these approaches are covered in “Metric instruments”.

OpenTelemetry Metrics Imports

There are numerous packages in the OpenTelemetry framework. Fortunately, for the purposes of this section, we’ll be able to focus on just a subset of these.

The examples in this section were created using OpenTelemetry v1.28.0, which was the latest release at the time of writing. If you choose to follow along with this section, you’ll need to import the following packages from that release:

import (
    "github.com/prometheus/client_golang/prometheus/promhttp"

    "go.opentelemetry.io/otel/attribute"
    "go.opentelemetry.io/otel/exporters/prometheus"
    "go.opentelemetry.io/otel/metric"
    sdkmetric "go.opentelemetry.io/otel/sdk/metric"
)

As usual, the complete code examples are available in the GitHub repository associated with this book.

Creating your metric exporters

Just like with tracing, the first thing you have to do when using OpenTelemetry for metrics is create and configure your exporters. Metric exporters implement the metric.Exporter interface, which in OpenTelemetry v1.28.0 lives in OpenTelemetry’s sdk/metric package.

The way that you create metric exporters varies a little between implementations, but it’s typical for an exporter to have a New builder function, at least in the standard OpenTelemetry packages.

To get an instance of the Prometheus exporter, for example, you would use the New function from OpenTelemetry’s exporters/prometheus package:

prometheusExporter, err := prometheus.New()

While it may not be immediately clear just from the code, New is actually variadic, and it can accept zero or more options that allow you to specify a variety of custom behaviors. We’re not going to do that here, but if you’re interested, the API documentation is fairly straightforward.

Setting the global meter provider

Where OpenTelemetry tracing has the “tracer provider” that provides Tracer values, OpenTelemetry metrics has the meter provider, which provides the Meter values through which all metric collection is configured and reported.

You may recall that when working with tracing exporters, defining the global tracer provider requires two steps: creating and configuring a tracer provider instance and then setting that instance as the global tracer provider. The meter provider works exactly the same way:

// Now we can register it as the otel meter provider.
mp := sdkmetric.NewMeterProvider(sdkmetric.WithReader(exporter))

// Set it as the global meter provider.
otel.SetMeterProvider(mp)

Exposing the metrics endpoint

Because Prometheus is pull-based, any telemetry data we want to send it must be exposed through an HTTP endpoint that the collector can scrape.

To do this, we can make use of Go’s standard http package, which, as we’ve shown several times in this book, requires minimal configuration and is rather straightforward to use.

To review what we first introduced in “Building an HTTP Server with net/http”, starting a minimal HTTP server in Go requires at least two calls:

http.Handle: To register a handler function that implements the http.Handler interface
http.ListenAndServe: To start the server listening

The OpenTelemetry Prometheus exporter has a pretty nifty trick up its sleeve: it implements the http.Handler interface, which allows it to be passed directly to http.Handle to act as a handler function for the metric endpoint! See the following:

// Register the exporter as the handler for the "/metrics" pattern.
http.Handle("/metrics", promhttp.Handler())

// Start the HTTP server listening on port 3000.
log.Fatal(http.ListenAndServe(":3000", nil))

In this example, we pass the Prometheus exporter directly into http.Handle to register it as the handler for the pattern “/metrics.” It’s hard to get more convenient than that.

Note

Ultimately, the name of your metrics endpoint is up to you, but
metrics is the most common choice. It’s also where Prometheus looks by default.

Prometheus Exporters

It’s relatively straightforward to expose a metrics endpoint if your application is a standard web service written in Go. But what if you want to collect JMX data from a JVM-based application, query metrics from a PostgreSQL database, or system metrics from a deployed Linux or Windows instance?

Unfortunately, not all of the things you’ll want to collect data on are things that you control, and few of them natively expose their own metrics endpoints.

In this (very common) scenario, the typical solution is to deploy a Prometheus exporter. A Prometheus exporter (not to be confused with an OpenTelemetry exporter) is a specialized adapter that runs as a service to collect the desired metric data and expose it on a metrics endpoint.

As of the time of writing, there are over 200 different Prometheus exporters listed in the Prometheus documentation, and many more community-built exporters that aren’t. You can check that page for an up-to-date list, but some of the most popular are:

Node exporter: Exposes hardware and OS metrics made available by *NIX kernels.
Windows exporter: Exposes hardware and OS metrics made available by Windows.
JMX exporter: Scrapes and exposes MBeans of a JMX target.
PostgreSQL exporter: Retrieves and exposes PostgreSQL server metrics.
Redis exporter: Retrieves and exposes Redis server metrics.
Blackbox exporter: Allows closed/open probing of endpoints over HTTP, HTTPS, DNS, TCP, or ICMP.
Push gateway: A metrics cache that lets ephemeral and batch jobs expose their metrics by pushing them to an intermediary. Technically a core component of Prometheus rather than a standalone exporter.

Obtaining a meter

Before you can instrument an operation, you first have to obtain a Meter value from a MeterProvider.

As you’ll see in “Metric instruments”, the metric.Meter type is the means by which all metric collection is configured and reported, either as record batches of synchronous measurements or asynchronous observations.

You can retrieve a Meter value as follows:

meter := otel.GetMeterProvider().Meter("fibonacci")

You may have noticed that snippet looks almost exactly like the expression used to get a Tracer back in “Obtaining a tracer”. In fact, otel.GetMeterProvider is exactly equivalent to otel.GetTracerProvider and works pretty much the same way.

The otel.GetMeterProvider function returns the registered global meter provider. If none is registered, then a default meter provider is returned that forwards the Meter interface to the first registered Meter value.

The provider’s Meter method returns an instance of the metric.Meter type. It accepts a string parameter representing the instrumentation name, which by convention is named after the library or package it’s instrumenting.

Metric instruments

Once you have a Meter, you can create instruments, which you can use to make measurements and to instrument your code. However, just as there are several different types of metrics, there are several types of instruments. The type of instrument you use will depend on the type of measurement you’re making.

All told, there are 12 kinds of instruments available, each with some combination of synchronicity, instrument type, and data type.

The first of these properties, synchronicity, determines how an instrument collects and transmits data:

Synchronous instruments: Explicitly invoked to record a metric, as we’ll see in “Synchronous instruments”.
Asynchronous instruments: Also called observers, can monitor a specific property and are asynchronously called by the SDK during collection. We’ll demonstrate in “Asynchronous instruments”.

Second, each instrument has a type that describes how it tracks the acquisition of new data:

Counters: Track a non-negative value that can only increase,¹⁵ like the number of requests served, tasks completed, or errors observed.
Up-down counters: Track a value that can be either incremented or decremented. They’re used for sums that can go up and down, like the current number of concurrent requests or active threads.
Gauges: Track a value that can go arbitrarily up or down. They’re useful for measured values like temperatures or current memory usage.
Histograms: Capture a distribution of observations, like request durations or response sizes, and count them in ranged buckets. They also provide a sum of all observed values.

Each of these four instrument types can be expressed with a particular synchronicity. Counters and up-down counters may be either synchronous or asynchronous, while gauges may be only asynchronous and histograms may be only synchronous.

Finally, each of the these six combinations has two versions based on the type of values it’s capable of tracking—float64 or int64—for a total of 12 kinds of instruments. Each of these has a corresponding Go type in the metric package, summarized in Table 11-1.

Table 11-1. The 12 kinds of metric instruments by synchronicity and instrument type
Instrument type	Synchronous	Asynchronous
Counter	`Int64Counter, Float64Counter`	`Int64ObservableCounter, Float64ObservableCounter`
Up-down counter	`Int64UpDownCounter, Float64UpDownCounter`	`Int64ObservableUpDownCounter, Float64ObservableUpDownCounter`
Gauge	`-`	`Int64ObservableGauge, Float64ObservableGauge`
Histogram	`Int64Histogram, Float64Histogram`	`-`

Each of the 12 types has an associated constructor method on the metric.Meter type, all with a similar signature. For example, the Int64Counter method looks like the following:

func (m Meter) Int64Counter(name string, options ...Int64CounterOption)
    (Int64Counter, error)

All 12 constructor methods accept the name of the metric as a string, and zero or more metric.InstrumentOption values, just like the NewInt64Counter method. Similarly, each returns an instrument value of the appropriate type with the given name and options, and can return an error if the name is empty or otherwise invalid, or if the instrument is duplicate registered.

For example, a function that uses the NewInt64Counter method to get a new metric.Int64Counter from a metric.Meter value looks something like the following:

// The requests counter instrument. As a synchronous instrument,
// we'll need to keep it so we can use it later to record data.
var requests metric.Int64Counter

func buildRequestsCounter(meter metric.Meter) error {
    var err error

    // Get an Int64Counter for a metric called "fibonacci_requests_total".
    requests, err = meter.Int64Counter("fibonacci_requests_total",
        metric.WithDescription("Total number of Fibonacci requests."),
    )

    return err
}

Note how we retain a reference to the instrument in the form of the requests global variable. For reasons I’ll discuss shortly, this is generally specific to synchronous instruments.

Even though the metric.Int64Counter happens to be a synchronous instrument, the takeaway here is that synchronous and asynchronous instruments are both obtained in the same way: via the corresponding Metric constructor method. How they’re used, however, differs significantly, as we’ll see in the subsequent sections.

Synchronous instruments

The initial steps to using a synchronous instrument—retrieving a meter from the meter provider and creating an instrument—are largely the same for both synchronous and asynchronous instruments. We saw these in the previous section.

However, using synchronous instruments differs from using asynchronous instruments in that they’re explicitly exercised in your code logic when recording a metric, which means you have to be able to refer to your instrument after it’s been created. That’s why the preceding example uses a global requests variable.

The most common application of synchronous instruments is to record individual events by incrementing a counter when an event occurs. The following example uses the requests value that we created in the previous example by adding a call to requests.Add to the API’s Fibonacci function that was originally defined in “The Fibonacci service API”:

// Define our attributes here so that we can easily reuse them.
var attributes = []attribute.KeyValue{
    attribute.Key("application").String(serviceName),
    attribute.Key("container_id").String(os.Getenv("HOSTNAME")),
}

func Fibonacci(ctx context.Context, n int) chan int {
    // Use the Add method on our metric.Int64Counter instance
    // to increment the counter value.
    requests.Add(ctx, 1, metric.WithAttributes(attributes...))

    // The rest of the function...
}

As you can see, the requests.Add method—which is safe for concurrent use—accepts three parameters:

The first parameter is the current context in the form of a context.Context value. This is common for all of the synchronous instrument methods.
The second parameter is the number to increment by. In this case, each call to Fibonacci increases the call counter by one.
The third parameter is zero or more metric.AddOption values that, in this example, represent the attributes to associate with the data points. This increases the cardinality of the metrics, which, as discussed in “Cardinality”, is incredibly useful.

Tip

Data attributes, or labels, are a powerful tool that allow you to describe data beyond which service or instance emitted it. They can allow you to ask questions of your data that you hadn’t thought of before.

Asynchronous instruments

Asynchronous instruments, or observers, are created and configured during setup to measure a particular property and are subsequently called by the SDK during collection. This is especially useful when you have a value you want to monitor without managing your own background recording process.

Just like synchronous instruments, asynchronous instruments are created from a constructor method attached to a metric.Meter instance. In total, there are six such functions: a float64 and int64 version for each of the three asynchronous instrument types. All six have a similar signature, of which the following is representative:

func (m Meter) Int64UpDownSumObserver(
    name string,
    options ...Int64UpDownSumObserverOption
) (Int64UpDownSumObserver, error)

As you can see, the Int64UpDownSumObserver method accepts the name of the metric as a string and zero or more instrument options (such as the metric description). Although it returns the observer value, this isn’t actually used all that often, though it can return a non-nil error if the name is empty, duplicate registered, or otherwise invalid.

It isn’t obvious by looking at this function signature, but this is also how you add the very heart of any asynchronous instrument: the callback function. Callback functions are asynchronously called by the SDK on data collection to update the observer that it’s associated with. There are two kinds, one each for int64 and float64, but they look, feel, and work essentially the same:

type Float64Callback func(context.Context, Float64Observer) error
type Int64Callback func(context.Context, metric.Int64Observer) error

When called by the SDK, the callback functions receive the current context.Context and either a metric.Float64Observer (for float64 observers) or me⁠tr⁠ic.I⁠nt64Obs⁠erv⁠er (for int64 observers). Both result types have an Observe method, which you use to report your results.

This is a lot of little details, but they come together fairly seamlessly. The following function does exactly that, defining two observers:

func buildRuntimeObservers(meter metric.Meter) error {
    var err error
    var m runtime.MemStats

    _, err = meter.Int64ObservableUpDownCounter(
        "fibonacci_memory_usage_bytes",
        metric.WithInt64Callback(
            func(_ context.Context, result metric.Int64Observer) error {
                runtime.ReadMemStats(&m)
                result.Observe(int64(m.Sys),
                    metric.WithAttributes(attributes...))
                return nil
            }
        ),
        metric.WithDescription("Amount of memory used."),
    )
    if err != nil {
        return err
    }

    _, err = meter.Int64ObservableGauge(
        "fibonacci_num_goroutines",
        metric.WithInt64Callback(
            func(_ context.Context, o metric.Int64Observer) error {
                o.Observe(int64(runtime.NumGoroutine()),
                    metric.WithAttributes(attributes...))
                return nil
            }
        ),
        metric.WithDescription("Number of running goroutines."),
    )
    if err != nil {
        return err
    }

    return nil
}

When called by main, the buildRuntimeObservers function defines two asynchronous instruments—memory_usage_bytes and num_goroutines—each with a callback function that works exactly like the data collection in the Fibonacci function that we defined in “Synchronous instruments”.

As you can see, using an asynchronous approach for nonevent data is not only less work to manage but has fewer moving parts to worry about later since there isn’t anything else to do once the observers (and their callback functions) are defined and the SDK takes over.

Putting It All Together: Metrics

Now that we have an idea what metrics we’re going to collect and how, we can use them to extend the Fibonacci web service that we put together in “Putting It All Together: Distributed Tracing”.

The functionality of the service will remain unchanged. As before, it will be able to accept an HTTP GET request, in which the nth Fibonacci number can be requested using parameter n on the GET query string. For example, to request the sixth Fibonacci number, you should be able to curl the service as http://localhost:3000?n=6.

The specific changes we’ll be making, and the metrics that we’ll be collecting, are as follows:

Synchronously recording the API request count by adding the buildRequestsCounter function to main and instrumenting the Fibonacci function in the service API as we described in “Synchronous instruments”
Asynchronously recording the processes’ memory used and number of active goroutines by adding the buildRuntimeObservers described in “Asynchronous instruments” to the main function

As usual, all of this code can be found in the companion GitHub repository for this book.

Starting your services

Once again, start your service by running its main function:

$ go run .

As before, your terminal should pause. You can stop the service with a Ctrl-C.

Next, you’ll start the Prometheus server. Before you do, you’ll need to create a minimal configuration file for it. Prometheus has a ton of available configuration options, but the following should be perfectly sufficient. Copy and paste it into a file named prometheus.yml:

scrape_configs:
- job_name: fibonacci
  scrape_interval: 5s
  static_configs:
  - targets: ['host.docker.internal:3000']

This configuration defines a single target named fibonacci that lives at host.docker.internal:3000 and will be scraped every five seconds (down from the default of every minute).

Once you’ve created the file prometheus.yml, you can start Prometheus. The easiest way to do this is a container using Docker:

docker run -d --name prometheus                             \
  -p 9090:9090                                              \
  -v "${PWD}/prometheus.yml:/etc/prometheus/prometheus.yml" \
  prom/prometheus:v2.23.0

Warning

If you’re using Linux for development, you’ll need to add the parameter --add-host=host.docker.internal:host-gateway to this command. But do not use this in production.

Now that your services are both running, you can send a request to the service:

$ curl localhost:3000?n=6
13

Behind the scenes, OpenTelemetry has just recorded a value for the number of requests (recursive and otherwise) made to its Fibonacci function.

Metric endpoint output

Now that your service is running, you can always examine its exposed metrics directly with a standard curl to its /metrics endpoint:

$ curl localhost:3000/metrics
# HELP fibonacci_memory_usage_bytes Amount of memory used.
# TYPE fibonacci_memory_usage_bytes gauge
fibonacci_memory_usage_bytes{application="fibonacci",container_id=""} 5.791e+07
# HELP fibonacci_num_goroutines Number of running goroutines.
# TYPE fibonacci_num_goroutines gauge
fibonacci_num_goroutines{application="fibonacci",container_id=""} 6
# HELP fibonacci_requests_total Total number of Fibonacci requests.
# TYPE fibonacci_requests_total counter
fibonacci_requests_total{application="fibonacci",container_id=""} 21891

As you can see, all three of the metrics we’re recording—as well as their types, descriptions, labels, and values—are listed here. Don’t be confused if the value of container_id is empty—that just means you’re not running in a container.

Viewing your results in Prometheus

Now that you’ve started your service, started Prometheus, and run a query or two to the service to seed some data, it’s time to visualize your work in Prometheus. Again, Prometheus isn’t a full-fledged graphing solution (you’ll want to use something like Grafana for that), but it does provide a simple interface for executing arbitrary queries.

You can access this interface by browsing to localhost:9090. You should be presented with a minimalist interface with a search field. To see the value of your metric over time, enter its name in the search field, hit enter, and click the “graph” tab. You should see something like the screenshot in Figure 11-6.

Now that you’re collecting data, take a moment to run a few more queries and see how the graph changes. Maybe even look at some other metrics. Enjoy!

Logging

A log is an immutable record of events—discrete occurrences that are worth recording—emitted by an application over time. Traditionally, logs were stored as append-only files, but these days, a log is just as likely to take the form of some kind of searchable data store.

So, what’s there to say about logging, other than that it’s a really good idea that’s been around as long as electronic computing has? It’s the OG of observability methods.

There’s actually quite a bit to say, largely because it’s really, really easy to do logging in a way that makes your life harder than it needs to be.

Of the three pillars of observability, logs are by far the easiest to generate. Since there’s no initial processing involved in outputting a log event, in its simplest form it’s as easy as adding a print statement to your code. This makes logs really good at providing lots and lots of context-rich data about what a component is doing or experiencing.

This free-form aspect to logging cuts both ways, however. While it’s possible (and often tempting) to output whatever you think might be useful, the verbose, unstructured logs are difficult to extract usable information from, especially at scale. To get the most out of logging, events should be structured, and that structure doesn’t come for free. It has to be intentionally considered and implemented.

Another, particularly underappreciated, pitfall of logging is that generating a lot of events puts significant pressure on disk and/or network I/O. It’s not unusual for half or more of available bandwidth to be consumed this way. What’s more, this pressure tends to scale linearly with load: N users each doing M things translates to N*M log events being emitted, with potentially disastrous consequences for scalability.

Finally, for logs to be meaningfully useful, they have to be processed and stored in a way that makes them accessible. Anybody who’s ever had to manage logs at scale can tell you that it’s notoriously operationally burdensome to self-manage and self-host, and absurdly expensive to have somebody else manage and host.

In the remainder of this section, we’ll first discuss some high-level practices for logging at scale, followed by how to implement them in Go.

Better Logging Practices

As simple as the act of logging may seem on the face of it, it’s also really easy to log in a way that makes life harder for you and anybody who has to use your logs after you. Awkward logging issues, like having to navigate unstructured logs or higher-than-expected resource consumption, which are annoying in small deployments, become major roadblocks at scale.

As you’ll see, for this reason and others, the best practices around logging tend to focus on maximizing the quality and minimizing the quantity of logging data generated and retained.

Warning

It goes without saying that you shouldn’t log sensitive business data or personally identifiable information.

Treat logs as streams of events

How many times have you looked at log output and been confronted with an inscrutable stream of consciousness? How useful was it? Better than nothing, maybe, but probably not by much.

Logs shouldn’t be treated as data sinks to be written to and forgotten until something is literally on fire, and they definitely shouldn’t be a garbage dump where you send random thoughts and observations.

Instead, as we saw back in Chapter 6, logs should be treated as a stream of events and should be written, unbuffered, directly to stdout and stderr. Though seemingly simple (and perhaps somewhat counterintuitive), this small change in perspective provides a great deal of freedom.

By moving the responsibility for log management out of the application code, it’s freed from concerns about implementation trivialities like routing or storage of its log events, allowing the executor to decide what happens to them.

This approach provides quite a lot of freedom for how you manage and consume your logs. In development, you can keep an eye on your service’s behavior by sending them directly to a local terminal. In production, the execution environment can capture and redirect log events to a log indexing system like ELK or Splunk for review and analysis, or perhaps a data warehouse for long-term storage.

Treat logs as streams of events, and write each event, unbuffered, directly to stdout and stderr.

Structure events for parsing

Logging, in its simplest and most primitive form, is technically possible using nothing more than fmt.Println statements. The result, however, would be a set of unformatted strings of questionable utility.

Fortunately, it’s more common for programmers to use Go’s standard log library, which is conveniently located and easy to use, and generates helpful timestamps. But how useful would a terabyte or so of log events formatted like the following be?

2024/01/09 02:15:10AM User 12345: GET /help in 23ms
2024/01/09 02:15:11AM Database error: connection reset by peer

Certainly, it’s better than nothing, but you’re still confronted with a mostly unstructured string, albeit an unstructured string with a timestamp. You still have to parse the arbitrary text to extract the meaningful bits.

Compare that to the equivalent messages outputted by a structured logger:¹⁶

{"time":1604888110, "level":"info", "method":"GET", "path":"/help",
        "duration":23, "message":"Access"}
{"time":1604888111, "level":"error", "error":"connection reset by peer",
        "database":"user", "message":"Database error"}

This log structure places all of the key elements into properties of a JavaScript object, each with:

time: A timestamp, which is a piece of contextual information that’s critical for tracking and correlating issues. Note that the JSON example is also in an easily parsable format that’s far less computationally expensive to extract meaning from than the first, barely structured example. When you’re processing billions of log events, little things add up.
level: A log level, which is a label that indicates the level of importance for the log event. Frequently used levels include INFO, WARN, and ERROR. These are also key for filtering out low-priority messages that might not be relevant in production.
One or more contextual elements: These contain background information that provides insight into the state of the application at the time of the message. The entire point of a log event is to express this context information.

In short, the structured log form is easier, faster, and cheaper to extract meaning from, and the results are far easier to search, filter, and aggregate.

Structure your logs for parsing by computers, not for reading by humans.

Less is (way) more

Logging isn’t free. In fact, it’s quite expensive.

Imagine you have a service deployed to a server running in AWS. Nothing fancy, just a standard server with a standard, general-purpose disk capable of a sustained throughput of 16 Mb/second.

Let’s say that your service likes to be thorough, so it fastidiously logs events acknowledging each request, response, database call, calculation status, and various other bits of information, totaling sixteen 1,024-byte events for each request the service handles. It’s a little verbose, but nothing too unusual so far.

But this adds up. In a scenario in which the service handles 512 requests per second—a perfectly reasonable number for a highly concurrent service—your service would produce 8,192 events/second. At 16 KiB per event, that’s a total of 8 Mb/second of log events, or half of your disk’s I/O capacity. That’s quite a burden.

What if we skip writing to disk and forward events straight to a log-hosting service? Well, the bad news is that we then have to transfer and store our logs, and that gets expensive. If you’re sending the data across the internet to a log provider like Splunk or Datadog, you’ll have to pay your cloud provider a data transfer fee. For AWS, this amounts to US $0.10/GB, which at an average rate of 8 Mb/second—about 725 GiB every day and a half—comes to almost $25,000/year for a single instance. Fifty such instances would run more than one million dollars per year in data transfer costs alone.

Obviously, this example doesn’t take into account fluctuations in load due to hour of day or day of week. However, it clearly illustrates that logging can get very expensive very quickly, so log only what’s useful, and be sure to limit log generation in production by using severity thresholds.

Dynamic sampling

Because the kind of events that are produced by debug events tend to be both high-volume and low-fidelity, it’s pretty standard practice to eliminate them from production output by setting the log level to WARNING. But debug logs aren’t worthless, are they?¹⁷ As it turns out, they become really useful really fast when you’re trying to chase down the root cause of an outage, which means you have to waste precious incident time turning debug logs on just long enough for you find the problem. Oh, and don’t forget to turn them off afterward.

However, by dynamically sampling your logs—recording some proportion of events and dropping the rest—you can still have your debug logs—but not too many—available in production, which can help drive down the time to recovery during an incident.

Having some debug logs in production can be really useful when things are on fire.

Logging with Go’s Standard log Package

Go includes a standard logging package, appropriately named log, that provides some basic logging features. While it’s very bare bones, it still has just about everything you need to put together a basic logging strategy.

Its most basic functions can be leveraged with a selection of functions similar to the various fmt print functions you should already be familiar with:

func Print(v ...any)
func Printf(format string, v ...any)
func Println(v ...any)

You may have noticed what is perhaps the most glaring omission from the log package: that it doesn’t support logging levels. However, what it lacks in functionality, it makes up for in simplicity and ease of use.

Here’s the most basic logging example:

package main

import "log"

func main() {
    log.Print("Hello, world!")
}

When run, it provides the following output:

$ go run .
2024/01/10 09:15:39 Hello, world!

As you can see, the log.Print function—like all of the log logging functions—adds a timestamp to its messages without any additional configuration.

The special logging functions

Although log sadly doesn’t support log levels, it does offer some other interesting features. Namely, a class of convenience functions that couple outputting log events with another useful action.

The first of these is the log.Fatal functions. There are three of these, each corresponding to a different log.PrintX function, and each equivalent to calling its corresponding print function followed by a call to os.Exit(1):

func Fatal(v ...any)
func Fatalf(format string, v ...any)
func Fatalln(v ...any)

Similarly, log offers a series of log.Panic functions, which are equivalent to calling its corresponding log.PrintX followed by a call to panic:

func Panic(v ...any)
func Panicf(format string, v ...any)
func Panicln(v ...any)

Both of these sets of functions are useful, but they’re not used nearly as often as the log.Print functions, typically in error handling where it makes sense to report the error and halt.

Log flags

The log package also allows you to use constants to enrich log messages with additional context information, such as the filename, line number, date, and time.

For example, adding the following line to our previous Hello, world!:

log.SetFlags(log.Ldate | log.Ltime | log.Lshortfile)

will result in a log output like the following:

2024/01/10 10:14:36 main.go:7: Hello, world!

As you can see, it includes the date in the local time zone (log.Ldate), the time in the local time zone (log.Ltime), and the final file name element and line number of the log call (log.Lshortfile).

We don’t get any say over the order in which the log parts appear or the format in which they are presented, but if you want that kind of flexibility, you probably want to use a more advanced logging framework, such as Zap.

Structured Logging with the log/slog Package

The log/slog package was added to Go v1.21 to address a long-standing demand for structured logging support in the standard library. Before this quite significant update, structured logging was (and to a large degree still is) provided by a variety of third-party packages. While this did the job well enough, large programs would often find themselves importing more than one logging package through their various dependencies, making consistent logging behavior a challenge.

In addition to structure and consistency, slog brings with it other long-awaited improvements to standard logging, including support for severity levels and high-performance logging.

Using the slog package

In its simplest, most basic form, using log/slog isn’t much different from using the log package, except that instead of a Print function, you instead use a level-specific output function (Info, in this example):

package main

import "log/slog"

func main() {
    slog.Info("Hello, world!")
}

Running this code produces a line much like the following:

$ go run .
2024/01/10 14:03:19 INFO Hello, world!

This is similar to the one we got from the log package in “Logging with Go’s Standard log Package”, except for that INFO string. That’s new.

As you’ve probably already inferred, INFO indicates the level, or severity, of the log event. We’ll talk about log levels more in the next section.

Logging levels

The level of a log event is really just an integer value that represents its importance or severity. The higher the level, the more severe (and attention-worthy) the event.

For convenience, the slog package defines constants and corresponding top-level functions for the most common levels: Debug, Info, Warn, and Error, all of which are illustrated in Table 11-2.

Table 11-2. The standard logging levels and their associated constants and functions
Log level	Constant	Int value	Output function
Debug	`slog.LevelDebug`	-4	`slog.Debug(string, …any)`
Info	`slog.LevelInfo`	0	`slog.Info(string, …any)`
Warn	`slog.LevelWarn`	4	`slog.Warn(string, …any)`
Error	`slog.LevelError`	8	`slog.Error(string, …any)`

As you can see, the standard levels are conveniently spaced out a bit to allow you to easily create levels that fall between them. For example, the value of LevelInfo is 0 and LevelWarn is 4, so if you wanted to generate an intermediate severity event with a log level of 2, you can do that by calling the general slog.Log function:

slog.Log(context.TODO(), 2, "Hello, world!")

This line results in something like the following:

2024/01/10 14:03:19 INFO+2 Hello, world!

Pretty neat, right? Note how the default log format describes the severity relative to the standard level immediately below it.

The slog.Logger type

The slog package defines a Logger type, which also provides convenient output methods for reporting events. These methods share their names with the top-level output functions described previously.

If we want, we can retrieve the default logger explicitly and call its methods:

logger := slog.Default()
logger.Info("Hello, world!")

Since the package-level output functions actually just call through to identical functions on a default Logger, this is functionally equivalent to calling slog.Info.

Attributes

In addition to the message string, the various output functions also accept zero or more arguments that can be used to specify the event’s attributes: the key-value pairs that are the core of structured logging.

There are two approaches that you can use to specify log event attributes in slog.

Alternating keys and values

The first approach is to include a list of alternating keys and values after the log message (key, value, key, value, …). For example:

slog.Info("Hello", "number", 3)

This will produce an event containing a timestamp, a level of Info, the message “Hello,” and one attribute pair with the key “number” and value 3:

2024/01/10 23:00:00 INFO Hello number=3

Importantly, there are a couple of pros and cons to this approach relative to the Attr alternative we discuss in the next section.

First, this approach requires less unsightly boilerplate than the Attr approach, which is nice. You do have to keep in mind that keys must always be string values, though values can have any type. This approach is also slightly less efficient because it requires more memory allocation and reflection.

slog.Attr values

In the second approach, event attributes are represented by one or more slog.Attr values.

An Attr is a type that contains a single key-value pair. Its key is a standard string, but its value is implemented using a special Value type that can hold any Go value without additional memory allocation.

The slog package includes a number of convenient functions for creating Attr values of different types, such as Int, String, and Bool. There’s also a slog.Any function, which allows you to construct Attr values of any type.

The following is functionally identical to the alternating keys and values example:

slog.Info("Hello", slog.Int("number", 3))

This approach is meant to keep reflection and memory allocation to a minimum, making it the more performant of the two approaches. However, it also requires a bit of extra effort and boilerplate, so it should be used mostly where performance is a concern.

Tip

You can actually mix and match these approaches, if you want, with both alternating keys and values and Attr values in the same call.

Go ahead. Nobody can stop you.

Common attributes

At some point you’ll likely find yourself with attributes that are common to many log calls, such as a request URL or current trace ID. One option, of course, is to include the attribute in every log call.

However, there’s a better way: by using the Logger.With method to construct a new Logger that will automatically include the attribute in every record:

logger := slog.Default()
logger2 := logger.With("url", r.URL)

The arguments to Logger.With work in the same way as the key-value pairs in Logger.Info: it accepts zero or more attributes in the form of alternating keys and values and/or Attr values.

Output formatting with handlers

Every Logger is associated with a Handler whose job it is to process log records—in the form of struct values containing a timestamp, log message, and level—generated by the Logger.

It’s the Handler that decides whether a record is handled, where it’s sent if it is, and in what format. A particular Handler might print log records to standard error, or write them to a file or database. It can even choose to augment them with additional attributes and pass them on to another handler.

The default handler implementation is the TextHandler, which we’ve used to generate all of the examples in this section so far:

2024/01/10 23:00:00 INFO Hello number=3

This is nice and human-readable and all, but what if we want to output our log records in something more machine-readable?

In this case, you can create a new logger with a custom handler by using the slog.New(h Handler) function. For example, we could use it to install the built-in JSONHandler:

logger := slog.New(slog.NewJSONHandler(os.Stdout, nil))
logger.Info("Hello, world!", "number", 3)

Now our output is a sequence of JSON objects, one per logging call:

{"time":"2023-11-10T23:00:00Z","level":"INFO","msg":"Hello, world!","number":3}

This makes it quite convenient for you to generate your own custom handlers to provide all manner of functionalities. We’re not going to go into the specifics for that here, but if you’re interested in seeing what it takes to satisfy the Handler interface, you need only look at the log/slog documentation.

OpenTelemetry Logging

At the time of this writing, there isn’t all that much to say about the state of OpenTelemetry logging in Go, except that while the OpenTelemetry Logging specification is considered to be mature (and is a pretty good read, at that), no implementation for Go exists yet. Hopefully sometime in 2025, we’ll see an initial release.

The overall design for logging in OpenTelemetry acknowledges that most languages already include robust logging solutions, so rather than create a whole new one from scratch, OpenTelemetry logging is expected to integrate fully with existing implementations as appenders for existing APIs.

Why would you use OpenTelemetry logging at all? The big benefit is correlation between different telemetry types, such as logs that are automatically associated with the current trace. It may seem like a minor perk, but this has the potential to considerably enhance a system’s observability by making it possible to move seamlessly between different data types, which can be hugely advantageous when trying to chase down an issue.

Summary

There’s a lot of hype around observability, and with its promises to dramatically shorten development feedback loops and generally make complexity manageable again, it’s easy to see why.

I wrote a little at the start of this chapter about observability and its promises, and a little more about how observability isn’t done. Unfortunately, how to do observability is a really, really big subject, but unfortunately I wasn’t able to say as much about that as I certainly would have liked.¹⁸ Fortunately, with some pretty great books available (most notably Observability Engineering by Charity Majors, Liz Fong-Jones, and George Miranda [O’Reilly]), that void won’t go unfilled for long.

By far, however, most of this chapter was spent talking about the three pillars of observability in turn, specifically how to implement them using OpenTelemetry, where possible.

All told, this was a challenging chapter. Observability is a vast subject about which not that much is written yet, and, as a result of its newness, the same is true of OpenTelemetry. Even its own documentation is limited and spotty in parts. On the plus side, I got to spend a lot of time in the source code.

¹ Clifford Stoll, High-Tech Heretic: Reflections of a Computer Contrarian (Random House, 2000).

² Interestingly, this was also just after AWS launched its Lambda functions as a service (FaaS) offering. Coincidence? Maybe.

³ Assuming, of course, that all of our network and platform configurations are correct!

⁴ I’m not one of the cool kids, but I still call it that anyway.

⁵ In addition to Go, implementations exist for Python, Java, JavaScript, C#/.NET, C++, Rust, PHP, Erlang/Elixir, Ruby, and Swift.

⁶ If you’ve never seen Charity Majors’ blog, I recommend that you check it out immediately. It’s one part genius plus one part experience, tied together with rainbows, cartoon unicorns, and a generous helping of rude language.

⁷ Formerly known as Lightstep.

⁸ Benjamin H. Sigelman et al., “Dapper, a Large-Scale Distributed Systems Tracing Infrastructure”, Google Technical Report, April 2010.

⁹ Which you probably do.

¹⁰ Recall that the name “mux” is short for “HTTP request multiplexer.”

¹¹ That wins the record for longest package name, at least in this book.

¹² It can also refer to the numerical relationship between two database tables (i.e., one-to-one, one-to-many, or many-to-many), but that definition is arguably less relevant here.

¹³ Whatever “better” means.

¹⁴ Oliver Kiran, “Exploring Prometheus Use Cases with Brian Brazil”, The New Stack Makers, October 30, 2016.

¹⁵ I’m quite fond of the technical term for such a value: monotonically increasing. Pure poetry.

¹⁶ Any wrapping in the example is for the benefit of formatting for presentation only. Don’t use line breaks in your log events if you can help it.

¹⁷ If they are, why are you producing them at all?

¹⁸ This is a Go book, after all. At least that’s what I keep telling my incredibly patient editors.

Chapter 12. Security

No technology that’s connected to the Internet is unhackable.¹

Abhijit Naskar, The Gospel of Technology

It can’t be overstated: security is everyone’s responsibility.

Too often, security is a distant afterthought in the software development process. Or worse, it’s treated as the exclusive job of a dedicated² security person or team to worry about.

This is a bad idea for a bunch of reasons, not the least of which is that it keeps security personnel isolated from the software development process, virtually guaranteeing that vulnerabilities in the software won’t be addressed until late in the development cycle. It also discourages developers from thinking about security practices, making it far more likely that they’ll introduce security flaws into their code.

If you take one thing away from this chapter, take this: producing a safe, secure product is the job of everybody involved in its construction.

In this chapter, we’ll explore a variety of techniques, ranging from simple to complex, for doing exactly that. This will include topics such as authentication, authorization, access control, data protection, and encryption. It will also discuss security best practices for Go, emphasizing input sanitization, validation, defensive programming, and a variety of techniques for reducing complexity.

Go: Secure by Design

Like any language, Go is only as secure as the way it’s used, but it does boast a number of features designed to reduce or eliminate certain classes of vulnerabilities. These include:

Safe memory management: Go’s garbage collection mechanism eliminates an entire class of vulnerabilities that are common in languages where manual memory management is required. So common, in fact, that the NSA placed Go on its list of languages recommended for their memory safety, alongside C#, Java, Python, Rust, and Swift.
Strong type system: Go’s type system minimizes common vulnerabilities like buffer overflows and type mismatches, which are common in dynamically typed languages.
Simplicity and readability: Go is designed for simplicity and readability, which is further reinforced by a culture that actively discourages magic and cleverness. This clarity reduces the likelihood of security flaws going unnoticed.
Concurrency model: Go’s concurrency model helps developers avoid common concurrency issues like race conditions that can be exploited in security attacks.
Standard library and cryptography: Go’s standard library includes well-implemented cryptographic packages, such as crypto/tls, that provide strong cryptographic primitives. These libraries are vetted by the community and are continually refined to address new security vulnerabilities and to adhere to best practices.
Dependency management and module system: The introduction of Go Modules provided a reliable system for managing dependencies, reducing the risk of introducing security vulnerabilities via outdated or compromised dependencies.
Built-in security tools: Go includes several built-in tools like go vet and gosec that can aid in identifying security vulnerabilities early in the development cycle.

Go’s design and features offer a strong foundation for building secure cloud native applications. By prioritizing type safety, simplicity, robust concurrency handling, secure libraries, effective dependency management, and community support, Go provides developers with the tools and practices necessary for implementing security by design.

Common Vulnerabilities

Among cloud native services, the same common security vulnerabilities tend to crop up again and again. So much so that the OWASP Foundation, a well-known software security community, authored (and occasionally updates) the OWASP Top 10, a popular awareness document for developers and web application security specialists that attempts to classify and quantify the most critical of these security vulnerabilities faced by networked services.

In this section, I’ll do my best to provide a brief overview—and examples where possible—of some of the most critical security vulnerabilities in web applications.

Injection

Injection represents a broad category of vulnerabilities that an attacker can exploit to inject malicious data into a program, causing it to execute unintended commands or access unauthorized data.

Common forms of injection include SQL injections that are designed to manipulate database queries, and cross-site scripting (XSS) attacks that can inject scripts into web pages. In this section, we’ll discuss both of these in more detail.

Vulnerabilities of this sort are extremely common. So common, in fact, that the OWASP Foundation ranks injection (broadly defined) as third in its list of top 10 critical vulnerabilities.

Generally speaking, the best defense against injection attacks is to carefully manage your inputs and outputs, particularly when it includes user-generated data. Techniques like input validation, which we’ll cover in detail in “Input Validation”; input sanitization, which we’ll cover in “Input Sanitization”; and output encoding, which we’ll discuss in “Output Encoding” are key.

SQL injection

The flagship of injection attacks is SQL injection, which is so terrifying that it inspired what is arguably one of the most popular webcomics of all time.³

SQL injection vulnerabilities arise when string concatenation is used to construct a SQL query using improperly treated input that may include hazardous characters with special meaning to the database management system.

To illustrate, imagine you have some code like the following:

func authenticateUser(username, password string, db *sql.DB) (bool, error) {
    // Vulnerable SQL query
    query := "SELECT * FROM users WHERE username = '%s' AND password = '%s'"

    // Executing the SQL query
    results, err := db.Query(fmt.Sprintf(query, username, password))
    if err != nil {
        return false, fmt.Errorf("error executing query: %w", err)
    }
    defer results.Close()

    // Check if any rows returned
    return results.Next(), nil
}

In this example, the authenticateUser function constructs a SQL query string by directly embedding the username and password values into the statement.

Seems straightforward enough, right? Just give it a username and password and get a true if the user exists. But what if username becomes admin' --? In that case, your query will look like this:

SELECT * FROM users WHERE username = 'admin' --' AND password = 'foo'

Since the -- in SQL comments out the rest of the line, this effectively bypasses the password check and allows unauthorized access if admin is a valid username.

Perhaps more frighteningly, this attack can even be used to change data. For example, if the value of username was instead something like '; UPDATE users SET password='' --, then the resulting query would set the passwords of all of the users to an empty string.

The defense against this attack is to properly sanitize your inputs, which we’ll touch on in “Input Sanitization”, and especially to encode your query arguments by using parameterized queries, which we’ll cover in “Parameterized Queries”.

Cross-site scripting (XSS)

In a cross-site scripting (XSS) attack, a malicious actor injects scripts into content that’s delivered to a user’s browser. These scripts then execute in the victim’s browser under the domain of the web application, which can lead to unauthorized access to session cookies, manipulation of content, or redirection to malicious websites.

Your service is vulnerable to XSS attacks if it displays untrusted input that isn’t properly validated and sanitized before ingestion or properly encoded prior to output to a user.

Take, for example, the following code snippet that uses the net/http and io packages to build a trivial example:

package main

import "net/http"
import "io"

func echoHandler(w http.ResponseWriter, r *http.Request) {
    io.WriteString(w, r.URL.Query().Get("param"))
}

func main() {
    http.HandleFunc("/", echoHandler)
    http.ListenAndServe(":8000", nil)
}

This code starts a basic HTTP server that listens on port 8000. The echoHandler function echoes back the contents of the param parameter. While this code is probably far simpler than anything you’d put into production, it demonstrates the principle of XSS quite effectively.

Try starting the server and browsing to the following URL: http://localhost:8000/?param=%3Cbody%20onload=alert(%27Gotcha!%27)%3E.

As you may be able to tell, the contents of the param value are a URL-encoded string: <body onload=alert('Gotcha!')>. This example injects only a harmless popup, but the same principle can be used for much more harmful scripts as well.

As is the case with other injection-style attacks, the appropriate defense against XSS is some combination of input validation and sanitization and output encoding. We’ll discuss all of this in some detail later in this chapter.

Tip

If you’re generating a significant amount of HTML, particularly if it’s constructed from user-generated content, consider using the template/html package, which is explicitly hardened against code injection.

Broken Access Control

Access control is a broadly-defined term for any scheme that enforces policy such that users cannot act outside of their intended permissions. In a modern system it might refer to something like role-based access control (RBAC) in which users are assigned roles that are consistently checked and enforced, but it could just as easily be applied on a simple file server using rudimentary file and directory permissions.

Access control is famously easy to misconfigure or to just do poorly, which helps to explain why it’s the number one vulnerability in the most recent edition of the OWASP Top Ten.

For example, a control may be inadequately enforced in such a way that users can do things they shouldn’t be able to, access to a particular URL may not be properly restricted, or incorrectly configured file permissions might allow unauthorized users to access sensitive files.

Mitigation of this kind of vulnerability requires the careful and consistent application of robust authentication and authorization mechanisms that enforce strict access controls, based on user roles and permissions.

A common attack vector made possible by broken access control is the humble URL, using what’s called a path traversal (or directory traversal) attack. An attacker applying this technique will attempt to manipulate references to files using ../ (“dot-dot-slash”) or similar sequences with the goal of accessing files and directories that are stored outside the web root folder, including application source code or configuration and critical system files. This can lead to information disclosure, data corruption, or other negative outcomes.

For example, consider the following URL that can be used to retrieve a file:

http://example.com/get-file?file=report.pdf

If the service backing this request doesn’t properly validate and sanitize its input, a bad actor could decide to do something like the following:

http://example.com/get-file?file=../../../../etc/passwd

Such an attack is best stopped by intercepting it during validation by explicitly rejecting any paths that could escape the intended directory. This often means rejecting input containing .. or ~, or starting with /.

Barring that, it’s best to lean on trusted functions that inherently manage path traversal issues. For example, the http.Dir type automatically prevents path traversal, making it a good choice for serving static files. Coupled with the http.FileServer function, it’s reasonably straightforward to construct a simple file server:

package main

import "net/http"

func main() {
    dir := http.Dir("/var/www/myapp/files/")    // Define the allowed path

    fs := http.FileServer(dir)                  // Get the file server handler

    http.Handle("/", fs)                        // Register the handler

    http.ListenAndServe(":8080", nil)
}

As you may recall from “Building an HTTP Server with net/http”, the http.Handler function accepts an implementation of the http.Handler interface to respond to requests to a particular path, / in this case. In the example, we use the http.FileServer function, which returns a specialized http.Handler that serves HTTP requests with the contents of the filesystem rooted at the specified path.

Cryptographic Failures

Cryptographic failures are vulnerabilities that arise whenever sensitive data—passwords, credit card numbers, business secrets, etc.—isn’t correctly protected through cryptography.

When designing a system that works with any kind of sensitive data,⁴ you’ll want to determine very early how your data should be protected, both in transit and at rest. This is particularly true if that data is subject to privacy laws or regulations, such as the EU’s General Data Protection Regulation (GDPR) or PCI Data Security Standard (PCI DSS).

This includes, but certainly isn’t limited to:

Ensuring that any transmitted information—especially if it’s traversing the internet—is protected with Transport Layer Security (TLS), which we’ll cover in “HTTP Over TLS”.
Hashing stored passwords with a strong (and nondeprecated) cryptographic hash function, as we discuss in “Hashing”.
Using strong crypto keys and employing proper key management and rotation practices.

There are many, many more encryption best practices, but unfortunately we have only so much space. These principles and more will be covered in “Cryptographic Practices”.

Handling Untrusted Input

By their nature, cloud native systems depend on the free flow of data across networks and service boundaries from all manner of sources. It’s at the very core of what it means to be cloud native. But what happens when that data isn’t structured in the expected way? Or worse, what if it’s been designed to cause harm to your system or its users?

Left unchecked, any input source can be a vector for a variety of attacks, including but not limited to, injections that can manipulate data or output, buffer overflows that attempt to overwrite memory areas, and denials of service that aim to overwhelm a system with bogus requests.

To mitigate such vulnerabilities, all data provided to your application by an untrusted source should be examined for correctness and any suspect inputs filtered out, known as input validation and input sanitization, respectively.

In this section, we’ll review a few common ways of doing this in Go, as well as a few useful rules of thumb to keep in mind regardless of language.

To Trust, or Not to Trust

Input validation and sanitization can be a somewhat burdensome (and expensive) process, so naturally you’ll want to determine exactly what kind of inputs can be trusted and accepted and what kinds should be untrusted and carefully validated.

In a nutshell, trusted data is that which is assumed to be safe and reliable. Typically it’s under the control of the application or originates from a secure, known source, like internal services or databases, authenticated APIs, or services that maintain a strict compliance with security standards.

Conversely, untrusted data comes from sources that haven’t been verified or are outside the organization’s security perimeter. These might include, but certainly aren’t limited to, user input, data received from the internet, or inputs from external systems and services that don’t have a clear, vetted security policy.

Input Validation

During input validation, data from untrusted sources is checked against a set of conditions to ensure that it’ll behave as expected, without any unintended side effects.

Validation ensures that the incoming data meets the specific criteria or rules defined for acceptable input. This could include checks for the correct format, type, length, range, or structure. By validating data, you ensure that only the appropriate and expected types of data are processed further in your application. For example, if a field expects a date, validation checks that the input matches the expected date format before any additional processing occurs.

It may be tempting to skip validation assuming that sanitization alone will be sufficient, but not every threat can be sanitized away. While sanitization modifies the input to prevent potential threats, validation confirms the legitimacy and appropriateness of the data before it even reaches the point where threats need to be neutralized. You really do need both steps.

Input Validation Rules of Thumb

The OWASP Secure Coding Practices Quick Reference Guide is a handy checklist for input validation. We’ll cover most of the points in detail in the following sections, but there are a few more general ones that more or less always make sense:

Always validate data from untrusted sources: This was already hinted at in “To Trust, or Not to Trust”, but it’s worth spelling out: always identify all of your data sources (databases, file streams, user input,⁵ etc.) and classify them as either “trusted” or “untrusted.” Data from all untrusted sources should be carefully validated and sanitized.
Conduct all data validation on a trusted system: All input validation should be performed on the server side (typically by the service itself) where the processing can’t be easily manipulated by a bad actor.
Use a centralized input validation routine: Ideally, there should be a single, centralized input validation routine. This ensures that the validation logic is consistent throughout the entire application and is easier to maintain and test.
All validation failures should result in input rejection: If input fails validation, it should be immediately rejected—don’t try to correct it. This is important not only from a security standpoint but from the perspective of data consistency and integrity as well, since data is often used across multiple systems and applications.

In other words, trust nothing and no one.

These principles of input validation are foundational to securing applications and ensuring data integrity across systems. Implementing them diligently will significantly reduce the risk of security breaches and data corruption.

String encoding

Since Go source code is always UTF-8, and Go string literals (absent any funny business with byte-level escapes) always hold valid UTF-8 sequences, it’s easy to forget that this doesn’t always have to be the case.

As we mentioned in “Strings”, a Go string holds arbitrary bytes; there’s absolutely no guarantee that it’ll be formatted in UTF-8, Unicode, or any other predefined format. A string is just a byte slice with a little magic sprinkled on it, and like any magic should be treated with suspicion.

Fortunately, the standard library’s unicode/utf8 package makes it quite straight-forward to validate a string’s character encoding scheme using the utf8.ValidString(string) or utf8.Valid([]byte) functions:

package main

import (
    "fmt"
    "unicode/utf8"
)

func main() {
    valid := "äǒů"
    invalid := string([]byte{0xfd, 0xfe, 0xff})

    fmt.Println(valid, utf8.ValidString(valid))
    fmt.Println(invalid, utf8.ValidString(invalid))
}

In this code, we use the utf8.ValidString function to test two strings: the valid (if non-ASCII) string “äǒů”, and another containing some arbitrary bytes. Running it produces something like the following (though the exact output of the invalid string will vary by system):

äǒů true
��� false

In addition to validating UTF-8, the unicode/utf8 package also includes a variety of functions, such as utf8.EncodeRune and utf8.DecodeRune, to translate between runes and UTF-8 byte sequences.

Double encoding and canonicalization

Double encoding is an offensive obfuscation technique where special characters are encoded twice, potentially allowing an attacker to slip malicious data through security checks that decode inputs only once. It’s often used to carry out attacks like XSS and SQL injection.

Consider the character <, which is normally URL-encoded as %3C. If this character is double-encoded, it would first turn into %3C and then into %253C. If a security filter decodes this only once, it turns back into %3C, which it might allow if it’s incorrectly configured to not decode input again.

The counter to this attack is called canonicalization, which is the process of converting data that might have more than one possible representation into a standard, or “canonical,” format. Once data is canonicalized by being decoded to its simplest form, it can be uniformly checked against a standard set of rules.

Applying canonicalization is deceptively straightforward: decode repeatedly, so that data isn’t considered fully decoded until no encoding tokens remain. For example, the string %253C should be decoded not just to %3C but all the way to <. This makes sure that any double-encoded sequences are fully decoded to their true representations.

The following code canonicalizes a string by repeatedly applying the QueryUnescape function from the net/url package until no encoding tokens remain:

package main

import (
    "fmt"
    "net/url"
)

func CanonicalizeInput(in string) (string, error) {
    var err error
    var prev string
    var count int

    for in != prev {
        prev = in

        if count > 10 {
            return "", fmt.Errorf("too many escape layers")
        }

        if in, err = url.QueryUnescape(in); err != nil {
            return "", err
        }
        count++
    }

    return in, nil
}

func main() {
    doubleEncoded := "%253Cfoo%253E" // Double-encoded "<foo>"
    canonicalized, err := CanonicalizeInput(doubleEncoded)
    if err != nil {
        fmt.Println("Error canonicalizing:", err)
        return
    }

    fmt.Println("Canonicalized string:", canonicalized)
}

As you can see, the main function calls CanonicalizeInput, to which it passes our test doubly URL-encoded string. CanonicalizeInput then enters a loop that continues until the string is fully decoded.

Importantly, this implementation also limits itself to 10 layers of escape: any more than that is automatically considered invalid. Without this check, the function would be susceptible to a DoS attack of its own by means of particularly long inputs to keep the server busy unescaping, or even specially crafted inputs that never terminate.

Warning

Although we use URL-encoding to illustrate the concepts of double encoding and canonicalization, this section is relevant for all encoding types, such as HTML entities, escaped characters, and even “dot-dot-slash” (../ or ..\) path alteration characters.

Hazardous characters

A hazardous character is any character, or sequence of characters, in a string that could be used to manipulate or harm a system or its data. They’re often used to construct inputs that are specifically designed to exploit vulnerabilities, such as injection attacks, buffer overflows, or other types of security breaches.

Precisely which characters are hazardous depends on the system and its data, but common hazardous characters can include (but aren’t limited to):

< > (left and right angle brackets)
( ) (left and right parentheses)
" ' (double and single quotes)
% (percent symbol)
& (ampersand)
+ (plus symbol)
\ (backslash)
\0 (the null character)

By far the most effective way to deal with hazardous characters is to construct an allowlist—an exhaustive list of exactly which characters are allowed—and to validate your string against that.

Here we illustrate this concept with a basic allowlist constructed using a map of type map[rune]bool:

package main

import "fmt"

var allowed map[rune]bool

func init() {
    allowed = make(map[rune]bool)

    for r := 'A'; r <= 'Z'; r++ {
        allowed[r] = true
    }
    for r := 'a'; r <= 'z'; r++ {
        allowed[r] = true
    }
    allowed[' '] = true
    allowed[','] = true
}

func ValidateHazardous(str string) error {
    for i, r := range str {
        if !allowed[r] {
            return fmt.Errorf("hazardous character at position %d", i)
        }
    }

    return nil
}

func main() {
    if err := ValidateHazardous("Hello, Ğö"); err != nil {
        fmt.Println(err)
    }
}

In this code, we construct the allowlist allowed by explicitly mapping the allowed runes to true. Obviously, this example specifies only a very small set of allowed characters, but this approach could be expanded to allow an arbitrary number and variety of values.

The ValidateHazardous function accepts a string and loops through its contents, comparing each rune against the allowed map. When it finds a character that isn’t in the map, it returns an error. The main method exercises the ValidateHazardous function by passing it a string, “Hello, Ğö”. It produces the following output, correctly identifying the Ğ character is invalid:

hazardous character at position 7

Of course, sometimes you’ll still have to accept potentially hazardous characters as input. In these cases, you’ll want to implement additional controls like output encoding, which we’ll discuss in “Output Encoding”, and take care to account for the utilization of that data throughout the application.

Warning

This approach can get tricky fast when internationalization is involved because a “hazardous character” in one locale is perfectly valid in another language. One option is to use the functions of golang.org/x/text, such as golang.org/x/text/secure/precis, to canonicalize inputs first, and then validate against standard character classes, rather than explicit ranges.

Numeric validation

If you expect your input string to be numeric, you may wish to make use of the strconv package, which includes functions to convert from (and to) string representations of basic data types.

The following code uses the strconv.Atoi function⁶ to convert two strings into integers:

package main

import (
    "fmt"
    "strconv"
)

func main() {
    strs := []string{"42", "puppies"}

    for _, v := range strs {
        i, err := strconv.Atoi(v)
        fmt.Println(i, err)
    }
}

Running this applies strconv.Atoi to each of the two strings: “42” (which is a number) and “puppies” (which are adorable but not a number). The following text is produced:

42 <nil>
0 strconv.Atoi: parsing "puppies": invalid syntax

As expected, “42” is converted correctly, while attempting to convert “puppies” results in an error.

Leveraging regular expressions

Regular expressions (regex) are a powerful tool for input validation, allowing developers to define complex criteria for efficiently checking the format of user input. By describing patterns that strings should match, regex can enforce formats such as email addresses, phone numbers, usernames, or even complex password rules.

They’re also generally regarded as being rather challenging to master, and even a reasonably simple (if poorly made) expression could lead to an inadvertent self-denial-of-service, an aspect of regex that we’ll cover more fully in “Regular Expressions”.

Tip

The OWASP Validation Regex Repository contains a number of useful and well-tested regex input validation patterns.

Input Sanitization

Input sanitization is the process of cleaning or altering input data to remove or neutralize any potentially harmful characters, scripts, or data formats that could lead to security vulnerabilities or application errors. It’s the next step after validation and canonicalization to ensure that your inputs are as safe as possible.

For example, if your validation logic allows special characters like quotes or angle brackets, they could be sanitized by removing or escaping them to remove any possibility of them being used in SQL injection or XSS attacks.

This section describes some common applications of input sanitization.

Converting special characters to HTML entities

In this form of input sanitization, special characters that are meaningful in HTML, such as <, >, or &, are converted into entities (such as < to < or > to >). This is particularly common in web applications where user-generated content is displayed, like forums or comments, where it’s preferable to preserve user input so that it can be displayed properly without risking execution of potentially malicious code.

The standard html library provides functions for escaping and unescaping text. The html.EscapeString function accepts a string and returns the same string with certain special characters escaped into HTML entities, but importantly it escapes only five such characters: < , > , & , ', and ".

Warning

The standard library’s html.EscapeString function escapes only the < , > , & , ', and " characters. Other characters should be encoded manually, or you can use a third-party library that encodes all relevant characters.

The html.UnescapeString function performs the reverse process so that entities like < become <. It unescapes a larger range of entities than EscapeString escapes, such that UnescapeString(EscapeString(s)) == s is always true, but the converse isn’t necessarily true.

Stripping tags

Depending on your application, it might make more sense to strip all HTML markup from your input to ensure that the text is plain and free from any HTML or script tags. This approach is appropriate in environments where no HTML is supposed to be rendered at all and maintaining the original look of user input isn’t necessary.

Unfortunately, there’s no function in the standard libraries to strip all HTML tags from text.⁷ But powerful third-party HTML sanitization libraries exist, such as the microcosm-cc/bluemonday sanitizer, which can accept user-generated content and return sanitized HTML:

package main

import (
    "fmt"
    "github.com/microcosm-cc/bluemonday"
)

func main() {
    // Unsanitized HTML input
    in := `<a onblur="alert(secret)" href="http://oreilly.com">O'Reilly</a>`

    // The strict policy will strip all elements and attributes
    p := bluemonday.StrictPolicy()

    out := p.Sanitize(in)   // Sanitize the input

    fmt.Println(out)        // O'Reilly
}

Bluemonday even supports the definition of an allowlist of approved HTML elements and attributes, including the default UGCPolicy that’s designed to sanitize user-generated content from HTML WYSIWYG tools and markdown conversions. This policy is quite lenient, so I can’t recommend it for most use cases, but it can be useful for web applications where user-generated content is displayed, like those discussed in “Converting special characters to HTML entities”.

Output Encoding

Output encoding refers to the practice of converting data that’s being output into a format that’s safe and appropriate for its recipient to use. This typically involves transforming special characters, control sequences, or any data that could be misinterpreted by the recipient system into formats that prevent unintended execution or rendering.

Output encoding is critical (but often neglected) when outputting data built from user-generated content or other untrusted sources; failing to properly encode output is a common source of injection attacks.

The tricky bit with output encoding, however, is that because you have to encode according to the recipient’s own format, there’s no one output encoding approach. HTML encoding is different from JavaScript encoding, which is different again from SQL encoding.

A few common encoding types are listed here:

HTML encoding: When dynamic content is inserted into HTML, characters like < and > should be encoded to their corresponding entities (< and >). This prevents any tags in the content from being interpreted as actual HTML tags. This is similar to the input sanitization method we covered in “Converting special characters to HTML entities”.
SQL encoding: When incorporating user-provided data into SQL queries, it’s crucial to escape specific characters that could otherwise be misused to alter the SQL command, leading to SQL injection vulnerabilities. We’ll cover how to use parameterized queries to combat this case in “Parameterized Queries”.
URL-encoding: In URLs, characters like spaces, ampersands, and slashes should be URL-encoded so that the URLs are interpreted correctly by web servers and browsers without accidentally delimiting URL parameters or changing the intended path. The standard net/url package includes the PathEscape and QueryEscape functions for this purpose, which escape a string by replacing any special characters with %XX sequences.
JavaScript encoding: When inserting content into JavaScript code, characters like quotes and backslashes should be escaped (\", \', \\) to prevent breaking out of strings in scripts and inadvertently executing malicious code.

Authentication

Authentication is the process of verifying the identity of a user, device, or other entity that’s trying to access a system. In other words, it aims to determine whether an entity claiming to be X is really X.

If you’ve ever used a password or PIN number, you’ve had to negotiate with an authentication system. But the world of authentication is big and goes far beyond password challenges. In fact, there are numerous approaches to authentication, each with their own pros and cons:

Password-based authentication: Grants access on the basis of a secret, typically a password or passphrase, that’s known only to the user and the system. It’s by far the most common form of authentication due to its familiarity, simplicity, and ease of implementation, but its efficacy depends heavily on users choosing strong passwords and keeping them confidential. We’ll cover password-based authentication in “Password-Based Authentication”.
Token-based authentication: In this system, a user enters their credentials one time and receives a digital token in exchange that then provides temporary access. This provides an additional layer of security, and the stateless nature of this approach makes it ideal for distributed systems where maintaining state is challenging. It does add a fair amount of complexity, though. We’ll cover token-based authentication in “Token-Based Authentication”.
Certificate-based authentication: This uses digital certificates to verify the identity of users, devices, or services, leveraging public key infrastructure (PKI) for enhanced security. This method involves issuing a certificate by a trusted authority after verifying identity credentials, and it uses cryptographic techniques to ensure secure, mutual authentication between parties without transmitting sensitive information.
Multifactor authentication (MFA): This combines multiple authentication methods (something you know, something you have, something you are) to verify identity. MFA is more complex to implement than any one form of authentication, and users don’t exactly love having to authenticate more than once, but it’s the de facto standard for user authentication because multiple authentication mechanisms have to be broken to circumvent it. We’re not going to cover MFA explicitly, but you should absolutely know about it.

Understanding these various authentication methods is crucial for building secure systems and protecting user data. By carefully selecting and implementing the appropriate authentication techniques, you can significantly enhance the security posture of your applications.

Securing Authentication Against Failures

Any authentication logic should be carefully designed to handle failures by defaulting to a state that doesn’t compromise security or expose sensitive information if an error or failure occurs. More specifically, your system should:

Deny access by default: Any authentication error, including system faults or incorrect credentials, should result in denial of access.
Minimize information leakage: Error messages or logs shouldn’t disclose details that could help an attacker. For example, “invalid password for user Foo” hints that the account Foo exists. A more generic “invalid username or password” shares nothing and reduces the risk of targeted attacks.
Log securely: In the event of an authentication failure, the system should log the incident without exposing any sensitive data. This can help you understand attack patterns or identify repeated failed attempts that could indicate a brute force, credential stuffing, or other attack. Don’t log unrecognized user names, though: often the user has mistakenly entered the password into the user field.

Ensuring that your authentication controls fail gracefully and securely is a key feature that will help your organization protect its systems and data, even when unexpected errors occur.

Password-Based Authentication

Perhaps the most classic (and certainly the most familiar) method of authentication in a computer system is password-based authentication, which attempts to verify a user’s identity using something only they know.⁸ Certainly password authentication has its advantages, and it remains popular due to its familiarity to users and its relative simplicity, both for the programmer to implement and for the user to use.

This simplicity and familiarity comes with a cost, however. As we discuss in “Common password attacks”, there are password attacks with varying degrees of effectiveness that are made far more effective by the fact that user account databases, often including (generally hashed) passwords, are commonly leaked on the internet.⁹

Fortunately, there are counters to such attacks, and collateral damages can be avoided if authentication data, especially passwords, are stored properly.

Common password attacks

Password-based authentication is a common target for attackers trying to gain unauthorized access into a system. Here is a list of several frequently applied attacks, in increasing levels of sophistication:

Brute force attacks: These attacks attempt to guess every possible combination of passwords until the correct one is found. This kind of attack is very resource-intensive, and can be mitigated by implementing account lockout policies, strong password policies, and CAPTCHA.
Dictionary attacks: These attacks are similar to brute force attacks, except that they use a list of common passwords instead of all possible combinations. Dictionary attacks can be foiled using the same techniques as brute force attacks, but especially by using complex passwords that are not simple words or common password combinations.
Rainbow table attacks: These attacks use a precomputed table of hashes for a very large set of character combinations, which an attacker uses to quickly discover what passwords correspond to a given hash (hashes will be introduced in the next section). This method is effective against systems that use unsalted hashes for storing passwords.
Credential stuffing attacks: These attacks attempt to use stolen account credentials to gain access to user accounts on another unrelated system. These attacks depend on the fact that users often reuse their passwords across multiple services, but they can be mitigated by enforcing MFA.

These are just the automated attack types. There are also various targeted attacks designed to steal passwords, such as phishing, which involves tricking users into giving away their passwords through deceptive emails or websites, and even keylogging, which involves using malware to record keystrokes on a user’s device.

Each of these methods targets different vulnerabilities inherent to password management and user behavior, and understanding each one is crucial for implementing effective security measures to protect against them.

Hashing passwords

Since users provide their passwords every time they try to log in, you never actually need to store them in a way that makes the original text accessible. You only need to make sure that what they provide matches what’s expected.

So rather than store users’ passwords in plain text, what’s typically stored instead is a hash, a unique representation of the password that’s generated by a cryptographically secure hashing algorithm like the ones we talk about in “Hashing”.

If that’s all we do, then every user who uses the same password will have the same generated hash, making rainbow table attacks easier. What we’d really like is to make sure that two different users who provide the same password still have different stored hashes. But how do we do that?

The solution is to add a random bit of data—unique to each user—to every password before it’s hashed. This process, called salting, ensures that every user’s password hash will be unique, even if their passwords are the same.

One way to do this could be to generate a random value for each user that’s stored alongside the hash in the account database, but you really shouldn’t do this. This would be effectively rolling your own crypto. Never roll your own crypto. There are libraries that do this for you, and you should use them.

One of the most common password hashing libraries is bcrypt, which implements Provos and Mazières’s adaptive hashing algorithm of the same name. It’s available from the golang.org/x/crypto/bcrypt package, which contains two key functions:

GenerateFromPassword(password []byte, cost int) ([]byte, error): Accepts a password string and a “cost” value (more on this soon) and returns the bcrypt hash of the password.
CompareHashAndPassword(hashedPassword, password []byte) error: Compares a bcrypt-hashed password with its possible plain-text equivalent and returns nil if they match or an error value if they don’t.

You can see both of these in action in the following example:

package main

import (
    "fmt"
    "golang.org/x/crypto/bcrypt"
)

func main() {
    // The password bytes
    password := []byte("password123")

    // Apply the bcrypt algorithm with the default cost
    hash, _ := bcrypt.GenerateFromPassword(password, bcrypt.DefaultCost)

    fmt.Println("Password:", string(password))
    fmt.Println("Hashed:  ", string(hash))

    // Does the hash match the original password?
    if err := bcrypt.CompareHashAndPassword(hash, password); err != nil {
        fmt.Println("Result:   Password mismatch")
    } else {
        fmt.Println("Result:   Password match!")
    }
}

This snippet uses the bcrypt.GenerateFromPassword function (with the default “cost” value) to hash a password string, and then uses bcr⁠yp⁠t.Co⁠mpa⁠reH⁠as⁠hA⁠ndPassword to validate that hash against the original password. Executing this code produces something like the following:

Password: password123
Hashed:   $2a$10$lryHYRUgH4ZP7GGy.HZTY.01HlY5hxqxMS7rT7iL3VGxgOIlZepJa
Result:   Password match!

What’s interesting is that each time you run this, you’ll get a different hash value. Not only is this very cool, but it highlights an important feature of the bcrypt algorithm: it automatically adds a salt value to the password that it’s hashing, making it highly resistant to rainbow table attacks.

What’s more, bcrypt is an adaptive function: its cost can be modified to make it more expensive (read: slower). Believe it or not, this is actually a good thing: more expensive hash functions take longer to run and are therefore more resistant to brute-force search attacks. This is another reason why we generally don’t use the general-purpose hashing functions described in “Hashing Algorithms” for passwords—they’re actually too efficient.

Tip

Slowness in a hashing function is a feature, not a bug. The slower and more expensive a function is, the longer it takes to brute force.

Token-Based Authentication

Token-based authentication begins like other authentication schemes in that it’s initiated with an exchange of credentials, like a username and password. Once those credentials are verified, however, the service issues the user an encrypted token that can be used to authenticate for the remainder of the session. This might seem like an unnecessary complication of password-based authentication, but there are several benefits.

First, since the tokens can contain a variety of specific user permissions and session details, token-based systems can support virtually any security model.

Second, token-based authentication decouples authentication from the user’s session, which improves security because it reduces the number of times that credentials have to be sent over the network and removes the need for the service to store and manage session information.

Finally,¹⁰ because the authentication information is maintained by the client, it’s effectively stateless. As we discussed at length in Chapter 7, statelessness is a very big deal when you’re developing distributed systems because it allows sessions to span multiple systems and services without the burden of an additional data store.

The lifecycle of a token

Unfortunately,¹¹ token-based authentication isn’t magic. It doesn’t remove the need for credentials, like a username and password. It’s what happens after the credential exchange that’s different:

User login: The user provides their credentials to the service.
Verification: The service verifies the credentials against its database or authentication provider. If the credentials are correct, the service generates a token containing user identity information, a timestamp, and other relevant metadata.
Token signing: The token is digitally signed by the service, ensuring that it can’t be tampered with without detection.
Token sent to user: The token is then sent back to the user’s client.
Client stores the token: The client stores the token and includes it in subsequent requests to the service.
Token validation: The service verifies the token’s integrity by validating its signature. If the token is valid, the service extracts the user’s identity and other details from the token and processes the request.
Session continuation or termination: Once the token is authenticated, the user can continue to access the service until the token expires or is invalidated.

Again, because all of the user’s session data is stored in the authenticatable token, the service doesn’t need to maintain session state, which makes this method inherently scalable and suitable for distributed systems.

JSON Web Tokens

JSON Web Tokens (JWTs) are a proposed industry standard¹² for creating and encoding tokens that can be used to represent some number of claims to be transferred between two parties. JWTs are designed to be both URL-safe and reasonably compact, consisting of three parts:

Header: Specifies the token type and the algorithm used for the signature
Payload: Contains the claims
Signature: Used to verify the token and ensure that it hasn’t been tampered with

Each of these parts is Base64URL encoded and separated by dots, so a typical JWT looks something like the following:

hhhhhh.pppppp.ssssss

Now, let’s take a slightly closer look at these parts.

Header

The JWT header is a small JSON object that specifies the token type (which is always JWT) and the algorithm used to generate the signature part of the token. Typical algorithms include HMAC with SHA-256 (HS256) and RSA signature with SHA-256 (RS256).

For example:

{
  "alg": "HS256",
  "typ": "JWT"
}

This JSON is then Base64URL encoded to form the first part of the token.

Payload

The payload part of the token contains any claims it’s asserting about an entity (typically the user) and any additional data it wishes to include.

There are a variety of predefined claim types, such as iss (issuer), exp (expiration time), and sub (subject). While it’s recommended that you use these where appropriate, you’re also free to define your own.

Note

In JWT parlance, the predefined claim types defined in RFC 7519 are called registered claims. We’ll be using this term going forward to refer to these kinds of claims.

For example, the following payload uses the registered claim iat (Issued At) and a custom name claim:

{
  "iat": 1715524558,
  "name": "Matt Titmus"
}

As with the header, the payload JSON is then Base64URL encoded to form the second part of the token.

Warning

Unencrypted tokens are readable by anyone. Though they’re protected against tampering, you shouldn’t put secrets in the payload or header elements unless the entire token is encrypted.

Signature

To create the signature part, the algorithm specified in the header (which requires a passphrase or private encryption key) is applied to the encoded header and payload.

For example, if you want to use the HMAC SHA-256 algorithm, the signature will be created in the following way:

HMACSHA256(secret, base64UrlEncode(header) + "." + base64UrlEncode(payload))

The signature is used to verify that the message wasn’t changed along the way. In the case of tokens signed with a private key, it can also verify that the sender is who it says it is. More on available key types in “Signing Methods and Key Types”.

All together now

At the end of this process, the result is three Base64URL strings, separated by dots, that can be easily passed around in HTTP while being considerably more compact than older standards like SAML.

So at last, we have the final form for our token (newlines added for formatting purposes):

eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.
eyJpYXQiOjE3MTU1MjQ1NTgsIm5hbWUiOiJNYXR0IFRpdG11cyJ9.
2NS8o1huBkPZ-3-7W2LrgmOunzOjrW5LR7foC2ypJO0

As you can see, this string is a complete (yet convenient) encoding of our token.

JSON Web Tokens in Go

Unfortunately, there’s no standard library implementation of JSON Web Tokens, but there’s an excellent external package in the form of github.com/golang-jwt/jwt/v5 that supports both the parsing and verification and the generation and signing of JWTs.

We’ll use this package in this section to demonstrate how JWTs can be used in your own code.

Building and signing a token

The github.com/golang-jwt/jwt/v5 package provides two functions to generate tokens: New, which provides a Token value that contains no claims (though these can be added later), and NewWithClaims, which creates a new Token value with any claims you specify.

The following function uses the latter of these to generate a new Token and return its signed string:

// Secret is used for signing our token.
// Never, ever check secrets into source code.
var secret = []byte(os.Getenv("JWT_KEY"))

func buildToken(username string) (string, error) {
    issuedAt := time.Now()
    expirationTime := issuedAt.Add(time.Hour)

    // Define our claims map
    claims := jwt.MapClaims{
        "iat":  issuedAt.Unix(),
        "exp":  expirationTime.Unix(),
        "name": username,
    }

    // Create a new token, specifying its signing method and claims.
    token := jwt.NewWithClaims(jwt.SigningMethodHS256, claims)

    // Sign and get the complete encoded token as a string using the secret.
    return token.SignedString(secret)
}

The jwt.MapClaims type uses a map[string]any to hold any claims you wish to assert. In the example, we choose to assert three: two registered claims (iat (Issued At) and exp (Expiration Time)) and one custom claim (name, for the username).

We then pass the claims to the jwt.NewWithClaims function, along with a constant that specifies our preferred signing method (jwt.SigningMethodHS256), and we get a Token value in return. Finally, we call SignedString on the token itself, passing it the signing secret, returning the complete, signed JWT string.

Signing Methods and Key Types

While we use HMAC with SHA-256 (HS256) for our examples, there are a variety of other signing algorithms available, each of which accepts a different type of signing secret.

These algorithms and their signing and verification key types are listed in the table. See the package’s signing methods documentation for more information.

Name	`alg` values	Signing key type	Verification key type
`HMAC`	`HS256,HS384,HS512`	`[]byte`	`[]byte`
`RSA`	`RS256,RS384,RS512`	`*rsa.PrivateKey`	`*rsa.PublicKey`
`ECDSA`	`ES256,ES384,ES512`	`*ecdsa.PrivateKey`	`*ecdsa.PublicKey`
`RSA-PSS`	`PS256,PS384,PS512`	`*rsa.PrivateKey`	`*rsa.PublicKey`
`EdDSA`	`EdDSA`	`ed25519.PrivateKey`	`ed25519.PublicKey`

Now let’s exercise our buildToken function by building a small HTTP service that accepts a POST-ed username and password and responds with the encoded token:

func main() {
    http.HandleFunc("/authenticate", authenticateHandler)
    http.ListenAndServe(":8000", nil)
}

func authenticateHandler(w http.ResponseWriter, r *http.Request) {
    // This is required to populate r.Form.
    if err := r.ParseForm(); err != nil {
        w.WriteHeader(http.StatusBadRequest)
        return
    }

    // Retrieve and validate the POST-ed credentials.
    username := r.Form.Get("username")
    password := r.Form.Get("password")

    // Authenticate the password, responding to errors appropriately.
    valid, err := authenticatePassword(username, password)
    if err != nil {
        w.WriteHeader(http.StatusInternalServerError)
        return
    } else if !valid {
        w.WriteHeader(http.StatusUnauthorized)
        return
    }

    // Password is valid; build a new token using our buildToken function.
    tokenString, err := buildToken(username)
    if err != nil {
        w.WriteHeader(http.StatusInternalServerError)
        return
    }

    // Respond with the new token string.
    fmt.Fprint(w, tokenString)
}

This almost executable snippet (authenticatePassword isn’t implemented here) creates a minimal HTTP service with a single endpoint: /authenticate. When it receives a valid username and password, it builds and returns a token. If we run it, we can manually test it using a curl command like the following:

curl -XPOST -d "username=user" -d "password=SuperSecret"
    http://localhost:8000/authenticate

If everything works correctly, our service returns a status 200 and a token string. If you wish to verify the token string manually,¹³ jwt.io provides an excellent token debugger.

Parsing and validating a token

So now that you’ve issued one or more tokens, it’s time to verify them. There are three variations of parse functions provided by github.com/golang-jwt/jwt/v5, but the one we want to use is ParseWithClaims. It parses, validates, and verifies like Parse but also supports the use of custom claims. But to use it we have to provide two more things.

First, we need a struct that describes the claims that we’re interested in. We do this by defining a struct with a JSON-tagged field, as if we were going to unmarshal the JSON object.¹⁴ Embedding jwt.RegisteredClaims ensures full support for registered claims as well:

type CustomClaims struct {
    Name string `json:"name"`
    jwt.RegisteredClaims
}

The second thing we need to define is our key function. Key functions must adhere to jwt.Keyfunc contract and are used by ParseWithClaims as a callback to supply the verification key. Key functions can be quite sophisticated, but ours only needs to provide the secret that we used to encode the token:

// Secret is used for verifying our token.
// Never, ever check secrets into source code.
var secret = []byte(os.Getenv("JWT_KEY"))

func keyFunc(token *jwt.Token) (any, error) {
    return secret, nil
}

With these two pieces in hand, we can write our verification logic:

func verifyToken(tokenString string) (*CustomClaims, error) {
    token, err := jwt.ParseWithClaims(tokenString, &CustomClaims{}, keyFunc)
    if err != nil {
        return nil, err
    }

    claims, ok := token.Claims.(*CustomClaims)
    if !ok {
        return nil, fmt.Errorf("unknown claims type")
    }

    return claims, nil
}

This verifyToken function receives an encoded JWT string, which it then parses and verifies using jwt.ParseWithClaims. Since our token includes an exp (Expiration Time) claim, it even automatically verifies that the token isn’t expired. It then asserts that the returned token’s claim is a CustomClaims value, and returns the complete and verified claims value.

Now that our verification logic is complete, we can add it to a service. The following handler can be used by an HTTP service to authorize a request that includes a JWT as a bearer token on an Authorization header:

func verifyHandler(w http.ResponseWriter, r *http.Request) {
    header := r.Header.Get("Authorization")
    token := strings.TrimPrefix(header, "Bearer ")

    if header == "" || token == header {
        log.Println("Missing authorization header")
        w.WriteHeader(http.StatusUnauthorized)
        return
    }

    claims, err := verifyToken(token)
    if err != nil {
        log.Println("invalid authorization token:", err)
        w.WriteHeader(http.StatusUnauthorized)
        return
    }

    fmt.Fprintf(w, "Welcome back, %s!\n", claims.Name)
}

This function retrieves the Authorization header from the request and attempts to validate the token that it includes, if any. If the verification fails, we reject the request, but otherwise we conclude that the request is coming from who it says it is.

Note that most services wouldn’t have anything like our /authenticate endpoint, which exists exclusively to verify a JWT. Rather, this logic (or something similar) would be applied to most of a service’s endpoints via some kind of middleware function.

Communication Security

While it’s important to encrypt your data at rest—on the disk—the true dangers to your data arise when it’s in transit—being transmitted—exposing it to the possibility of being viewed or altered by bad actors.

This is where communication security comes in. Besides protecting sensitive information from prying eyes (confidentiality), it also ensures that the party you think you’re talking to is the party you’re talking to (authentication) and that the data you send is the same as the data that is eventually received (integrity).

Transport Layer Security

TLS is a popular cryptographic protocol designed to provide secure communication over a computer network. It uses asymmetric cryptography for key exchange, symmetric encryption for privacy, and message authentication codes for message integrity. It’s used in all kinds of places, but you’re probably most familiar with its use in securing HTTPS.

You might recall that we first introduced TLS in “Transport Layer Security”, way back in Chapter 5. If you want a more thorough overview, you can check back there. I’ll be here when you get back.

Tip

People sometimes say SSL when they mean TLS, when in fact SSL (Secure Sockets Layer) is an older protocol that’s been largely replaced by TLS due to security vulnerabilities.

HTTP Over TLS

HTTP over TLS, generally just called HTTPS, is an extension of HTTP that uses TLS for encryption. It’s possibly the most common application of TLS. If you’re reading this book, you probably use it every day.

I’d be remiss if I didn’t at least acknowledge the existence of HTTPS in this section, even if only to redirect you back to “Securing Your Web Service with HTTPS” in Chapter 5, where we cover it in some detail.

gRPC Over TLS

Adding support for TLS to your gRPC service¹⁵ works a lot like creating an unencrypted gRPC service, except it requires the additional step of supplying your key pair. Frequently this is done by providing a pair of .pem files, which we introduced in “Private Key and Certificate Files” in the context of adding TLS to an HTTP service.

For example, take a look at the following bit of code, which is adapted from the service we created back in Chapter 8:

package main

import (
    "log"
    "net"

    "google.golang.org/grpc"
    "google.golang.org/grpc/credentials"
)

func main() {
    // Specify the credentials to use
    creds, err := credentials.NewServerTLSFromFile("cert.pem", "key.pem")
    if err != nil {
        log.Fatalf("failed to setup tls: %v", err)
    }

    // Create a gRPC server and register our server with it
    s := grpc.NewServer(grpc.Creds(creds))
    pb.RegisterKeyValueServer(s, &server{})

    // Open a listening port on 50051
    lis, err := net.Listen("tcp", ":50051")
    if err != nil {
        log.Fatalf("failed to listen: %v", err)
    }

    // Start accepting connections on the listening port
    if err := s.Serve(lis); err != nil {
        log.Fatalf("failed to serve: %v", err)
    }
}

This code is nearly identical to the original, with one notable exception. You may notice that before we build the service, we first use the credentials.NewServerTLSFromFile function to define our credentials. Those credentials are then passed as an option to grpc.NewServer.

That’s really it. As is the case with the net/http package, the encryption logic for gRPC is abstracted away so you don’t need to think about it once it’s been started. Less to do means less to think about, means less to mess up.

Cryptographic Practices

It’s not an exaggeration to say that cryptography forms the foundation of information security. Cryptographic practices are crucial for safeguarding sensitive information from unauthorized access and ensuring data integrity, and cryptographic techniques such as encryption, hashing, and key management are vital components of a comprehensive security strategy.

In this section, we’re going to cover two fundamental concepts in cryptography, hashing and encryption, as well as some general cryptographic best practices.

Hashing

Hashing is a fundamental cryptographic practice with multiple applications across various fields. The process of hashing is conceptually straightforward: you apply a hash function to some source data to generate a value called a hash. A key feature of hashing is that it can’t be reversed to recover the original input. Just as importantly, a good hash function also produces output that varies considerably even with small changes to the input.

Hashing is commonly used in situations in which you want to be able to verify what something is (or isn’t), without necessarily needing to know its value or contents. Examples include but aren’t limited to the following:

Data integrity: Hashes can be generated of any sized input to make tampering or changes evident.
Password storage: As mentioned in “Hashing passwords”, systems will typically store hashes of passwords instead of the passwords themselves. When a user logs in, the system hashes the password entered and compares it to the stored hash.
Checksums for data transmission: Hash functions are used to generate checksums for data blocks being transmitted over a network. The checksum helps ensure that the data has not been altered during transmission, allowing for error detection.
Blockchain and cryptocurrency: Each block in a blockchain is connected to the previous one via a cryptographic hash, ensuring the security and immutability of the entire chain.
Unique identifiers: Hashes can be used to generate unique identifiers for large sets of data, which are useful in databases for quick data retrieval without data conflicts.
Commitment schemes: Hashing is used in various cryptographic protocols to commit to a chosen value while keeping it hidden, only revealing the value at a later time. This technique ensures that the commitment is kept unchanged once made.

Generally speaking, you should use hashing whenever you need to verify the contents of something without having to know its original contents.

Hashing Algorithms

There are about 10 or so popular¹⁶ hashing algorithms in use today, each with its own strengths, weaknesses, and preferred applications. These largely fall into three categories:

Cryptographically insecure algorithms

The MD5 algorithm is probably the most popular hashing algorithm in use today, being widely used for verifying data integrity and checksums. You’ll sometimes see SHA-1 used in some legacy systems, and it can even be seen in some older SSL/TLS certificates and digital signatures. Both of these are generally considered obsolete for cryptographic purposes and are not recommended for new applications.

General hashing algorithms

These algorithms are designed to balance speed and security. SHA-256 is secure, widely available, and suitable for general cryptographic use, with the newer SHA-3 providing a modern alternative. BLAKE2b (or just BLAKE2) and BLAKE2s are less widely supported but are considered versatile, fast, and secure. BLAKE2b is optimized for 64-bit platforms, BLAKE2s for 8-bit to 32-bit platforms.

BLAKE2 is generally considered to be the strongest and most flexible of these, but SHA-256 is perfectly good if that’s unavailable.

Password hashing algorithms

Algorithms like Argon2, bcrypt, and scrypt are specialized for hashing passwords. They’re designed to be computationally expensive to resist brute-force attacks, and they include some kind of salting strategy to foil rainbow table attacks.

Note

Go’s standard crypto package provides several cryptographic implementations, but most modern algorithms have been implemented under the supplemental golang.org/x/crypto packages. You should prefer those instead.

In Go, the general-purpose hashing algorithms are all used in similar ways, which we demonstrate here:

package main

import (
    "crypto/md5"
    "crypto/sha256"
    "fmt"

    "golang.org/x/crypto/blake2b"
    "golang.org/x/crypto/sha3"
)

func main() {
    str := []byte("Welcome to Cloud Native Go, Second Edition!")

    hMd5 := md5.New()
    hSha256 := sha256.New()
    hSha3 := sha3.New256()
    hBlake2b, _ := blake2b.New256(nil)

    hMd5.Write(str)
    hSha256.Write(str)
    hSha3.Write(str)
    hBlake2b.Write(str)

    fmt.Printf("MD5       : %x\n", hMd5.Sum(nil))
    fmt.Printf("SHA256    : %x\n", hSha256.Sum(nil))
    fmt.Printf("SHA3-256  : %x\n", hSha3.Sum(nil))
    fmt.Printf("BLAKE2-256: %x\n", hBlake2b.Sum(nil))
}

For all four of the algorithms used here, you use some variation of a New function to obtain a hash.Hash value, which lives in the standard hash package. You can then pass your data (represented as a []byte slice) to its Write method, and retrieve the resulting hash via the Sum method. Executing this snippet will provide the following output:

MD5       : 390127f420dbc185fce3ed7fe914a08d
SHA256    : a64c3cadf4803bcbcbe5a1ca277af785cb604a492ae578dbd9cd87809b1e6bb8
SHA3-256  : a4abab18bd01ae4e85e07f27e575022b614d5a06d100707cd43229fa447cd20c
BLAKE2-256: 37fad4546db45abfe6aa85324460098d7c9dde02d0c03a9d7d67dcd9dce760f7

The Go implementations of the password hashing algorithms are a little more diverse. You can go back to “Hashing passwords” to see an example of how to use bcrypt to hash, and then validate, a password.

Encryption

Encryption is a very different animal from hashing. Where hashing applies a function to produce a fixed-length value that can’t be used to recover the original input, encryption uses a cryptographic key to reversibly convert data into unreadable ciphertext.

In this section, we’re going to dig just a little deeper into the concept of encryption, including symmetric and asymmetric encryption, their uses and differences, and how to implement them in Go.

Tip

Some people use the terms hashing and encryption interchangeably, when in fact, they’re quite different things that serve quite different purposes. If anybody tells you any different, you should ignore them.

Symmetric encryption

Symmetric encryption, also called shared key or private key encryption, is the most straightforward approach to encryption, using only a single cryptographic key for both encryption and decryption of data.

Though symmetric encryption is less complicated and generally faster and more efficient than asymmetric encryption, it does require that all communicating parties have the secret key, making key management a challenge and compromise that much more likely.

There are a handful of popular symmetric ciphers (encryption algorithms) available. DES (Data Encryption Standard) is an older cipher that remains quite prevalent, though it’s considered obsolete and generally shouldn’t be used. These days, AES (Advanced Encryption Standard) is generally preferred due to its security, efficiency, and widespread support.

The encryption process

Every cipher is a little different, but there are some generalities to using them. Here are steps you’ll need to use, in broad strokes:

Obtain your encryption key: Encryption keys are implemented as a []byte slice, the specific size of which varies between algorithms.
Retrieve your cipher: In Go, encryption ciphers can be implemented in a variety of ways. Fortunately, most crypto packages include a function named something like NewCipher.
Choose your cipher mode: Some symmetric algorithms (including AES and DES) can be applied using different methods, called cipher modes. Several such modes are available, but unfortunately, we can’t cover them here. In this book, we use Galois/Counter Mode (GCM), which provides a good balance of security and efficiency.
Encrypt your data: Now that you’ve fully initialized your cipher, you can provide it with your plain text and receive encrypted ciphertext in return.

In the following code, we follow these steps to construct an AES-GCM cipher that’s able to encrypt any provided data:

import (
    "crypto/aes"
    "crypto/cipher"
    "crypto/rand"
    "io"
)

// encryptAES encrypts plaintext using the given key with AES-GCM.
func encryptAES(key, plaintext []byte) ([]byte, error) {
    // Create a new `cipher.Block`, which implements the AES cipher
    // using the given key
    block, err := aes.NewCipher(key)
    if err != nil {
        return nil, err
    }

    // Specify the cipher mode to be GCM (Galois/Counter Mode).
    gcm, err := cipher.NewGCM(block)
    if err != nil {
        return nil, err
    }

    // GCM requires the use of a nonce (number used once), which is
    // a []byte of (pseudo) random values.
    nonce := make([]byte, gcm.NonceSize())
    if _, err = io.ReadFull(rand.Reader, nonce); err != nil {
        return nil, err
    }

    // We can now encrypt our plaintext. Prepends the nonce value
    // to the ciphertext.
    ciphertext := gcm.Seal(nonce, nonce, plaintext, nil)

    return ciphertext, nil
}

There’s a lot to unpack here, but it’s actually not that bad once you wrap your head around it.

First, we use aes.NewCipher from the crypto/aes package, passing it our key (represented as a []byte slice) to get our AES cipher. Then we pass that to crypto.NewGCM from crypto/cipher to set the cipher mode as GCM.

Next, we define a []byte of a specific size and populate it with (pseudo)random data. This will be our nonce. A nonce (short for “number used once”) is a (pseudo) random value that’s used (once) in certain cipher modes to ensure uniqueness of encryption processes.

With that, we now have all of the pieces we need to execute our encryption, so we call the gcm.Seal method with our nonce and plain text and get encrypted ciphertext in return!

Note

The nonce value generally isn’t considered sensitive, but it’s essential for decryption. It’s actually common practice to prepend the ciphertext with it, which is exactly what gcm.Seal does.

The decryption process

Just as when you encrypt data, you also have to assemble your cipher for decryption. The process for this is pretty straightforward, though it varies slightly. See if you can spot the differences:

import (
    "crypto/aes"
    "crypto/cipher"
    "fmt"
)

// decryptAES decrypts ciphertext using the given key with AES-GCM.
func decryptAES(key, ciphertext []byte) ([]byte, error) {
    // Retrieve our AES cipher
    block, err := aes.NewCipher(key)
    if err != nil {
        return nil, err
    }

    // Specify the cipher mode to be GCM (Galois/Counter Mode).
    gcm, err := cipher.NewGCM(block)
    if err != nil {
        return nil, err
    }

    // Retrieve the nonce value, which was prepended to the ciphertext,
    // and the ciphertext proper.
    nonceSize := gcm.NonceSize()

    if len(ciphertext) < nonceSize {
        return nil, fmt.Errorf("invalid input")
    }

    nonce, cipherbytes := ciphertext[:nonceSize], ciphertext[nonceSize:]

    // We can now decrypt our ciphertext!
    plaintext, err := gcm.Open(nil, nonce, cipherbytes, nil)
    if err != nil {
        return nil, err
    }

    return plaintext, nil
}

As you can see, the steps to decrypt are really similar to those for encrypting, at least with respect to constructing the cipher. There are two key details that differ, however.

First, instead of generating a new random nonce, as we would when encrypting, we retrieve the nonce value that we previously prepended onto the ciphertext. Second, instead of the gcm.Seal method, we call gcm.Open, which decrypts and authenticates the ciphertext and returns the decrypted bytes.

Putting these together

Now that we have a method to encrypt data and another to decrypt data, we can put them together. All we need is some data to encrypt and a key.

For AES, the key can be any series of bytes, but the strength of the encryption depends on the length of the key. The key must be exactly 16, 24, or 32 bytes to select AES-128, AES-192, or AES-256, respectively.

Here we demonstrate everything together:

import (
    "encoding/base64"
    "fmt"
)

func main() {
    // 32 bytes for AES-256. Obviously, don't do this.
    key := []byte("example.key.12345678.example.key")

    plaintext := []byte("Hello, Cloud Native Go!")

    encrypted, err := encryptAES(key, plaintext)
    if err != nil {
        panic(err)
    }

    decrypted, err := decryptAES(key, encrypted)
    if err != nil {
        panic(err)
    }

    // Encode the encrypted string bites to Base64
    encoded := base64.StdEncoding.EncodeToString(encrypted)

    fmt.Println("Encrypted:", encoded)
    fmt.Println("Decrypted:", string(decrypted))
}

As you can see, we use a dummy key to encrypt the phrase “Hello, Cloud Native Go!” and then to decrypt the resulting ciphertext. Running this gives us the following:

Encrypted: L0+xZEX6mdS1kX7hSdQ1m3KXvdj4DyvmoypXNGoxqZxMv0DkUx9sYkBSzkAk5Z/XgZDP
Decrypted: Hello, Cloud Native Go!

That was pretty straightforward, wasn’t it? Next up, we’ll start digging into asymmetric encryption.

Asymmetric encryption

Asymmetric encryption, or public-key encryption, differs from its symmetric cousin in that it uses a pair of keys to do its job: a public key for encryption and a private key for decryption. This key pair is mathematically related, but it’s computationally infeasible to derive the private key from the public key.¹⁷ With asymmetric encryption, two parties who wish to communicate privately can exchange their public keys, which can then be used to secure subsequent communications in a way that can be read by only the intended recipient who holds the corresponding private key.

However, while asymmetric algorithms provide significant advantages around key distribution, they’re also much slower and computationally expensive than symmetric ones. In fact, it’s common to combine the methodologies by using asymmetric encryption to securely exchange a symmetric key that’s then used for the actual data encryption.

Perhaps the most widely used and widely supported asymmetric algorithm is RSA (Rivest-Shamir-Adleman), which is suitable for secure data transmission and digital signatures. ECC (elliptic curve cryptography) offers similar security to RSA but with smaller key sizes, leading to improved performance.

Generating a key pair

Before we encrypt anything, we’re going to need a key pair. One way to do this would be to generate keys and load them from key .pem files. We don’t have those right now though, so we’ll use the standard crypto/rsa implementation to generate ours:

import "crypto/rsa"

// generateKeyPair generates an RSA key pair.
func generateKeyPair(bits int) (*rsa.PrivateKey, *rsa.PublicKey, error) {
    privateKey, err := rsa.GenerateKey(rand.Reader, bits)
    if err != nil {
        return nil, nil, err
    }

    return privateKey, &privateKey.PublicKey, nil
}

The rsa.GenerateKey function generates a random RSA private key of the given bit size. It accepts an io.Reader that’s expected to provide a source of random bytes, which should pretty much always be Reader from crypto/rand, which is designed to be exactly what we need: a cryptographically secure random number generator.

Warning

The pseudorandom number generator provided by math/rand is not considered cryptographically secure, and it should never be used for security-sensitive work. Use crypto/rand instead. See “Cryptographic Randomness” for more on this.

If you want to export these keys to PEM format, or just visualize them, you can use the standard encoding/pem and crypto/x509 packages to do that:

import (
    "crypto/rsa"
    "crypto/x509"
    "encoding/pem"
    "fmt"
)

// exportKeys exports keys to PEM format for demonstration purposes
func exportKeys(privateKey *rsa.PrivateKey, publicKey *rsa.PublicKey) {
    privBytes := x509.MarshalPKCS1PrivateKey(privateKey)
    privPEM := pem.EncodeToMemory(
        &pem.Block{
            Type:  "RSA PRIVATE KEY",
            Bytes: privBytes,
        })

    pubBytes, _ := x509.MarshalPKIXPublicKey(publicKey)
    pubPEM := pem.EncodeToMemory(
        &pem.Block{
            Type:  "RSA PUBLIC KEY",
            Bytes: pubBytes,
        })

    fmt.Println(string(privPEM))
    fmt.Println(string(pubPEM))
}

In this function, we first marshal each of the keys into an appropriate byte representation, which we can then PEM-encode in memory. Here we just print the keys out to stdout, but you could write these to files almost as easily.

The encryption process

The crypto/rsa package provides implementations of two specifications: RSA-OAEP, which is used for encryption and decryption, and RSA-PSS, used for signing and verifying digital signatures. Since we’re encrypting data, we use the former, implemented by rsa.EncryptOAEP:

import (
    "crypto/rand"
    "crypto/rsa"
    "crypto/sha256"
)

// encryptRSA encrypts the given message with the RSA public key.
func encryptRSA(publicKey *rsa.PublicKey, message []byte) ([]byte, error) {
    ciphertext, err := rsa.EncryptOAEP(
        sha256.New(), rand.Reader, publicKey, message, nil)
    if err != nil {
        return nil, err
    }

    return ciphertext, nil
}

The first parameter of the rsa.EncryptOAEP function is a hash.Hash implementation, which you may recall from “Hashing Algorithms”. Any secure hash implementation will do, but it’s important that the decryption uses the same algorithm. The documentation suggests that SHA-256 is a reasonable choice, so let’s use that.

The second parameter is an io.Reader that’s used as a source of entropy. This is just like rsa.GenerateKey, and we provide it the same value: rand.Reader.

The decryption process

The decryption is quite similar to the encryption process, except instead of rsa.EncryptOAEP, we use rsa.DecryptOAEP, and instead of a public key, we provide a private key:

import (
    "crypto/rand"
    "crypto/rsa"
    "crypto/sha256"
)

// decryptRSA decrypts the given ciphertext with the RSA private key.
func decryptRSA(privateKey *rsa.PrivateKey, ciphertext []byte) ([]byte, error) {
    plaintext, err := rsa.DecryptOAEP(
        sha256.New(), rand.Reader, privateKey, ciphertext, nil)
    if err != nil {
        return nil, err
    }

    return plaintext, nil
}

Just as with rsa.EncryptOAEP, the first parameter of rsa.DecryptOAEP is a hash.Hash implementation. Again, this has to be the same implementation used for encryption.

Putting it all together

Now that we can generate a key pair, and we can use those keys to encrypt and decrypt a message, let’s put them together and see what happens:

import (
    "encoding/base64"
    "fmt"
)

func main() {
    // Generate 2048-bit RSA keys
    privateKey, publicKey, err := generateKeyPair(2048)
    if err != nil {
        panic(err)
    }

    plaintext := []byte("Hello, Cloud Native Go!")

    // Encrypt message
    encrypted, err := encryptRSA(publicKey, plaintext)
    if err != nil {
        panic(err)
    }

    // Decrypt message
    decrypted, err := decryptRSA(privateKey, encrypted)
    if err != nil {
        panic(err)
    }

    data := base64.StdEncoding.EncodeToString(encrypted)
    fmt.Println("Encrypted:", data)
    fmt.Println("Decrypted:", string(decrypted))
}

This main function mostly just calls the functions we built, so nothing in it should be a surprise. We first generate a 2,048-bit key pair, which we then use to encrypt and then decrypt a message. Running this code will output something like the following:

Encrypted: ucX/ZZj0ISevpqL+rn+bKMBAubsUdJ+wA7ah2r+PpIVyOVaSm67zZSivTpZl2...
Decrypted: Hello, Cloud Native Go!

The actual encrypted message is pretty long, even when Base64-encoded, a consequence of using a 2,048-bit key pair. What’s important, though, is that we’ve successfully encrypted and decrypted an arbitrary message!

When to use each

Now that you’ve seen a detailed description of both symmetric and asymmetric cryptographic algorithms, you may wonder what the right time would be to apply one over the other. Let’s review some of the strengths and weaknesses of each approach:

Symmetric algorithms: Relatively fast and efficient, making them best suited for encrypting large volumes of data with minimal computational overhead. However, since they use the same key for both encryption and decryption, key distribution and management is critical but often challenging.
Asymmetric algorithms: Simplify key distribution by using a pair of keys (public and private) for encryption and decryption, enhancing security for key exchanges and digital signatures. Their main drawbacks are their slower performance and higher computational requirements, which make them less suitable for encrypting large amounts of data.
Combining both methodologies: Using an asymmetric algorithm to securely exchange a symmetric cipher key, which is then used to secure the remainder of the communication, provides something of a best-of-both-worlds. This benefits from the security of asymmetric key exchange with the efficiency of symmetric encryption, but it adds a fair amount of complexity.

Now, armed with an understanding of these techniques, you’re far more prepared to build robust protections for sensitive data, both at rest and in transit.

Cryptographic Randomness

Having a good source of randomness is necessary for all kinds of cryptographic tasks, like generating salt values, nonces, and keys. Fortunately, while computers are really bad at being random, there are a lot of very smart people working very hard to think of ways to make them act as close to randomly as possible by designing a variety of pseudorandom number generators. You can actually find two of these in the Go standard library.

This first is provided by math/rand. This is a source of “statistical” randomness that’s suitable for tasks such as simulation, sampling, numerical analysis, and non-cryptographic randomized algorithms. However, it was neither designed nor intended for cryptographic use because it’s possible to predict sequences after seeing enough values.¹⁸

Warning

The pseudorandom number generator provided by math/rand is not considered cryptographically secure and should never be used for security-sensitive work.

The second is a cryptographic randomness generator provided by crypto/rand. It’s designed to provide numbers that are virtually impossible to predict, even by someone who knows exactly how they’re generated and is well-suited for use in security-sensitive work.

The interface into crypto/rand is its rand.Read function, which follows:

import (
    "crypto/rand"
    "fmt"
)

func main() {
    b := make([]byte, 10)

    if _, err := rand.Read(b); err != nil {
        fmt.Println("error:", err)
        return
    }

    // The slice will now contain random bytes instead of only zeroes.
    fmt.Println(b)
}

In this snippet, we start by making a byte slice, which we then pass to rand.Read, which fills the slice with pseudorandom numbers.

Database Security

We first introduced how to use Go to interact with SQL databases way back in “Working with databases in Go”. It covered a few important things, largely at an introductory level, about the database/sql package and how to use it to connect to a database and perform some basic queries.

In this section, we’re going to dive a little deeper, focusing primarily on database security issues and the actions you can take when using databases in your applications.

Connections

You might recall that Go provides the sql.Open function to open a database connection, which looks something like the following:

import (
    "database/sql"
    _ "github.com/lib/pq"
)

func setupDB() (*sql.DB, error) {
    db, err := sql.Open("postgres", "user:password@host/dbname")
    if err != nil {
        return nil, err
    }

    return db, nil
}

Only, that’s kind of a lie. In fact, sql.Open doesn’t attempt to open a database connection until it’s used to perform a query (or other database operation). Only then does Go retrieve an available connection from the connection pool, returning it to the pool as soon as the operation completes.

This may seem like a nuanced fact, but it has some consequences. For one thing, sql.Open doesn’t actually test for database connectivity, so invalid credentials won’t trigger an error until the first time you try to use them. You may find it useful to call db.Ping (or db.PingContext) immediately after calling sql.Open to allow a connection attempt to fail fast.

Parameterized Queries

Parameterized queries, also known as prepared statements, are a type of SQL query that uses placeholders for parameters instead of directly incorporating user input into the query string. Because they clearly separate parameters from executable code, they’re better able to provide a significant defense against SQL injection attacks. So much so that they’re generally regarded as the way to keep your database safe.

The standard database/sql provides a straightforward means for defining and executing parameterized queries:

func getUserId(ctx context.Context, db *sql.DB, username string) (int, error) {
    var userId int

    query := "SELECT id FROM users WHERE username = $1"

    row := db.QueryRowContext(ctx, query, username)

    if err := row.Scan(&userId); err != nil {
        return 0, err
    }

    return userId, nil
}

In this example, db.QueryRowContext is used with a parameterized SQL query. The $1 placeholder in the query is automatically replaced by the username variable by the database/sql package, ensuring that username is treated as a literal string rather than part of the SQL command. The row.Scan method then copies the columns from the matched row into the values pointed at by userId, returning an error if no rows matched the query.

The same syntax can be used to insert data as well:

func insertUser(ctx context.Context, db *sql.DB, username, email string) error {
    query := "INSERT INTO users (username, email) VALUES ($1, $2)"

    _, err := db.ExecContext(ctx, username, email)

    return err
}

This function inserts a new user into the users table using parameters for both the username and email. The db.ExecContext method executes a query without returning any rows.

Parameterized Query Syntax by Database

If you look at other examples and other documentation, you may notice slightly different syntax being used for parameterized queries. This unfortunate fact is because the syntax varies between databases.

For example, comparing PostgreSQL syntax (the kind used in this book) with the syntax for MySQL and Oracle:

PostgreSQL	MySQL	Oracle
`WHERE col = $1`	`WHERE col = ?`	`WHERE col = :col`
`VALUES($1, $2, $3)`	`VALUES(?, ?, ?)`	`VALUES(:val1, :val2, :val3)`

Regular Expressions

As I mentioned briefly in “Leveraging regular expressions”, regular expressions (regex) are a powerful tool that’s commonly used for input validation. But even moderately complex regex can be fiendishly difficult to understand and modify, and a poorly made expression can even allow an inadvertent (or intentional) denial-of-service.

How Go Regex Is Different

Go’s regex implementation differs from that of other languages in some significant ways that are relevant to security. Rather than trying to implement all of the features offered by other regex engines, the designers opted instead to use the syntax accepted by RE2, which is specifically designed to safely handle regex from untrusted users.

What makes RE2 so great, you ask?

For one thing, Go’s regexp implementation is guaranteed to run in linear time with the size of the input. This is very much not the case for Perl, Python, and most other open source regex implementations, with which expression can run exponentially with input size.

This linear time guarantee makes it much more difficult for an attacker to employ what’s termed a regular expression denial of service (ReDoS), in which an attacker inputs a valid input that’s specifically designed to force a program to hang for a long time. For more about this property, see Russ Cox’s excellent article on the subject, “Regular Expression Matching Can Be Simple And Fast (but Is Slow in Java, Perl, PHP, Python, Ruby, …)”.

What’s more, Go’s regex parser, compiler, and execution engines also explicitly limit their memory usage by working within a fixed budget, and fail gracefully when that memory is exhausted. This makes them even more resistant to ReDoS attacks.

But there’s a trade-off to this level of safety:¹⁹ Go’s regex implementation doesn’t support recursive calls, so approaches that have only recursive solutions aren’t possible.

Syntax

I’m not going to cover regex syntax here, since it’s well outside the scope of this chapter and is well-covered in a bunch of other places,²⁰ but it’s generally the same as that used by Perl, Python, and other languages, with the notable absence of recursive functions. The regexp/syntax package documentation provides a solid overview though, if you’re interested.

Summary

As with so many of the other chapters in this book, the subject of this one is vast. Vaster than most, I think. As a consequence, a lot of things—pretty useful things—had to be left out.

But I’m proud of what I was able to cover: common vulnerabilities and how to defend against them, input validation and sanitization, authentication, and encryption, to name a few. Each of these could easily have been complete chapters—or even complete books—of their own.

And now, in the next (and final) chapter, Chapter 13, we’re going to start looking into another big subject: distributed data. I’ll do my best to do that subject justice.

¹ Abhijit Naskar, The Gospel of Technology (Independently published, January 2020).

² And dedicated they are—but also often chronically overworked and underfunded.

³ I would have printed it here if I could have, but licensing, ya know?

⁴ An argument can be made that all data is sensitive to a greater or lesser degree.

⁵ As a rule, user input generally shouldn’t be considered trusted.

⁶ The name “Atoi” stands for “ASCII to integer.” This name comes by way of the C language, of which it’s been a part since at least 1971.

⁷ Curiously, the html/template package does have a stripTags function, but it’s unexported.

⁸ At least in theory. Netflix knows this isn’t necessarily true, though.

⁹ Wikipedia maintains a very informative list of data breaches that contains about 500 entries spanning 20 years.

¹⁰ This one’s my favorite.

¹¹ Or, perhaps, fortunately.

¹² Michael B. Jones et al., “RFC 7519: JSON Web Token (JWT)”, IETF Datatracker, May 2015.

¹³ Which can be good fun, for a very specific definition of “fun”.

¹⁴ Which is essentially what the parser is doing.

¹⁵ Sadly, gRPC over TLS doesn’t have a fancy acronym like HTTPS does.

¹⁶ Probably many more less-popular ones, too.

¹⁷ At least it is if you don’t have a supercomputer and a couple thousand years.

¹⁸ Although this is less true (but still somewhat true) with Go 1.22. See “Secure Randomness in Go 1.22” by Russ Cox and Filippo Valsorda for more information on that.

¹⁹ Of course there is. There always is.

²⁰ I rather like regular-expressions.info.

Chapter 13. Distributed State

A distributed system is one in which the failure of a computer you didn’t even know about can render your own computer unusable.¹

Leslie Lamport, DEC SRC Bulletin Board (May 1987)

Distributed state is at the heart of cloud computing. Well-designed distributed systems can provide higher levels of fault tolerance and availability and can more easily scale horizontally to handle larger volumes of data and better share the load. However, distributed state is an advanced subject, so up until now, we’ve pretty much avoided the topic altogether.

You might recall that in Chapters 5 and 7 I discussed the problem of “state” in distributed computing, discussing at some length why state is the enemy of scalability. At the time, I suggested externalizing any shared state to a database, or even simply eliminating it altogether. That isn’t necessarily helpful advice because it isn’t always possible (or desirable).

This chapter seeks to remedy this injustice by directly addressing the difficult problem of “state” in cloud native systems. In this chapter, I’ll introduce the CAP theorem and its implications and offer some practical advice about how best to approach distributed state based on the performance priorities of your application. Finally, I’ll cover a few of the more common algorithms, and I’ll use one of those—Raft—to enhance our key-value store.

Distributed State Is Hard

It’s generally well-understood that implementing and managing distributed state can be a considerable challenge. But why exactly is it so hard? Well, lots of reasons, a few of which are listed here:

Data replication: Keeping data synchronized across multiple nodes introduces the challenge of maintaining consistency across multiple replicas. Strategies include synchronous and asynchronous replication, each with its own trade-offs. Inconsistent replicas can lead to data conflicts, which require robust conflict resolution mechanisms.
Data consistency: Maintaining a consistent view of data across multiple nodes can be challenging. There are different consistency models, which we’ll discuss in “Consistency Models”, that trade off between performance and reliability. Algorithms like Paxos or Raft are often used to ensure that all nodes agree on the data state, but these are complex and can introduce latency.
Network partitions: A network partition occurs when a network is divided into disjoint segments, preventing nodes in one segment from communicating with those in another. This can result in a “split-brain scenario” in which each partition operates independently, potentially leading to inconsistent states and conflicting actions across the system.

Scalability: Distributing data across multiple nodes (sharding) can help with scalability but introduces complexity in ensuring efficient query processing and data rebalancing. Ensuring an even distribution of data and queries across nodes is challenging and crucial for performance.
Operational complexity: Diagnosing and fixing issues in a distributed system is significantly more challenging due to the number and complexity of components and their interactions.

We’ve directly or indirectly touched on a few of these issues in other chapters (scalability, operational complexity), and others are entirely new (network partitions).

As you’ll see in this chapter, each of these issues can be overcome to some extent, but there are always trade-offs. As is so often the case in system design, mitigating one of these often means making sacrifices elsewhere. I’ll discuss each of these issues—and what can be done about them—in a little more detail in the following sections.

Theoretical Foundations

As much as we’d sometimes like to skip the theory, understanding the theoretical foundations is crucial for designing robust and reliable distributed applications. In this section, I’ll delve into the core concepts that underpin the management of distributed state, doing my best to offer some insights into the challenges and trade-offs involved in designing distributed systems.

The CAP Theorem

The CAP theorem, also known as Brewer’s theorem, is a fundamental principle in distributed systems that describes the trade-offs between three key properties: consistency, availability, and partition tolerance.

There’s a fair bit of subtlety and nuance to the CAP theorem, but the key takeaway is that no distributed system can guarantee more than two of the following three properties:

Consistency (C)

In a perfectly consistent system, every read receives the most recent write (or an error). In other words, all nodes see the same data at the same time. This ensures that any read operation returns the latest write value.

This degree of consistency comes at a cost, however, requiring complex synchronization mechanisms that can increase latency and reduce availability during network issues.

Availability (A)

In a perfectly available system, every read or write request receives a response, regardless of the success of that response. The system remains operational and responsive, even in the presence of some node failures.

Guaranteeing that the system is always responsive means sometimes serving stale data or sacrificing consistency, especially during network partitions.

Partition Tolerance (P)

A partition-tolerant system will continue to function even if an arbitrary number of messages are dropped or delayed by the network. Such a system can handle network failures or partitions without completely shutting down.

Handling network partitions means the system can continue to operate despite failures, but it must trade off between returning consistent data or ensuring every request is processed.

These properties are illustrated in Figure 13-1.

Essentially, the CAP theorem asserts that in the presence of a network partition a distributed system must choose between consistency and availability. The clear consequence of this is that it’s simply not possible to create a system that can achieve all three of these properties simultaneously, forcing system designers to make trade-offs between them.

But two out of three isn’t bad, and each of the resulting pairs of guarantees (CP, AP, and CA) has its own properties and consequences:

Consistency and partition tolerance (CP)

CP systems ensure consistency even in the presence of network partitions but may sacrifice availability. In practice this often means that the system can refuse to respond to some requests—becoming less available—to ensure data consistency.

Examples of CP systems include distributed databases such as Google Spanner that use consensus algorithms² to ensure consistency but that delay responses during partitions.

Availability and partition tolerance (AP)

AP systems are designed to remain available and to tolerate network partitions but may return stale or inconsistent data during partitions.

Examples include NoSQL databases like Apache Cassandra and DynamoDB that provide high availability and tolerate partitions but allow for eventual consistency.

Consistency and availability (CA)

In a CA system, if there’s a network partition, the system must become unavailable to maintain consistency. However, since network partitions are inevitable, this is usually impractical for distributed systems.

The CAP theorem provides a robust framework for understanding the limitations and compromises implicit in designing distributed systems. Understanding it and its implications is essential for making informed decisions about the trade-offs between consistency, availability, and partition tolerance when designing distributed systems.

Consistency Models

A distributed system’s consistency model is the set of rules and guarantees that describe the behavior of its reads and writes. It dictates how and when changes made to data by one node become visible to other nodes, and therefore to users or applications interacting with the system.

There are many kinds of consistency models, but most can be classified into one of two broad types based on the visibility of updates that they provide and their subsequent trade-offs in terms of performance, availability, and fault tolerance.

Strong consistency

Strong consistency ensures that all operations on a data item are seen by all nodes in the same order, and that every read in a distributed system receives the most recent write. This type of model provides a straightforward and intuitive guarantee: once a write is acknowledged, any subsequent read will reflect that write, regardless of which node services the read.

Expressed in CAP terms, strongly consistent models prioritize consistency (C) and partition tolerance (P) over availability (A). During network partitions, the system might sacrifice availability to maintain consistency.

Strong consistency is typically implemented using protocols such as Paxos or Raft, which ensure that all nodes agree on a single history of operations. These protocols involve multiple rounds of communication and acknowledgments to achieve consensus, thereby guaranteeing that all nodes have a consistent view of the data.

Benefits

The first benefit to such a data model is obvious: it’s relatively easy to reason about the state of the system, since it effectively behaves like a single-node system. Furthermore, it ensures a high degree of data integrity and correctness, which is crucial for systems that require precise consistency, like financial transactions, inventory systems, and critical configuration management.

Trade-offs

Naturally there are trade-offs for having such a high degree of consistency. First, the higher degree of coordination necessary between nodes leads to generally higher latencies. Second, during network partitions, strong consistency may sacrifice availability, as operations cannot complete until all nodes can agree on the state. Finally, the overhead of maintaining a global order and immediate visibility can limit the system’s scalability.

Weak consistency

Weak consistency offers more flexible guarantees compared to strong consistency, but what it loses in immediate consistency it gains in higher availability and lower latency. Under weak consistency, updates to a data item may not be immediately visible to all nodes, and the order of operations may not be globally agreed upon.

In other words, weakly consistent models prioritize availability (A) and partition tolerance (P) over consistency (C).

Weak consistency models are used in systems like DynamoDB and Cassandra in which operations are acknowledged by a subset of nodes (a quorum), providing a balance between consistency and performance. Over time, background processes ensure that all nodes eventually receive updates, achieving eventual consistency, which guarantees that, given enough time, all replicas will converge to the same value.

Benefits

Weak consistency may not sound like a great choice compared to the robustness provided by strong consistency, but weak consistency does have some pretty significant virtues of its own. First, since replicas don’t have to coordinate quite so extensively, weakly consistent systems are more scalable across large systems, and they often have markedly lower latency and higher throughput, especially at scale. Second, since each node is capable of operating independently, weakly consistent systems can remain highly available during network partitions.

Trade-offs

The most visible trade-off for achieving a high data availability is the potential for data staleness, in which applications may temporarily observe stale or inconsistent data. This makes weak consistency a less suitable choice for use cases that require immediate correctness.

Weak consistency requires an additional degree of complexity, in that application developers need to handle the potential for inconsistent reads and may need to implement additional logic to manage data convergence.

Eventual consistency

Eventual consistency is a specific type of weak consistency that guarantees, given enough time and no new updates, that all replicas in the distributed system will converge to the same value. This guarantee of convergence is the key difference between eventual consistency and other forms of weak consistency, which don’t necessarily guarantee that all replicas will ever converge to the same value.

In an eventually consistent system, updates are propagated in the background, and replicas update their state asynchronously. However, while all updates will (eventually) be seen by all replicas, reads might return outdated data until the system converges, just as with any form of weak consistency.

The most common example of an eventually consistent system is DNS, in which updates propagate through the network, until eventually all servers reflect the new data.

Comparing weak and strong consistency

Distributed designs have to do their best to strike a balance between consistency and availability, where stronger consistency provides more of the former, and weaker consistency the latter. The balance you choose, of course, should be determined by the specific needs of your application and its use case.

Stronger consistency is appropriate where data integrity and correctness are critical, or when the application would especially benefit from being able to provide a relatively simple, straightforward view of the system’s state.

Weaker consistency, on the other hand, is ideal when high availability and low latency are required, particularly if the application can tolerate temporary inconsistencies. It’s also especially useful when a system needs to scale efficiently across a large number of nodes.

Data Replication

Data replication is the practice of creating and storing multiple copies of data in multiple locations. It’s also one of the hallmarks of distributed systems.³ Data replication is a fundamental technique for building reliability and scalability into a distributed system, and a properly designed data replication strategy can provide much-needed availability, fault tolerance, and performance.

There are two broad categories of data replication, synchronous replication and asynchronous replication, which are illustrated in Figure 13-2. Each has its own set of trade-offs that impact the system’s consistency, availability, and partition tolerance.

This image illustrates the difference between the two replication methodologies, and I further elaborate on them next.

Synchronous replication

Synchronous replication strategies ensure that data is written to multiple nodes before the write operation is considered complete. This method guarantees that all replicas are consistent at any given time, providing strong consistency.

One of the consequences of synchronous replication is strong consistency at the possible expense of availability, so its use cases overlap with those of strong consistency. Examples include financial transactions that depend on data consistency, and inventory systems in which real-time updates are necessary to prevent overselling.

One of the clearest advantages of synchronous replication is that since all nodes have the same data (essentially) at the same time, read operations will always return the latest data. What’s more, in the event of a node failure, any data is immediately available on other nodes, which ensures a degree of high availability.

Unfortunately, synchronous replication, by design, requires that any mutating operations must wait for acknowledgment from all nodes, leading to higher latency on average. This effect only increases as the number of nodes increases, impacting scalability.

Asynchronous replication

Asynchronous replication allows the primary node to complete the write operation without waiting for acknowledgments from the secondary nodes. The data is eventually replicated to the other nodes, leading to eventual consistency.

Asynchronous replication trades consistency for availability, and it’s typically used in systems where eventual consistency is acceptable and low latency is preferred, such as social media feeds or content delivery networks (CDNs).

It’s easy to see why: write operations in a system that uses asynchronous replication tend to be faster because they don’t have to wait for acknowledgments from other nodes, and they can generally also handle a higher rate of write operations.

The downside, though, is that there’s a time window where replicas may not have the latest data, leading to potential inconsistencies. There’s also a risk of data loss if a node fails before its data is replicated.

Replication and the CAP theorem

Data replication strategies must balance the trade-offs described by the CAP theorem. Synchronous replication leans toward consistency and partition tolerance, while asynchronous replication favors availability and partition tolerance.

In cloud native applications, the choice between synchronous and asynchronous replication depends on the specific requirements of the application. For critical data where consistency cannot be compromised, synchronous replication is preferred. For applications requiring high availability and performance, asynchronous replication is more suitable.

Hybrid approaches

Many modern distributed systems use a hybrid approach, combining both synchronous and asynchronous replication to balance the trade-offs. For instance, they might use synchronous replication within a data center for strong consistency and asynchronous replication across data centers to ensure global availability.

Common Distributed Algorithms

As difficult as designing and building distributed systems is, a number of algorithms have been developed to solve certain problems that such systems routinely face. These include how to get all replicas to agree on something (consensus), or how to reliably disseminate a value or status to all replicas (status dissemination).

In this section we’ll review a few common algorithms for each of these problems. We’ll cover how they’re used and their relative strengths and weaknesses.

Consensus Algorithms

Consensus algorithms are used to make sure that multiple nodes agree on a given value by ensuring that all nonfaulty nodes in the system agree on a sequence of operations used to arrive at that value, even in the presence of faults. In effect, they allow a collection of machines to work as a coherent group.

Consensus algorithms are typically designed to always return a correct result even in the presence of conditions such as network delays and partitions (safety), and to remain fully functional so long as any majority—a quorum—of nodes are functioning correctly and can communicate with one other and with clients (availability).

Note

Requiring a quorum—a minimum number of “votes”—for a distributed transaction to be allowed provides fault tolerance, helps ensure consistency, and supports availability by keeping a small number of slow servers from impacting overall system performance.

At the heart of most consensus algorithms is a replicated state machine, in which each node maintains identical copies of the inputs that brought it to its current state.

Replicated state machines are typically implemented using a replicated log, as shown in Figure 13-3. Each node maintains an event log that stores a series of operations, which its state machine⁴ executes in order. Every log on every node contains the same operations in the same order, so each state machine processes the same sequence of operations, and therefore arrives at the same state.

It’s the job of the consensus algorithm to keep the replicated logs consistent across nodes, even if some servers experience failures. This coordination makes the servers work together as a single, reliable state machine, providing consistent outputs to clients.

Paxos

The Paxos algorithm was invented by the great Leslie Lamport in 1989 and was effectively the consensus algorithm prior to the development of Raft, which we’ll discuss in “Raft”. Like other consensus algorithms, Paxos was designed to make it possible for a distributed set of computers, such as a cluster of database nodes, to achieve agreement over an asynchronous network.

The algorithm operates through phases involving proposers, acceptors, and learners: proposers suggest values, acceptors agree on a single value, and learners learn the chosen value. Paxos ensures that once a value is chosen, it remains consistent even in the presence of network failures or message delays. This robustness has made Paxos the cornerstone of consensus protocols in distributed systems, influencing most subsequent implementations and becoming the standard in teaching consensus algorithms.

However, while Paxos has the virtue of being provably correct,⁵ it’s also notoriously difficult to understand. Its intricate phase structure and the need to handle various failure scenarios make it difficult to comprehend and implement. Even with numerous efforts to simplify its presentation, both system builders and students find Paxos incredibly challenging. Moreover, adapting Paxos for practical systems requires significant architectural changes, adding to its complexity.

We’re not going to dive into the algorithm itself here—there are plenty of other places for that,⁶ and we have only so much space—but you should certainly know that it exists.

Raft

Raft was developed in 2014 by Diego Ongaro and John Ousterhout at Stanford University,⁷ primarily as an alternative to Paxos, whose utility, they argued, was greatly hindered by its near incomprehensibility. To them, it was important not just for the algorithm to work but for it to be clear why it works. Apparently, they were right, because in the years since its publication, Raft has largely replaced Paxos in new development.

Like Paxos, Raft manages a replicated log in a distributed system, ensuring consistency across nodes, and operations are finalized only when a quorum of nodes confirms them.

Raft divides the consensus process into three main tasks: leader election, log replication, and ensuring safety. The leader handles all client requests, replicates log entries to followers, and commits them once a majority acknowledges receipt. Raft ensures that committed entries are consistent across the cluster, maintaining fault tolerance and data reliability.

Other consensus algorithms

While Paxos and Raft are widely known and used, a variety of other consensus algorithms have emerged over the years to address various needs in distributed systems.

One such protocol is Byzantine fault tolerance (BFT), exemplified by the Practical Byzantine fault tolerance (PBFT) algorithm.⁸ PBFT was introduced in the late 1990s by Barbara Liskov and Miguel Castro at MIT to handle scenarios where nodes can act maliciously or arbitrarily to introduce faults, making it a good option for environments like blockchains where security and reliability are paramount.

Zab (Zookeeper Atomic Broadcast)⁹ is another important consensus protocol, which was specifically designed for the ZooKeeper coordination service. Zab ensures high availability and reliability by maintaining a consistent and ordered log of state updates across a cluster of servers, making it particularly well-suited for managing distributed applications that require strong consistency guarantees. Zab operates through a leader-based protocol in which a single leader processes client requests and broadcasts state changes to follower nodes.

Google offers a distributed, highly available lock service known as Chubby,¹⁰ which uses modified Paxos protocol for managing distributed locks and offering a readily available and consistent locking service. Chubby simplifies the implementation of consensus by focusing on a specific use case, making it an effective tool for ensuring consistency in distributed applications that require strong coordination and synchronization.

Additionally, the Viewstamped Replication (VR) protocol,¹¹ which predates Raft, provides a method for maintaining consistent replicas of a state machine in the face of node failures. VR employs a primary-backup model similar to Raft but with different mechanisms for handling view changes and ensuring consistency.

Finally, the Egalitarian Paxos (EPaxos) algorithm,¹² developed by Iulian Moraru, David G. Andersen, and Michael Kaminsky at Carnegie Mellon University, extends Paxos to optimize for wide-area networks by allowing commands to be committed with fewer round trips between nodes. EPaxos achieves lower latency and higher throughput in geographically dispersed systems, addressing some of the performance limitations of traditional Paxos.

This list is far from exhaustive, but it provides a fairly reasonable overview of the variety of algorithms available, each with its own unique features, advantages, and trade-offs tailored to particular applications and environments.

Status Dissemination Techniques

Status dissemination refers to the methods and protocols used to share and update the state of various nodes within a network. Though less appreciated than the various consensus algorithms, ensuring that each node has up-to-date information about the status of other nodes is crucial. Effective status dissemination enables efficient coordination, failure detection, and load balancing across the system, allowing it to respond dynamically to changes and to maintain consistent operation.

Given the differing needs between different distributed systems, various techniques have been developed over the years to address the challenges of status dissemination. These techniques, presented in Figure 13-4, vary in their approaches, scalability, and resource requirements, making each approach suitable for different types of applications and network architectures.

In the following sections, we’ll explore the four key status dissemination techniques illustrated.

Heartbeating

Heartbeating is a technique in which nodes periodically send “heartbeat” messages to a central coordinator or to other nodes to signal that they’re still alive and functioning. If a heartbeat isn’t received within a specified interval, the node is considered failed. This method is straightforward and provides timely updates about node status, making it easy to detect failures quickly. Heartbeating is often used in systems where timely detection of node failures is critical, although the sheer number of messages involved in larger systems can generate significant communication overhead.

Polling

Polling involves actively querying nodes for their status at regular intervals. A central coordinator or leader node sends requests to each node, which then (if it’s healthy) responds with its current state. This technique provides explicit and reasonably up-to-date information about each node’s status, making it easy to manage and reason about. However, polling can be resource-intensive and may not scale well in large distributed systems due to the high communication overhead and the potential for delayed responses in busy networks.

Gossip

Gossip protocols, also known as epidemic protocols, are decentralized methods for disseminating status information across a network. Inspired by the way rumors spread in social networks, gossip protocols involve each node periodically selecting a few random peers and sharing its current state with them. This information exchange continues iteratively, allowing the status to propagate throughout the network.

Gossip protocols are highly scalable and robust, making them suitable for large, dynamic systems where nodes frequently join and leave the network. Their decentralized nature ensures that there’s no single point of failure, and they can tolerate message losses and network partitions gracefully. However, they can lead to increased network traffic and resource consumption due to the frequent and redundant message exchanges between nodes, which can affect performance and scalability, especially in large-scale systems.

Hierarchical

Hierarchical status dissemination arranges nodes into a hierarchy so that status information can be efficiently managed and propagated. In this approach, nodes are typically grouped into clusters, with each cluster having a leader or coordinator responsible for aggregating and disseminating status information both within that cluster and to higher levels of the hierarchy.

This technique reduces communication overhead by limiting the number of status messages exchanged, making it more scalable for large systems. It also enables more efficient failure detection and recovery, as the hierarchical structure allows for localized management of status information. The downside, however, is that the process of arranging nodes into a hierarchy can be complex and requires some careful planning.

Distributing Our Key-Value Store

Remember our key-value store that we first built in Chapter 5 and have extended on and off throughout the book? Well, we set it aside for a couple of chapters, but we’re going to bring it back here to demonstrate exactly how one goes about distributing a simple service.

Recalling “The CAP Theorem”, we know that the first thing we need to do is decide on our CAP priorities. As a key-value store, we probably want to be certain that when a value is written that it’s truly and surely written, so consistency is important to us. We also know that any distributed system has to keep partition tolerance in mind. So, what we’re looking to build is a CP system, which means that a consensus algorithm like Raft, which we briefly mentioned in “Raft”, is an ideal solution.

Adding Some Raft

Conceptually, this will change our service model slightly.

Looking back at Chapter 5, our architecture was pretty straightforward: an outer (user-facing) RESTful interface receives any requests, sends them to an inner API that does the actual work, and returns a response that’s presented back to the user.

Converting to use Raft doesn’t change this architecture all that much. In fact, we can leave the RESTful component largely unchanged. The inner API, however, does change the architecture. Specifically, rather than manipulating its internal map directly, its data mutation methods will send commands to a consensus module. Only when the consensus module determines that a quorum of nodes agree on a command will we apply the actual change to our internal data. In other words, we have to make use of the replicated state architecture shown in Figure 13-3.

This sounds like a complicated change, and it certainly is. Fortunately, there’s a package out there that allows us to add Raft functionality to a service: hashicorp/raft package. It’s high-quality, and it’s used in HashiCorp’s Consul and Nomad products, so we know that it’s battle-tested.

Note

Version 1.7 (the latest release as of this writing) is MPL-licensed, though future versions are subject to change. It’s important to note this because HashiCorp recently changed the licensing for its products but not for its libraries.

Let’s take a look at how we can make use of this package.

Defining the State Machine

The first thing we need to define is our finite state machine, which will be the heart of our distributed key-value store. As mentioned in “Consensus Algorithms”, all copies of a replicated state machine maintain an identical list of the inputs that brought it to its current state.

Because its job is to maintain the state of the store, it’s analogous to the store data structure that we created back in Chapter 5:

type Store struct {
    sync.Mutex
    store map[string]string
    raft  *raft.Raft
}

As you can see, the structure of the Store struct is almost identical to the store data structure: it’s still pretty much just a lockable struct wrapped around a map[string]string. In addition, there’s a new *raft.Raft value, which will act as the access point into the Raft consensus module, as described in Figure 13-3.

The Get method is effectively unchanged relative to the version in Chapter 5. It still accepts a string representing the key and returns a string and error values:

// Get returns the value for the given key.
func (s *Store) Get(key string) (string, error) {
    s.Lock()
    defer s.Unlock()

    return s.store[key], nil
}

As you can see next, the Put and Delete methods, however, are quite different.

Specifically, rather than applying mutating changes directly to the map, operation instructions are passed to the Raft framework—the consensus module—via s.raft.Apply, which in turn writes the operation to the replicated log:

type command struct {
    Op    string
    Key   string
    Value string
}

// Set sets the value for the given key.
func (s *Store) Put(key, value string) error {
    c := command{Op: "put", Key: key, Value: value}
    b, err := json.Marshal(c)
    if err != nil {
        return err
    }

    f := s.raft.Apply(b, time.Second*10)

    return f.Error()
}

// Delete deletes the given key.
func (s *Store) Delete(key string) error {
    c := command{Op: "delete", Key: key}
    b, err := json.Marshal(c)
    if err != nil {
        return err
    }

    f := s.raft.Apply(b, time.Second*10)

    return f.Error()
}

Operations that are sent to the consensus module via s.raft.Apply are represented as a []byte slice, so it’s up to us to properly encode it. To do this, we define a struct command that we can use to represent any operation instruction.

To send a command to the log, we need only marshal a command to bytes; conversely, receiving a command from the log will mean that we just have to unmarshal the bytes back into a command value.

So now you may be wondering how mutating changes are applied to the internal state. To do that, we will implement our own version of the Apply method, which we’ll cover in the next section.

The Apply Method

Back in “Your Super Simple API”, we had separate functions to perform Put, Get, and Delete operations. However, to use the hashicorp/raft package, our service now has to comply with the contract of the raft.FSM interface, which requires that we centralize all of the operational functionality into a single method, Apply.

Its signature is defined as follows:

type FSM interface {
    // Apply is called once a log entry is committed by
    // a majority of the cluster.
    Apply(*Log) interface{}
}

The Apply method is called when a log entry is committed by a majority of the cluster. When called, it receives an operation from the replicated log (represented as a []byte slice from the *raft.Log parameter), which it then applies to its local state. The value it returns is provided to the client as an ApplyFuture.Response, which we’ll see in “Defining the API”.

Warning

Because the Apply method is responsible for making all mutating operations, it’s critical that it be deterministic and produce the same result on all peers in the cluster.

Fortunately, we know how the contents of this []byte slice are structured, since we were the ones to structure it in the Put and Delete methods we described in the previous section. We just have to unmarshal the bytes back into a command value.

Putting this together gives us our Apply method, which replaces the Put and Delete functions we defined back in Chapter 5:

func (s *Store) Apply(log *raft.Log) any {
    // Unmarshal the log data into a command value
    var cmd command

    if err := json.Unmarshal(log.Data, &cmd); err != nil {
        return err
    }

    s.Lock()
    defer s.Unlock()

    // Make the appropriate change based on the operation
    switch cmd.Op {
    case "put":
        s.m[cmd.Key] = cmd.Value
    case "delete":
        delete(s.m, cmd.Key)
    }

    return nil
}

Our Apply method is fairly straightforward in its construction: it’s called by the Raft framework whenever a log entry—represented as *raft.Log value—is committed by a majority of the cluster. It then unmarshals the operation data from the log entry as a command and applies that operation to the local state.

Setting Up the Raft Node

Now that we’ve updated our API by integrating it with the Raft consensus module, we have to initialize the Raft module itself by defining several important things. More specifically, we have to define the following:

The Raft configuration: Contains settings that define a variety of behaviors, such as the protocol version and various timeouts.
The log and stable stores: Maintain the state and durability of the system by managing the state machine log and persist Raft metadata, respectively.
The transport layer: Responsible for handling communication between Raft nodes including message exchange, network configuration, and handling network errors.

The following code sets up all of these and uses them to initialize a new Raft node. It includes several steps, but we’ll discuss them in detail after the example:

package main

import (
    "fmt"
    "net"
    "os"
    "path/filepath"
    "time"

    "github.com/hashicorp/raft"
    raftboltdb "github.com/hashicorp/raft-boltdb/v2"
)

var (
    // The identifier for the Raft node
    localID string = "node1"

    // The network address to which the Raft node will bind
    raftBind string = "127.0.0.1:8080"
)

// Initializes the Raft node.
func (s *Store) Open() error {
    // Create a default Raft configuration and set the local node ID
    config := raft.DefaultConfig()
    config.LocalID = raft.ServerID(localID)

    // Create a BoltDB-backed log store at the specified path
    logStore, err := raftboltdb.NewBoltStore(
        filepath.Join("raft", "raft.db"))
    if err != nil {
        return fmt.Errorf("failed to create log store: %w", err)
    }

    // Create a BoltDB-backed stable store
    stableStore, err := raftboltdb.NewBoltStore(
        filepath.Join("raft", "stable.db"))
    if err != nil {
        return fmt.Errorf("failed to create stable store: %w", err)
    }

    // Create a file-based snapshot store in the "raft" directory,
    // with a maximum of 1 snapshot retained
    snapshots, err := raft.NewFileSnapshotStore("raft", 1, os.Stdout)
    if err != nil {
        return fmt.Errorf("failed to create snapshot store: %w", err)
    }

    // Resolve the TCP address specified by raftBind
    addr, err := net.ResolveTCPAddr("tcp", raftBind)
    if err != nil {
        return fmt.Errorf("failed to resolve TCP address: %w", err)
    }

    // Create a TCP transport for Raft communication, with a connection pool
    // of 3 and a timeout of 10 seconds
    transport, err := raft.NewTCPTransport(raftBind, addr, 3,
        10*time.Second, os.Stdout)
    if err != nil {
        return fmt.Errorf("failed to create transport: %w", err)
    }

    // Initialize a new Raft node with the given configuration, state machine,
    // log store, stable store, snapshot store, and transport
    raftNode, err := raft.NewRaft(config, s, logStore, stableStore,
        snapshots, transport)
    if err != nil {
        return fmt.Errorf("failed to create Raft node: %w", err)
    }

    // If the Raft node is not already part of a cluster, bootstrap the
    // cluster with the current node as the initial server
    if raftNode.Leader() == "" {
        config := raft.Configuration{
            Servers: []raft.Server{
                {
                    ID:      raft.ServerID(localID),
                    Address: transport.LocalAddr(),
                },
            },
        }
        raftNode.BootstrapCluster(config)
    }

    // Assign the created Raft node to the Store struct
    s.raft = raftNode

    return nil
}

There’s a lot going on in this method, so we’ll go over it point-by-point:

The first thing we do in this method is create the Raft configuration, which specifies a variety of important values like the protocol version and various timeouts. For the most part, the default configuration provided by raft.DefaultConfig is perfectly adequate for our needs. Importantly, though, we do update it with the local node ID.
The next two operations set up the log and stable stores, which are used to store the state log and Raft metadata, respectively. For these we use the very handy hashicorp/raft-boltdb package, which provides a BoltDB-backed store implementation.

This example uses hardcoded locations for both stores, but if you’re running multiple nodes at the same location (such as for testing), they’ll have to be different for each node.
Next we create a file-based snapshot store, which manages snapshots of the state machine on the local disk. This allows quicker recovery of nodes after failures and also helps to reduce log size by capturing the state at specific points and allowing old log entries to be discarded.
The next two steps resolve the address specified by raftBind into a *net.TCPAddr value, which it uses to create a TCP transport for Raft communication. In this example, we specify a connection pool of 3 and a timeout of 10 seconds. Again, if you’re running multiple nodes, this value will need to vary for each node.
Next we initialize our Raft node with the given configuration, state machine, log store, stable store, snapshot store, and transport. For our efforts, we receive a *raft.Raft value, which will serve as the access point into the Raft node for the lifetime of the service.
Now that we have our *raft.Raft value, we can determine whether this is the first node in the cluster. If it’s not already part of a cluster, we bootstrap the cluster with the current node as the initial server.
And finally, we can assign the created Raft node to the Store struct, and rest with the satisfaction of a job well-done.

Note

This example is greatly simplified by hardcoding values that should vary per node, including the node name and binding address.

To summarize, the function covered creates the Raft configuration and updates it with the local node ID, sets up the log and stable stores using the hashicorp/raft-boltdb package, creates a file-based snapshot store for state management, configures a TCP transport for Raft communication, initializes the Raft node with the necessary components, bootstraps the node if it’s the first in the cluster, and assigns the Raft node to the Store struct.

Defining the API

If you’ve gotten this far, you can congratulate yourself with the knowledge that the hardest parts are complete. All that’s left is to create the RESTful endpoints and put everything together.

To do this, we define our service as a simple struct that contains only a reference to a Store value, which you may recall is where we implemented the core service methods:

type Service struct {
    store *Store
}

Okay, that was easy, I guess.

Now that we have this, we can create the equivalents to the handler functions that we first defined in “Building an HTTP Server with gorilla/mux”:

// getHandler is the handler function for HTTP requests to get a value
// from the key-value store.
func (s *Service) getHandler(w http.ResponseWriter, r *http.Request) {
    vars := mux.Vars(r)
    key := vars["key"]

    value, err := s.store.Get(key)
    if err != nil {
        http.Error(w, err.Error(), http.StatusInternalServerError)
        return
    }

    w.Write([]byte(value))
}

// putHandler is the handler function for HTTP requests to set a value
// to the key-value store.
func (s *Service) putHandler(w http.ResponseWriter, r *http.Request) {
    vars := mux.Vars(r)
    key := vars["key"]

    value, err := io.ReadAll(r.Body)
    if err != nil {
        http.Error(w, err.Error(), http.StatusInternalServerError)
        return
    }

    defer r.Body.Close()

    if err = s.store.Put(key, string(value)); err != nil {
        http.Error(w, err.Error(), http.StatusInternalServerError)
        return
    }

    w.WriteHeader(http.StatusCreated)
}

Both of these are nearly identical to the versions we first put together in Chapter 5. Both retrieve the key value from the query path, pass it to the appropriate Store method, and return an appropriate result.

An equivalent deleteHandler would be pretty much the same thing, and is left as an exercise for the reader.

Putting It All Together

Now that we have all the pieces, we can wrap them all up with a nice pretty bow.

All that’s left to do is create the main function, which will initialize all the components and start the HTTP server to interact with the key-value store:

func main() {
    store := &Store{m: map[string]string{}}
    store.Open()

    service := &Service{store: store}

    // Create a new mux router
    r := mux.NewRouter()

    r.HandleFunc("/v1/{key}", service.getHandler).Methods("GET")
    r.HandleFunc("/v1/{key}", service.putHandler).Methods("PUT")

    if err := http.ListenAndServe(":8080", r); err != nil {
        slog.Error(err.Error())
        os.Exit(1)
    }
}

At this point, this is probably old news for you, so I’ll spare you the detailed analysis. I expect you’ve probably seen enough at this point anyway.

Future Directions

So, here’s the part where I deliver some bad news: this service isn’t truly distributed.

I’ve given you the foundation that you need to understand how Raft and the hashicorp/raft packages work, but your service still needs a little more functionality, such as the ability to join an existing cluster or to persist and restore its state. Fortunately, there’s a more detailed and semi-official distributed example that you can use (which also happened to influence the example in this book).

Summary

Well, you made it. You’ve made it through the final chapter of the book.

Congratulations!

As challenging as it was, I loved writing this book. I loved this chapter in particular. The topics of distributed state and distributed systems are very dear to my heart.

We were able to both cover theory—the CAP theorem, consistency and replication models—and get some hands-on practice by building a Raft-enabled service. I would have really loved to cover all of it in more detail, but unfortunately, the field is absolutely vast, and I didn’t have nearly the amount of space or time I would have needed. Who knows, maybe I’ll write another book! The subject certainly deserves one!

¹ Leslie Lamport, DEC SRC Bulletin Board, May 28, 1987.

² Specifically, Spanner uses a variant of Paxos, though nobody outside of Google knows what the variations are.

³ It’s also a common source of headaches for the people who maintain those systems.

⁴ That’s academic-speak for your service and its central logic.

⁵ The worst kind of correct.

⁶ A great place to start would be Leslie Lamport’s own “Paxos Made Simple”.

⁷ Diego Ongaro and John Ousterhout, “In Search of an Understandable Consensus Algorithm”, Proceedings of the 2014 USENIX Annual Technical Conference, 2014.

⁸ Miguel Castro and Barbara Liskov, “Practical Byzantine Fault Tolerance”, Proceedings of the Third Symposium on Operating Systems Design and Implementation, February 1999.

⁹ “ZooKeeper Internals”, June 5, 2022.

¹⁰ Mike Burrows, “The Chubby Lock Service for Loosely-Coupled Distributed Systems”, 7th Symposium on Operating Systems Design and Implementation, November 2006, 335–350.

¹¹ Brian M. Oki and Barbara H. Liskov, “Viewstamped Replication: A New Primary Copy Method to Support Highly-Available Distributed Systems”, Proceedings of the Seventh Annual ACM Symposium on Principles of Distributed Computing, August 15–17, 1988, 8–17.

¹² Iulian Moraru et al., “There Is More Consensus in Egalitarian Parliaments”, Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, November 3–6, 2013.

Praise for Cloud Native Go

Cloud Native Go

Cloud Native Go

Revision History for the Second Edition

Dedication

Preface

What’s New in the Second Edition

Who Should Read This Book

Why I Wrote This Book

Conventions Used in This Book

Tip

Note

Warning

Using Code Examples

O’Reilly Online Learning

Note

How to Contact Us

Acknowledgments

Acknowledgments for the Second Edition

Part I. Going Cloud Native

Chapter 1. What Is a “Cloud Native” Application?

The Story So Far

Figure 1-1. A traditional three-tiered architecture, with clearly defined presentation, business logic, and data components.

What Is Cloud Native?

Scalability

Loose Coupling

Resilience

Manageability

Observability

Why Is Cloud Native a Thing?

Summary

Chapter 2. Why Go Rules the Cloud Native World

The Motivation Behind Go

Features for a Cloud Native World

Composition and Structural Typing

Figure 2-1. Over time, objected-oriented programming trends toward taxonomy.

Comprehensibility

CSP-Style Concurrency

Fast Builds

Linguistic Stability

Memory Safety

Performance

Static Linking

Static Typing

Summary

Part II. Cloud Native Go Constructs

Chapter 3. Go Language Foundations

Basic Data Types

Booleans

Note

Simple Numbers

Tip

Complex Numbers

Strings

Warning

Variables

Note

Short Variable Declarations

Warning

Zero Values

The Blank Identifier

Constants

Container Types: Arrays, Slices, and Maps

Arrays

Slices

Figure 3-1. Two slices backed by the same array.

Working with slices

Warning

The slice operator

Strings as slices

Note

Maps

Warning

Map membership testing

Pointers

Figure 3-2. The expression p := &a gets the address of a and assigns it to p.

Control Structures

Fun with for

The general for statement

Note

Figure 3-2. The expression `p := &a` gets the address of `a` and assigns it to `p`.