Designing fault-tolerant systems on Go

Designing fault-tolerant systems on Go

Hello, Habre!

Fault-tolerant systems these are those that are capable of continuing to function even in the face of partial failures or malfunctions. The main feature of such systems is to ensure the continuity of the program and data security even in the event of errors or unforeseen situations. This is achieved due to a number of architectural and software solutions aimed at preventing the complete failure of the system in the event of individual failures.

Go, thanks to its simplicity, performance and, most importantly, support for concurrency at the language level, becomes an ideal choice for creating fault-tolerant systems.

Briefly about Fault Tolerance

The basis of fault tolerance is this redundancy, or redundancy. This means that the components of the system are duplicated, so that in the event of failure of one of them, the other can take over its functions.

The system must be able to detect errors and failures. This can be implemented through various monitoring and health check mechanisms. It is important not only to detect the error, but also to isolate it to prevent the failure from spreading to other parts of the system.

After detecting and isolating the error, the system should restore its functionality. This may include failover, restarting failed components, or dynamically reallocating resources.

Although this goes beyond direct fault tolerance, it is also important to address error prevention through good and adequate system design, testing, and maintenance.

Go for Fault Tolerance

Competitiveness in Go

Horutyn – These are lightweight threads of execution managed by the Go runtime. They are significantly more efficient than standard operating system threads for several reasons:

Goroutines take up less memory (typically a few kilobytes) and are faster to create and destroy than traditional threads. This allows you to run thousands or even millions of goroutines on a single car without a significant load on the system.

The Go runtime efficiently allocates goroutines to available processor cores, ensuring high performance of competitive operations. Goroutines simplify writing asynchronous code because they allow you to use common flow control structures (such as loops and conditional statements) instead of callbacks or promises.

Channels are mechanisms for secure and synchronized data exchange between goroutines. They prevent common problems associated with competitive access to data, such as data races and race conditions.

Channels ensure that access to shared resources is synchronized, which reduces the risk of errors in multi-threaded applications. They can also be used to control the flow of execution in a program, for example, to organize pools of goroutines or regulate the frequency of operations.

Channels allow goroutines to exchange messages, which simplifies the development of distributed systems and event-handling systems.

Summarizing:

Coroutines make it easy to implement distributed processing of tasks, which increases resistance to failure, since the failure of one task does not affect the execution of others. Because goroutines work independently, a failure in one goroutine does not cause the entire program to fail. This helps isolate and manage errors at a finer level.

Channels can be used to communicate error information between goroutines, allowing error management and fast response. Threads and channels facilitate the creation of complex resource management mechanisms, such as connection pools or rate limiters, which increases the system’s resilience to overloads and failures.

Error handling

In Go, error handling is done explicitly. Instead of throwing exceptions like many other languages, Go uses return values ​​to represent errors.

Go functions often return two values: a result and an error. This forces you to explicitly check for an error after the function executes:

result, err := someFunction()
if err != nil {
    // Обработка ошибки
}

Go allows you to create your own error types, which gives you more control over the handling of specific error scenarios. Go has an integrated interface errorwhich can be implemented by any type.

Panic in Go is an analogue of exception in other languages, but is used in more limited cases. A panic usually indicates a programmer error or the impossibility of continuing with the program.

Panic can be triggered explicitly using a function panic(). This will immediately terminate the execution of the current function and begin “unwinding the stack”:

if someCondition {
    panic("something went wrong")
}

A panic can be intercepted and handled by a function recover()which must be called inside the deferred function:

defer func() {
    if r := recover(); r != nil {
        // Обработка паники
    }
}()

Built-in features and packages

Built-in functions

Function defer is used to guarantee that certain code is executed at the end of a function regardless of how the function terminates (normally or on error). This is useful for freeing up resources:

package main

import (
    "fmt"
    "os"
)

func main() {
    f, err := os.Open("filename.txt")
    if err != nil {
        panic(err)
    }
    // Этот код будет выполнен в конце функции main, аже если произойдет ошибка при чтении файла.z
    // даже если произойдет ошибка при чтении файла.
    defer f.Close()

    // Дальнейшие операции с файлом
}

panic and recover are used to handle unexpected errors and exceptions. panic causes the immediate termination of the function, and recover allows you to recover from a panic:

package main

import "fmt"

func mayPanic() {
    panic("a problem occurred")
}

func main() {
    defer func() {
        if r := recover(); r != nil {
            fmt.Println("Recovered. Error:\n", r)
        }
    }()

    mayPanic()

    fmt.Println("After mayPanic()")
}

Function mayPanic causes panic. In the function mainbloc defer with function recover intercepts this panic, allowing the program to continue execution after the function mayPanic. Without recover the program would terminate immediately after the panic without executing the code following the call mayPanic.

Packages

Package net/http provides an HTTP client and server. It includes mechanisms for managing timeouts and configuring client and server behavior:

package main

import (
    "fmt"
    "net/http"
    "time"
)

func handler(w http.ResponseWriter, r *http.Request) {
    fmt.Fprintf(w, "Hello, %s!", r.URL.Path[1:])
}

func main() {
    server := &http.Server{
        Addr:         ":8080",
        Handler:      http.HandlerFunc(handler),
        ReadTimeout:  10 * time.Second,
        WriteTimeout: 10 * time.Second,
    }

    fmt.Println("Starting server at port 8080")
    server.ListenAndServe()
}

Package context used to manage the request lifecycle, including canceling operations:

package main

import (
    "context"
    "fmt"
    "time"
)

func operation(ctx context.Context) {
    select {
    case <-time.After(500 * time.Millisecond):
        fmt.Println("Operation done")
    case <-ctx.Done():
        fmt.Println("Operation cancelled")
    }
}

func main() {
    ctx, cancel := context.WithTimeout(context.Background(), 100*time.Millisecond)
    defer cancel()

    go operation(ctx)

    time.Sleep(200 * time.Millisecond)
}

(sync.Mutex) and atomic operations (sync/atomic) provide primitives for synchronization. They are necessary for safe work with shared resources in multi-threaded applications:

package main

import (
    "fmt"
    "sync"
    "sync/atomic"
)

func main() {
    var count int32
    var wg sync.WaitGroup
    for i := 0; i < 10; i++ {
        wg.Add(1)
        go func() {
            defer wg.Done()
            atomic.AddInt32(&count, 1)
        }()
    }
    wg.Wait()
    fmt.Println("Count:", count)
}

io и bufio I/O packages that include buffering and streaming utilities:

package main

import (
    "bufio"
    "fmt"
    "os"
)

func main() {
    file, err := os.Open("example.txt")
    if err != nil {
        panic(err)
    }
    defer file.Close()

    scanner := bufio.NewScanner(file)
    for scanner.Scan() {
        fmt.Println(scanner.Text())
    }

    if err := scanner.Err(); err != nil {
        panic(err)
    }
}

Package time provides functionality for working with time and timers:

package main

import (
    "fmt"
    "time"
)

func main() {
    timer := time.NewTimer(2 * time.Second)
    <-timer.C
    fmt.Println("Timer expired")
}

Patterns

Retry and backoff

Exponential backoff is a strategy where the time between retries increases exponentially. Helps to avoid excessive load on the system due to repeated attempts:

package main

import (
    "fmt"
    "math/rand"
    "time"
)

// Функция для имитации операции, которая может завершиться неудачей
func operation() error {
    if rand.Float32() < 0.5 {
        return fmt.Errorf("temporary error")
    }
    fmt.Println("Operation successful")
    return nil
}

// Retry выполняет операцию с повторными попытками и экспоненциальным отступлением
func Retry(attempts int, sleep time.Duration, function func() error) error {
    for i := 0; ; i++ {
        err := function()
        if err == nil {
            return nil
        }

        if i >= (attempts - 1) {
            return fmt.Errorf("after %d attempts, last error: %s", attempts, err)
        }

        time.Sleep(sleep)
        sleep = sleep * 2 // Экспоненциальное увеличение времени ожидания
    }
}

func main() {
    rand.Seed(time.Now().UnixNano())
    err := Retry(5, 1*time.Second, operation)
    if err != nil {
        fmt.Println("Operation failed:", err)
    }
}

Retry takes the number of attempts, the initial timeout, and the function to be performed. If the operation fails, the wait time is doubled and the operation is repeated. This continues until the operation succeeds or the maximum number of attempts is reached.

Some form of wait time randomization is often used to avoid the situation where many clients are repeating requests at the same time.

Reasonable limits on the number of attempts and the maximum waiting time should be set to avoid endless loops or excessive system load.

There is also a ready-made library, e.g github.com/cenkalti/backoff.

Circuit breaker

A Circuit Breaker is used to prevent repeated calls that are likely to fail, thereby allowing temporary problems in the system to “fix” themselves. In microservices, it is a direct top.

Go has no built-in support for the Circuit Breaker pattern, but it can be implemented manually:

package main

import (
    "errors"
    "fmt"
    "sync"
    "time"
)

// Состояния Circuit Breaker
const (
    StateClosed   = "closed"
    StateOpen     = "open"
    StateHalfOpen = "half-open"
)

// CircuitBreaker структура
type CircuitBreaker struct {
    state          string
    failureCount   int
    maxFailures    int
    resetTimeout   time.Duration
    halfOpenTimer  *time.Timer
    halfOpenMutex  sync.Mutex
    onStateChange  func(string)
}

// NewCircuitBreaker создает новый Circuit Breaker
func NewCircuitBreaker(maxFailures int, resetTimeout time.Duration) *CircuitBreaker {
    return &CircuitBreaker{
        state:        StateClosed,
        maxFailures:  maxFailures,
        resetTimeout: resetTimeout,
    }
}

// Call выполняет операцию через Circuit Breaker
func (cb *CircuitBreaker) Call(operation func() error) error {
    switch cb.state {
    case StateOpen:
        return errors.New("circuit breaker is open")
    case StateHalfOpen, StateClosed:
        err := operation()
        if err != nil {
            cb.recordFailure()
            return err
        }
        cb.reset()
        return nil
    }
    return nil
}

// recordFailure обновляет счетчик неудач и переключает состояние при необходимости
func (cb *CircuitBreaker) recordFailure() {
    cb.failureCount++
    if cb.failureCount >= cb.maxFailures {
        cb.transitionTo(StateOpen)
        cb.halfOpenTimer = time.AfterFunc(cb.resetTimeout, func() {
            cb.transitionTo(StateHalfOpen)
        })
    }
}

// reset сбрасывает счетчик неудач и переключает состояние на закрытое
func (cb *CircuitBreaker) reset() {
    cb.failureCount = 0
    cb.transitionTo(StateClosed)
}

// transitionTo переключает состояние Circuit Breaker
func (cb *CircuitBreaker) transitionTo(state string) {
    cb.state = state
    if cb.onStateChange != nil {
        cb.onStateChange(state)
    }
}

func main() {
    // Пример использования Circuit Breaker
    cb := NewCircuitBreaker(3, 5*time.Second)
    cb.onStateChange = func(state string) {
        fmt.Println("Circuit Breaker State:", state)
    }

    for i := 0; i < 10; i++ {
        err := cb.Call(func() error {
            // Здесь должна быть логика вызова внешнего сервиса
            return errors.New("service error")
        })
        if err != nil {
            fmt.Println("Operation failed:", err)
        }
        time.Sleep(1 * time.Second)
    }
}

The Circuit Breaker structure includes a state, a failure counter, a maximum number of failures, a reset timeout, and a half-open timer.

Call performs the operation through the Circuit Breaker. If the Circuit Breaker is in the open state, no operation is performed. In the half-open and closed states, the operation is performed, and in the event of an error, a failure is considered.

recordFailure and reset control the failure counter and Circuit Breaker status.

When the maximum number of failures is reached, the Circuit Breaker turns into an open state. After the reset timeout, it goes into a semi-open state, where the next operation will determine whether it should return to the closed state or remain open.

Timeout and deadline

Go provides several mechanisms for managing timeouts and deadlines, including the use of contexts (context) and timers (time).

For simple cases, when you just need to limit the time of execution of a certain operation, you can use timers from the package time:

package main

import (
    "fmt"
    "time"
)

func main() {
    timeout := 2 * time.Second
    timer := time.NewTimer(timeout)

    go func() {
        // Длительная операция
        time.Sleep(3 * time.Second)
        if !timer.Stop() {
            fmt.Println("Operation timed out")
        } else {
            fmt.Println("Operation completed")
        }
    }()

    <-timer.C
    fmt.Println("Timeout reached, operation aborted")
}

In the example, we run a long-running operation in a goroutine and set a timer for 2 seconds. If the operation is not completed within this time, the program displays a timeout message.

Package context Go provides more flexibility, especially in the context of network requests and other operations that can be canceled:

package main

import (
    "context"
    "fmt"
    "time"
)

func operation(ctx context.Context) {
    select {
    case <-time.After(3 * time.Second): // Длительная операция
        fmt.Println("Operation completed")
    case <-ctx.Done():
        fmt.Println("Operation aborted:", ctx.Err())
    }
}

func main() {
    ctx, cancel := context.WithTimeout(context.Background(), 2*time.Second)
    defer cancel()

    go operation(ctx)

    // Дождаться завершения операции или таймаута
    select {
    case <-ctx.Done():
        fmt.Println("Main: ", ctx.Err())
    }
}

We create a context with a timeout and pass it to the function operation. If the operation does not complete before the timeout expires, the context is automatically canceled and the function operation receives a message about it through the channel ctx.Done().

Using contexts (context) Go for managing the life cycle of operations is a powerful mechanism that allows for effective resource management, especially in distributed systems and when working with asynchronous operations. Contexts in Go provide a way to pass information about state, timeouts, and cancellations between different parts of a program.

Using contexts to manage the lifecycle of operations

Context is created using functions context.Background(), context.TODO(), context.WithCancel(), context.WithDeadline(), context.WithTimeout().

The context is passed to functions and methods that must take into account its state (for example, to undo operations). A context can be canceled at any time, which should result in the termination of all operations associated with that context.

Creating and removing a context might look like this:

package main

import (
    "context"
    "fmt"
    "time"
)

func operation(ctx context.Context, duration time.Duration) {
    select {
    case <-time.After(duration):
        fmt.Println("Operation completed")
    case <-ctx.Done():
        fmt.Println("Operation aborted:", ctx.Err())
    }
}

func main() {
    // Создание контекста с таймаутом
    ctx, cancel := context.WithTimeout(context.Background(), 50*time.Millisecond)
    defer cancel()

    // Запуск операции с контекстом
    go operation(ctx, 100*time.Millisecond)

    // Дождаться завершения операции или таймаута
    select {
    case <-ctx.Done():
        fmt.Println("Main: ", ctx.Err())
    }
}

operation takes a context and performs some operation. If the operation does not complete before the context timeout expires, it is aborted.

Contexts are often used to propagate information to deeply nested functions:

package main

import (
    "context"
    "fmt"
    "time"
)

func operation1(ctx context.Context) {
    // Пропагация контекста в другую функцию
    operation2(ctx)
}

func operation2(ctx context.Context) {
    select {
    case <-time.After(100 * time.Millisecond):
        fmt.Println("Operation 2 completed")
    case <-ctx.Done():
        fmt.Println("Operation 2 aborted:", ctx.Err())
    }
}

func main() {
    ctx, cancel := context.WithCancel(context.Background())

    go operation1(ctx)

    // Отмена контекста через 50 мс
    time.Sleep(50 * time.Millisecond)
    cancel()

    // Дать время для завершения операций
    time.Sleep(100 * time.Millisecond)
}

operation1 causes operation2, giving her the same context. If the context is canceled, both operations are notified and can complete their work correctly.


Let me remind you that my friends from OTUS have a whole series of practical courses on application architecture, where you can learn not only Fault-tolerant systems, but also many other useful systems and tools. Go to the catalog and choose the appropriate direction.

Related posts