Integration of a Go service with systemd: socket activation

Vincent Bernat March 19, 2018

In a previous post, I highlighted some useful features of systemd when writing a service in Go, notably to signal readiness and prove liveness. Another interesting bit is socket activation: systemd listens on behalf of the application and, on incoming traffic, starts the service with a copy of the listening socket. Lennart Poettering details in a blog post:

If a service dies, its listening socket stays around, not losing a single message. After a restart of the crashed service it can continue right where it left off. If a service is upgraded we can restart the service while keeping around its sockets, thus ensuring the service is continuously responsive. Not a single connection is lost during the upgrade.

This is one solution to get zero-downtime deployment for your application. Another upside is you can run your daemon with less privileges—loosing rights is a difficult task in Go.¹

The basics
Handling of existing connections
Addendum: decoy process using Go
Addendum: identifying sockets by name

The basics#

Let’s take back our nifty 404-only web server:

package main

import (
    "log"
    "net"
    "net/http"
)

func main() {
    listener, err := net.Listen("tcp", ":8081")
    if err != nil {
        log.Panicf("cannot listen: %s", err)
    }
    http.Serve(listener, nil)
}

Here is the socket-activated version, using go-systemd:

package main

import (
    "log"
    "net/http"

    "github.com/coreos/go-systemd/activation"
)

func main() {
    listeners, err := activation.Listeners(true) // ❶
    if err != nil {
        log.Panicf("cannot retrieve listeners: %s", err)
    }
    if len(listeners) != 1 {
        log.Panicf("unexpected number of socket activation (%d != 1)",
            len(listeners))
    }
    http.Serve(listeners[0], nil) // ❷
}

In ❶, we retrieve the listening sockets provided by systemd. In ❷, we use the first one to serve HTTP requests. Let’s test the result with systemd-socket-activate:²

$ go build 404.go
$ systemd-socket-activate -l 8000 ./404
Listening on [::]:8000 as 3.

In another terminal, we can make some requests to the service:

$ curl '[::1]':8000
404 page not found
$ curl '[::1]':8000
404 page not found

For a proper integration with systemd, you need two files:

a socket unit for the listening socket; and
a service unit for the associated service.

We can use the following socket unit, 404.socket:

[Socket]
ListenStream = 8000
BindIPv6Only = both

[Install]
WantedBy = sockets.target

The systemd.socket(5) manual page describes the available options. BindIPv6Only = both is explicitly specified because the default value is distribution-dependent. As for the service unit, we can use the following one, 404.service:

[Unit]
Description = 404 micro-service

[Service]
ExecStart = /usr/bin/404

systemd knows the two files work together because they share the same prefix. Once the files are in /etc/systemd/system, execute systemctl daemon-reload and systemctl start 404.socket. Your service is ready to accept connections!

Handling of existing connections#

Our 404 service has a major shortcoming: existing connections are abruptly killed when the daemon is stopped or restarted. Let’s fix that!

Waiting a few seconds for existing connections#

We can include a short grace period for connections to terminate, then kill remaining ones:

// On signal, gracefully shut down the server and wait 5
// seconds for current connections to stop.
done := make(chan struct{})
quit := make(chan os.Signal, 1)
server := &http.Server{}
signal.Notify(quit, syscall.SIGINT, syscall.SIGTERM)

go func() {
    <-quit
    log.Println("server is shutting down")
    ctx, cancel := context.WithTimeout(context.Background(),
        5*time.Second)
    defer cancel()
    server.SetKeepAlivesEnabled(false)
    if err := server.Shutdown(ctx); err != nil {
        log.Panicf("cannot gracefully shut down the server: %s", err)
    }
    close(done)
}()

// Start accepting connections.
server.Serve(listeners[0])

// Wait for existing connections before exiting.
<-done

Upon reception of a termination signal, the goroutine would resume and schedule a shutdown of the service:

Shutdown() gracefully shuts down the server without interrupting any active connections. Shutdown() works by first closing all open listeners, then closing all idle connections, and then waiting indefinitely for connections to return to idle and then shut down.

While restarting, new connections are not accepted: they sit in the listen queue associated to the socket. This queue is bounded and its size can be configured with the Backlog directive in the socket unit. Its default value is 128. You may keep this value, even when your service is expecting to receive many connections by second. When this value is exceeded, incoming connections are silently dropped. The client should automatically retry to connect. On Linux, by default, it will retry 5 times (tcp_syn_retries) in about 3 minutes. This is a nice way to avoid the herd effect you would experience on restart if you increased the listen queue to some high value.

Waiting longer for existing connections#

If you want to wait for a very long time for existing connections to stop, you do not want to ignore new connections for several minutes. There is a very simple trick: ask systemd to not kill any process on stop. With KillMode = none, only the stop command is executed and all existing processes are left undisturbed:

[Unit]
Description = slow 404 micro-service

[Service]
ExecStart = /usr/bin/404
ExecStop  = /bin/kill $MAINPID
KillMode  = none

If you restart the service, the current process gracefully shuts down for as long as needed and systemd spawns immediately a new instance ready to serve incoming requests with its own copy of the listening socket. On the other hand, we loose the ability to wait for the service to come to a full stop—either by itself or forcefully after a timeout with SIGKILL.

Update (2021-01)

KillMode=none is now deprecated. In addition to the below alternative, it is possible to pass the currently active connections to systemd with the sd_pid_notify_with_fds() function. Then, the new process needs some logic to handle them. Unfortunately, this is not an easy task as you also need to serialize the associated state. Moreover, the function is not implemented in go-systemd.

Waiting longer for existing connections (alternative)#

An alternative to the previous solution is to make systemd believe your service died during reload.

done := make(chan struct{})
quit := make(chan os.Signal, 1)
server := &http.Server{}
signal.Notify(quit,
    // for reload:
    syscall.SIGHUP,
    // for stop or full restart:
    syscall.SIGINT, syscall.SIGTERM)
go func() {
    sig := <-quit
    switch sig {
    case syscall.SIGINT, syscall.SIGTERM:
        // Shutdown with a time limit.
        log.Println("server is shutting down")
        ctx, cancel := context.WithTimeout(context.Background(),
            15*time.Second)
        defer cancel()
        server.SetKeepAlivesEnabled(false)
        if err := server.Shutdown(ctx); err != nil {
            log.Panicf("cannot gracefully shut down the server: %s", err)
        }
    case syscall.SIGHUP: // ❶
        // Execute a short-lived process and asks systemd to
        // track it instead of us.
        log.Println("server is reloading")
        pid := detachedSleep()
        daemon.SdNotify(false, fmt.Sprintf("MAINPID=%d", pid))
        time.Sleep(time.Second) // Wait a bit for systemd to check the PID

        // Wait without a limit for current connections to stop.
        server.SetKeepAlivesEnabled(false)
        if err := server.Shutdown(context.Background()); err != nil {
            log.Panicf("cannot gracefully shut down the server: %s", err)
        }
    }
    close(done)
}()

// Serve requests with a slow handler.
server.Handler = http.HandlerFunc(
    func(w http.ResponseWriter, r *http.Request) {
        time.Sleep(10 * time.Second)
        http.Error(w, "404 not found", http.StatusNotFound)
    })
server.Serve(listeners[0])

// Wait for all connections to terminate.
<-done
log.Println("server terminated")

The main difference is the handling of the SIGHUP signal in ❶: a short-lived decoy process is spawned and systemd is told to track it. When it dies, systemd will start a new instance. This method is a bit hacky: systemd needs the decoy process to be a child of PID 1 but Go cannot easily detach on its own. Therefore, we leverage a short Python helper, wrapped in a detachedSleep() function:³

// detachedSleep spawns a detached process sleeping
// one second and returns its PID.
func detachedSleep() uint64 {
    py := `
import os
import time

pid = os.fork()
if pid == 0:
    for fd in {0, 1, 2}:
        os.close(fd)
    time.sleep(1)
else:
    print(pid)
`
    cmd := exec.Command("/usr/bin/python3", "-c", py)
    out, err := cmd.Output()
    if err != nil {
        log.Panicf("cannot execute sleep command: %s", err)
    }
    pid, err := strconv.ParseUint(strings.TrimSpace(string(out)), 10, 64)
    if err != nil {
        log.Panicf("cannot parse PID of sleep command: %s", err)
    }
    return pid
}

During reload, there may be a small period during which both the new and the old processes accept incoming requests. If you don’t want that, you can move the creation of the short-lived process outside the goroutine, after server.Serve(), or implement some synchronization mechanism. There is also a possible race-condition when we tell systemd to track another PID—see PR #7816.

The 404.service unit needs an update:

[Unit]
Description = slow 404 micro-service

[Service]
ExecStart    = /usr/bin/404
ExecReload   = /bin/kill -HUP $MAINPID
Restart      = always
NotifyAccess = main
KillMode     = process

Each additional directive is significant:

ExecReload tells how to reload the process—by sending SIGHUP.
Restart tells to restart the process if it stops “unexpectedly,” notably on reload.⁴
NotifyAccess specifies which process can send notifications, like a PID change.
KillMode tells to only kill the main identified process—others are left untouched.

Zero-downtime deployment?#

Zero-downtime deployment is a difficult endeavor on Linux. For example, HAProxy had a long list of hacks until a proper—and complex—solution was implemented in HAproxy 1.8. How do we fare with our simple implementation?

From the kernel point of view, there is a only one socket with a unique listen queue. This socket is associated to several file descriptors: one in systemd and one in the current process. The socket stays alive as long as there is at least one file descriptor. An incoming connection is put by the kernel in the listen queue and can be dequeued from any file descriptor with the accept() syscall. Therefore, this approach actually achieves zero-downtime deployment: no incoming connection is rejected.

By contrast, HAProxy was using several different sockets listening to the same addresses, thanks to the SO_REUSEPORT option.⁵ Each socket gets its own listening queue and the kernel balances incoming connections between each queue. When a socket gets closed, the content of its queue is lost. If an incoming connection was sitting here, it would receive a reset. An elegant patch for Linux to signal a socket should not receive new connections was rejected. HAProxy 1.8 is now recycling existing sockets by sending them to the new processes through a Unix socket.

I hope this post and the previous one show how systemd is a good sidekick for a Go service: readiness, liveness and socket activation are some of the useful features you can get to build a more reliable application.

Addendum: decoy process using Go#

Update (2018-03)

On /r/golang, it was pointed out to me that, in the version where systemd is tracking a decoy, the helper can be replaced by invoking the main executable. By relying on a change of environment, it assumes the role of the decoy. Here is such an implementation replacing the detachedSleep() function:

func init() {
    // As early as possible, check if we should be the decoy.
    state := os.Getenv("__SLEEPY")
    os.Unsetenv("__SLEEPY")
    switch state {
    case "1":
        // First step, fork again.
        execPath := self()
        child, err := os.StartProcess(
            execPath,
            []string{execPath},
            &os.ProcAttr{
                Env: append(os.Environ(), "__SLEEPY=2"),
            })
        if err != nil {
            log.Panicf("cannot execute sleep command: %s", err)
        }

        // Advertise child's PID and exit. Child will be
        // orphaned and adopted by PID 1.
        fmt.Printf("%d", child.Pid)
        os.Exit(0)
    case "2":
        // Sleep and exit.
        time.Sleep(time.Second)
        os.Exit(0)
    }
    // Not the sleepy helper. Business as usual.
}

// self returns the absolute path to ourselves. This relies on
// /proc/self/exe which may be a symlink to a deleted path (for
// example, during an upgrade).
func self() string {
    execPath, err := os.Readlink("/proc/self/exe")
    if err != nil {
        log.Panicf("cannot get self path: %s", err)
    }
    execPath = strings.TrimSuffix(execPath, " (deleted)")
    return execpath
}

// detachedSleep spawns a detached process sleeping one second and
// returns its PID. A full daemonization is not needed as the process
// is short-lived.
func detachedSleep() uint64 {
    cmd := exec.Command(self())
    cmd.Env = append(os.Environ(), "__SLEEPY=1")
    out, err := cmd.Output()
    if err != nil {
        log.Panicf("cannot execute sleep command: %s", err)
    }
    pid, err := strconv.ParseUint(strings.TrimSpace(string(out)), 10, 64)
    if err != nil {
        log.Panicf("cannot parse PID of sleep command: %s", err)
    }
    return pid
}

Addendum: identifying sockets by name#

For a given service, systemd can provide several sockets. To identify them, it is possible to name them. Let’s suppose we also want to return 403 error codes from the same service but on a different port. We add an additional socket unit definition, 403.socket, linked to the same 404.service job:

[Socket]
ListenStream = 8001
BindIPv6Only = both
Service      = 404.service

[Install]
WantedBy=sockets.target

Unless overridden with FileDescriptorName, the name of the socket is the name of the unit: 403.socket. go-systemd provides the ListenersWithNames() function to fetch a map from names to listening sockets:

package main

import (
    "log"
    "net/http"
    "sync"

    "github.com/coreos/go-systemd/activation"
)

func main() {
    var wg sync.WaitGroup

    // Map socket names to handlers.
    handlers := map[string]http.HandlerFunc{
        "404.socket": http.NotFound,
        "403.socket": func(w http.ResponseWriter, r *http.Request) {
            http.Error(w, "403 forbidden",
                http.StatusForbidden)
        },
    }

    // Get listening sockets.
    listeners, err := activation.ListenersWithNames(true)
    if err != nil {
        log.Panicf("cannot retrieve listeners: %s", err)
    }

    // For each listening socket, spawn a goroutine
    // with the appropriate handler.
    for name := range listeners {
        for idx := range listeners[name] {
            wg.Add(1)
            go func(name string, idx int) {
                defer wg.Done()
                http.Serve(
                    listeners[name][idx],
                    handlers[name])
            }(name, idx)
        }
    }

    // Wait for all goroutines to terminate.
    wg.Wait()
}

Let’s build the service and run it with systemd-socket-activate:

$ go build 404.go
$ systemd-socket-activate -l 8000 -l 8001 \
>                         --fdname=404.socket:403.socket \
>                         ./404
Listening on [::]:8000 as 3.
Listening on [::]:8001 as 4.

In another console, we can make a request for each endpoint:

$ curl '[::1]':8000
404 page not found
$ curl '[::1]':8001
403 forbidden

Many process characteristics in Linux are attached to threads. Go runtime transparently manages them without much user control. Until recently, this made some features, like setuid() or setns(), unusable. ↩︎
With an older version of systemd (before 230), look for /lib/systemd/systemd-activate instead. ↩︎
Python is a good candidate: it’s likely to be available on the system, it is low-level enough to easily implement the functionality and, as an interpreted language, it doesn’t require a specific build step.

There is no need to fork twice as we only need to detach the decoy from the current process. This simplify a bit the Python code. ↩︎
This is not an essential directive as the process is also restarted through socket-activation. ↩︎
This approach is more convenient when reloading since you don’t have to figure out which sockets to reuse and which ones to create from scratch. Moreover, when several processes need to accept connections, using multiple sockets is more scalable as the different processes won’t fight over a shared lock to accept connections. ↩︎