A few months ago, I was looking for a piece of software to spawn long-running tasks on behalf of some daemon with the benefit of the tasks not being interrupted when this daemon is restarted. Here are the specifications:
- No special privilege is needed to submit a task.
- Submitted tasks are not known beforehand.
- Task output is redirect to a log file.
- Submitted tasks are identified by a provided name.
- With only the name as a reference, a task can be checked for existence or killed.
- Tasks should not be interrupted by an unrelated event, like a configuration change or a software upgrade.
The last requirement explains why tasks are not spawned directly by the daemon requesting them: its complexity or the way it needs to be operated may need frequent restarts. Even if it is possible to re-execute a daemon while keeping its children, like stateful re-exec support in Upstart, this is quite difficult: internal state should be serialized and restored. Part of this state can be contained into a third-party library.
Here comes lanĉo2. It is a very simple task launcher. It can run any tasks, stop them and check if they are still running. It leverages cgroups in recent Linux kernels and avoids the use of any daemon.
Before looking at how lanĉo works, let’s have a look at how it can be used. To avoid usage conflicts, each task is run in the context of a namespace that needs to be initialized:
$ sudo lanco testns init -u $(id -un) -g $(id -gn)
This is the only command that needs to be run as root. Subsequent ones can be run as a normal user. Let’s run some task:
$ lanco testns run first-task openssl speed aes $ lanco testns check first-task && echo "Still running" Still running $ lanco testns ls testns ├ first-task │ → 28456 openssl speed aes ╯
The output of the task is logged into a file:
$ head -3 /var/log/lanco-testns/task-first-task.log Doing aes-128 cbc for 3s on 16 size blocks: 8678442 aes-128 cbc's in 2.85s Doing aes-128 cbc for 3s on 64 size blocks: 2478283 aes-128 cbc's in 2.99s Doing aes-128 cbc for 3s on 256 size blocks: 628105 aes-128 cbc's in 3.00s
If the task is too long, we can kill it:
$ lanco testns run first-task openssl speed aes $ lanco testns stop first-task
You cannot run a task that already exists or kill a task that does not exist:
$ lanco testns run first-task openssl speed aes 2013-06-09T22:50:34 [WARN/run] task first-task is already running $ lanco testns stop second-task 2013-06-09T22:50:45 [WARN/stop] task second-task is not running
Thanks to the use of cgroups, lanĉo is able to track multiple processes even when they are forking away3:
$ lanco testns run second-task sh -c \ > "while true; do (sleep 30 &)& sleep 1; done" $ lanco testns ls testns ├ first-task │ → 28456 openssl speed aes ├ second-task │ → 29572 sh -c while true; do (sleep 30 &)& sleep 1; done │ → 29575 sleep 30 │ → 29593 sleep 30 │ → 29596 sleep 30 │ → 29599 sleep 30 │ → 29622 sleep 30 │ → 29644 sleep 1 │ → 29645 sleep 30 ╯ $ lanco testns stop second-task $ lanco testns check second-task || echo "Killed!" Killed!
Also, there is a
top-like command (
lanco testns top):
Control groups (cgroups) is a mechanism to partition a set of tasks and their future children into hierarchical groups with a set of parameters.
Let’s start with the hierarchical stuff first. To create a new
hierarchy, you have to mount the
cgroup filesystem on some empty
directory. Usually, cgroups hierarchies are setup in
/sys/fs/cgroups which is a
# mount -t tmpfs tmpfs /sys/fs/cgroup -o nosuid,nodev,noexec,relatime,mode=755
Now, we can create our first hierarchy:
# cd /sys/fs/cgroup # mkdir my-first-hierarchy # mount -t cgroup cgroup my-first-hierarchy -o name=my-first-hierarchy,none # ls -1 my-first-hierarchy cgroup.clone_children cgroup.event_control cgroup.procs notify_on_release release_agent tasks
We’ll see later the purpose of
none. The most interesting file is
tasks. It contains the PIDs of all processes in our group. Since
there are currently no sub-group, all processes are part of it. Let’s
create a sub-group and attach a process to it:
# mkdir first-child # cd first-child # ls -1 cgroup.clone_children cgroup.event_control cgroup.procs notify_on_release tasks # echo $$ > tasks # cat tasks 23184 23311 # cat /proc/$$/cgroup 9:name=my-first-hierarchy:/first-child 8:perf_event:/ 7:blkio:/ 6:net_cls:/ 5:freezer:/ 4:devices:/ 3:cpuacct,cpu:/ 2:cpuset:/ 1:name=systemd:/user/bernat/1
We have added our shell to the new cgroup. Moreover, all its
children will also be part of this group. This explains why we have
two tasks: the shell and
cat. The last command is quite interesting:
for each hierarchy, it shows which cgroup the task belongs to. For a
given hierarchy, each task is exactly in one group.
The most useful features of lanĉo are done using just this: a namespace is a named hierarchy and each task is enclosed into a cgroup so it can be tracked properly.
A subsystem makes use of the task grouping provided by cgroups to assign resources to a set of tasks. Each available subsystem can only be in one hierarchy, so the usual way to setup things is to assign a hierarchy for each subsystem4.
Let’s have a look at the
cpuset subsystem: it allows one to assign tasks
to specific CPUs and memory nodes (for NUMA systems). Let’s suppose we
have 4 cores and we want to assign the first core to common system
tasks and the three remaining ones to nginx:
# cd /sys/fs/cgroup # mkdir cpuset # mount -t cgroup cgroup cpuset -o cpuset # echo 0-3 > cpuset/cpuset.cpus # echo 0 > cpuset/cpuset.mems # mkdir cpuset/system # echo 0 > cpuset/system/cpuset.cpus # echo 0 > cpuset/system/cpuset.mems # for task in $(cat cpuset/tasks); do > echo $task > cpuset/system/tasks > done # mkdir cpuset/nginx # echo 1-3 > cpuset/nginx/cpuset.cpus # echo 0 > cpuset/nginx/cpuset.mems # for task in $(pidof nginx); do > echo $task > cpuset/nginx/tasks > done
mount is not needed if something else (like systemd) has
setup the appropriate subsystem. Now, even if some system process is
going crazy, it won’t affect the performance of your webserver.
The kernel comes with a more complete documentation if you need additional details on cgroups. There are high level tools to manipulate them, like the tools provided by libcg, but they are currently quite buggy and invasive.
Use in lanĉo
Here are the features used by lanĉo:
- For each namespace, a specific named hierarchy is created.
- By setting the appropriate permissions on the hierarchy, an unprivileged user can create subgroups.
- Each submitted task has its own sub-group for tracking purpose.
- Actions can be executed when a task terminates by using the release agent mechanism5.
cpuacctsubsystem is used to track CPU usage: a group is created in this subsystem for this purpose.
Both systemd and Upstart are difficult to run without being PID 1 and without being root. They both support user sessions but this requires to use them as PID 1 as well. I discovered later than runit would have been a good fit: it does not require to be PID 1, it does not need to be run as root and the service directory can be specified by an environment variable. It could have been tweaked to meet the above requirements. ↩
Lanĉo means launcher in esperanto. ↩
Linux cgroups do not provide any facility to kill a whole group. We need to iterate through the processes and kill them one by one. lanĉo does not try to freeze processes before killing them and therefore may be inefficient against violent fork bombs. ↩
cpuacctsubsystems are the notable exception. There seems to be little use to keep them in separate hierarchies, so they are part of the same hierarchy. ↩
Unfortunately, only one global release agent per hierarchy can be used. Since lanĉo provides with the ability to execute an arbitrary command when a task terminates, the command needs to be stored on the filesystem for the release agent of the namespace to execute it. ↩