Container setup

LXC

This section documents the processes of configuring LXC on distros supported by Gel.

Installation

Alpine: apk add lxc lxcfs lxc-download lxc-bridge
Debian: apt install lxc lxcfs lxc-templates uidmap libpam-cgfs bridge-utils rsync --no-install-recommends
openSUSE: zypper in lxc
Rocky Linux/AlmaLinux: dnf install lxc lxcfs lxc-templates
Photon: N/A

Config files

Container config files: /var/lib/lxc/<name>/config

Container creation from official templates

When creating containers from official templates, you'll be presented with a list of available distros, alongwith release names and CPU architectures. Visit for the full list.

To select a source image directly without the selection prompt, use the following command.

lxc-create -t download -n "<name>" -- --dist <distro> --release <release> --arch <arch>

Assign static IPv4 addresses

From Setup network bridge in lxc-net.

Create /etc/lxc/dhcp.conf. The definitions go in dhcp-host=<containerName>,<ip> format. Example below.

dhcp-host=deerHorny,10.0.3.114
dhcp-host=polakCute,10.0.3.115

If /etc/default/lxc-net exists, have the following line inside to tell lxc-net use the DHCP config before restarting the lxc-net service.

LXC_DHCP_CONFILE=/etc/lxc/dhcp.conf

Enable autostart

In the container config, have the following lines.

# Enable autostart
lxc.start.order = <startOrder> # Lower is earlier
lxc.start.auto = 1
lxc.start.delay = 4 # In seconds

Enable nested containerization

In the container config, have the following lines.

# Allow nested containerization
lxc.include = /usr/share/lxc/config/nesting.conf

Enable FUSE

In the container config, have the following lines.

# Enable FUSE
lxc.mount.entry = /dev/fuse dev/fuse none bind,create=file,rw 0 0

Enable TUN

In the container config, have the following lines.

# Enable TUN
lxc.mount.entry = /dev/net dev/net none bind,create=dir
lxc.cgroup2.devices.allow = c 10:200 rwm

Limit CPU and RAM usage

From Memory Controller ・cgroup2.

In the container config, follow the example provided below.

# Limit CPU and RAM
lxc.cgroup2.memory.min = 268435456
lxc.cgroup2.memory.max = 536870912
lxc.cgroup2.cpu.max = 500000 1000000

This sets the container to...

Use at most 512 MiB of RAM (hard limit), with 256 MiB guaranteed (hard limit).
Allows using half of a core's worth of computing power.

Raise limits on opened files

From Proxmox ulimit hell: how to really increase open files ?

In /etc/sysctl.conf, make sure the following lines are present. Feel free to adjust the values to your needs.

fs.inotify.max_queued_events = 1048576
fs.inotify.max_user_instances = 1048576
fs.inotify.max_user_watches = 1048576
vm.max_map_count = 262144

In /etc/security/limits.conf, have the following lines. Feel free to adjust the values to your needs.

*     soft  nofile  1048576 unset
*     hard  nofile  1048576 unset
root  soft  nofile  1048576 unset
root  hard  nofile  1048576 unset
*     soft  memlock 1048576 unset
*     hard  memlock 1048576 unset

In the container config, have the following lines. Feel free to adjust the values to your needs.

# Raise limits on opened files
lxc.prlimit.nofile = 1048576

Inside the container, have the following lines in /etc/security/limits.conf. Feel free to adjust the values to your needs.

*     soft  nofile  1048576 unset
*     hard  nofile  1048576 unset
root  soft  nofile  1048576 unset
root  hard  nofile  1048576 unset
*     soft  memlock 1048576 unset
*     hard  memlock 1048576 unset

Reboot the host and the container(s) to apply the changes.

Manual unprivileged container setup

Extended from this blog post.

The containers configured these way are unprivileged, however they are owned by root, this is due to the problems surrounding unprivileged containers when owned by unprevileged users.

You can create the container before or after assigning subordinate IDs manually, but it must be done before modifying the container's configuration file. All commands in this section assume root privilege unless told explicitly otherwise.

Mount folders from host

To mount a folder from the host with read-only permissions, append this in the container config. Remove ro to allow writing.

lxc.mount.entry = <hostPath> <containerPath> none ro,bind 0 0

Keep in mind that the host path must be prefixed with /, while the container path should not. For example, if mounting /run/horniDeer to /tmp/horniDeer in the container, the following line should be present.

lxc.mount.entry = /run/horniDeer tmp/horniDeer none ro,bind 0 0

Select and map subordinate IDs

Subordinate IDs permit mapping a range of IDs to a user, allowing the container to run unprivileged without the typical downsides. To avoid conflicts, it's advised to reserve a relatively large gap between different unprivileged containers in multiples of 65536, the minimum required amount of subordinate IDs for running unprivileged containers of any kind.

You'll be editing /etc/subuid for user IDs, and /etc/subgid for group IDs. Both files follow the same scheme: <username>:<startID>:<idCount>. For example, hornydeer:2097152:65536 maps IDs from 2097152 to 2162687 to user hornydeer, 65536 IDs in total. Comments are not allowed there.

As an example, we're setting the start UID and start GID to 1148576, and allocate 65536 IDs for use by the container. If you intend to have an LXC container act as a container host, you may need to scale up the count of IDs. Write the following line to both /etc/subuid and /etc/subgid.

root:1148576:65536

If you're going to run unprivileged containers inside the target unprivileged LXC, below is an example reserving enough subordinate IDs for use.

root:1148576:262144

Apply mapped IDs in configuration

To apply the mapped IDs, head to /var/lib/lxc/<containerName> and modify the config file. According to the containerized distro chosen, there may be seperate user namespace profiles, so switch to those if you encounter problems.

# Remapped user and group IDs
lxc.include = /usr/share/lxc/config/userns.conf
lxc.idmap = u 0 1148576 65536
lxc.idmap = g 0 1148576 65536

If you've chosen to use the larger ID space for unprivileged containers above, below is the corresponding example.

# Remapped user and group IDs
lxc.include = /usr/share/lxc/config/userns.conf
lxc.idmap = u 0 1148576 262144
lxc.idmap = g 0 1148576 262144

Change owner of the container root

Before the LXC container could be started, the owner of its root folder must be set as the beginning subordinate ID, 1148576 in the case of the example. Run the following command.

chown -R 1148576:1148576 /var/lib/lxc/<containerName>/rootfs

Also ensuring the container itself can access its own filesystem for good measure.

chmod 755 /var/lib/lxc # Most distros already has this as default
chmod 755 /var/lib/lxc/<containerName>
chmod 755 /var/lib/lxc/<containerName>/rootfs
chmod 640 /var/lib/lxc/<containerName>/config

`nftables`

The default config for nftables looks like this.

#!/usr/sbin/nft -f

flush ruleset

table inet filter {
	chain input {
		type filter hook input priority filter;
	}
	chain forward {
		type filter hook forward priority filter;
	}
	chain output {
		type filter hook output priority filter;
	}
}

It's possible to match multiple ports at the same time. Instead of specifying a single port number (e.g. 443), use curly braces: {443, 8443}. Ranges can also be specified: 1024-2047.
If a certain rule only applies to traffic originating from certain interfaces, prefix the rule with iif <interface>. Can be a single interface (e.g. iif "eth0") or multiple (e.g. iif {"eth0", "ens15"}). Use iifname instead of iif if you are not sure if the interfaces are going to be present, and swap the first i with o (e.g. oif) if you want to select outbound interfaces instead.
Common selectors: iifname, oifname, ct, ip, ip6, icmp, icmpv6, sctp, tcp, udp, udplite.
- ip: protocol, ttl, saddr, daddr
- ip6: nexthdr, hoplimit, saddr, daddr
- icmp, icmpv6: type
- sctp, tcp, udp, udplite: sport, dport
- ct: direction, mark, state, status
Common actions: accept, drop, reject, dnat, snat, masquerade

Transparent service exposure

From nftables: forwarding without masquerading, Quick reference: nftables in 10 minutes.

Because the LXC host is the network gateway of all LXC containers, service exposure without masquerading is entirely possible, allowing services inside LXC slices to obtain the actual IP addresses. Add the block below to begin specifying rules for service exposure.

If you want to expose services on both IPv4 and IPv6, rules will need to be duplicated. It's also important to note that containers must have the respective IP version available, for it to be exposed transparently. LXC 6.0.0 and newer has IPv6 addresses assigned automatically, while 4.0.0 and newer can have IPv6 manually configured. Only LXC 5.0.0 and newer supports IPv6 connectivity behind NAT.

table ip nat {
	chain prerouting {
		type nat hook prerouting priority filter;
		# Insert new rules for IPv4 here
	}
}
table ip6 nat {
	chain prerouting {
		type nat hook prerouting priority filter;
		# Insert new rules for IPv6 here
	}
}

Let's say we want to expose 10.0.3.2:443 for anyone on the Internet to access on port 443.

tcp dport 443 dnat to 10.0.3.2

If the port numbers are not the same, the port will need to be overriden.

tcp dport 443 dnat to 10.0.3.2:8443

Or multiple ports are to be exposed without overriding the port.

tcp dport {443, 8443} dnat to 10.0.3.2
tcp dport 512-1023 dnat to 10.0.3.2

Or only expose access to (a) certain interface(s).

iif "eth0" tcp dport 443 dnat to 10.0.3.2
iif {"eth0", "vlan0"} tcp dport 443 dnat to 10.0.3.2

An example of a rule with similar use under IPv6.

iif "he-ipv6" tcp dport {80, 443} dnat to [fc11:4514:1919:810::ff:fe00:2]

Feel free to swap tcp to any kind of layer 4 protocol you intend to use, namely icmp, icmpv6, sctp, tcp, udp, udplite and more.

Flush your rulesets with the command below, so LXC slices will still have connectivity via NAT after flushing.

nft -f /etc/nftables.conf; systemctl restart lxc-net

Network access restriction - IP-based

Inspired by How to restrict network access of LXC container.

Notice

nftable-based network access control is still under investigation. Problems are expected to rise.

If fine-grained access control like destination-matching (e.g. domain) is desired, use EEP with transparent proxy on the host instead.

The current nftables approach requires static IPs to be assigned first, but a static IPv6 address must be assigned via a static MAC address, as such remember to define a static MAC address for the container.

The inet filter forward section is where network access of individual containers is filtered.

If whitelisted network access is desired, add a rule in the scheme shown below to the end of the section for that specific container.

iif "lxcbr0" ip saddr 10.0.3.2 drop;

Then add allowed access ranges before the final drop to grant access to specific addresses. If problems occur with transparent service exposure, they will need to be made exempt.

iif "lxcbr0" ip saddr 10.0.3.2 ip daddr 10.0.3.0-10.0.3.255 accept;

Or if network access isn't whitelisted, and access to certain ranges are to be blocked, add a rule in the scheme shown below.

iif "lxcbr0" ip saddr 10.0.3.2 ip daddr 10.0.3.2-10.0.3.255 drop;

Podman

This section documents the processes of setting up Podman on distros supported by Gel. To get Podman functioning, fuse and tun support has to be present.

If you're running Podman inside an (unprivileged) LXC container, make sure the steps listed below have all been applied to the host LXC container, all of which could be found above.

Assign a larger ID space
Enable FUSE
Enable nested containerization
Enable TUN
Raise limits on opened files

Installation

Warning

Certain distros (e.g. Debian) may not have a functioning version of crun. Install crun from Nixpkgs when such errors are encountered.

A few distros like Photon do not have podman-compose bundled.

If you encounter warnings regarding / not being shared, fix temporarily with mount --make-rshared /. Read Alpine Wiki for further info.

Alpine: apk add podman podman-compose
Debian: apt install podman podman-compose
openSUSE: zypper in podman podman-compose
Rocky Linux/AlmaLinux: dnf install podman podman-compose
Photon: tdnf install podman

After installation, run a "Hello World" container to ensure everything works correctly.

podman run --rm hello-world

If problems occur, below is an example command for debugging.

podman run --security-opt="seccomp=unconfined" --log-level=debug --rm hello-world

Manual subordinate ID assign

Note Distros may already have this section configured automatically. Only follow this section when you encounter problems.

Explanations about subordinate IDs are available in previous sections. If you encounter Podman complaining about IDs, below is an example inside unprivileged LXC containers to apply in both /etc/subuid and /etc/subgid.

<username>:65536:131072

Run podman system migrate whenever the assigned subordinate ID space changes.