Container setup

LXC

This section documents the processes of configuring LXC on distros supported by Gel.

Installation

  • Alpine: apk add lxc lxcfs lxc-download lxc-bridge
  • Debian: apt install lxc lxcfs lxc-templates uidmap libpam-cgfs bridge-utils --no-install-recommends
  • openSUSE: zypper in lxc
  • Rocky Linux/AlmaLinux: dnf install lxc lxcfs lxc-templates
  • Photon: N/A

Config files

  • Container config files: /var/lib/lxc/<name>/config

Container creation from official templates

When creating containers from official templates, you'll be presented with a list of available distros, alongwith release names and CPU architectures. Visit for the full list.

To select a source image directly without the selection prompt, use the following command.

lxc-create -t download -n "<name>" -- --dist <distro> --release <release> --arch <arch>

Assign static IPv4 addresses

From Setup network bridge in lxc-net.

Create /etc/lxc/dhcp.conf. The definitions go in dhcp-host=<containerName>,<ip> format. Example below.

dhcp-host=deerHorny,10.0.3.114
dhcp-host=polakCute,10.0.3.115

If /etc/default/lxc-net exists, have the following line inside to tell lxc-net use the DHCP config before restarting the lxc-net service.

LXC_DHCP_CONFILE=/etc/lxc/dhcp.conf

Enable autostart

In the container config, have the following lines.

# Enable autostart
lxc.start.order = <startOrder> # Lower is earlier
lxc.start.auto = 1
lxc.start.delay = 4 # In seconds

Enable nested containerization

In the container config, have the following lines.

# Allow nested containerization
lxc.include = /usr/share/lxc/config/nesting.conf

Enable FUSE

In the container config, have the following lines.

# Enable FUSE
lxc.mount.entry = /dev/fuse dev/fuse none bind,create=file,rw 0 0

Enable TUN

In the container config, have the following lines.

# Enable TUN
lxc.mount.entry = /dev/net dev/net none bind,create=dir
lxc.cgroup2.devices.allow = c 10:200 rwm

Limit CPU and RAM usage

From Memory Controller ・cgroup2.

In the container config, follow the example provided below.

# Limit CPU and RAM
lxc.cgroup2.memory.min = 268435456
lxc.cgroup2.memory.max = 536870912
lxc.cgroup2.cpu.max = 500000 1000000

This sets the container to...

  • Use at most 512 MiB of RAM (hard limit), with 256 MiB guaranteed (hard limit).
  • Allows using half of a core's worth of computing power.

Raise limits on opened files

From Proxmox ulimit hell: how to really increase open files ?

In /etc/sysctl.conf, make sure the following lines are present. Feel free to adjust the values to your needs.

fs.inotify.max_queued_events = 1048576
fs.inotify.max_user_instances = 1048576
fs.inotify.max_user_watches = 1048576
vm.max_map_count = 262144

In /etc/security/limits.conf, have the following lines. Feel free to adjust the values to your needs.

*     soft  nofile  1048576 unset
*     hard  nofile  1048576 unset
root  soft  nofile  1048576 unset
root  hard  nofile  1048576 unset
*     soft  memlock 1048576 unset
*     hard  memlock 1048576 unset

In the container config, have the following lines. Feel free to adjust the values to your needs.

# Raise limits on opened files
lxc.prlimit.nofile = 1048576

Inside the container, have the following lines in /etc/security/limits.conf. Feel free to adjust the values to your needs.

*     soft  nofile  1048576 unset
*     hard  nofile  1048576 unset
root  soft  nofile  1048576 unset
root  hard  nofile  1048576 unset
*     soft  memlock 1048576 unset
*     hard  memlock 1048576 unset

Reboot the host and the container(s) to apply the changes.

Manual unprivileged container setup

Extended from this blog post.

The containers configured these way are unprivileged, however they are owned by root, this is due to the problems surrounding unprivileged containers when owned by unprevileged users.

You can create the container before or after assigning subordinate IDs manually, but it must be done before modifying the container's configuration file. All commands in this section assume root privilege unless told explicitly otherwise.

Select and map subordinate IDs

Subordinate IDs permit mapping a range of IDs to a user, allowing the container to run unprivileged without the typical downsides. To avoid conflicts, it's advised to reserve a relatively large gap between different unprivileged containers in multiples of 65536, the minimum required amount of subordinate IDs for running unprivileged containers of any kind.

You'll be editing /etc/subuid for user IDs, and /etc/subgid for group IDs. Both files follow the same scheme: <username>:<startID>:<idCount>. For example, hornydeer:2097152:65536 maps IDs from 2097152 to 2162687 to user hornydeer, 65536 IDs in total. Comments are not allowed there.

As an example, we're setting the start UID and start GID to 1148576, and allocate 65536 IDs for use by the container. If you intend to have an LXC container act as a container host, you may need to scale up the count of IDs. Write the following line to both /etc/subuid and /etc/subgid.

root:1148576:65536

If you're going to run unprivileged containers inside the target unprivileged LXC, below is an example reserving enough subordinate IDs for use.

root:1148576:262144
Apply mapped IDs in configuration

To apply the mapped IDs, head to /var/lib/lxc/<containerName> and modify the config file. According to the containerized distro chosen, there may be seperate user namespace profiles, so switch to those if you encounter problems.

# Remapped user and group IDs
lxc.include = /usr/share/lxc/config/userns.conf
lxc.idmap = u 0 1148576 65536
lxc.idmap = g 0 1148576 65536

If you've chosen to use the larger ID space for unprivileged containers above, below is the corresponding example.

# Remapped user and group IDs
lxc.include = /usr/share/lxc/config/userns.conf
lxc.idmap = u 0 1148576 262144
lxc.idmap = g 0 1148576 262144
Change owner of the container root

Before the LXC container could be started, the owner of its root folder must be set as the beginning subordinate ID, 1148576 in the case of the example. Run the following command.

chown -R 1148576:1148576 /var/lib/lxc/<containerName>/rootfs

Also ensuring the container itself can access its own filesystem for good measure.

chmod 755 /var/lib/lxc # Most distros already has this as default
chmod 755 /var/lib/lxc/<containerName>
chmod 755 /var/lib/lxc/<containerName>/rootfs
chmod 640 /var/lib/lxc/<containerName>/config

nftables

The default config for nftables looks like this.

#!/usr/sbin/nft -f

flush ruleset

table inet filter {
	chain input {
		type filter hook input priority filter;
	}
	chain forward {
		type filter hook forward priority filter;
	}
	chain output {
		type filter hook output priority filter;
	}
}
  • It's possible to match multiple ports at the same time. Instead of specifying a single port number (e.g. 443), use curly braces: {443, 8443}. Ranges can also be specified: 1024-2047.
  • If a certain rule only applies to traffic originating from certain interfaces, prefix the rule with iif <interface>. Can be a single interface (e.g. iif "eth0") or multiple (e.g. iif {"eth0", "ens15"}).
Transparent service exposure

From nftables: forwarding without masquerading, Quick reference: nftables in 10 minutes.

Because the LXC host is the network gateway of all LXC containers, service exposure without masquerading is entirely possible, allowing services inside LXC slices to obtain the actual IP addresses. Add the block below to begin specifying rules for service exposure.

If you want to expose services on both IPv4 and IPv6, rules will need to be duplicated. It's also important to note that containers must have the respective IP version available, for it to be exposed transparently. LXC 6.0.0 and newer has IPv6 addresses assigned automatically, while 4.0.0 and newer can have IPv6 manually configured. Only LXC 5.0.0 and newer supports IPv6 connectivity behind NAT.

table ip nat {
	chain prerouting {
		type nat hook prerouting priority filter;
		# Insert new rules for IPv4 here
	}
}
table ip6 nat {
	chain prerouting {
		type nat hook prerouting priority filter;
		# Insert new rules for IPv6 here
	}
}

Let's say we want to expose 10.0.3.2:443 for anyone on the Internet to access on port 443.

tcp dport 443 dnat to 10.0.3.2

If the port numbers are not the same, the port will need to be overriden.

tcp dport 443 dnat to 10.0.3.2:8443

Or multiple ports are to be exposed without overriding the port.

tcp dport {443, 8443} dnat to 10.0.3.2
tcp dport 512-1023 dnat to 10.0.3.2

Or only expose access to (a) certain interface(s).

iif "eth0" tcp dport 443 dnat to 10.0.3.2
iif {"eth0", "vlan0"} tcp dport 443 dnat to 10.0.3.2

An example of a rule with similar use under IPv6.

iif "he-ipv6" tcp dport {80, 443} dnat to [fc11:4514:1919:810::ff:fe00:2]

Flush your rulesets with the command below, so LXC slices will still have connectivity via NAT after flushing.

nft -f /etc/nftables.conf; systemctl restart lxc-net
Network access restriction - IP-based

Inspired by How to restrict network access of LXC container.

Notice

nftable-based network access control is still under investigation. Problems are expected to rise.

If fine-grained access control like destination-matching (e.g. domain) is desired, use EEP with transparent proxy on the host instead.

Since the current nftables approach requires static IPs to be assigned first, but there is no way found to have IPv6 addresses assigned statically, IPv6 access might need to be disabled for the container.

The inet filter forward section is where network access of individual containers is filtered.

If whitelisted network access is desired, add a rule in the scheme shown below to the end of the section for that specific container.

iif "lxcbr0" ip saddr 10.0.3.2 drop;

Then add allowed access ranges before the final drop to grant access to specific addresses. If problems occur with transparent service exposure, they will need to be made exempt.

iif "lxcbr0" ip saddr 10.0.3.2 ip daddr 10.0.3.0-10.0.3.255 accept;

Or if network access isn't whitelisted, and access to certain ranges are to be blocked, add a rule in the scheme shown below.

iif "lxcbr0" ip saddr 10.0.3.2 ip daddr 10.0.3.2-10.0.3.255 drop;

Podman

This section documents the processes of setting up Podman on distros supported by Gel. To get Podman functioning, fuse and tun support has to be present.

If you're running Podman inside an (unprivileged) LXC container, make sure the steps listed below have all been applied to the host LXC container, all of which could be found above.

  • Assign a larger ID space
  • Enable FUSE
  • Enable nested containerization
  • Enable TUN
  • Raise limits on opened files

Installation

Warning

  • Certain distros (e.g. Debian) may not have a functioning version of crun. Install crun from Nixpkgs when such errors are encountered.
  • A few distros like Photon do not have podman-compose bundled.
  • If you encounter warnings regarding / not being shared, fix temporarily with mount --make-rshared /. Read Alpine Wiki for further info.
  • Alpine: apk add podman podman-compose
  • Debian: apt install podman podman-compose
  • openSUSE: zypper in podman podman-compose
  • Rocky Linux/AlmaLinux: dnf install podman podman-compose
  • Photon: tdnf install podman

After installation, run a "Hello World" container to ensure everything works correctly.

podman run --rm hello-world

If problems occur, below is an example command for debugging.

podman run --security-opt="seccomp=unconfined" --log-level=debug --rm hello-world

Manual subordinate ID assign

Note Distros may already have this section configured automatically. Only follow this section when you encounter problems.

Explanations about subordinate IDs are available in previous sections. If you encounter Podman complaining about IDs, below is an example inside unprivileged LXC containers to apply in both /etc/subuid and /etc/subgid.

<username>:65536:131072

Run podman system migrate whenever the assigned subordinate ID space changes.