Docker in depth: namespaces, cgroups, images from scratch
Author: Carmelo C.
Published: Jun 13, 2023 (updated)docker
arm64
aarch64
containers
Images and containers
Let’s take a closer look at what’s in an image.
NOTE: I’m filtering out lots of info by using Docker’s –format command.
user@laptop $ docker inspect --format="{{json .RootFS.Layers}}" carmelo0x63/goweb:latest
["sha256:5216338b40a7b96416b8b9858974bbe4acc3096ee60acbc4dfb1ee02aecceb10",
"sha256:a0b2ea330c61ec1ec3d25024a8ddaa6121e995e2e3dc2473c48bfdeb7adfab69",
"sha256:4b7b5c980fbe0abe030c29236a05764ea3c32f898d56495b2bc146d6b82a2c3d"]
One can see that, for instance, the image above is made up of three different layers.
Likewise, docker history
can show how the image had been built:
user@laptop $ docker history carmelo0x63/goweb:latest
IMAGE CREATED CREATED BY SIZE COMMENT
ccf5bf7f5979 7 days ago /bin/sh -c #(nop) CMD ["./main.go"] 0B
<missing> 7 days ago /bin/sh -c #(nop) COPY file:fe2451faf4c4dbce… 7.47MB
<missing> 7 days ago /bin/sh -c #(nop) WORKDIR /app 0B
<missing> 7 days ago /bin/sh -c mkdir /app 0B
<missing> 2 months ago /bin/sh -c #(nop) CMD ["/bin/sh"] 0B
<missing> 2 months ago /bin/sh -c #(nop) ADD file:e69d441d729412d24… 5.59MB
A container runs off the image but is an entirely different object.
user@laptop $ docker container ls
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
66df9337ad51 carmelo0x63/goweb:latest "./main.go" 14 minutes ago Up 14 minutes beautiful_williamson
In reality, the application is obviously running on our host, we can see how it’s identified by its PID:
user@laptop $ ps -ef | grep main.go | grep -v grep
root 3993 3970 0 17:38 ? 00:00:00 ./main.go
user@laptop $ docker inspect 66df9337ad51 | grep -i pid
"Pid": 3993,
"PidMode": "",
"PidsLimit": null,
We can also connect to the running container. From within, it looks as if we’re in a totally separate environment:
user@laptop $ docker exec -it 66df sh
/app # hostname
66df9337ad51
/app # ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
6: eth0@if7: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1500 qdisc noqueue state UP
link/ether 02:42:ac:11:00:02 brd ff:ff:ff:ff:ff:ff
inet 172.17.0.2/16 brd 172.17.255.255 scope global eth0
valid_lft forever preferred_lft forever
/app # ls
main.go
/app # ls /
app bin dev etc home lib media mnt opt proc root run sbin srv sys tmp usr var
/app # exit
Namespaces
Docker uses a technology called namespaces to provide the container with a layer of isolation. A few examples:
- pid namespace: Process isolation (PID: Process ID).
- net namespace: Managing network interfaces (NET: Networking).
- ipc namespace: Managing access to IPC resources (IPC: InterProcess Communication).
- mnt namespace: Managing filesystem mount points (MNT: Mount).
- uts namespace: Isolating kernel and version identifiers. (UTS: Unix Timesharing System).
Let’s analyze namespaces by, for instance, inheriting the hostname:
user@laptop $ sudo nsenter --target 3993 --uts
root@66df9337ad51:~# hostname
66df9337ad51
root@66df9337ad51:~# ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: enp14s0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc fq_codel state DOWN mode DEFAULT group default qlen 1000
link/ether 30:f9:ed:fe:b1:28 brd ff:ff:ff:ff:ff:ff
3: wlp2s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DORMANT group default qlen 1000
link/ether 08:ed:b9:ce:e2:cf brd ff:ff:ff:ff:ff:ff
4: docker_gwbridge: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default
link/ether 02:42:51:dc:ce:e9 brd ff:ff:ff:ff:ff:ff
5: docker0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default
link/ether 02:42:7c:5c:a8:3e brd ff:ff:ff:ff:ff:ff
7: veth30cdacf@if6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker0 state UP mode DEFAULT group default
link/ether 52:6a:51:42:8f:b8 brd ff:ff:ff:ff:ff:ff link-netnsid 0
9: veth52187b1@if8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker0 state UP mode DEFAULT group default
link/ether 82:b7:dc:96:1b:58 brd ff:ff:ff:ff:ff:ff link-netnsid 1
root@66df9337ad51:~# exit
Let’s now try something different and borrow the container’s network settings:
user@laptop $ sudo nsenter --target 3993 --net
root@laptop:~# hostname
laptop
root@laptop:~# ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
6: eth0@if7: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default
link/ether 02:42:ac:11:00:02 brd ff:ff:ff:ff:ff:ff link-netnsid 0
root@laptop:~# exit
Cgroups
Docker Engine on Linux also relies on another technology called control groups (or cgroups). A cgroup limits an application to a specific set of resources. Control groups allow Docker Engine to share available hardware resources to containers and optionally enforce limits and constraints.
Let’s see an example of how memory can be limited for a container.
user@laptop $ cat /proc/3993/cgroup
12:net_cls,net_prio:/docker/66df9337ad51dd25ed8befe778bfe19698df8636a3fbfb45c4257899d93d9a86
11:memory:/docker/66df9337ad51dd25ed8befe778bfe19698df8636a3fbfb45c4257899d93d9a86
10:pids:/docker/66df9337ad51dd25ed8befe778bfe19698df8636a3fbfb45c4257899d93d9a86
9:devices:/docker/66df9337ad51dd25ed8befe778bfe19698df8636a3fbfb45c4257899d93d9a86
8:cpuset:/docker/66df9337ad51dd25ed8befe778bfe19698df8636a3fbfb45c4257899d93d9a86
7:blkio:/docker/66df9337ad51dd25ed8befe778bfe19698df8636a3fbfb45c4257899d93d9a86
6:hugetlb:/docker/66df9337ad51dd25ed8befe778bfe19698df8636a3fbfb45c4257899d93d9a86
5:rdma:/
4:perf_event:/docker/66df9337ad51dd25ed8befe778bfe19698df8636a3fbfb45c4257899d93d9a86
3:freezer:/docker/66df9337ad51dd25ed8befe778bfe19698df8636a3fbfb45c4257899d93d9a86
2:cpu,cpuacct:/docker/66df9337ad51dd25ed8befe778bfe19698df8636a3fbfb45c4257899d93d9a86
1:name=systemd:/docker/66df9337ad51dd25ed8befe778bfe19698df8636a3fbfb45c4257899d93d9a86
0::/system.slice/containerd.service
user@laptop $ cat /sys/fs/cgroup/memory/docker/66df9337ad51dd25ed8befe778bfe19698df8636a3fbfb45c4257899d93d9a86/memory.limit_in_bytes
9223372036854771712
user@laptop $ docker run -d --memory 4m --name test4m carmelo0x63/goweb:latest
WARNING: Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap.
051f309ee14220936dd6a746341cde687c94bca45108a9413ba7fcc1ee323520
user@laptop $ cat /sys/fs/cgroup/memory/docker/051f309ee14220936dd6a746341cde687c94bca45108a9413ba7fcc1ee323520/memory.limit_in_bytes
4194304
Images from scratch
So far we’ve always started from pre-built images whose contents was unknown until we’d run them. There’s an alternative way to building custom images and that’s starting from scratch.
This approach entails downloading the filesystem first, then adding our customizations on top of it. Alpine is a very convenient distribution since it’s mini root filesystem is very limited in size:
user@laptop $ curl -O http://dl-cdn.alpinelinux.org/alpine/v3.11/releases/x86_64/alpine-minirootfs-3.11.5-x86_64.tar.gz
user@laptop $ ls -lh
total 2.7M
-rw-rw-r-- 1 user user 2.6M Mar 25 11:10 alpine-minirootfs-3.11.5-x86_64.tar.gz
Let’s build an image based on it and add our own customization. We’ll need the following files:
Dockerfile
:
FROM scratch
ADD alpine-minirootfs-3.11.5-x86_64.tar.gz /
COPY os-release /etc/
CMD ["/bin/sh"]
os-release
:
NAME="My cool-and-small image"
ID=alpine
VERSION_ID=1.0
PRETTY_NAME="Based on Alpine Linux v3.11"
HOME_URL="https://alpinelinux.org/"
BUG_REPORT_URL="https://bugs.alpinelinux.org/"
We can start the build process as follows now:
user@laptop docker build -t carmelo0x63/scratch:1.0 .
Sending build context to Docker daemon 2.728MB
Step 1/4 : FROM scratch
--->
Step 2/4 : ADD alpine-minirootfs-3.11.5-x86_64.tar.gz /
---> 5beb49a29512
Step 3/4 : COPY os-release /etc/
---> 034d3918f9ac
Step 4/4 : CMD ["/bin/sh"]
---> Running in a494bc027971
Removing intermediate container a494bc027971
---> 786e178aee95
Successfully built 786e178aee95
Successfully tagged carmelo0x63/scratch:1.0
One thing to notice is how small our image is:
user@laptop docker image ls
REPOSITORY TAG IMAGE ID CREATED SIZE
carmelo0x63/scratch 1.0 786e178aee95 14 seconds ago 5.6MB
...
Next and last step is to verify that the image is behaving as intended:
user@laptop docker run -it carmelo0x63/scratch:1.0 sh
/ # cat /etc/os-release
NAME="My cool-and-small image"
ID=alpine
VERSION_ID=1.0
PRETTY_NAME="Based on Alpine Linux v3.11"
HOME_URL="https://alpinelinux.org/"
BUG_REPORT_URL="https://bugs.alpinelinux.org/"