This is included in kernel 3.8 released in 2013, and the linked PDF says that's when CLONE_NEWUSER itself was released - i.e., ever since user namespaces have existed, they've been allowed to unprivileged users.
Some downstream distributors, including Debian, started carrying a patch that added a restriction back in via the kernel.unprivileged_userns_clone sysctl. This is because allowing unprivileged user namespaces exposes a lot of attack surface to users. There's nothing fundamentally unsafe about it, but there are many kernel components for which bugs are only exploitable by unprivileged users if the feature is on. See https://lwn.net/Articles/673597/ for more discussion. (I thought a version of that patch eventually got accepted upstream, but I can't find it.)
The only reference to kernel 4.6 in the document is CLONE_NEWCGROUP, which is entirely unrelated to unprivileged user namespaces.
I've been bitten by this one over and over again in the last months. It's really unfortunate, and I see both sides' arguments here, but the effect is that on debian machines, running electron apps has become (even more of) a hassle, because chromium's sandboxing mechanism relies on these user namespaces.
Will this make it easier to run docker without root / sudo?
That could fix the barrier of entry and make docker tutorials passed around more canonical. I would love it if docker "just worked" across machines when testing locally, commands and all.
Even sharing docker in open source projects, there's a learning curve for me where commands in the README won't work. Is it my docker installation? Version differences with docker? Docker compose? Did the container images I'm pulling in change some way?
Do I need sudo or not? I guess with proper group permissions I'm okay - but will the developer(s) I share instructions with have these permissions ready to go?
On StackOverflow: copy/pasting docker CLI (even given proper context fitting into a larger whole) commands and configs probably has a 50% success rate for commands, and maybe 10% if it's a config of some sort (e.g. compose files)
> That could fix the barrier of entry and make docker tutorials passed around more canonical. I would love it if docker "just worked" across machines when testing locally, commands and all.
For an almost drop-in daemonless, rootless docker replacement for this use case, see podman (https://podman.io/). You can `alias podman=docker` and it will just work.
You do need a bit of configuration, specifically you need to create `/etc/subuid` and `/etc/subgid` if they don't exist and add subordinate UIDs/GIDs which will be used to map the containers users and groups. E.g.
Ultimately Docker on macOS is just Docker on Linux in a VM over TCP (and it’s not open source.)
To be less nebulous: regarding the problem of being “rootless,” macOS support is neither here nor there. As far as I can tell it doesn’t really matter since the containers are already in a VM anyways.
Podman supports pods, hence the name. It follows the k8s approach, and it’s nowhere close to a drop-in replacement for docker-compose, but managing multiple related containers is supported as a core feature.
> Will this make it easier to run docker without root / sudo?
No, because 4.6 is pretty old at this point and this feature has been around for ages.
I've been using podman instead of docker more and more recently, it's pretty great. It doesn't require root access to the machine (that's how I use it), although there are some limitations.
Kudos to Redhat for backporting the kernel 4 and package changes needed to RHEL 7 which is still on a version 3 Linux kernel. Rootless containers with podman was a technology preview in RHEL 7.6 and it's now supported in RHEL 7.7 [1].
There are some caveats - specifically copy-on-write filesystems aren't yet supported but Redhat says it's on the RHEL 7 roadmap. And you need a fresh of install 7.7 to get the right behavior out of the box, otherwise there are manual steps required for upgraded systems. Specifically stackoverflow examples should if there is a usermapping to unprivileged IDs and the user aliases or links docker to podman.
Yes, see https://docs.docker.com/engine/security/rootless/ . We've been rolling it out at $work on an experimental basis (so people can `docker run` things on their development machines, where they're not allowed to have actual root access or root-equivalent access) and it's working pretty well.
It needs the setuid (or setcap) helpers `newuidmap` and `newgidmap`, plus setup in /etc/subuid and /etc/subgid, to allocate some UIDs on the host system for unprivileged users to use on the guest container. This is required so that you can have multiple users inside your container. There is a way to use user namespaces entirely unprivileged, but you only get one UID and one GID inside your namespace, because UIDs and GIDs need to have a one-to-one map across user namespaces. You can pick whether your UID maps to root inside the namespace, but then you can only use root, you can't drop privileges to anything else.
On most Linux distros /etc/subuid and /etc/subgid get automatically created these days when you create a local user account. If you have LDAP users (which we do), you need to fill them in manually somehow.
I've also been using entirely-unprivileged user namespaces at work to sandbox builds and tests so that running them on the build/test farm works just like on your local machine, and you have fewer "it works on my machine" problems. In this case I only need one user inside the namespace, and it doesn't need to be root (you can set up mounts, networking, etc. before you drop privileges).
[See also my comment about how 4.6 isn't relevant here, we have some machines at work that are on 4.1 and the functionality works fine.]
> On StackOverflow: copy/pasting docker CLI (even given proper context fitting into a larger whole) commands and configs probably has a 50% success rate for commands, and maybe 10% if it's a config of some sort (e.g. compose files)
This is not particularly surprising. Docker does two things. One is a filesystem package manager, the other is providing a common interface around dozens of low-level features. The filesystem packaging stuff works well enough. It's the other part that is inherently complex because it does so many things at once (networking, mounts, resource management, security, process management, ...). When something doesn't work you'll end up with old-school linux sysadmin work except that you now don't deal with a single network interface and a handful of iptables rules but dozens scattered across namespaces and complex mount trees.
There's no magic, just convenience as long as you stay on the happy path.
I’ve recently discovered systemd-nspawn and machined. It’s not a drop-in replacement for Docker, but I much prefer it for my purposes because you don’t need to install anything on a systemd-based Linux distribution and it doesn’t work so hard to conceal the fact that all the hard work is done by the kernel.
It's a tradeoff. It's certainly better to give untrusted users access to unprivileged user namespaces than to give them access to /var/run/docker.sock, which straightforwardly gives them full root.
Another project in this space is bubblewrap https://github.com/containers/bubblewrap , which can run either with unprivileged user namespaces or by being setuid. It's intended to create an environment for container runtimes to use so that the container software itself doesn't need to be privileged, and the idea is that bubblewrap itself uses the privileged interfaces to set up the environment and then drops privileges before running user-provided code, so it shouldn't introduce more risk.
Yes. Namespace support has been a great source of CVEs, and disabling all kinds of unneeded namespace functionalities is one of the first steps when hardening a Linux kernel.
Has anyone tried fuzzing random sequences of unshare(2), clone(2), setuid(2), capabilities, uid/gid map calls (etc) to see if there is a sequence that eventually gains real root or some other privilege escalation? I'm dubious that Linux is theoretically sound, what with the multiple layers of historical baggage.
It fuzzes individual syscalls. Does it string them together into sequences? (Edit: yes it does, but it doesn't look for priv escalations, only crashes, kernel panics and the like)
It can be adapted to privlege escalations, but it doesn't do it on it's own. You have to give it a stub to check after running the syscalls to check if the permissions are intact.
If Michael Kerrisk is reading this: would be great to have a small supplementary book to the great "The Linux Programming Interface" that handles namespaces, capabilities et al.
I had the absolute pleasure of attending a week of all-day sessions with Michael Kerrisk and on 2 or 3 or those days he covered only container related primitives. His materials were great and he was an excellent teacher.
This is great news, because I can't switch over to OpenBSD (docker, bluetooth, etc) or more folkloric distributions like VoidOS and Qubes. Going to make a bunch of Anki cards today to remember these namespaces and how to use them!
Could you provide a little more detail? I’ve recently disabled user namespacing on an app I’ve deployed because I couldn’t figure out how to create an encrypted volume from within the container [1]. I think I understand the security implications and I’m comfortable with them, but I’d be really interested to know how user namespacing has saved people from real vulnerabilities in the real world.
How do you use them? Are we talking about developer desktop PC's here? Or user namespaces as part of a containerization setup?
The main use of user namespaces seems to be running stuff that wants to be root as non-root. It would seem better to simply fix all those tools to not check if they are root, and instead just try to do the thing they were trying to do.
This is included in kernel 3.8 released in 2013, and the linked PDF says that's when CLONE_NEWUSER itself was released - i.e., ever since user namespaces have existed, they've been allowed to unprivileged users.
Some downstream distributors, including Debian, started carrying a patch that added a restriction back in via the kernel.unprivileged_userns_clone sysctl. This is because allowing unprivileged user namespaces exposes a lot of attack surface to users. There's nothing fundamentally unsafe about it, but there are many kernel components for which bugs are only exploitable by unprivileged users if the feature is on. See https://lwn.net/Articles/673597/ for more discussion. (I thought a version of that patch eventually got accepted upstream, but I can't find it.)
The only reference to kernel 4.6 in the document is CLONE_NEWCGROUP, which is entirely unrelated to unprivileged user namespaces.