Linux 3.8 introduced unprivileged user namespaces [pdf]

geofft · on Dec 18, 2019

The given title (currently "Linux 4.6 introduced unpriviledged user namespaces") isn't accurate - unprivileged user namespaces have been around since this commit: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...

This is included in kernel 3.8 released in 2013, and the linked PDF says that's when CLONE_NEWUSER itself was released - i.e., ever since user namespaces have existed, they've been allowed to unprivileged users.

Some downstream distributors, including Debian, started carrying a patch that added a restriction back in via the kernel.unprivileged_userns_clone sysctl. This is because allowing unprivileged user namespaces exposes a lot of attack surface to users. There's nothing fundamentally unsafe about it, but there are many kernel components for which bugs are only exploitable by unprivileged users if the feature is on. See https://lwn.net/Articles/673597/ for more discussion. (I thought a version of that patch eventually got accepted upstream, but I can't find it.)

The only reference to kernel 4.6 in the document is CLONE_NEWCGROUP, which is entirely unrelated to unprivileged user namespaces.

black_puppydog · on Dec 18, 2019

I've been bitten by this one over and over again in the last months. It's really unfortunate, and I see both sides' arguments here, but the effect is that on debian machines, running electron apps has become (even more of) a hassle, because chromium's sandboxing mechanism relies on these user namespaces.

https://github.com/electron/electron/issues/17972

dang · on Dec 18, 2019

Ok, we've subtracted the pair (1, -0.2) from Linux in the title above.

Edit: corrected from 0.8

Edit edit: corrected from (1, 0.2)

toraobo · on Dec 18, 2019

Linux versions are not decimal, you subtracted 1.-2

dang · on Dec 18, 2019

Oh dear. Fixed. Thanks!

naniwaduni · on Dec 18, 2019

Subtracting (1, 0.2) would've left 3.4!

dang · on Dec 19, 2019

I need to stop doing this so hastily. Argh!

majke · on Dec 18, 2019

You are totally right, as confirmed by Mr Kerrisk here: https://twitter.com/mkerrisk/status/1207018543193063424

tony · on Dec 18, 2019

Will this make it easier to run docker without root / sudo?

That could fix the barrier of entry and make docker tutorials passed around more canonical. I would love it if docker "just worked" across machines when testing locally, commands and all.

Even sharing docker in open source projects, there's a learning curve for me where commands in the README won't work. Is it my docker installation? Version differences with docker? Docker compose? Did the container images I'm pulling in change some way?

Do I need sudo or not? I guess with proper group permissions I'm okay - but will the developer(s) I share instructions with have these permissions ready to go?

On StackOverflow: copy/pasting docker CLI (even given proper context fitting into a larger whole) commands and configs probably has a 50% success rate for commands, and maybe 10% if it's a config of some sort (e.g. compose files)

feanaro · on Dec 18, 2019

> That could fix the barrier of entry and make docker tutorials passed around more canonical. I would love it if docker "just worked" across machines when testing locally, commands and all.

For an almost drop-in daemonless, rootless docker replacement for this use case, see podman (https://podman.io/). You can `alias podman=docker` and it will just work.

You do need a bit of configuration, specifically you need to create `/etc/subuid` and `/etc/subgid` if they don't exist and add subordinate UIDs/GIDs which will be used to map the containers users and groups. E.g.

    usermod --add-subuids 100000-165536 $USER
    usermod --add-subgids 100000-165536 $USER

ghevshoo · on Dec 18, 2019

Maybe podman “just works” on Linux but it still has some work to be done before the same could be said for macOS.

E.g. https://github.com/containers/libpod/issues/4039#issuecommen...

jchw · on Dec 18, 2019

Ultimately Docker on macOS is just Docker on Linux in a VM over TCP (and it’s not open source.)

To be less nebulous: regarding the problem of being “rootless,” macOS support is neither here nor there. As far as I can tell it doesn’t really matter since the containers are already in a VM anyways.

aorth · on Dec 18, 2019

> You can `alias podman=docker` and it will just work.

Yes, and it's super awesome. Unless you need docker-compose!

thinkmassive · on Dec 18, 2019

Podman supports pods, hence the name. It follows the k8s approach, and it’s nowhere close to a drop-in replacement for docker-compose, but managing multiple related containers is supported as a core feature.

https://developers.redhat.com/blog/2019/01/15/podman-managin...

thenewnewguy · on Dec 18, 2019

> Unless you need docker-compose!

There's a solution for that :) (Note: still somewhat beta quality)

https://github.com/containers/podman-compose

throwaway8941 · on Dec 18, 2019

> Will this make it easier to run docker without root / sudo?

No, because 4.6 is pretty old at this point and this feature has been around for ages.

I've been using podman instead of docker more and more recently, it's pretty great. It doesn't require root access to the machine (that's how I use it), although there are some limitations.

https://github.com/containers/libpod/blob/master/rootless.md

technofiend · on Dec 18, 2019

Kudos to Redhat for backporting the kernel 4 and package changes needed to RHEL 7 which is still on a version 3 Linux kernel. Rootless containers with podman was a technology preview in RHEL 7.6 and it's now supported in RHEL 7.7 [1].

There are some caveats - specifically copy-on-write filesystems aren't yet supported but Redhat says it's on the RHEL 7 roadmap. And you need a fresh of install 7.7 to get the right behavior out of the box, otherwise there are manual steps required for upgraded systems. Specifically stackoverflow examples should if there is a usermapping to unprivileged IDs and the user aliases or links docker to podman.

[1] https://www.redhat.com/en/blog/three-new-container-capabilit...

geofft · on Dec 18, 2019

Yes, see https://docs.docker.com/engine/security/rootless/ . We've been rolling it out at $work on an experimental basis (so people can `docker run` things on their development machines, where they're not allowed to have actual root access or root-equivalent access) and it's working pretty well.

It needs the setuid (or setcap) helpers `newuidmap` and `newgidmap`, plus setup in /etc/subuid and /etc/subgid, to allocate some UIDs on the host system for unprivileged users to use on the guest container. This is required so that you can have multiple users inside your container. There is a way to use user namespaces entirely unprivileged, but you only get one UID and one GID inside your namespace, because UIDs and GIDs need to have a one-to-one map across user namespaces. You can pick whether your UID maps to root inside the namespace, but then you can only use root, you can't drop privileges to anything else.

On most Linux distros /etc/subuid and /etc/subgid get automatically created these days when you create a local user account. If you have LDAP users (which we do), you need to fill them in manually somehow.

I've also been using entirely-unprivileged user namespaces at work to sandbox builds and tests so that running them on the build/test farm works just like on your local machine, and you have fewer "it works on my machine" problems. In this case I only need one user inside the namespace, and it doesn't need to be root (you can set up mounts, networking, etc. before you drop privileges).

[See also my comment about how 4.6 isn't relevant here, we have some machines at work that are on 4.1 and the functionality works fine.]

the8472 · on Dec 18, 2019

> On StackOverflow: copy/pasting docker CLI (even given proper context fitting into a larger whole) commands and configs probably has a 50% success rate for commands, and maybe 10% if it's a config of some sort (e.g. compose files)

This is not particularly surprising. Docker does two things. One is a filesystem package manager, the other is providing a common interface around dozens of low-level features. The filesystem packaging stuff works well enough. It's the other part that is inherently complex because it does so many things at once (networking, mounts, resource management, security, process management, ...). When something doesn't work you'll end up with old-school linux sysadmin work except that you now don't deal with a single network interface and a handful of iptables rules but dozens scattered across namespaces and complex mount trees.

There's no magic, just convenience as long as you stay on the happy path.

sjy · on Dec 18, 2019

I’ve recently discovered systemd-nspawn and machined. It’s not a drop-in replacement for Docker, but I much prefer it for my purposes because you don’t need to install anything on a systemd-based Linux distribution and it doesn’t work so hard to conceal the fact that all the hard work is done by the kernel.

[1] https://www.freedesktop.org/software/systemd/man/systemd-nsp...

amelius · on Dec 18, 2019

Perhaps one day we could even run things without Docker™ ...

mfontani · on Dec 18, 2019

If you want to run docker without sudo, you can use podman. Root inside a podman container is your own user id on the host machine.

AdrienLemaire · on Dec 19, 2019

From the document, slide 51:

> User NSs permit novel applications; for example: > Running Linux containers without root privileges > Docker, LXC

This seems to answer yes to your question.

globular-toast · on Dec 19, 2019

Yes, but you have to use podman instead of docker. So far I've found it to work very well as drop-in replacement.

Giornito · on Dec 18, 2019

I wonder if it will make it easier at the cost of security.

geofft · on Dec 18, 2019

It's a tradeoff. It's certainly better to give untrusted users access to unprivileged user namespaces than to give them access to /var/run/docker.sock, which straightforwardly gives them full root.

Another project in this space is bubblewrap https://github.com/containers/bubblewrap , which can run either with unprivileged user namespaces or by being setuid. It's intended to create an environment for container runtimes to use so that the container software itself doesn't need to be privileged, and the idea is that bubblewrap itself uses the privileged interfaces to set up the environment and then drops privileges before running user-provided code, so it shouldn't introduce more risk.

navaati · on Dec 18, 2019

And notably Bubblewrap is the technology behind Flatpak’s sandboxing !

HorstG · on Dec 19, 2019

Yes. Namespace support has been a great source of CVEs, and disabling all kinds of unneeded namespace functionalities is one of the first steps when hardening a Linux kernel.

hiasen · on Dec 18, 2019

I would recommend watching the same (or similar) talks about namespaces in Linux done by Michael Kerrisk at NDC TechTown (Kongsberg) September 2019.

I learnt a lot from these talks.

https://www.youtube.com/watch?v=0kJPa-1FuoI

https://www.youtube.com/watch?v=73nB9-HYbAI

rwmj · on Dec 18, 2019

Has anyone tried fuzzing random sequences of unshare(2), clone(2), setuid(2), capabilities, uid/gid map calls (etc) to see if there is a sequence that eventually gains real root or some other privilege escalation? I'm dubious that Linux is theoretically sound, what with the multiple layers of historical baggage.

pcwalton · on Dec 18, 2019

Syzkaller fuzzes that stuff. https://github.com/google/syzkaller

rwmj · on Dec 18, 2019

It fuzzes individual syscalls. Does it string them together into sequences? (Edit: yes it does, but it doesn't look for priv escalations, only crashes, kernel panics and the like)

simcop2387 · on Dec 18, 2019

It can be adapted to privlege escalations, but it doesn't do it on it's own. You have to give it a stub to check after running the syscalls to check if the permissions are intact.

sitkack · on Dec 18, 2019

Does it trace existing programs and use those as seeds in how to compose syscalls?

simcop2387 · on Dec 19, 2019

I don't think it's got the ability on it's own to do that, but I can imagine you could do it with strace and some other scripting.

londons_explore · on Dec 18, 2019

When you have root, it doesn't take many more random syscalls to cause a panic.

chipb · on Dec 18, 2019

Or just try a not-so-random one:

   kill(1, 9);

chipb · on Dec 19, 2019

Huh. I guess it’s not too surprising, but that doesn’t work at all. init gets special treatment for signals.

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...

_pmf_ · on Dec 18, 2019

If Michael Kerrisk is reading this: would be great to have a small supplementary book to the great "The Linux Programming Interface" that handles namespaces, capabilities et al.

Huggernaut · on Dec 18, 2019

I had the absolute pleasure of attending a week of all-day sessions with Michael Kerrisk and on 2 or 3 or those days he covered only container related primitives. His materials were great and he was an excellent teacher.

nezirus · on Dec 18, 2019

+1 Pricing of the book should be easy, worth its weight in gold.

workOrNah · on Dec 19, 2019

He's actually teaching a class that I'm in now and I just asked him! He says writing a book like that is in his plans when he has some downtime

tyingq · on Dec 18, 2019

Typo in title, should be "unprivileged". Not nitpicking, just should be searchable later.

loudmax · on Dec 18, 2019

Also the bane of many a DBA trying to update permissions.

AdrienLemaire · on Dec 19, 2019

I have never used OpenBSD, but isn't Linux getting closer to their equivalent pledge and unveil? https://news.ycombinator.com/item?id=17277067

This is great news, because I can't switch over to OpenBSD (docker, bluetooth, etc) or more folkloric distributions like VoidOS and Qubes. Going to make a bunch of Anki cards today to remember these namespaces and how to use them!

zorked · on Dec 18, 2019

The author's book, The Linux Programming Interface, is an amazing book and a new classic.

sargun · on Dec 18, 2019

user namespaces are a super rad feature. They've protected us at $WORK from multiple vulnerabilities that have come out.

sjy · on Dec 18, 2019

Could you provide a little more detail? I’ve recently disabled user namespacing on an app I’ve deployed because I couldn’t figure out how to create an encrypted volume from within the container [1]. I think I understand the security implications and I’m comfortable with them, but I’d be really interested to know how user namespacing has saved people from real vulnerabilities in the real world.

[1] https://unix.stackexchange.com/questions/557293/how-can-i-ma...

londons_explore · on Dec 18, 2019

How do you use them? Are we talking about developer desktop PC's here? Or user namespaces as part of a containerization setup?

The main use of user namespaces seems to be running stuff that wants to be root as non-root. It would seem better to simply fix all those tools to not check if they are root, and instead just try to do the thing they were trying to do.

woadwarrior01 · on Dec 19, 2019

For developer desktops, there's firejail[1].

[1]: https://github.com/netblue30/firejail

AdrienLemaire · on Dec 19, 2019

Lovely! Thanks for sharing this awesome-looking tool :)

edit: oh, it's also mentioned in the document slide 53 along with Flatpak