Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
PyPI: Python packets steal AWS keys from users (sonatype.com)
168 points by modinfo on June 26, 2022 | hide | past | favorite | 100 comments


I've been building tooling to mitigate supply chain attacks like these. Packj [1] analyzes Python/NPM packages for risky code and attributes such as Network/File permissions, expired email domains, etc. Auditing hundreds of direct/transitive dependencies manually is impractical, but Packj can quickly point out access to sensitive files (e.g., SSH keys), spawning shell, data exfiltration, etc. We found a bunch of malicious packages on PyPI using the tool, which have now been taken down; a few are listed here https://packj.dev/malware

1. https://github.com/ossillate-inc/packj


Does it work on obfuscated calls? For example, a base 64 encrypted string that gets decrypted and then passed to a shell.

Altough, I guess in that case, the shell call itself is suspicious.

What if the shell call itself is obfuscated?


> a base 64 encrypted string that gets decrypted and then passed to a shell.

This is a very common malicious behavior. Packj detects obfuscation [1] as well as spawning of shell commands (exec system call) [2]. I've updated threats.csv to flag code obfuscation.

1. https://github.com/ossillate-inc/packj/blob/main/main.py#L48... 2. https://github.com/ossillate-inc/packj/blob/main/main.py#L48...


Does this work at the system call level, where it would detect e.g. any attempt to open ~/.aws/credentials, or does it rely on heuristic analysis of the code itself, which will always be able to be "coded around" by the malware authors?

The correct approach would seem to be to not run untrusted code in an environment where it can read your AWS credentials.


Packj currently uses static code analysis to derive permissions (e.g., file/network accesses). Therefore, it can detect open() calls if used by the malware directly (e.g., not obfuscated in a base64 encoded string). But, Packj can also point out such base64 decode calls. Fortunately, malware has to use these APIs (read, open, decode, eval, etc.) for their functionality -- there's no getting around. Having said that, a sophisticated malware can hide itself better, so dynamic analysis as you suggested must be performed for completeness. We are incorporating strace-based dynamic analysis (containerized) to collect system calls.


https://github.com/ossillate-inc/packj#how-it-works

I imagine you'd want some kind of fuzzer with security-oriented tracing on top. I've never heard of such a tool, but I'd bet it exists somewhere.


As I understand, they use static analysis. So the malicious code can be hidden by obfuscation.

For example, instead of writing

    open('/etc/password')
One can write

    method = calculateMethodName()
    name = calculateEtcPassword()
    geattr(__builtins__, method)(name)


Yes, we are incorporating strace-based dynamic analysis (containerized) to collect system calls. However, Packj can currently flag use of getattr() API as it is equivalent to "runtime code generation" [1]; a developer can go and take a deeper look then.

1. https://github.com/ossillate-inc/packj/blob/main/main.py#L47...


I guess you'd have to run it in a malware sandbox like Cuckoo.


Turns out, stuff like Java's SecurityManager:https://docs.oracle.com/javase/7/docs/api/java/lang/Security... were there for a reason.

Note, it's been around since JDK 1.0, so 1996, 28 years ago (!), though it's not widely used.

But it allows sandboxing of libraries you use. I was using it to make a faulty library that was calling System.exit(0) - yes, a library that was shutting down the entire process, closing the main app... - not do that and instead wrap the exit in a regular exception.

Python and especially Javascript are reinventing a lot of "enterprise" ideas that Java invented 20+ years ago. Maven groupIds are another example from the Java ecosystem (they're used to avoid top level package name squatting).


You are of course correct that those features are extremely useful and that the JDK authors foresaw these situations long before they were a big deal. But also you do note that it's not widely used which is also correct and that's the important bit. I think it would be more accurate to say that python and javascript are discovering (perhaps not fully intentionally) ways to implement these ideas that are suitable for mass community adoption.

IMHO it's like the difference between basic research (mostly not very useful on its own but extremely important) and public health in medicine (the thing that actually makes a massive impact on the world that leverages the former but at a slower and more controlled pace). To stretch the analogy even more any doctor worth their salt will tell you that a treatment plan is only as good as the likelihood that the patient will stick to it. Other things like let's encrypt (https everywhere) and signal/whatsapp/telegram (strong privacy enabled by default) come to mind as implementing ideas on things that experts had been handwringing for decades that people "should" be doing that are now done so well that it feels almost silly not to do them the good way.


Python had some sandboxing attempts long time ago, like 20 years if not more, but without much success. Zope - one of the first web frameworks - needed one, but if I remember correctly it broke around Python 2.5, so you had to use Zope with an old version of Python.

https://wiki.python.org/moin/SandboxedPython has some details


The Java Security Manager was not very effective in practice. It was already deprecated in Java 17 and will be removed in the future: https://openjdk.org/jeps/411


I understand their reasoning but it's still kind of sad that there's no replacement planned.

It would be cool if there was a universal access control platform, but I guess the best we're getting is Docker, because such a platform would commoditize OSes, which OS creators for sure don't want :-(


From an organizational standpoint its better because it lets a sysadmin/security person sandbox anything and everything instead of having to become proficient with the tooling for every different language (which may not even exist).


Hey, I tried to check whether your project can detect obfuscation, but it doesn't appear to be installable. In requirements.txt:

- esprima==4.0.0 requires Python 3.6 (EOL) or lower because the package is really old and uses the async keyword (promoted in py37) as an attribute name, which is a SyntaxError on py37+.

- GitPython==3.1.27 requires Python 3.7 or later (requires-python:>=3.7).


Thanks for trying it out and sharing your feedback! I will fix this issue and also create a Dockerfile for easy testing.


Looks interesting, thanks. I'd like to see something like this but with a transparent proxy interface that you can point your package managers to so that you always in inspected and approved packages.


That's exactly what I am working on right now! I made an open source project called ambience (currently in public alpha https://ambience.sourcecode.ai ).

The point of that project is that you can create or use an existing repository proxies and attach to it what I called "audit policies" those are basically a list of packages/versions you want to block or allow. The default ones include for example malicious, vulnerable, yanked packages etc... (the blacklist repository) to which you point pip, poetry etc... and it will block installation of the packages listed in the audit policies attached to the repository. You can also create ad-hoc repositories or repository per project to keep it separate and operate in whitelist mode where you allow only whitelisted&audited packages.

On top of that there is also "monitor" mode where you can allow installation of any package or subsset of packages and it will capture all depedencies for purpose of tracking the software supply chain accross the company or project and those packages would be automatically scanned and audit using integration with another project of mine called Aura that is a static analysis scanner designed for the python supply chain.

As mentioned this is currently in open alpha mode so access is limited and user registration is not open (I am currently working on users&permissions for making their own repositories and audit policies) but if someone is interested in testing or this project in general or an early access to features behind the curtain feel free to shoot me an email at admin @ sourcecode.ai . The license is open source so it can be also self-hosted.


That's a cool project - I really like that it's open source and self hostable. A key thing for me would be npm package support - the nodejs ecosystem is such a dumpster fire.


Is it easy to find a list of packages that have been pulled from python/npm in the past? Would be interesting to train some models against it


Sure. Please email me (in profile) for the list. You can also look at the following resources for malware samples:

1. https://github.com/IQTLabs/software-supply-chain-compromises 2. https://github.com/rsc-dev/pypi_malware 3. https://github.com/osssanitizer/maloss/blob/master/malware/R...


What would the input to such a model be? The malicious code snippets? Or do you want to classify packages according to other meta data?


Yeah. I was wondering how easy it would be to classify using a language model/models. ALthough I don't know how it'd work with binary blobs, multiple languages, etc.

I think certain things would be picked up pretty easily e.g. obfuscated code would be a pretty loud feature, but subtle stuff might be undetected and generally I can't see the model being super accurate.


Thanks, will take a look at this - would be lovely to add something of this nature to a CI pipeline.


We've created a Github action [1]. It pulls data from our backend service https://packj.dev that continuously scans packages. You will have to create a free account.

1. https://github.com/marketplace/actions/packj-dev-audit


How do you make sure the dependencies the tool itself uses are sane to use?


By pinning versions, running Packj on deps and manual analysis :)


I wonder if the target of loglib-modules and hkg-sol-utils is a specific company running its own internal PyPi server, or something like a scientific collaboration (which can be thousands of people and billions of dollars, especially in the drugs discovery domain).

PIP is designed to treat all indexes as mirrors, rather than to specify the source of a package: So a higher package version of an internal package name would be chosen whether it is on public or private PyPi. Equally a certain percentage of the time it would choose public over private with the same package version.


Wow. So what you are saying is that someone with knowledge of a company’s internal private package dependencies would be able to hijack their build process by publishing a higher-numbered version of the same package to the public PyPi? And even if the build system explicitly references the package version number, part of the time the package would be taken from the public PyPi instead of the company’s own package server?

How do you mitigate this, other than just totally blocking access to the public PyPi? Shouldn’t everyone be blocking access to PyPi, then, and only be relying on their own private package server?


So the correct solution is really to pre-approve dependencies and store them in a local PyPi. Defence, Medical and some other high-sec industries do this as a matter of course. The problem with this approach is the amount of security toil required to vet packages is large, generally too large for most companies to buy into.

So one solution I see in the wild a lot is to prefix your company packages with a company identifier, then to set proxy rules to prevent fetching these from public PyPi (often these are set directly on the private PyPi and all traffic is directed through it).

Having been an agency pentester, I can tell you that given access to one company package (Open Sourced?), you can correctly guess the package structure, naming conventions and internal build infra of the company 90+% of the time.



Straight from the horse's mouth! Thank you, I didn't know about this.


This is an attack vector which has been exploited in the past to great effect[1].

[1] https://arstechnica.com/information-technology/2021/02/suppl...


Title should be "packages" rather than "packets".


I got really excited thinking this was something super interesting that I didn't know about i.e. "Python packets?? What?? Stealing AWS keys?? How??" Disappointed that it's actually just malicious packages


In Linux, is there a secure, but also convenient and user-friendly, way to prevent processes from having the same default level of access to the filesystem as the human user?

I like Android's system of per-app uid/gid. But AFAIK it's not implemented by any mainstream Linux kernel or distro.

There's AppArmor, but the last time I tried it, I came away with the opinion that it's not very convenient or user-friendly. Perhaps some kind of friendlier CLI or GUI frontend may help.

I'm assuming SELinux can achieve this but I don't have first-hand experience with it and from what I've read online, it seems to be less user-friendly than even AppArmor.

Any other approach you know about? Do secure distros like Qubes OS or Tails implement this systematically?


Containerization is what you want. There are many containerization tools for Linux, Docker being the most popular, and systemd-nspawn being the most Linuxy, but a bit unknown.


I didn't know about systemd-nspawn — thanks for the suggestion.

I already use Dockerized aliases for some CLI apps (e.g.: ffmpeg) but I didn't find the approach as convenient as I'd like.

I've found docker mounts difficult to secure. I'd like to mount $HOME but exclude even read-only access to "$HOME/.ssh/", "$HOME/passwordsafe.pwsafe3" and a dozen other sensitive file patterns. Some kind of predefined "access profiles" to create FS access rules and assign processes to them ("assign all python processes to python profile") is probably what I'd like.

Containerization is probably the best approach but I'd prefer if it's more opaque and less effort than Docker. For example, if I create a pyenv environment and run its 'python' command, I want that python process to not have full access to the filesystem without having to create container images, command aliases, or volume mounts.


I hacked up a bash script for running arbitrary command in docker container, mounting only PWD. It traces dynamic libraries through ldd and creates a new image for each unique command. I got it working for ffmpeg:

https://github.com/paskozdilar/dockerify

I might try to optimize it a little bit later, perhaps bind-mount dynamic libraries instead of creating a new image for each command.


In that case, you should look into NixOS/Nix package manager (or, if you're a GNU fan, GuixSD/Guix package manager). I've heard a lot of good things about them related to your problems, including extremely good support for virtual environments of any kind.


Containers don't contain.

You have to assume that any code running inside a container has broken outside of its mount namespace & can interact with anything running on the host. Only Linux's traditional mechanisms (credentials; capabilities; SELinux policy; others are available) are able to defend against this.


Also: don't install curl in your container.


The "user-friendly" part is always tricky. Maybe you could give bubblewrap a go. I think that it strikes the correct balance between inconvenience and security. I use it to wrap different package managers like npm.

https://github.com/containers/bubblewrap


> I like Android's system of per-app uid/gid. But AFAIK it's not implemented by any mainstream Linux kernel or distro.

You can create users manually for each app.

For GUI apps, https://firejail.wordpress.com/


A once-over of the docs seemed to tick a lot of my boxes. I'll check it out, thanks!


firejail is not limited to GUI apps, is it?


It's not, but there is a 2-3s startup and shutdown delay, which I find too annoying for CLIs


I guess you can limit entire shell sessions with firejail, to amortize the startup time. But I suppose the limits would need to be looser in that case to be useful.


bubblewrap, which is used by flatpak's containerisation, is a great tool for modifying the view of the filesystem seen by particular processes. You can even containerise processes installed in your rootfs.


> In Linux, is there a secure, but also convenient and user-friendly, way to prevent processes from having the same default level of access to the filesystem as the human user?

SELinux, if it was easier to use!

Apparmor

Bubblewrap?

If anything the problem is that there are too many mechanisms and most users are familiar with none of them...


As others have said, containerisation is what you want.

If you want something low level look at bubblewrap (brwap). Firejail is also another tool to do this, a bit higher level and with more features (maybe too many IMHO).


man -s 7 capabilities

I’d like to see something like OpenBSD’s pledge/unveil.

These all work at the process level, though, not individual portions of code in a process.


Glad they were able to automatically detect/catch this. There seems to be so much bloat when dependencies get pulled in.

Wonder if something like pledge and unveil around library code could be helpful; perhaps library code needs to be separated out into a separate process that would not have reason to access AWS keys.

Also, looking at the screenshot, could a simple programming searching for URLs in the library code help in this case?

Looks like they removed the modules, so one can't examine them any more.


Yeah, I'd love to have a decorator at the top of a file - @env to provide access to env variables or @secrets for some kind of secrets access, nothing else gets this.

Python is tough though, a very dynamic language so probably kind of hard to lock it down.


You would normally be able to solve things like this via file permissions in the OS but currently we are running everything as root inside a container.


Not really, because the dependencies to your python code often have access to all global variables and/or run at same permission level as all your code etc.


Is it really automated? The infographic glosses over the how with a “security research team” in the flowchart. Maybe the automation catches suspicious code/activity (like the urls or accessing sensitive files) and the research team just verifies


This is also why I like to put honey credentials everywhere, including in my .aws/credentials... You will never know when it might save you XD If you cant be bother with setting up cloudtrail+metric alert, canarytokens peeps can generate one for you ;)

https://canarytokens.org/generate#

Like any other tools though, i recommend to have a script to trigger it every now and then to make sure it works (and alert you about it so you dont go into panic mode)... for personal stuff, I usually have a specific day in the month i expect to see some canary tokens fire :)


What's HN's go-to solution for egress control? It seems to be a murky mix of expensive vendor products and hand-rolled Squid.

With supply chain attacks much more common, I'm starting to think that egress control is essential for all.


Remember that AWS credentials are easy to lock down. At the very least you could add an ACL to only let them be used on your AWS instances. Then you can set up alerts if someone attempts to use them outside.


But wouldn’t any host running the malicious package be vulnerable to having creds stolen from that host? Maybe I don’t know which “ACL” tech you’re referencing. You can limit where credentials are used from, and not just something like where an S3 bucket is read from? (For example)


You can set a permission that says “these credentials can only be used on an aws instance owned by this account”

Even if the creds are stolen they’d need access to an instance in your account to use them. Also you can be alerted if someone attempts to use them anywhere else.


There are very few situations where it even makes sense to have static credentials on an AWS instance any more. "Ambient" short-lived credentials from the instance profile and assumed roles are much safer.


Exactly. My assumed roles last an hour and are protected by MFA.


In the article they claim the package is literally stealing the IAM role credentials from the EC2’s metadata URL. So it’s presumed that the code is already running on your EC2.

Of course, just because it takes the credentials doesn’t mean it does anything else with them, but it could have done anything.


Network access (outgoing) should be firewalled a lot more these days.. There’s almost no need to open an outgoing connection


Right. Have a DNS servers which only forwards request to trusted domains, and a transparent proxy which only allows traffic to those same domains.

This is not a new idea at all, but it's still a very good one.


It’s tricky, though. Data could be exfiltrated by a DNS lookup or OCSP request.


I'm not familiar with OCSP, but DNS happens via UDP, and since parent said "network access", I'm assuming they mean all network protocols, not just TCP.


Any access to resolve global dns lets you exfiltrate even if you’re locked to a local resolver. Just blocking connections to the internet directly is not enough.


Sure, but again, parent said "Network access" which I assume includes internet (global), local network or any other type of sockets/connections, not just "internet" ones.


Not all OSes’ resolver libraries use the network. Solaris/Illumos uses door IPC.


We need the GPG feature in pip to be finished. We can currently upload signed packages, but pip won't check if you thrust the keys. Pip should include a keyring system and prevent people from installing packages without a signature that has a certain thrust level without explicit whitelist or interactivity.


Unbelievable that a major packaging ecosystem still has no mechanism for end-to-end signature verification!


Apparently they are working on it in PEP-480


I believe this is something that automated code parsing (a la Github Copilot) could really shine. Those tools are able to explain what a block of code does, so it should be possible to catch many types of malware hidden in source code.


If a human can't figure it out than an AI trained on human behavior (or more specifically, trained on comments left next to code) can't figure it out either. It's not like malware authors annotate their code with nice "here's where I exfiltrate all the user's secret data" comments.


I never claimed that it would be able to catch all malware, but it beats a human team reading through hundreds or thousands lines of code, for every package, after every update. I believe it can easily point out many types of suspicious additions to existing code.


That's not going to stop people from trying though. If people keep abusing the package managers like this, then publishing open source code is at risk of ending up like distributing binaries on Windows, where you need to fight dozens of virus scanners arbitrarily blocking your code for no apparent reason. Imagine opening a tab in VSCode only to see a popup warning saying it might be harmful, even though you just wrote it yourself.


That's exactly was vscode does for me.


Are you talking about this? https://code.visualstudio.com/docs/editor/workspace-trust That's different from how virus scanners work and not exactly what I'm talking about. Even Emacs does that.


That is different, vscode finally realized most people don't want stuxnet-class implicit autorun behaviour from random files and folders in your system.


This kinda happens already. I get prompted to trust stuff I wrote


A focused human will currently do better than automated tools, but actually getting a human to focus on this sort of thing is rare. Most projects I'm familiar with use a multitude of dependencies and only do cursory verification of changes at best.

An automated tool to check all depency changes for suspicious code and flag that to a human could be valuable. Whether or not that could be done in a way where wading through the false positives is worth it, I'm not sure, but it's a reasonable idea.


This is the kind of stuff that antiviruses did for ages, from signature, code simulation to heuristics.

They have a lot of expertise in this, it feels like they could branch out in finding malware in source code, instead of in binaries.


Source code has issues with obfuscation methodologies that can defeat a lot of techniques. It’s why companies are trying to build more analysis down into the kernel such as via EBPF. For example, concatenating a series of strings and characters that wind up reading from .AWS/credentials in the end is surprisingly tough to catch based upon simple pattern recognition alone, especially if it’s done in a subtle way such as with a spare buffer while doing other legit activities. So until the syscall gets issued and all substitutions resolved the user space analysis can be highly resource intensive or inaccurate


Right, code analysis to try to detect places where it reads from ~/.aws/credentials is never going to be reliable. The correct approach is not to run untrusted code in an environment where it can read your AWS credentials.


Another approach is to only run credentials that are ephemeral, which is sort of what most SSO systems will do for cloud IAM. Instance profiles using IDMSv2 work as well, too. However some malware out there only needs a few seconds of dwell time to wreak some serious havoc so even ephemeral credentials may as well be the same as static credentials potentially, especially if your credentials are used to do permanent privilege escalation. All it really can do then is provide a time window of usage and make filtering through a SIEM much more accurate, which is certainly valuable for forensics at the very least and even more important in terms of law (chain of custody, irrefutability, etc).


And if that's infeasible, to not run untrusted code in an environment where it has unfettered outbound access to the Internet.


How about flagging anything that looks like obfuscated code?


Apologies if this was not a joke, but just imagining the implications of this made me laugh harder than any comedy I've watched the past few weeks. My... umm... "scientific code" would be flagged in miliseconds :D


Binaries can be seen as "source code" for the CPU.


Isn't this impossible as per the halting problem?

It would be better to run the code in a sandbox and log all the syscalls it makes, similar to how Cuckoo's malware sandbox works.


Halting problem is only relevant if we are discussing a perfect 100% solution with no false-positives and false-negatives.

In real-life it is not of much relevance. I haven't seen a practical case in my life where I would conclude that it is undecidable to say what a function does (happy to see some practical examples if someone has some). For a theoretical program that "uploads credentials iff a sub-program halts" you would probably block it and live with a possible false-positive.


> undecidable to say what a function does (happy to see some practical examples if someone has some)

Any and all functions that involve templating/plugins/metaprogramming/etc (common in various web frameworks) that can effectively run arbitrary code and thus are undecidable. Also any function that does deserialization and thus will instantiate novel objects with their code, e.g. Python code that uses pickle will often be undecidable as its execution will depend on what exactly is unpickled.


You can try clearing environment variables before running a command:

    env -i <command>
And only set the ones you absolutely need.


That doesn’t help you when you’re application legitimately needs an environmental variable set but unknown to the developer, that application pulls a malicious module.


Now, the question is, who is pygrata and who is using their packages?


Might something like this work as a robust solution:

https://0bin.net/paste/TTOdW9+B#olQ+FmoCmluokC8UdWllx8GXQtos...

TL;DR

At program initialization, clear the `environ` object, and stash the environment variables in some random location in memory. You can then restore the `environ` object when needed via a contextmanager - meaning that code must explicitly be granted permission to access env vs. being able to snoop regardless.

Curious to hear your thoughts!

EDIT: I wonder if this approach could be extended to allow the use of certain "restricted" libraries (I'm thinking stuff to do with network calls/file system) only within specific scopes - as this would defend against publishing the env vars to a public endpoint...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: