Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

An open question - why do you prefer doing this in the command line versus via a scripting language like Python? I get the piping philosophy, but why one versus the other?


Well, if I can do something in one line (albeit perhaps a long line), awk is my preferred tool. But as soon as things get complicated, one is almost certainly better off in a richer environment -- bash, Python, node.js, etc. So I don't view it as one versus the other, but rather picking the right tool for the job -- which the Unix philosophy very much liberates us to do.


Most of the well-know "unix-style" command-line tools, such as grep, sort, etc. actually have very high performance. Their relatively constrained use cases allow the authors to implement decent algorithms and optimizations (e.g. sort uses merge sort, grep uses all kinds of optimizations: http://lists.freebsd.org/pipermail/freebsd-current/2010-Augu...)

In contrast, when you're building a custom pipeline in a high-level language, you're optimizing for simple solutions and are not likely to get better performance unless you hit an edge case where the standard tools do really poorly.


Actually sort is even better than normal (purely in-memory) merge sort. It looks at available memory and writes out sorted files to merge.

http://vkundeti.blogspot.com/2008/03/tech-algorithmic-detail...


Interesting! So if I understand that article correctly, it's basically doing a multi-phase merge sort where each individual run is stored in a file?


This is really a matter of personal style. The point of this approach is that if you want to break out Python or whatever, it integrates with the rest of it just like any of the "built-in" programs.

It's usually quicker for me to iterate on building up a complex program using existing command-line tools -- up to a point. After that point, I switch to something like Node or Python.

One reason it's faster is that they're designed to be composable. They're flexible in just the right ways -- record separators, output formats, and key behaviors (like inverting the sense of a filter or whatever) -- to be able to perform a variety of tasks, but not so flexible that you need a lot of boilerplate, as with more general-purpose languages. They defer unrelated tasks (like sorting) to tools designed for that, keeping concerns separate.

Take an awk script that reads whitespace-separated fields as input and transforms that, adding a header at the top and a summary at the end. awk's got a really nice syntax for these common tasks, and at the end you're left with a program where nearly all of the code is part of the specific problem you're trying to solve, not junk around requiring modules, opening files, looping, parsing, and so on.


Its worth mentioning that a UNIX pipeline is highly parallel and naturally exploits multiple cores, while Python is in my experience does not.


Python is slow - parsing 10GB of logging works best with awk, grep, etc.


Python isn't going to beat grep, but it beats awk in a lot of cases. (Cases that awk isn't well suited to, to be fair. Python doesn't beat awk for 99% of what people use awk for.)

It's faster than people think it is. Especially when you add in libraries like pandas, it's fantastic for data analysis.

Of course, by the time you get to using pandas, you have to have everything in memory.

This isn't true of python in general, though. For simpler tasks, you can easily write generators to read from stdin and write to stdout.

I'm not saying that it's better for things like log parsing, but for more complicated ascii formats, I'd far rather use python than awk.

That having been said, people who don't learn and use awk are missing out. It's a fantastic tool.

I've just seen one too many unreadable, 1000-line awk programs to do something that's a dozen lines of python.


I would say that perhaps parsing logging works best when using awk, grep and the like, because that's more or less what they were designed for. But not everything is unix logs, and not everything is over 10GB. Having said that, python can absolutely handle 10GB data sets. In fact with things like PySpark, you can really go much bigger.


You're right - in the end my solution for this particular project was using grep and awk to parse the loglines into a CSV-ish format. That was then interpreted by Python and matplotlib to create beautiful graphs.


I hear you on this. I'm very interested in non-performance based reasons. Some of the python libraries are optimized for big data too, no?

I guess the reason I ask is much of the "manipulate and check" that I do happens before I get things to where a one liner will work. That could very well be a programmer competency issue on my part though. :-)


Use pypy. At 10 GB the bottleneck will probably be the storage.


For myself: specialized tools are useful, but the commandline utility set lends itself, as Kernighan and Pike noted in The UNIX Programming Environment over a quarter century ago, to very rapid prototyping and developing. You can accomplish a great deal in a few well-understood lines of shell (which is to say, UNIX utilities).

Yes, sometimes the full power of a programming or scripting language is what you need, and in cases it may execute faster (though you may well be surprised -- the shell utilities are often highly optimised), but if a one-liner, or even a few brief lines can accomplish the task, why bother with the heavier tool?


Command line tools are just faster in a lot of cases and doesn't disrupt your flow as you work on the terminal.

However, that being said, I do notice there is a disturbing trend of command line warriors trying to do absolutely everything on the command line resulting in spending 10 minutes to construct a perfect one-liner when they could have just wrote a python/perl script in 2 minutes.


for anything beyond a chained set of greps and cuts, I'll use Perl for a one-liner. This has the benefit of allowing me to easily transfer to a script if it becomes unwieldy. It's just a case of pasting in the one liner, adding newlines at semicolons, and a few characters to fix up at the top and bottom of the file.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: