Reading awk as a human is hard too. And performance of awk is crap. A lot slower than most interpreter language out there. I had replaced all the awk scripts in python and everything is a lot faster.
> And performance of awk is crap. [...] I had replaced all the awk scripts in python and everything is a lot faster.
My experience points exactly the other way: for data-processing tasks, especially streaming ones, even Gawk is a lot faster than Python (pre-3.11), and apparently I’m not the only one[1]. If you’re not satisfied with Gawk’s performance, though, try Nawk[2] or, even better, Mawk[3]. (And stick to POSIX to ensure your code works in all of them.)
Do you know of any performance comparisons vs. PyPy? I find it works extremely well as a drop-in replacement for CPython when only the built-in modules are needed, which should generally hold for awk-like use cases. Yet some brief searching doesn't seem to yield any numbers.
You gotta share the code how you are doing it. If you are using awk alternative, you would be comparing against pandas or pypy.
I will do a comparison as soon as I am free.
Discussing performance only makes sense in the context of a particular awk implementation, like TFA is doing as well. If you‘re (stuck) on gawk, try setting LANG=C to prevent Unicode/multi-byte regexp execution, or switch to mawk (which according to [1] is much faster than cpython).
Honestly only makes sense in the context of a Python library and implementation as well, since so many libraries use C extensions in order to speed up processing. Also, Python has gotten a lot faster over time.
Awk is blazingly fast for some operations. I remember using it to solve Project Euler problem 67 [0] in a couple of milliseconds, which is more comparable to C/Rust than Python. Weirdly the forum posts from between 2013 and 2023 are missing so I can't see what I wrote there.
Sure. I do not live in the terminal. But, I work with Linux enough to comfortably navigate around, read various shell scripts with relative ease. With the exception of awk. Which to me signals that, at least in my case, awk has a higher barrier for entry compared to most other things in the same environment.
So with alternatives around I can more easily parse myself, I happily concede that I have a skill issue with awk.
Well, using awk because you are familiar with it could be due to a skill issue with other languages too. Can't use python for parsing? Skill issue I guess, going by your logic.
once there are more productive alternatives that require less specialized "skill", your condescending "skill issue" becomes a devex issue, and basically a productivity gap which will doom your language or tool.
You just need to have the skill to overcome whatever non-technical, legacy, lack of education, or poor judgement issues that are steamrolling you into choosing to use awk instead of a sane rational decent modern efficient maintainable language.
The rule of thumb back at Netcraft was to prototype in awk/sed for brevity/expressiveness and then port to perl for production use for performance reasons.
Been a couple decades since I was wrangling the survey systems there though, no idea what it looks like now.
Well. I don't know. Those two programs don't really do the same thing. There's an awful lot of comparisons in the second one. After making the awk program more similar to the Perl program, and using mawk instead of gawk (which is quite a bit slower) the numbers look a bit different:
$ seq 100000000 > /tmp/numbers
$ time perl -MData::Dumper -ne '$n{length($_)}++; END {print Dumper(%n)}' /tmp/numbers
$VAR1 = '7';
$VAR2 = 900000;
$VAR3 = '8';
$VAR4 = 9000000;
$VAR5 = '5';
$VAR6 = 9000;
$VAR7 = '4';
$VAR8 = 900;
$VAR9 = '6';
$VAR10 = 90000;
$VAR11 = '10';
$VAR12 = 1;
$VAR13 = '2';
$VAR14 = 9;
$VAR15 = '3';
$VAR16 = 90;
$VAR17 = '9';
$VAR18 = 90000000;
real 0m16.483s
user 0m16.071s
sys 0m0.352s
$ time mawk '{ lengths[length($0)]++ } END { max = 0; for(l in lengths) if (int(l) > max) max = int(l); print max; }' /tmp/numbers
9
real 0m5.980s
user 0m5.493s
sys 0m0.457s
[edit]: Actually had a bug in the initial implementation. Of course.
I used them both to find the longest line in a file. The Perl option just spits out the number of times each line length occurs. It will get messy if you have many different line lengths (which was not my case).
You also have to take into account that awk does not count the line terminator.
Let's try the opposite: make the Perl script more like the AWK one.
$ time perl -ne 'if(length($_)>$n) {$n=length($_)}; END {print $n}' rockyou.txt
286
real 0m2,569s
user 0m2,506s
sys 0m0,056s
$ time awk 'length($0) > max { max=length($0) } END { print max }' rockyou.txt
285
real 0m3,768s
user 0m3,714s
sys 0m0,048s