Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Reading awk as a human is hard too. And performance of awk is crap. A lot slower than most interpreter language out there. I had replaced all the awk scripts in python and everything is a lot faster.


> And performance of awk is crap. [...] I had replaced all the awk scripts in python and everything is a lot faster.

My experience points exactly the other way: for data-processing tasks, especially streaming ones, even Gawk is a lot faster than Python (pre-3.11), and apparently I’m not the only one[1]. If you’re not satisfied with Gawk’s performance, though, try Nawk[2] or, even better, Mawk[3]. (And stick to POSIX to ensure your code works in all of them.)

[1] https://brenocon.com/blog/2009/09/dont-mawk-awk-the-fastest-...

[2] https://github.com/onetrueawk/awk

[3] https://invisible-island.net/mawk/


Do you know of any performance comparisons vs. PyPy? I find it works extremely well as a drop-in replacement for CPython when only the built-in modules are needed, which should generally hold for awk-like use cases. Yet some brief searching doesn't seem to yield any numbers.


You gotta share the code how you are doing it. If you are using awk alternative, you would be comparing against pandas or pypy. I will do a comparison as soon as I am free.


Discussing performance only makes sense in the context of a particular awk implementation, like TFA is doing as well. If you‘re (stuck) on gawk, try setting LANG=C to prevent Unicode/multi-byte regexp execution, or switch to mawk (which according to [1] is much faster than cpython).

[1]: https://brenocon.com/blog/2009/09/dont-mawk-awk-the-fastest-...


Honestly only makes sense in the context of a Python library and implementation as well, since so many libraries use C extensions in order to speed up processing. Also, Python has gotten a lot faster over time.


We gotta compare against pypy,or cpython plus pandas then


Awk is blazingly fast for some operations. I remember using it to solve Project Euler problem 67 [0] in a couple of milliseconds, which is more comparable to C/Rust than Python. Weirdly the forum posts from between 2013 and 2023 are missing so I can't see what I wrote there.

[0] https://projecteuler.net/problem=67


skill issue


Sure. I do not live in the terminal. But, I work with Linux enough to comfortably navigate around, read various shell scripts with relative ease. With the exception of awk. Which to me signals that, at least in my case, awk has a higher barrier for entry compared to most other things in the same environment.

So with alternatives around I can more easily parse myself, I happily concede that I have a skill issue with awk.


Well, using awk because you are familiar with it could be due to a skill issue with other languages too. Can't use python for parsing? Skill issue I guess, going by your logic.


Even the eminent Mr. A., W., and K. had sKiLl isSueS when designing this language, apparently. You can only ask so much from regular programmers.


once there are more productive alternatives that require less specialized "skill", your condescending "skill issue" becomes a devex issue, and basically a productivity gap which will doom your language or tool.


You just need to have the skill to overcome whatever non-technical, legacy, lack of education, or poor judgement issues that are steamrolling you into choosing to use awk instead of a sane rational decent modern efficient maintainable language.


To be fair, sometimes awk is just faster to call. In all other case, as my sibling says, use perl :D


Perl, then?


The rule of thumb back at Netcraft was to prototype in awk/sed for brevity/expressiveness and then port to perl for production use for performance reasons.

Been a couple decades since I was wrangling the survey systems there though, no idea what it looks like now.


i very much appreciate the server surveys; for a time i read the report every month!


As a dare from a friend I compared my Perl solution to an AWK solution:

  $time perl -MData::Dumper -ne '$n{length($_)}++; END {print Dumper(%n)}' bigfile.txt

  $VAR1 = '1088';
  $VAR2 = 349647;

  real    0m1.326s
  user    0m0.814s
  sys     0m0.371s

  $time awk 'length($0) > max { max=length($0) } END { print max }' bigfile.txt

  1087

  real    0m21.400s
  user    0m18.596s
  sys     0m0.455s
I prefer Perl, but I have no issue with AWK and I actually use it frequently.


Well. I don't know. Those two programs don't really do the same thing. There's an awful lot of comparisons in the second one. After making the awk program more similar to the Perl program, and using mawk instead of gawk (which is quite a bit slower) the numbers look a bit different:

  $ seq 100000000 > /tmp/numbers 
  $ time perl -MData::Dumper -ne '$n{length($_)}++; END {print Dumper(%n)}'  /tmp/numbers 
  $VAR1 = '7';
  $VAR2 = 900000;
  $VAR3 = '8';
  $VAR4 = 9000000;
  $VAR5 = '5';
  $VAR6 = 9000;
  $VAR7 = '4';
  $VAR8 = 900;
  $VAR9 = '6';
  $VAR10 = 90000;
  $VAR11 = '10';
  $VAR12 = 1;
  $VAR13 = '2';
  $VAR14 = 9;
  $VAR15 = '3';
  $VAR16 = 90;
  $VAR17 = '9';
  $VAR18 = 90000000;
  
  real 0m16.483s
  user 0m16.071s
  sys 0m0.352s
  $ time mawk '{ lengths[length($0)]++ } END { max = 0; for(l in lengths) if (int(l) > max) max = int(l); print max; }' /tmp/numbers 
  9

  real 0m5.980s
  user 0m5.493s
  sys 0m0.457s
[edit]: Actually had a bug in the initial implementation. Of course.


I used them both to find the longest line in a file. The Perl option just spits out the number of times each line length occurs. It will get messy if you have many different line lengths (which was not my case).

You also have to take into account that awk does not count the line terminator.

Let's try the opposite: make the Perl script more like the AWK one.

  $ time perl -ne 'if(length($_)>$n) {$n=length($_)}; END {print $n}'  rockyou.txt 
  286

  real 0m2,569s
  user 0m2,506s
  sys 0m0,056s

  $ time awk 'length($0) > max { max=length($0) } END { print max }' rockyou.txt 
  285

  real 0m3,768s
  user 0m3,714s
  sys 0m0,048s


`perl -lne ...` to have perl strip the trailing newlines like awk does. Should give the same result with it.


You're right. It even makes the times converge.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: