I can’t quite find the blog post but I saw someone do a blog post using AWS’ map reduce on multiple servers to process a dataset… and then they redid their pipeline using bash, awk, and maybe grep and a single 8-core machine did it 100 times or so faster.
I think this is more of a problem of knowing when a specific tool should be used. Probably most people familiar with hadoop are aware of all the overhead it creates. At the same time you hit a point in dataset sizes (I guess even more with “real time” data processing) where it’s not even feasible with a single machine. (at the same time I’m not too knowledgeable about hadoop and bigdata, so anyone else feel free to chime in)
I think you can put this under the Linux command line. I.E. the bash shell and the commonly installed Linux command set. Way powerful for certain things.
I can’t quite find the blog post but I saw someone do a blog post using AWS’ map reduce on multiple servers to process a dataset… and then they redid their pipeline using bash, awk, and maybe grep and a single 8-core machine did it 100 times or so faster.
Edit: found it https://adamdrake.com/command-line-tools-can-be-235x-faster-than-your-hadoop-cluster.html
I think this is more of a problem of knowing when a specific tool should be used. Probably most people familiar with hadoop are aware of all the overhead it creates. At the same time you hit a point in dataset sizes (I guess even more with “real time” data processing) where it’s not even feasible with a single machine. (at the same time I’m not too knowledgeable about hadoop and bigdata, so anyone else feel free to chime in)
I think you can put this under the Linux command line. I.E. the bash shell and the commonly installed Linux command set. Way powerful for certain things.