"You don't learn from victory. You learn from being defeated by larger and larger things."
If you want to do a Bash thing on a bunch of files, this is wrong and will fail when encountering whitespace:
$ for i in $(ls | grep ".pdf") ; do something "$i" ; done
This is slightly more correct and will probably not fail:
$ for i in */*.pdf ; do something "$i" ; done
Bash is one of several popular Linux/Unix command shells. The term "Bash" is actually a punny acronym for "Bourne-again Shell" as it was the successor to the Bourne Shell in 1989.
I've been using Bash for about ten years. I'm no expert- I fancy myself somewhat of a journeyman Bash user, perhaps even a yeoman Bash user (if I may). Its not every day that I find out I've been doing something horribly wrong in Bash for years. Today was one of those days.
The following blog post will very likely be seen as common knowledge to a lot of people, but hopefully I can save the less-seasoned CLI jockies a few hours of Googling. Huge thanks to Rich S. and my other cohorts for helping me out so much.
Today I was asked by a coworker to wrangle up a few hundred PDF files and do some (inconsequential) data processing on them and send him the output. "No problem", I said. I requisitioned the files from a source on the Internet and organized them in folders by what year they were published.
$ ls -lah total 128 drwxr-xr-x 18 amorris staff 612B Dec 9 02:29 . drwxr-xr-x 45 amorris staff 1.5K Dec 4 14:19 .. drwxr-xr-x 3 amorris staff 102B Nov 24 14:12 2008 drwxr-xr-x 3 amorris staff 102B Nov 24 14:12 2009 drwxr-xr-x 11 amorris staff 374B Nov 24 14:12 2010 drwxr-xr-x 17 amorris staff 578B Nov 24 14:12 2011 drwxr-xr-x 26 amorris staff 884B Nov 24 14:12 2012 drwxr-xr-x 58 amorris staff 1.9K Nov 24 14:12 2013 drwxr-xr-x 111 amorris staff 3.7K Nov 24 14:12 2014 drwxr-xr-x 72 amorris staff 2.4K Nov 24 14:12 2015
I had already written some code to extract and process the data. The code takes a filename argument and returns the data to
$ ./process_pdf -f whatever.pdf [+] Here is your output [+] Blah blah blah ...
Now all I have to do is run a quick for-loop in Bash to recursively crunch through all these PDFs and output to one text file per PDF. Most of the PDF filenames included underscores and dashes, but a good chunk of them used spaces. WHATEVER. I can deal with that, right??.
For a job like this, I typically just do a quick
for loop that takes the output of
ls*, but since there is some recursion involved I'm gonna use
find. This should do it:
$ for i in $(find . -type f -name "*.pdf");do ./process_pdf -f $i > $i.txt ; done
This will simply run every file through the script and save the output to the filename, plus ".txt". Pretty simple, right?
My script is completely shitting the bed. It's totally breaking when it encounters any whitespace in the filenames. What gives?
$ for i in $(find . -type f -name "*.pdf");do ./process_pdf -f "$i" > "$i.txt" ; done [+] Processing ./2008/heres_a_pdf.pdf [+] Processing ./2009/cool-Pdf.pdf [+] Processing ./2009/Another-Pdf.pdf [+] Processing ./2010/pretty-cool-pdf.pdf [+] Processing ./2010/something.pdf [-] FAILED. File not found: ./2010/sweet [-] FAILED. File not found: pdf [-] FAILED. File not found: - [-] FAILED. File not found: draft.pdf [+] Processing ./2010/RegularOlePDF.pdf ...
for really doesn't like the whitespace provided by
find. Hmmm, I could do it with
find -exec or
xargs but now I'm curious as to why this is breaking. Quotes didn't change it either. What is up with this?!?! Let's do some semblence of troublshooting...
$ ls Another-Pdf.pdf cool-Pdf.pdf pretty-cool-pdf.pdf sweet pdf - draft.pdf RegularOlePDF.pdf heres_a_pdf.pdf something.pdf
Yes, this is normal. So I should be able to...
$ for i in $(ls) ; do echo "$i" ; done Another-Pdf.pdf RegularOlePDF.pdf cool-Pdf.pdf heres_a_pdf.pdf pretty-cool-pdf.pdf something.pdf sweet pdf - draft.pdf
It turns out everything I've ever known about
for loops in Bash was wrong. I did some Googling and found this article entitled "Bash Pitfalls", but the article may as well have been titled Andrew Sucks at Bash because I'm pretty sure I was guilty of most of what it detailed not to do.
It turns out that using
ls is the absolute worst choice for dealing with files in a given directory in Bash. Why?
ls is not meant for passing arguments to other Bash functions.
ls is simply a command to issue human-readable output of the contents of a particular directory. Whoops**.
A more "by-the-books" solution ended up being an insanely simple Bash syntax that I had never heard of:
for i in *.pdf. Full one-liner below:
$ for i in */*.pdf ; do ./process_pdf "$i" > "$i.txt"; done
Bash is weird
At some point you should throw in the towel and bust out the Python
If you're hellbent on using Bash, consider
This was a humbling Bash lesson for me and I sincerely hope to save the reader(s) of this article some time and energy. As always, please feel free to reach out to me via email or Twitter if you have any questions or feedback.
*Now, thanks to the article I read, I have seen the error in my ways of using
for i in $(ls) to do anything. Never again.
**brb, editing every single article I've posted on my blog that uses
for i in $(ls)