Accessing Files Properly in Bash One-Liners: How to Prevent Whitespace from Ruining Your Life

"You don't learn from victory. You learn from being defeated by larger and larger things."

TL;DR

If you want to do a Bash thing on a bunch of files, this is wrong and will fail when encountering whitespace:

$ for i in $(ls | grep ".pdf") ; do something "$i" ; done

This is slightly more correct and will probably not fail:

$ for i in */*.pdf ; do something "$i" ; done

Background

Bash is one of several popular Linux/Unix command shells. The term "Bash" is actually a punny acronym for "Bourne-again Shell" as it was the successor to the Bourne Shell in 1989.

I've been using Bash for about ten years. I'm no expert- I fancy myself somewhat of a journeyman Bash user, perhaps even a yeoman Bash user (if I may). Its not every day that I find out I've been doing something horribly wrong in Bash for years. Today was one of those days.

The following blog post will very likely be seen as common knowledge to a lot of people, but hopefully I can save the less-seasoned CLI jockies a few hours of Googling. Huge thanks to Rich S. and my other cohorts for helping me out so much.

The Task

Today I was asked by a coworker to wrangle up a few hundred PDF files and do some (inconsequential) data processing on them and send him the output. "No problem", I said. I requisitioned the files from a source on the Internet and organized them in folders by what year they were published.

$ ls -lah
total 128  
drwxr-xr-x   18 amorris  staff   612B Dec  9 02:29 .  
drwxr-xr-x   45 amorris  staff   1.5K Dec  4 14:19 ..  
drwxr-xr-x    3 amorris  staff   102B Nov 24 14:12 2008  
drwxr-xr-x    3 amorris  staff   102B Nov 24 14:12 2009  
drwxr-xr-x   11 amorris  staff   374B Nov 24 14:12 2010  
drwxr-xr-x   17 amorris  staff   578B Nov 24 14:12 2011  
drwxr-xr-x   26 amorris  staff   884B Nov 24 14:12 2012  
drwxr-xr-x   58 amorris  staff   1.9K Nov 24 14:12 2013  
drwxr-xr-x  111 amorris  staff   3.7K Nov 24 14:12 2014  
drwxr-xr-x   72 amorris  staff   2.4K Nov 24 14:12 2015  

I had already written some code to extract and process the data. The code takes a filename argument and returns the data to STDOUT.

$ ./process_pdf -f whatever.pdf
[+] Here is your output
[+] Blah blah blah
...

Now all I have to do is run a quick for-loop in Bash to recursively crunch through all these PDFs and output to one text file per PDF. Most of the PDF filenames included underscores and dashes, but a good chunk of them used spaces. WHATEVER. I can deal with that, right??.

For a job like this, I typically just do a quick for loop that takes the output of ls*, but since there is some recursion involved I'm gonna use find. This should do it:

$ for i in $(find . -type f -name "*.pdf");do ./process_pdf -f $i > $i.txt ; done

This will simply run every file through the script and save the output to the filename, plus ".txt". Pretty simple, right?

Wrong

My script is completely shitting the bed. It's totally breaking when it encounters any whitespace in the filenames. What gives?

$ for i in $(find . -type f -name "*.pdf");do ./process_pdf -f "$i" > "$i.txt" ; done
[+] Processing ./2008/heres_a_pdf.pdf
[+] Processing ./2009/cool-Pdf.pdf
[+] Processing ./2009/Another-Pdf.pdf
[+] Processing ./2010/pretty-cool-pdf.pdf
[+] Processing ./2010/something.pdf
[-] FAILED. File not found: ./2010/sweet
[-] FAILED. File not found: pdf
[-] FAILED. File not found: -
[-] FAILED. File not found: draft.pdf
[+] Processing ./2010/RegularOlePDF.pdf
...

Looks like for really doesn't like the whitespace provided by find. Hmmm, I could do it with find -exec or xargs but now I'm curious as to why this is breaking. Quotes didn't change it either. What is up with this?!?! Let's do some semblence of troublshooting...

$ ls
Another-Pdf.pdf       cool-Pdf.pdf          pretty-cool-pdf.pdf   sweet pdf - draft.pdf  
RegularOlePDF.pdf     heres_a_pdf.pdf       something.pdf  

Yes, this is normal. So I should be able to...

$ for i in $(ls) ; do echo "$i" ; done
Another-Pdf.pdf  
RegularOlePDF.pdf  
cool-Pdf.pdf  
heres_a_pdf.pdf  
pretty-cool-pdf.pdf  
something.pdf  
sweet  
pdf  
-
draft.pdf  

wat

Solution

It turns out everything I've ever known about for loops in Bash was wrong. I did some Googling and found this article entitled "Bash Pitfalls", but the article may as well have been titled Andrew Sucks at Bash because I'm pretty sure I was guilty of most of what it detailed not to do.

It turns out that using ls is the absolute worst choice for dealing with files in a given directory in Bash. Why? ls is not meant for passing arguments to other Bash functions. ls is simply a command to issue human-readable output of the contents of a particular directory. Whoops**.

A more "by-the-books" solution ended up being an insanely simple Bash syntax that I had never heard of: for i in *.pdf. Full one-liner below:

$ for i in */*.pdf ; do ./process_pdf "$i" > "$i.txt"; done

Takeaways

  1. Bash is weird

  2. At some point you should throw in the towel and bust out the Python

  3. If you're hellbent on using Bash, consider find -exec or xargs.

This was a humbling Bash lesson for me and I sincerely hope to save the reader(s) of this article some time and energy. As always, please feel free to reach out to me via email or Twitter if you have any questions or feedback.

Be well,

--Andrew


*Now, thanks to the article I read, I have seen the error in my ways of using for i in $(ls) to do anything. Never again.

**brb, editing every single article I've posted on my blog that uses for i in $(ls)