Ghetto Compressed Data Gymnastics

Summary

A large part of my day-to-day job responsibilities involve working with large compressed data sets. Sometimes the data sets are so huge, I don't have enough disk space to decompress them on my computer. Here are some tricks I've learned when wrangling large volumes of compressed data.

Searching The Data

In most cases, the data sets I'm working with are compressed log files. Often I want to search through these files and return anything that includes a particular string or matches a regex. In instances like this, I like to use a not-so-well known Unix utility called gzcat. gzcat is, quite literally cat, but for gzip compressed data. gzcat is often installed alongside the gzip CLI utility. I believe gzcat is also installed on OS X by default, but I could be mistaken.

note: I think on Linux, gzcat will only be zcat

Here's a quick proof-of-concept-

$ echo Here is some test data >> file.txt
$ gzip file.txt 
$ file file.txt.gz 
file.txt.gz: gzip compressed data, was "file.txt", from Unix, last modified: Mon Oct 12 22:05:16 2015  
$ gzcat file.txt.gz 
Here is some test data  

For example, if I want to only ask my data set about Telnet logs, I'll use the following horribly inefficient one-liner-

$ gzcat * | grep telnet | head -n 1

{"project": "research", "version": "1", "type": "simpletelnetsensor", ipaddr": "xx.xx.xx.xx", "event": "2015-10-11 00:04:34,980 [AUTHENTICATION REQUEST RECEIVED] 80.200.xx.xx [root/xc3511]\n"}

If your data set is composed of lots and lots of files, sometimes you'll run into this error-

$ gzcat */*
-bash: /usr/bin/gzcat: Argument list too long

If this happens, here's a stupid shell hack to get around it by simply processing one file at a time.

$ for i in $(find . -type f);do gzcat $i | grep whatever ; done

The Ubuntu package details for gzcat and its sister utilities can be found here.

Ghetto Compressed Data Backup

Sometimes I find myself in a situation where I want to compress a massive amount of data on a server, then ship it off to a remote server to be archived or analyzed. No problem right?

Well, sometimes this happens.

$ du -h -s data/
46G    data/  
$ df -h
Filesystem      Size  Used Avail Use% Mounted on  
/dev/vda1        59G   56G  397M 100% /

Shit, the hard disk is full. I can't compress this data because it's still being used by the server, I can't take the server offline because it's production, and I don't have enough disk space to compress it all, transfer it, and delete the file. Here's where some handy dandy tar + netcat kung fu comes in handy.

FROM THE SERVER

$ tar zcvf - data/ | nc -lvp 9999 
Listening on [0.0.0.0] (family 0, port 9999)

data/  
data/logs/  
data/logs/xxxx/syslog  
data/logs/xxxx/syslog.1  
data/logs/xxxx/syslog.2  
data/logs/xxxx/syslog.3  
...

What we've done is compressed the data using tar, but instead of outputting to a file, we're outputting straight to STDOUT, then snagging it with netcat, pushing it to a raw TCP socket and waiting to shove it out to whoever connects to us. This is cool because it actually compresses the data on the way into the socket so it never actually hits the disk.

FROM THE CLIENT

$ nc -v 5.5.5.5 9999 > backup.tar.gz
...
$ file backup.tar.gz 
backup.tar.gz: gzip compressed data, from Unix, last modified: Mon Oct 12 22:35:23 2015  

Swag.

It's important to note that this solution is very unencrypted and pretty terrible in general. If you want to wrap the data in some kind of crypto you can always substitute the nc command with something like ncat --ssl or shove it through an SSH tunnel.

It's also important to note that this is way less efficient than using something like S3 or Glacier for backup but hey, that's not always an option. A much better solution is to integrate log rotation, backup, and archival into your DevOps process to avoid situations like this in the first place.

Anyways, hopefully this is useful to someone, somewhere. As always, please feel free to shoot me a message on Twitter if you have any feedback or requests for other blog posts.

Be well,
--Andrew

EDIT 10/12/15: Brian Wallace on Twitter pointed out that gzcat is essentially the exact same as running gzip -dc. Thanks Brian!

EDIT 10/19/15: I've had a few people mention zgrep to me. My mind is completely blown. I had no idea this existed.