Feed on

Named pipes

Trivial to create:

[email protected][~]$ mkfifo pipe1
[email protected][~]$ ls -l pipe1
prw-r--r-- 1 ahoward ahoward 0 May 19 22:16 pipe1
[email protected][~]$ 


A bit trickier to use:

[email protected][~]$ echo test > pipe1
[email protected][~]$ cat pipe1
(also hangs...)


Any attempt to read or write from the pipe hangs.

The problem here is that a pipe doesn’t buffer to RAM/disk/etc, so any command that tries to fill the pipe actually blocks until another command is ready to consume that I/O. The opposite also holds true.

One correct way to use a named pipe is to background a command at one end of the pipe, then run another command at the other end of the pipe. It doesn’t matter which command consumes and which fills the pipe – the important thing is that they execute simultaneously. For example:

[email protected][~]$ echo test > pipe1 &
[1] 5079
[email protected][~]$ cat pipe1
[1]+  Done                    echo test > pipe1
[email protected][~]$ 


In retrospect, this should seem really obvious. Of course the pipe blocks until another process is on the other end. If it didn’t do that, it’d buffer to disk or RAM, and then it’d be a file – not a pipe.

There’s really very little difference between named pipes and in-line pipes (ie: “|”). In my basic example above, there’s actually no reason to use named pipes – in-line pipes are a better use. However, there are a couple situations where named pipes really shine.

These cases are so rare and complex that it’s tough to think of a contrived example, but I’ll do my best. Let’s say you want to read a lot of data off disk and perform multiple operations with that data. Maybe you want to upload a 100G file to a remote server, and also calculate the md5sum of that data. One valid approach would be to simply calculate the md5sum of the file, save that in a variable, then upload the file:

[email protected][~]$ ll -h dataFile 
-rw-r--r-- 1 ahoward ahoward 100G May 19 22:56 dataFile
[email protected][~]$ md5sum dataFile
[email protected][~]$ CHECKSUM=$( md5sum dataFile | awk '{print $1}' )
[email protected][~]$ uploadFile dataFile
[email protected][~]$ 


Yeah, it’d work… but it sucks. Why? Because disk IO is one of the slowest things a system can do, and we just did a lot of it. Twice. The ‘tee’ command lets us duplicate a command’s output, but it writes to a file:

[email protected][~]$ ls -l output
ls: cannot access output: No such file or directory
[email protected][~]$ echo test | tee output
[email protected][~]$ cat output
[email protected][~]$ 


In this case tee actually makes it worse – we’d still have to read from disk a second time, except now we’d also be writing to disk. Gah!

So tee’s useless… or is it? Anytime linux asks for a file, it really doesn’t care if you give it a file or not, as long as you give it a file *handle*. A named pipe counts as a file handle. Check this out:

[email protected][~]$ mkfifo pipe1
[email protected][~]$ uploadFile pipe1 &
[1] 5540
[email protected][~]$ CHECKSUM=$( tee pipe1 <dataFile | md5sum | awk '{print $1}' )
[1]+  Done                    uploadFile pipe1
[email protected][~]$ 


So we can use tee to duplicate an IO stream between STDOUT and an arbitrary file handle, so if we give tee a named pipe, suddenly we have two IO streams, and we only ever went to disk once.

Cool, eh?

Leave a Reply