Feed on
Posts
Comments

Named pipes

Trivial to create:

ahoward@martlet[~]$ mkfifo pipe1
ahoward@martlet[~]$ ls -l pipe1
prw-r--r-- 1 ahoward ahoward 0 May 19 22:16 pipe1
ahoward@martlet[~]$ 

 

A bit trickier to use:

ahoward@martlet[~]$ echo test > pipe1
(hangs...)
^C
ahoward@martlet[~]$ cat pipe1
(also hangs...)

 

Any attempt to read or write from the pipe hangs.

The problem here is that a pipe doesn’t buffer to RAM/disk/etc, so any command that tries to fill the pipe actually blocks until another command is ready to consume that I/O. The opposite also holds true.

One correct way to use a named pipe is to background a command at one end of the pipe, then run another command at the other end of the pipe. It doesn’t matter which command consumes and which fills the pipe – the important thing is that they execute simultaneously. For example:

ahoward@martlet[~]$ echo test > pipe1 &
[1] 5079
ahoward@martlet[~]$ cat pipe1
test
[1]+  Done                    echo test > pipe1
ahoward@martlet[~]$ 

 

In retrospect, this should seem really obvious. Of course the pipe blocks until another process is on the other end. If it didn’t do that, it’d buffer to disk or RAM, and then it’d be a file – not a pipe.

There’s really very little difference between named pipes and in-line pipes (ie: “|”). In my basic example above, there’s actually no reason to use named pipes – in-line pipes are a better use. However, there are a couple situations where named pipes really shine.

These cases are so rare and complex that it’s tough to think of a contrived example, but I’ll do my best. Let’s say you want to read a lot of data off disk and perform multiple operations with that data. Maybe you want to upload a 100G file to a remote server, and also calculate the md5sum of that data. One valid approach would be to simply calculate the md5sum of the file, save that in a variable, then upload the file:

ahoward@martlet[~]$ ll -h dataFile 
-rw-r--r-- 1 ahoward ahoward 100G May 19 22:56 dataFile
ahoward@martlet[~]$ md5sum dataFile
ahoward@martlet[~]$ CHECKSUM=$( md5sum dataFile | awk '{print $1}' )
ahoward@martlet[~]$ uploadFile dataFile
ahoward@martlet[~]$ 

 

Yeah, it’d work… but it sucks. Why? Because disk IO is one of the slowest things a system can do, and we just did a lot of it. Twice. The ‘tee’ command lets us duplicate a command’s output, but it writes to a file:

ahoward@martlet[~]$ ls -l output
ls: cannot access output: No such file or directory
ahoward@martlet[~]$ echo test | tee output
test
ahoward@martlet[~]$ cat output
test
ahoward@martlet[~]$ 

 

In this case tee actually makes it worse – we’d still have to read from disk a second time, except now we’d also be writing to disk. Gah!

So tee’s useless… or is it? Anytime linux asks for a file, it really doesn’t care if you give it a file or not, as long as you give it a file *handle*. A named pipe counts as a file handle. Check this out:

ahoward@martlet[~]$ mkfifo pipe1
ahoward@martlet[~]$ uploadFile pipe1 &
[1] 5540
ahoward@martlet[~]$ CHECKSUM=$( tee pipe1 <dataFile | md5sum | awk '{print $1}' )
[1]+  Done                    uploadFile pipe1
ahoward@martlet[~]$ 

 

So we can use tee to duplicate an IO stream between STDOUT and an arbitrary file handle, so if we give tee a named pipe, suddenly we have two IO streams, and we only ever went to disk once.

Cool, eh?

Leave a Reply