Threaded Copy (t_copy)

Motivation

Nodes that compose large scale clusters often have a large aggregate memory (tens of Gigabytes) and multiple compute cores (2,4,8,16...). Leveraging these resources for memory mapped parallel I/O operations on a node can provide some performance improvements over more traditional methods of moving data and files within a multicore node. Basically I wanted to demonstrate, in a simple implementation, that we can leverage resources on a node to significantly increase performance over a traditional Linux “cp”.

Design
This code uses POSIX threads to write a memory mapped file, with one thread mapped to a core. Each thread writes to different offsets of an output file. This can be quite effective when done concurrently with independent tasks, particularly if you have a machine with a considerable amount of memory. My goal with this was to demonstrate the effects of concurrent writes with the Linux “cp” command, so if a user uses
t_copy in the same way they use cp, they should see a considerable performance improvement.

parcp_mmap.001
Figure 1: threaded memory mapped I/O

Performance
The tests below were run on a 16 core, quad socket quad core, Intel Dunnington (Xeon E7440) system running at 2.4 GHz with 256GB of memory. The system is a non-dedicated resource, other users were running on the system at the time of the tests and the load average was 4 at the time of the tests. All files were moved from a GPFS filesystem to /tmp which is mounted to local disk on the node. File system buffer to disk synchronization using sync() was not used in generating these results, so the I/O buffers were exercised, just as they are using cp.

For most files t_copy is faster than cp, particularly for files over 1GB. The performance graph below shows the average bandwidth when running cp compared to copies with t_copy with varying thread size using a GPFS distributed filesystem on a single 16 core system. The X axis is the file size and the Y axis represents the average bandwidth.  The best t_copy performance demonstrates a  5x improvement (500%)  on a 4GB file copy over "serial" cp using 4 threads. For all files under 1GB we see that 8 threads generally perform better but for files of 512MB we actually see t_copy performance degradation when using less than 8 threads. For a 1GB file we see that 16 threads performs best, but only by around 10%.

parcp_img
Figure 2: 16 core Intel Dunnington node with 256GB RAM (non-dedicated)

We can see that generally t_copy performs as well and often better than cp, more significantly for files between the 1 to 4GB range.

Feel free to use and/or test the code. This version has been tested on Linux and IBM Power 6 machines.

To run: ./t_copy /home/myfile /workdir/myfile
mmio_t

cores

Makefile

An example of running the above code on an 8 core Intel Nehalem node with 32GB memory for a 1GB file.

[user@nehalem]$ ./t_copy 1GB /tmp/1GB
nehalem: Copying 1GB --> /tmp/1GB with 8 threads.

Copy Complete in 1.9 sec. 523.1 (MB/s)
[user@nehalem]$ time cp 1GB /tmp/1GB

real 0m6.625s
user 0m0.045s
sys 0m2.415s



On the test above, t_copy ran 1.9 sec and “cp” took 6.6 seconds to copy the same 1GB file to the same place, that’s a 3.5X (350%) improvement!