Some perfomance measures

The most surprising thing about ZIO for new users is the idea of a
"block" and a "control" (the 512 bytes thing) attached to every data
burst. This is considered a serious overhead.

This is the block:

I've made some performance measures on a recent PC-class computer.
In short, the overhead of bringing the block over the whole pipeline
(device, trigger, buffer, char device) is less than 0.3usec .

Measuring the overhead

With zio-zero.ko and the transparent buffer (called user and now
the default ai ZIO initialization), we can read or write huge amounts
of data.

We may compare with /dev/zero but that would be unfair because the
/dev/zero implementation uses __clear_user(), not memset() and
copy_to_user(). The optimization of /dev/zero is specific to
/dev/zero, so that device isn't a meaningful test.

Channel 1 of cset 0 of zio-zero returns random numbers. It uses
get_random_bytes() like /dev/urandom does, so this is a fair
comparison.

Acquisition in ZIO is cset-wide, so we should disable the other channels,
to avoid the overhead of 3 blocks when we are only interested in 1 of them.

echo 0 > /sys/zio/devices/zzero/cset0/chan0/enable
echo 0 > /sys/zio/devices/zzero/cset0/chan2/enable

The sample size in zio-zero is 1 byte, and the default block size is
16 sample (see /sys/zio/devices/zzero/cset0/trigger/nsamples). So
we can read 1 million times 16 bytes from /dev/urandom

spusa.root# dd bs=16 count=1000000 if=/dev/urandom  > /dev/null
1000000+0 records in
1000000+0 records out
16000000 bytes (16 MB) copied, 2.11017 s, 7.6 MB/s

We can then do the same with the ZIO device:

spusa.root# dd bs=16 count=1000000 if=/dev/zzero-0-1-data  > /dev/null
1000000+0 records in
1000000+0 records out
16000000 bytes (16 MB) copied, 2.46607 s, 6.5 MB/s

The difference is .355 seconds, which means .355 microseconds per block.
I repeated the test several times, and picked one around the middle.
The oscillation between runs, on an unloaded machine, is within 1%.

The vmalloc buffer

The new vmalloc buffer does even better: there is no need to
read() data, just build a pointer to it. The difference is not very
big with /dev/urandom because generating the data takes most of the
processing time within the test.

The suggested test here is the following:

size=16; while [ $size -lt 64000 ]; do
    echo
    echo $size
    echo $size > /sys/zio/devices/zzero/cset0/trigger/nsamples
    n=$(expr 16 \* 1048576 / $size)
    dd bs=$size count=$n if=/dev/urandom of=/dev/null 2>&1 | grep copied
    /tmp/zio-cat-file /dev/zzero-0-1-data $n > /dev/null
    size=$(expr $size \* 2)
done

This is done twice, using the two buffers available:

echo vmalloc > /sys/zio/devices/zzero/cset0/current_buffer
echo kmalloc > /sys/zio/devices/zzero/cset0/current_buffer

And the result is plotted in the following figure.

The first numbers obtained, for 1048576 reads of 16 bytes , are:

dd:        2.213750
kmalloc:   2.763513
vmalloc:   2.603158

This means that the overhead of reading the full block (both
control and data) is .52 microseconds per block, while reading
the control and accessing data with mmap is 0.37 microseconds
more than doing a plain dd read.

For large data sizes, the advantage of accessing data instead
of reading it will be more of this per-block overhead. However,
I have taken no measures so far.

Performance

Some perfomance measures

Measuring the overhead

The vmalloc buffer

Files

Performance

Some perfomance measures

Measuring the overhead

The vmalloc buffer

Files

New Wiki Page