Some perfomance measures
The most surprising thing about ZIO for new users is the idea of a
"block" and a "control" (the 512 bytes thing) attached to every data
burst. This is considered a serious overhead.
I've made some performance measures on a recent PC-class computer.
In short, the overhead of bringing the block over the whole pipeline
(device, trigger, buffer, char device) is less than 0.3usec .
Measuring the overhead
With zio-zero.ko
and the transparent buffer (called user
and now
the default ai ZIO initialization), we can read or write huge amounts
of data.
We may compare with /dev/zero
but that would be unfair because the
/dev/zero
implementation uses __clear_user()
, not memset()
and
copy_to_user()
. The optimization of /dev/zero is specific to
/dev/zero, so that device isn't a meaningful test.
Channel 1 of cset 0 of zio-zero
returns random numbers. It uses
get_random_bytes()
like /dev/urandom
does, so this is a fair
comparison.
Acquisition in ZIO is cset-wide, so we should disable the other
channels,
to avoid the overhead of 3 blocks when we are only interested in 1 of
them.
echo 0 > /sys/zio/devices/zzero/cset0/chan0/enable
echo 0 > /sys/zio/devices/zzero/cset0/chan2/enable
The sample size in zio-zero is 1 byte, and the default block size is
16 sample (see /sys/zio/devices/zzero/cset0/trigger/nsamples
). So
we can read 1 million times 16 bytes from /dev/urandom
spusa.root# dd bs=16 count=1000000 if=/dev/urandom > /dev/null
1000000+0 records in
1000000+0 records out
16000000 bytes (16 MB) copied, 2.11017 s, 7.6 MB/s
We can then do the same with the ZIO device:
spusa.root# dd bs=16 count=1000000 if=/dev/zzero-0-1-data > /dev/null
1000000+0 records in
1000000+0 records out
16000000 bytes (16 MB) copied, 2.46607 s, 6.5 MB/s
The difference is .355 seconds, which means .355 microseconds per
block.
I repeated the test several times, and picked one around the middle.
The oscillation between runs, on an unloaded machine, is within 1%.
The vmalloc buffer
The new vmalloc
buffer does even better: there is no need to
read()
data, just build a pointer to it. The difference is not very
big with /dev/urandom
because generating the data takes most of the
processing time within the test.
The suggested test here is the following:
size=16; while [ $size -lt 64000 ]; do
echo
echo $size
echo $size > /sys/zio/devices/zzero/cset0/trigger/nsamples
n=$(expr 16 \* 1048576 / $size)
dd bs=$size count=$n if=/dev/urandom of=/dev/null 2>&1 | grep copied
/tmp/zio-cat-file /dev/zzero-0-1-data $n > /dev/null
size=$(expr $size \* 2)
done
This is done twice, using the two buffers available:
echo vmalloc > /sys/zio/devices/zzero/cset0/current_buffer
echo kmalloc > /sys/zio/devices/zzero/cset0/current_buffer
And the result is plotted in the following figure.
The first numbers obtained, for 1048576 reads of 16 bytes , are:
dd: 2.213750
kmalloc: 2.763513
vmalloc: 2.603158
This means that the overhead of reading the full block (both
control and data) is .52 microseconds per block, while reading
the control and accessing data with mmap
is 0.37 microseconds
more than doing a plain dd
read.
For large data sizes, the advantage of accessing data instead
of reading it will be more of this per-block overhead. However,
I have taken no measures so far.