Discussion:
[dm-devel] Announcement: STEC EnhanceIO SSD caching software for Linux kernel
(too old to reply)
Amit Kale
2013-01-17 09:52:00 UTC
Permalink
Hi Joe, Kent,

[Adding Kent as well since bcache is mentioned below as one of the contenders for being integrated into mainline kernel.]

My understanding is that these three caching solutions all have three principle blocks.
1. A cache block lookup - This refers to finding out whether a block was cached or not and the location on SSD, if it was.
2. Block replacement policy - This refers to the algorithm for replacing a block when a new free block can't be found.
3. IO handling - This is about issuing IO requests to SSD and HDD.
4. Dirty data clean-up algorithm (for write-back only) - The dirty data clean-up algorithm decides when to write a dirty block in an SSD to its original location on HDD and executes the copy.

When comparing the three solutions we need to consider these aspects.
1. User interface - This consists of commands used by users for creating, deleting, editing properties and recovering from error conditions.
2. Software interface - Where it interfaces to Linux kernel and applications.
3. Availability - What's the downtime when adding, deleting caches, making changes to cache configuration, conversion between cache modes, recovering after a crash, recovering from an error condition.
4. Security - Security holes, if any.
5. Portability - Which HDDs, SSDs, partitions, other block devices it works with.
6. Persistence of cache configuration - Once created does the cache configuration stay persistent across reboots. How are changes in device sequence or numbering handled.
7. Persistence of cached data - Does cached data remain across reboots/crashes/intermittent failures. Is the "sticky"ness of data configurable.
8. SSD life - Projected SSD life. Does the caching solution cause too much of write amplification leading to an early SSD failure.
9. Performance - Throughput is generally most important. Latency is also one more performance comparison point. Performance under different load classes can be measured.
10. ACID properties - Atomicity, Concurrency, Idempotent, Durability. Does the caching solution have these typical transactional database or filesystem properties. This includes avoiding torn-page problem amongst crash and failure scenarios.
11. Error conditions - Handling power failures, intermittent and permanent device failures.
12. Configuration parameters for tuning according to applications.

We'll soon document EnhanceIO behavior in context of these aspects. We'll appreciate if dm-cache and bcache is also documented.

When comparing performance there are three levels at which it can be measured
1. Architectural elements
1.1. Throughput for 100% cache hit case (in absence of dirty data clean-up)
1.2. Throughput for 0% cache hit case (in absence of dirty data clean-up)
1.3. Dirty data clean-up rate (in absence of IO)
2. Performance of architectural elements combined
2.1. Varying mix of read/write, sustained performance.
3. Application level testing - The more real-life like benchmark we work with, the better it is.

Thanks.
-Amit
-----Original Message-----
Sent: Wednesday, January 16, 2013 4:16 PM
To: device-mapper development
Cc: Mike Snitzer; LKML
Subject: Re: [dm-devel] Announcement: STEC EnhanceIO SSD caching
software for Linux kernel
Hi Amit,
I'll look through EnhanceIO this week.
There are several cache solutions out there; bcache, my dm-cache and
EnhanceIO seeming to be the favourites. In suspect none of them are
without drawbacks, so I'd like to see if we can maybe work together.
I think the first thing we need to do is make it easy to compare the
performance of these impls.
I'll create a branch in my github tree with all three caches in. So
it's easy to build a kernel with them. (Mike's already combined dm-
cache and bcache and done some preliminary testing).
We've got some small test scenarios in our test suite that we run [1].
They certainly flatter dm-cache since it was developed using these.
It would be really nice if you could describe and provide scripts for
your test scenarios. I'll integrate them with the test suite, and then
I can have some confidence that I'm seeing EnhanceIO in its best light.
The 'transparent' cache issue is a valid one, but to be honest a bit
orthogonal to cache. Integrating dm more closely with the block layer
such that a dm stack can replace any device has been discussed for
years and I know Alasdair has done some preliminary design work on
this. Perhaps we can use your requirement to bump up the priority on
this work.
5. We have designed our writeback architecture from scratch.
Coalescing/bunching together of metadata writes and cleanup is much
improved after redesigning of the EnhanceIO-SSD interface. The DM
interface would have been too restrictive for this. EnhanceIO uses
set
level locking, which improves parallelism of IO, particularly for
writeback.
I sympathise with this; dm-cache would also like to see a higher level
view of the io, rather than being given the ios to remap one by one.
Let's start by working out how much of a benefit you've gained from
this and then go from there.
PROPRIETARY-CONFIDENTIAL INFORMATION INCLUDED
This electronic transmission, and any documents attached hereto, may
contain confidential, proprietary and/or legally privileged
information. The information is intended only for use by the
recipient
named above. If you received this electronic message in error, please
notify the sender and delete the electronic message. Any disclosure,
copying, distribution, or use of the contents of information received
in error is strictly prohibited, and violators will be pursued
legally.
Please do not use this signature when sending to dm-devel. If there's
proprietary information in the email you need to tell people up front
so they can choose not to read it.
- Joe
[1] https://github.com/jthornber/thinp-test-
suite/tree/master/tests/cache
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel"
info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
PROPRIETARY-CONFIDENTIAL INFORMATION INCLUDED



This electronic transmission, and any documents attached hereto, may contain confidential, proprietary and/or legally privileged information. The information is intended only for use by the recipient named above. If you received this electronic message in error, please notify the sender and delete the electronic message. Any disclosure, copying, distribution, or use of the contents of information received in error is strictly prohibited, and violators will be pursued legally.
Kent Overstreet
2013-01-17 11:39:40 UTC
Permalink
Suppose I could fill out the bcache version...
Post by Amit Kale
Hi Joe, Kent,
[Adding Kent as well since bcache is mentioned below as one of the contenders for being integrated into mainline kernel.]
My understanding is that these three caching solutions all have three principle blocks.
1. A cache block lookup - This refers to finding out whether a block was cached or not and the location on SSD, if it was.
2. Block replacement policy - This refers to the algorithm for replacing a block when a new free block can't be found.
3. IO handling - This is about issuing IO requests to SSD and HDD.
4. Dirty data clean-up algorithm (for write-back only) - The dirty data clean-up algorithm decides when to write a dirty block in an SSD to its original location on HDD and executes the copy.
When comparing the three solutions we need to consider these aspects.
1. User interface - This consists of commands used by users for creating, deleting, editing properties and recovering from error conditions.
2. Software interface - Where it interfaces to Linux kernel and applications.
Both done with sysfs, at least for now.
Post by Amit Kale
3. Availability - What's the downtime when adding, deleting caches, making changes to cache configuration, conversion between cache modes, recovering after a crash, recovering from an error condition.
All of that is done at runtime, without any interruption. bcache doesn't
distinguish between clean and unclean shutdown, which is nice because it
means the recovery code gets tested. Registering a cache device takes on
the order of half a second, for a large (half terabyte) cache.
Post by Amit Kale
4. Security - Security holes, if any.
Hope there aren't any!
Post by Amit Kale
5. Portability - Which HDDs, SSDs, partitions, other block devices it works with.
Any block device.
Post by Amit Kale
6. Persistence of cache configuration - Once created does the cache configuration stay persistent across reboots. How are changes in device sequence or numbering handled.
Persistent. Device nodes are not stable across reboots, same as say scsi
devices if they get probed in a different order. It does persist a label
in the backing device superblock which can be used to implement stable
device nodes.
Post by Amit Kale
7. Persistence of cached data - Does cached data remain across reboots/crashes/intermittent failures. Is the "sticky"ness of data configurable.
Persists across reboots. Can't be switched off, though it could be if
there was any demand.
Post by Amit Kale
8. SSD life - Projected SSD life. Does the caching solution cause too much of write amplification leading to an early SSD failure.
With LRU, there's only so much you can do to work around the SSD's FTL,
though bcache does try; allocation is done in terms of buckets, which
are on the order of a megabyte (configured when you format the cache
device). Buckets are written to sequentially, then rewritten later all
at once (and it'll issue a discard before rewriting a bucket if you flip
it on, it's not on by default because TRIM = slow).

Bcache also implements fifo cache replacement, and with that write
amplification should never be an issue.
Post by Amit Kale
9. Performance - Throughput is generally most important. Latency is also one more performance comparison point. Performance under different load classes can be measured.
10. ACID properties - Atomicity, Concurrency, Idempotent, Durability. Does the caching solution have these typical transactional database or filesystem properties. This includes avoiding torn-page problem amongst crash and failure scenarios.
Yes.
Post by Amit Kale
11. Error conditions - Handling power failures, intermittent and permanent device failures.
Power failures and device failures yes, intermittent failures are not
explicitly handled.
Post by Amit Kale
12. Configuration parameters for tuning according to applications.
Lots. The most important one is probably sequential bypass - you don't
typically want to cache your big sequential IO, because rotating disks
do fine at that. So bcache detects sequential IO and bypasses it with a
configurable threshold.

There's also stuff for bypassing more data if the SSD is overloaded - if
you're caching many disks with a single SSD, you don't want the SSD to
be the bottleneck. So it tracks latency to the SSD and cranks down the
sequential bypass threshold if it gets too high.
Post by Amit Kale
We'll soon document EnhanceIO behavior in context of these aspects. We'll appreciate if dm-cache and bcache is also documented.
When comparing performance there are three levels at which it can be measured
1. Architectural elements
1.1. Throughput for 100% cache hit case (in absence of dirty data clean-up)
North of a million iops.
Post by Amit Kale
1.2. Throughput for 0% cache hit case (in absence of dirty data clean-up)
Also relevant whether you're adding the data to the cache. I'm sure
bcache is slightly slower than the raw backing device here, but if it's
noticable it's a bug (I haven't benchmarked that specifically in ages).
Post by Amit Kale
1.3. Dirty data clean-up rate (in absence of IO)
Background writeback is done by scanning the btree in the background for
dirty data, and then writing it out in lba order - so the writes are as
sequential as they're going to get. It's fast.
Post by Amit Kale
2. Performance of architectural elements combined
2.1. Varying mix of read/write, sustained performance.
Random write performance is definitely important, as there you've got to
keep an index up to date on stable storage (if you want to handle
unclean shutdown, anyways). Making that fast is non trivial. Bcache is
about as efficient as you're going to get w.r.t. metadata writes,
though.
Post by Amit Kale
3. Application level testing - The more real-life like benchmark we work with, the better it is.
Amit Kale
2013-01-17 17:17:17 UTC
Permalink
Thanks for a prompt reply.
Post by Kent Overstreet
Suppose I could fill out the bcache version...
Post by Amit Kale
Hi Joe, Kent,
[Adding Kent as well since bcache is mentioned below as one of the
contenders for being integrated into mainline kernel.]
My understanding is that these three caching solutions all have three
principle blocks.
Post by Amit Kale
1. A cache block lookup - This refers to finding out whether a block
was cached or not and the location on SSD, if it was.
Post by Amit Kale
2. Block replacement policy - This refers to the algorithm for
replacing a block when a new free block can't be found.
Post by Amit Kale
3. IO handling - This is about issuing IO requests to SSD and HDD.
4. Dirty data clean-up algorithm (for write-back only) - The dirty
data clean-up algorithm decides when to write a dirty block in an SSD
to its original location on HDD and executes the copy.
Post by Amit Kale
When comparing the three solutions we need to consider these aspects.
1. User interface - This consists of commands used by users for
creating, deleting, editing properties and recovering from error
conditions.
Post by Amit Kale
2. Software interface - Where it interfaces to Linux kernel and
applications.
Both done with sysfs, at least for now.
sysfs is the user interface. Bcache creates a new block device. So it interfaces to Linux kernel at block device layer. HDD and SSD interfaces would at using submit_bio (pl. correct if this is wrong).
Post by Kent Overstreet
Post by Amit Kale
3. Availability - What's the downtime when adding, deleting caches,
making changes to cache configuration, conversion between cache modes,
recovering after a crash, recovering from an error condition.
All of that is done at runtime, without any interruption. bcache
doesn't distinguish between clean and unclean shutdown, which is nice
because it means the recovery code gets tested. Registering a cache
device takes on the order of half a second, for a large (half terabyte)
cache.
Since a new device is created, you need to bring down applications the first time a cache is created. There-onwards it would be online. Similarly applications need to be brought down when deleting a cache. Fstab changes etc also need to be done. My guess is all this requires some effort and understanding by a system administrator. Does fstab work without any manual editing if it contains labes instead of device paths?
Post by Kent Overstreet
Post by Amit Kale
4. Security - Security holes, if any.
Hope there aren't any!
All the three caches can be operated only as root. So as long as there are no bugs, there is no need to worry about security loopholes.
Post by Kent Overstreet
Post by Amit Kale
5. Portability - Which HDDs, SSDs, partitions, other block devices it
works with.
Any block device.
Post by Amit Kale
6. Persistence of cache configuration - Once created does the cache
configuration stay persistent across reboots. How are changes in device
sequence or numbering handled.
Persistent. Device nodes are not stable across reboots, same as say
scsi devices if they get probed in a different order. It does persist a
label in the backing device superblock which can be used to implement
stable device nodes.
Can this be embedded in a udev script so that the configuration becomes persistent regardless of probing order? What happens if either SSD or HDD are absent when a system comes up? Does it work with iSCSI HDDs? iSCSi HDDs can be tricky during shutdown, specifically if the iSCSI device goes offline before a cache saves metadata.
Post by Kent Overstreet
Post by Amit Kale
7. Persistence of cached data - Does cached data remain across
reboots/crashes/intermittent failures. Is the "sticky"ness of data
configurable.
Persists across reboots. Can't be switched off, though it could be if
there was any demand.
Believe me, enterprise customers do require a cache to be non-persistent. This is because of a paranoia that HDD and SSD may go out of sync after a shutdown and before a reboot. This is primarily in an environment with a large number of HDDs accessed through a complicated iSCSI based setup perhaps with software RAID.
Post by Kent Overstreet
Post by Amit Kale
8. SSD life - Projected SSD life. Does the caching solution cause too
much of write amplification leading to an early SSD failure.
With LRU, there's only so much you can do to work around the SSD's FTL,
though bcache does try; allocation is done in terms of buckets, which
are on the order of a megabyte (configured when you format the cache
device). Buckets are written to sequentially, then rewritten later all
at once (and it'll issue a discard before rewriting a bucket if you
flip it on, it's not on by default because TRIM = slow).
Bcache also implements fifo cache replacement, and with that write
amplification should never be an issue.
Most SSDs contain a fairly sophisticated FTL doing wear-leveling. Wear-leveling only helps by evenly balancing over-writes across an entire SSD. Do you have statistics on how many SSD writes are generated per block read from or written to HDD? Metadata writes should be done only for the affected sectors, or else they contribute to more SSD internal writes. There is also a common debate on whether writing a single sector is more beneficial compared writing a whole block containing that sector.
Post by Kent Overstreet
Post by Amit Kale
9. Performance - Throughput is generally most important. Latency is
also one more performance comparison point. Performance under different
load classes can be measured.
Post by Amit Kale
10. ACID properties - Atomicity, Concurrency, Idempotent, Durability.
Does the caching solution have these typical transactional database or
filesystem properties. This includes avoiding torn-page problem amongst
crash and failure scenarios.
Yes.
Post by Amit Kale
11. Error conditions - Handling power failures, intermittent and
permanent device failures.
Power failures and device failures yes, intermittent failures are not
explicitly handled.
The IO completion guarantee offered on intermittent failures should be as good as HDD.
Post by Kent Overstreet
Post by Amit Kale
12. Configuration parameters for tuning according to applications.
Lots. The most important one is probably sequential bypass - you don't
typically want to cache your big sequential IO, because rotating disks
do fine at that. So bcache detects sequential IO and bypasses it with a
configurable threshold.
There's also stuff for bypassing more data if the SSD is overloaded -
if you're caching many disks with a single SSD, you don't want the SSD
to be the bottleneck. So it tracks latency to the SSD and cranks down
the sequential bypass threshold if it gets too high.
That's interesting. I'll definitely want to read this part of the source code.
Post by Kent Overstreet
Post by Amit Kale
We'll soon document EnhanceIO behavior in context of these aspects.
We'll appreciate if dm-cache and bcache is also documented.
Post by Amit Kale
When comparing performance there are three levels at which it can be
measured 1. Architectural elements 1.1. Throughput for 100% cache hit
case (in absence of dirty data clean-up)
North of a million iops.
Post by Amit Kale
1.2. Throughput for 0% cache hit case (in absence of dirty data clean-up)
Also relevant whether you're adding the data to the cache. I'm sure
bcache is slightly slower than the raw backing device here, but if it's
noticable it's a bug (I haven't benchmarked that specifically in ages).
Post by Amit Kale
1.3. Dirty data clean-up rate (in absence of IO)
Background writeback is done by scanning the btree in the background
for dirty data, and then writing it out in lba order - so the writes
are as sequential as they're going to get. It's fast.
Great.

Thanks.
-Amit
Post by Kent Overstreet
Post by Amit Kale
2. Performance of architectural elements combined 2.1. Varying mix of
read/write, sustained performance.
Random write performance is definitely important, as there you've got
to keep an index up to date on stable storage (if you want to handle
unclean shutdown, anyways). Making that fast is non trivial. Bcache is
about as efficient as you're going to get w.r.t. metadata writes,
though.
Post by Amit Kale
3. Application level testing - The more real-life like benchmark we
work with, the better it is.
PROPRIETARY-CONFIDENTIAL INFORMATION INCLUDED



This electronic transmission, and any documents attached hereto, may contain confidential, proprietary and/or legally privileged information. The information is intended only for use by the recipient named above. If you received this electronic message in error, please notify the sender and delete the electronic message. Any disclosure, copying, distribution, or use of the contents of information received in error is strictly prohibited, and violators will be pursued legally.
Kent Overstreet
2013-01-24 23:45:24 UTC
Permalink
Post by Kent Overstreet
Suppose I could fill out the bcache version...
Post by Amit Kale
11. Error conditions - Handling power failures, intermittent and permanent device failures.
Power failures and device failures yes, intermittent failures are not
explicitly handled.
Coworker pointed out bcache actually does handle some intermittent io errors. I
just added error handling to the documentation:
http://atlas.evilpiepirate.org/git/linux-bcache.git/tree/Documentation/bcache.txt?h=bcache-dev

To cut and paste,

Bcache tries to transparently handle IO errors to/from the cache device without
affecting normal operation; if it sees too many errors (the threshold is
configurable, and defaults to 0) it shuts down the cache device and switches all
the backing devices to passthrough mode.

- For reads from the cache, if they error we just retry the read from the
backing device.

- For writethrough writes, if the write to the cache errors we just switch to
invalidating the data at that lba in the cache (i.e. the same thing we do for
a write that bypasses the cache)

- For writeback writes, we currently pass that error back up to the
filesystem/userspace. This could be improved - we could retry it as a write
that skips the cache so we don't have to error the write.

- When we detach, we first try to flush any dirty data (if we were running in
writeback mode). It currently doesn't do anything intelligent if it fails to
read some of the dirty data, though.
thornber-H+wXaHxf7aLQT0dZR+
2013-01-17 13:26:21 UTC
Permalink
Post by Amit Kale
Hi Joe, Kent,
[Adding Kent as well since bcache is mentioned below as one of the contenders for being integrated into mainline kernel.]
My understanding is that these three caching solutions all have three principle blocks.
Let me try and explain how dm-cache works.
Post by Amit Kale
1. A cache block lookup - This refers to finding out whether a block was cached or not and the location on SSD, if it was.
Of course we have this, but it's part of the policy plug-in. I've
done this because the policy nearly always needs to do some book
keeping (eg, update a hit count when accessed).
Post by Amit Kale
2. Block replacement policy - This refers to the algorithm for replacing a block when a new free block can't be found.
I think there's more than just this. These are the tasks that I hand
over to the policy:

a) _Which_ blocks should be promoted to the cache. This seems to be
the key decision in terms of performance. Blindly trying to
promote every io or even just every write will lead to some very
bad performance in certain situations.

The mq policy uses a multiqueue (effectively a partially sorted
lru list) to keep track of candidate block hit counts. When
candidates get enough hits they're promoted. The promotion
threshold his periodically recalculated by looking at the hit
counts for the blocks already in the cache.

The hit counts should degrade over time (for some definition of
time; eg. io volume). I've experimented with this, but not yet
come up with a satisfactory method.

I read through EnhanceIO yesterday, and think this is where
you're lacking.

b) When should a block be promoted. If you're swamped with io, then
adding copy io is probably not a good idea. Current dm-cache
just has a configurable threshold for the promotion/demotion io
volume. If you or Kent have some ideas for how to approximate
the bandwidth of the devices I'd really like to hear about it.

c) Which blocks should be demoted?

This is the bit that people commonly think of when they say
'caching algorithm'. Examples are lru, arc, etc. Such
descriptions are fine when describing a cache where elements
_have_ to be promoted before they can be accessed, for example a
cpu memory cache. But we should be aware that 'lru' for example
really doesn't tell us much in the context of our policies.

The mq policy uses a blend of lru and lfu for eviction, it seems
to work well.

A couple of other things I should mention; dm-cache uses a large block
size compared to eio. eg, 64k - 1m. This is a mixed blessing;

- our copy io is more efficient (we don't have to worry about
batching migrations together so much. Something eio is careful to
do).

- we have fewer blocks to hold stats about, so can keep more info per
block in the same amount of memory.

- We trigger more copying. For example if an incoming write triggers
a promotion from the origin to the cache, and the io covers a block
we can avoid any copy from the origin to cache. With a bigger
block size this optmisation happens less frequently.

- We waste SSD space. eg, a 4k hotspot could trigger a whole block
to be moved to the cache.


We do not keep the dirty state of cache blocks up to date on the
metadata device. Instead we have a 'mounted' flag that's set in the
metadata when opened. When a clean shutdown occurs (eg, dmsetup
suspend my-cache) the dirty bits are written out and the mounted flag
cleared. On a crash the mounted flag will still be set on reopen and
all dirty flags degrade to 'dirty'. Correct me if I'm wrong, but I
think eio is holding io completion until the dirty bits have been
committed to disk?

I really view dm-cache as a slow moving hotspot optimiser. Whereas I
think eio and bcache are much more of a heirarchical storage approach,
where writes go through the cache if possible?
Post by Amit Kale
3. IO handling - This is about issuing IO requests to SSD and HDD.
I get most of this for free via dm and kcopyd. I'm really keen to
see how bcache does; it's more invasive of the block layer, so I'm
expecting it to show far better performance than dm-cache.
Post by Amit Kale
4. Dirty data clean-up algorithm (for write-back only) - The dirty
data clean-up algorithm decides when to write a dirty block in an
SSD to its original location on HDD and executes the copy.

Yep.
Post by Amit Kale
When comparing the three solutions we need to consider these aspects.
1. User interface - This consists of commands used by users for
creating, deleting, editing properties and recovering from error
conditions.

I was impressed how easy eio was to use yesterday when I was playing
with it. Well done.

Driving dm-cache through dm-setup isn't much more of a hassle
though. Though we've decided to pass policy specific params on the
target line, and tweak via a dm message (again simple via dmsetup).
I don't think this is as simple as exposing them through something
like sysfs, but it is more in keeping with the device-mapper way.
Post by Amit Kale
2. Software interface - Where it interfaces to Linux kernel and applications.
See above.
Post by Amit Kale
3. Availability - What's the downtime when adding, deleting caches,
making changes to cache configuration, conversion between cache
modes, recovering after a crash, recovering from an error condition.

Normal dm suspend, alter table, resume cycle. The LVM tools do this
all the time.
Post by Amit Kale
4. Security - Security holes, if any.
Well I saw the comment in your code describing the security flaw you
think you've got. I hope we don't have any, I'd like to understand
your case more.
Post by Amit Kale
5. Portability - Which HDDs, SSDs, partitions, other block devices it works with.
I think we all work with any block device. But eio and bcache can
overlay any device node, not just a dm one. As mentioned in earlier
email I really think this is a dm issue, not specific to dm-cache.
Post by Amit Kale
6. Persistence of cache configuration - Once created does the cache
configuration stay persistent across reboots. How are changes in
device sequence or numbering handled.

We've gone for no persistence of policy parameters. Instead
everything is handed into the kernel when the target is setup. This
decision was made by the LVM team who wanted to store this
information themselves (we certainly shouldn't store it in two
places at once). I don't feel strongly either way, and could
persist the policy params v. easily (eg, 1 days work).

One thing I do provide is a 'hint' array for the policy to use and
persist. The policy specifies how much data it would like to store
per cache block, and then writes it on clean shutdown (hence 'hint',
it has to cope without this, possibly with temporarily degraded
performance). The mq policy uses the hints to store hit counts.
Post by Amit Kale
7. Persistence of cached data - Does cached data remain across
reboots/crashes/intermittent failures. Is the "sticky"ness of data
configurable.

Surely this is a given? A cache would be trivial to write if it
didn't need to be crash proof.
Post by Amit Kale
8. SSD life - Projected SSD life. Does the caching solution cause
too much of write amplification leading to an early SSD failure.

No, I decided years ago that life was too short to start optimising
for specific block devices. By the time you get it right the
hardware characteristics will have moved on. Doesn't the firmware
on SSDs try and even out io wear these days?

That said I think we evenly use the SSD. Except for the superblock
on the metadata device.
Post by Amit Kale
9. Performance - Throughput is generally most important. Latency is
also one more performance comparison point. Performance under
different load classes can be measured.

I think latency is more important than throughput. Spindles are
pretty good at throughput. In fact the mq policy tries to spot when
we're doing large linear ios and stops hit counting; best leave this
stuff on the spindle.
Post by Amit Kale
10. ACID properties - Atomicity, Concurrency, Idempotent,
Durability. Does the caching solution have these typical
transactional database or filesystem properties. This includes
avoiding torn-page problem amongst crash and failure scenarios.

Could you expand on the torn-page issue please?
Post by Amit Kale
11. Error conditions - Handling power failures, intermittent and permanent device failures.
I think the area where dm-cache is currently lacking is intermittent
failures. For example if a cache read fails we just pass that error
up, whereas eio sees if the block is clean and if so tries to read
off the origin. I'm not sure which behaviour is correct; I like to
know about disk failure early.
Post by Amit Kale
12. Configuration parameters for tuning according to applications.
Discussed above.
Post by Amit Kale
We'll soon document EnhanceIO behavior in context of these
aspects. We'll appreciate if dm-cache and bcache is also documented.

I hope the above helps. Please ask away if you're unsure about
something.
Post by Amit Kale
When comparing performance there are three levels at which it can be measured
Developing these caches is tedious. Test runs take time, and really
slow the dev cycle down. So I suspect we've all been using
microbenchmarks that run in a few minutes.

Let's get our pool of microbenchmarks together, then work on some
application level ones (we're happy to put some time into developing
these).

- Joe
Amit Kale
2013-01-17 17:53:11 UTC
Permalink
Post by Amit Kale
Post by Amit Kale
Hi Joe, Kent,
[Adding Kent as well since bcache is mentioned below as one of the
contenders for being integrated into mainline kernel.]
My understanding is that these three caching solutions all have three
principle blocks.
Let me try and explain how dm-cache works.
Post by Amit Kale
1. A cache block lookup - This refers to finding out whether a block
was cached or not and the location on SSD, if it was.
Of course we have this, but it's part of the policy plug-in. I've done
this because the policy nearly always needs to do some book keeping
(eg, update a hit count when accessed).
Post by Amit Kale
2. Block replacement policy - This refers to the algorithm for
replacing a block when a new free block can't be found.
I think there's more than just this. These are the tasks that I hand
a) _Which_ blocks should be promoted to the cache. This seems to be
the key decision in terms of performance. Blindly trying to
promote every io or even just every write will lead to some very
bad performance in certain situations.
The mq policy uses a multiqueue (effectively a partially sorted
lru list) to keep track of candidate block hit counts. When
candidates get enough hits they're promoted. The promotion
threshold his periodically recalculated by looking at the hit
counts for the blocks already in the cache.
Multi-queue algorithm typically results in a significant metadata overhead. How much percentage overhead does that imply?
Post by Amit Kale
The hit counts should degrade over time (for some definition of
time; eg. io volume). I've experimented with this, but not yet
come up with a satisfactory method.
I read through EnhanceIO yesterday, and think this is where
you're lacking.
We have an LRU policy at a cache set level. Effectiveness of the LRU policy depends on the average duration of a block in a working dataset. If the average duration is small enough so a block is most of the times "hit" before it's chucked out, LRU works better than any other policies.
Post by Amit Kale
b) When should a block be promoted. If you're swamped with io, then
adding copy io is probably not a good idea. Current dm-cache
just has a configurable threshold for the promotion/demotion io
volume. If you or Kent have some ideas for how to approximate
the bandwidth of the devices I'd really like to hear about it.
c) Which blocks should be demoted?
This is the bit that people commonly think of when they say
'caching algorithm'. Examples are lru, arc, etc. Such
descriptions are fine when describing a cache where elements
_have_ to be promoted before they can be accessed, for example a
cpu memory cache. But we should be aware that 'lru' for example
really doesn't tell us much in the context of our policies.
The mq policy uses a blend of lru and lfu for eviction, it seems
to work well.
A couple of other things I should mention; dm-cache uses a large block
size compared to eio. eg, 64k - 1m. This is a mixed blessing;
Yes. We had a lot of debate internally on the block size. For now we have restricted to 2k, 4k and 8k. We found that larger block sizes result in too much of internal fragmentation, in-spite of a significant reduction in metadata size. 8k is adequate for Oracle and mysql.
Post by Amit Kale
- our copy io is more efficient (we don't have to worry about
batching migrations together so much. Something eio is careful to
do).
- we have fewer blocks to hold stats about, so can keep more info per
block in the same amount of memory.
- We trigger more copying. For example if an incoming write triggers
a promotion from the origin to the cache, and the io covers a block
we can avoid any copy from the origin to cache. With a bigger
block size this optmisation happens less frequently.
- We waste SSD space. eg, a 4k hotspot could trigger a whole block
to be moved to the cache.
We do not keep the dirty state of cache blocks up to date on the
metadata device. Instead we have a 'mounted' flag that's set in the
metadata when opened. When a clean shutdown occurs (eg, dmsetup
suspend my-cache) the dirty bits are written out and the mounted flag
cleared. On a crash the mounted flag will still be set on reopen and
all dirty flags degrade to 'dirty'.
Not sure I understand this. Is there a guarantee that once an IO is reported as "done" to upstream layer (filesystem/database/application), it is persistent. The persistence should be guaranteed even if there is an OS crash immediately after status is reported. Persistence should be guaranteed for the entire IO range. The next time the application tries to read it, it should get updated data, not stale data.
Post by Amit Kale
Correct me if I'm wrong, but I
think eio is holding io completion until the dirty bits have been
committed to disk?
That's correct. In addition to this, we try to batch metadata updates if multiple IOs occur in the same cache set.
Post by Amit Kale
I really view dm-cache as a slow moving hotspot optimiser. Whereas I
think eio and bcache are much more of a heirarchical storage approach,
where writes go through the cache if possible?
Generally speaking, yes. EIO contains dirty data limits to avoid the situation where too much of the HDD is used for storing dirty data reducing the effectiveness of the cache for reads.
Post by Amit Kale
Post by Amit Kale
3. IO handling - This is about issuing IO requests to SSD and HDD.
I get most of this for free via dm and kcopyd. I'm really keen to
see how bcache does; it's more invasive of the block layer, so I'm
expecting it to show far better performance than dm-cache.
Post by Amit Kale
4. Dirty data clean-up algorithm (for write-back only) - The dirty
data clean-up algorithm decides when to write a dirty block in an
SSD to its original location on HDD and executes the copy.
Yep.
Post by Amit Kale
When comparing the three solutions we need to consider these aspects.
1. User interface - This consists of commands used by users for
creating, deleting, editing properties and recovering from error
conditions.
I was impressed how easy eio was to use yesterday when I was playing
with it. Well done.
Driving dm-cache through dm-setup isn't much more of a hassle
though. Though we've decided to pass policy specific params on the
target line, and tweak via a dm message (again simple via dmsetup).
I don't think this is as simple as exposing them through something
like sysfs, but it is more in keeping with the device-mapper way.
You have the benefit of using a well-know dm interface.
Post by Amit Kale
Post by Amit Kale
2. Software interface - Where it interfaces to Linux kernel and
applications.
See above.
Post by Amit Kale
3. Availability - What's the downtime when adding, deleting caches,
making changes to cache configuration, conversion between cache
modes, recovering after a crash, recovering from an error condition.
Normal dm suspend, alter table, resume cycle. The LVM tools do this
all the time.
Cache creation and deletion will require stopping applications, unmounting filesystems and then remounting and starting the applications. A sysad in addition to this will require updating fstab entries. Do fstab entries work automatically in case they use labels instead of full device paths.

Same with changes to cache configuration.
Post by Amit Kale
Post by Amit Kale
4. Security - Security holes, if any.
Well I saw the comment in your code describing the security flaw you
think you've got. I hope we don't have any, I'd like to understand
your case more.
Could you elaborate on which comment you are referring to? Since all of the three caching solutions allow only root user an access, my belief is that there are no security holes. I have listed it here as it's an important consideration for enterprise users.
Post by Amit Kale
Post by Amit Kale
5. Portability - Which HDDs, SSDs, partitions, other block devices it
works with.
I think we all work with any block device. But eio and bcache can
overlay any device node, not just a dm one. As mentioned in earlier
email I really think this is a dm issue, not specific to dm-cache.
DM was never meant to be cascaded. So it's ok for DM.

We recommend our customers to use a RAID for SSD when running writeback. This is because an SSD failure leads to a catastrophic data loss (dirty data). We support using an md device as a SSD. There are some issues with md devices for the code published in github. I'll get back with a code fix next week.
Post by Amit Kale
Post by Amit Kale
6. Persistence of cache configuration - Once created does the cache
configuration stay persistent across reboots. How are changes in
device sequence or numbering handled.
We've gone for no persistence of policy parameters. Instead
everything is handed into the kernel when the target is setup. This
decision was made by the LVM team who wanted to store this
information themselves (we certainly shouldn't store it in two
places at once). I don't feel strongly either way, and could
persist the policy params v. easily (eg, 1 days work).
Storing persistence information in a single place makes sense.
Post by Amit Kale
One thing I do provide is a 'hint' array for the policy to use and
persist. The policy specifies how much data it would like to store
per cache block, and then writes it on clean shutdown (hence 'hint',
it has to cope without this, possibly with temporarily degraded
performance). The mq policy uses the hints to store hit counts.
Post by Amit Kale
7. Persistence of cached data - Does cached data remain across
reboots/crashes/intermittent failures. Is the "sticky"ness of data
configurable.
Surely this is a given? A cache would be trivial to write if it
didn't need to be crash proof.
There has to be a way to make it either persistent or volatile depending on how users want it. Enterprise users are sometimes paranoid about HDD and SSD going out of sync after a system shutdown and before a bootup. This is typically for large complicated iSCSI based shared HDD setups.
Post by Amit Kale
Post by Amit Kale
8. SSD life - Projected SSD life. Does the caching solution cause
too much of write amplification leading to an early SSD failure.
No, I decided years ago that life was too short to start optimising
for specific block devices. By the time you get it right the
hardware characteristics will have moved on. Doesn't the firmware
on SSDs try and even out io wear these days?
That's correct. We don't have to worry about wear leveling. All of the competent SSDs around do that.

What I wanted to bring up was how many SSD writes does a cache read/write result. Write back cache mode is specifically taxing on SSDs in this aspect.
Post by Amit Kale
That said I think we evenly use the SSD. Except for the superblock
on the metadata device.
Post by Amit Kale
9. Performance - Throughput is generally most important. Latency is
also one more performance comparison point. Performance under
different load classes can be measured.
I think latency is more important than throughput. Spindles are
pretty good at throughput. In fact the mq policy tries to spot when
we're doing large linear ios and stops hit counting; best leave this
stuff on the spindle.
I disagree. Latency is taken care of automatically when the number of application threads rises.
Post by Amit Kale
Post by Amit Kale
10. ACID properties - Atomicity, Concurrency, Idempotent,
Durability. Does the caching solution have these typical
transactional database or filesystem properties. This includes
avoiding torn-page problem amongst crash and failure scenarios.
Could you expand on the torn-page issue please?
Databases run into torn-page error when an IO is found to be only partially written when it was supposed to be fully written. This is particularly important when an IO was reported to be "done". The original flashcache code we started with over an year ago showed torn-page problem in extremely rare crashes with writeback mode. Our present code contains specific design elements to avoid it.
Post by Amit Kale
Post by Amit Kale
11. Error conditions - Handling power failures, intermittent and
permanent device failures.
I think the area where dm-cache is currently lacking is intermittent
failures. For example if a cache read fails we just pass that error
up, whereas eio sees if the block is clean and if so tries to read
off the origin. I'm not sure which behaviour is correct; I like to
know about disk failure early.
Our read-only and write-through mode guarantee that no io errors are introduced regardless of the state SSD is in. So not retrying an io error doesn't cause any future problems. The worst case is a performance hit when an SSD shows an io error or goes completely bad.

It's a different story for write-back. We advise our customers to use RAID on SSD when using write-back as explained above.
Post by Amit Kale
Post by Amit Kale
12. Configuration parameters for tuning according to applications.
Discussed above.
Post by Amit Kale
We'll soon document EnhanceIO behavior in context of these
aspects. We'll appreciate if dm-cache and bcache is also documented.
I hope the above helps. Please ask away if you're unsure about
something.
Post by Amit Kale
When comparing performance there are three levels at which it can be measured
Developing these caches is tedious. Test runs take time, and really
slow the dev cycle down. So I suspect we've all been using
microbenchmarks that run in a few minutes.
Let's get our pool of microbenchmarks together, then work on some
application level ones (we're happy to put some time into developing
these).
We do run micro-benchmarks all the time. There are free database benchmarks, so we can try these. Running a full-fledged oracle based benchmarks takes hours, so I am not sure whether I can post that kind of a comparison. Will try to do the best possible.

Thanks.
-Amit

PROPRIETARY-CONFIDENTIAL INFORMATION INCLUDED



This electronic transmission, and any documents attached hereto, may contain confidential, proprietary and/or legally privileged information. The information is intended only for use by the recipient named above. If you received this electronic message in error, please notify the sender and delete the electronic message. Any disclosure, copying, distribution, or use of the contents of information received in error is strictly prohibited, and violators will be pursued legally.
Jason Warr
2013-01-17 18:36:19 UTC
Permalink
Post by Amit Kale
Post by Amit Kale
9. Performance - Throughput is generally most important. Latency is
also one more performance comparison point. Performance under
different load classes can be measured.
I think latency is more important than throughput. Spindles are
pretty good at throughput. In fact the mq policy tries to spot when
we're doing large linear ios and stops hit counting; best leave this
stuff on the spindle.
I disagree. Latency is taken care of automatically when the number of application threads rises.
Can you explain what you mean by that in a little more detail?

As an enterprise level user I see both as important overall. However,
the biggest driving factor in wanting a cache device in front of any
sort of target in my use cases is to hide latency as the number of
threads reading and writing to the backing device go up. So for me the
cache is basically a tier stage where your ability to keep dirty blocks
on it is determined by the specific use case.
Amit Kale
2013-01-18 09:08:37 UTC
Permalink
Post by Amit Kale
Post by Amit Kale
Post by Amit Kale
9. Performance - Throughput is generally most important. Latency is
also one more performance comparison point. Performance under
different load classes can be measured.
I think latency is more important than throughput. Spindles are
pretty good at throughput. In fact the mq policy tries to spot
when
Post by Amit Kale
Post by Amit Kale
we're doing large linear ios and stops hit counting; best leave
this
Post by Amit Kale
Post by Amit Kale
stuff on the spindle.
I disagree. Latency is taken care of automatically when the number of
application threads rises.
Can you explain what you mean by that in a little more detail?
Let's say latency of a block device is 10ms for 4kB requests. With single threaded IO, the throughput will be 4kB/10ms = 400kB/s. If the device is capable of more throughput, a multithreaded IO will generate more throughput. So with 2 threads the throughput will be roughly 800kB/s. We can keep increasing the number of threads resulting in an approximately linear throughput. It'll saturate at the maximum capacity the device has. So it could saturate at perhaps at 8MB/s. Increasing the number of threads beyond this will not increase throughput.

This is a simplistic computation. Throughput, latency and number of threads are related in a more complex relationship. Latency is still important, but throughput is more important.

The way all this matters for SSD caching is, caching will typically show a higher latency compared to the base SSD, even for a 100% hit ratio. It may be possible to reach the maximum throughput achievable with the base SSD using a high number of threads. Let's say an SSD shows 450MB/s with 4 threads. A cache may show 440MB/s with 8 threads.

A practical difficulty in measuring latency is that the latency seen by an application is a sum of the device latency plus the time spent in request queue (and caching layer, when present). Increasing number of threads shows latency increase, although it's only because the requests stay in request queue for a longer duration. Latency measurement in a multithreaded environment is very challenging. Measurement of throughput is fairly straightforward.
Post by Amit Kale
As an enterprise level user I see both as important overall. However,
the biggest driving factor in wanting a cache device in front of any
sort of target in my use cases is to hide latency as the number of
threads reading and writing to the backing device go up. So for me the
cache is basically a tier stage where your ability to keep dirty blocks
on it is determined by the specific use case.
SSD caching will help in this case since SSD's latency remains almost constant regardless of location of data. HDD latency for sequential and random IO could vary by a factor of 5 or even much more.

Throughput with caching could even be 100 times the HDD throughput when using multiple threaded non-sequential IO.
-Amit


PROPRIETARY-CONFIDENTIAL INFORMATION INCLUDED



This electronic transmission, and any documents attached hereto, may contain confidential, proprietary and/or legally privileged information. The information is intended only for use by the recipient named above. If you received this electronic message in error, please notify the sender and delete the electronic message. Any disclosure, copying, distribution, or use of the contents of information received in error is strictly prohibited, and violators will be pursued legally.
Jason Warr
2013-01-18 15:56:19 UTC
Permalink
Post by Amit Kale
Post by Jason Warr
Can you explain what you mean by that in a little more detail?
Let's say latency of a block device is 10ms for 4kB requests. With single threaded IO, the throughput will be 4kB/10ms = 400kB/s. If the device is capable of more throughput, a multithreaded IO will generate more throughput. So with 2 threads the throughput will be roughly 800kB/s. We can keep increasing the number of threads resulting in an approximately linear throughput. It'll saturate at the maximum capacity the device has. So it could saturate at perhaps at 8MB/s. Increasing the number of threads beyond this will not increase throughput.
This is a simplistic computation. Throughput, latency and number of threads are related in a more complex relationship. Latency is still important, but throughput is more important.
The way all this matters for SSD caching is, caching will typically show a higher latency compared to the base SSD, even for a 100% hit ratio. It may be possible to reach the maximum throughput achievable with the base SSD using a high number of threads. Let's say an SSD shows 450MB/s with 4 threads. A cache may show 440MB/s with 8 threads.
A practical difficulty in measuring latency is that the latency seen by an application is a sum of the device latency plus the time spent in request queue (and caching layer, when present). Increasing number of threads shows latency increase, although it's only because the requests stay in request queue for a longer duration. Latency measurement in a multithreaded environment is very challenging. Measurement of throughput is fairly straightforward.
Post by Jason Warr
As an enterprise level user I see both as important overall. However,
the biggest driving factor in wanting a cache device in front of any
sort of target in my use cases is to hide latency as the number of
threads reading and writing to the backing device go up. So for me the
cache is basically a tier stage where your ability to keep dirty blocks
on it is determined by the specific use case.
SSD caching will help in this case since SSD's latency remains almost constant regardless of location of data. HDD latency for sequential and random IO could vary by a factor of 5 or even much more.
Throughput with caching could even be 100 times the HDD throughput when using multiple threaded non-sequential IO.
-Amit
Thank you for the explanation. In context your reasoning makes more
sense to me.

If I am understanding you correctly when you refer to throughput your
speaking more in terms of IOPS than what most people would think of as
referencing only bit rate.

I would expect a small increase in minimum and average latency when
adding in another layer that the blocks have to traverse. If my minimum
and average increase by 20% on most of my workloads, that is very
acceptable as long as there is a decrease in 95th and 99th percentile
maximums. I would hope that absolute maximum would decrease as well but
that is going to be much harder to achieve.

If I can help test and benchmark all three of these solutions please
ask. I have allot of hardware resources available to me and perhaps I
can add value from an outsiders perspective.

Jason
thornber-H+wXaHxf7aLQT0dZR+
2013-01-18 16:11:36 UTC
Permalink
Post by Jason Warr
If I can help test and benchmark all three of these solutions please
ask. I have allot of hardware resources available to me and perhaps I
can add value from an outsiders perspective.
We'd love your help. Perhaps you could devise a test that represents
how you'd use it?

- Joe
Jason Warr
2013-01-18 16:45:03 UTC
Permalink
Post by thornber-H+wXaHxf7aLQT0dZR+
Post by Jason Warr
If I can help test and benchmark all three of these solutions please
ask. I have allot of hardware resources available to me and perhaps I
can add value from an outsiders perspective.
We'd love your help. Perhaps you could devise a test that represents
how you'd use it?
- Joe
--
To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
As much as I dislike Oracle that is one of my primary applications. I
am attempting to get one of my customers to setup an Oracle instance
that is modular in that I can move the storage around to fit a
particular hardware setup and have a consistent benchmark that they use
in the real world to gauge performance. One of them is a debit card
transaction clearing entity on multi-TB databases so latency REALLY
matters there. Hopefully I'll have a couple of them setup within a
week. At that point I may need help in getting the proper kernel trees
and patch sets munged into a working kernel. That seems to be the spot
where I fall over most of the time.

Unfortunately I probably could not share this specific setup but it is
likely that I can derive a version from it that can be opened.
thornber-H+wXaHxf7aLQT0dZR+
2013-01-18 17:42:19 UTC
Permalink
Post by Jason Warr
As much as I dislike Oracle that is one of my primary applications. I
am attempting to get one of my customers to setup an Oracle instance
that is modular in that I can move the storage around to fit a
particular hardware setup and have a consistent benchmark that they use
in the real world to gauge performance. One of them is a debit card
transaction clearing entity on multi-TB databases so latency REALLY
matters there. Hopefully I'll have a couple of them setup within a
week. At that point I may need help in getting the proper kernel trees
and patch sets munged into a working kernel. That seems to be the spot
where I fall over most of the time.
Unfortunately I probably could not share this specific setup but it is
likely that I can derive a version from it that can be opened.
That would be perfect. Please ask for any help you need.

- Joe
Amit Kale
2013-01-18 17:44:30 UTC
Permalink
-----Original Message-----
Sent: Friday, January 18, 2013 10:15 PM
Subject: Re: [dm-devel] Announcement: STEC EnhanceIO SSD caching
software for Linux kernel
Post by thornber-H+wXaHxf7aLQT0dZR+
Post by Jason Warr
If I can help test and benchmark all three of these solutions please
ask. I have allot of hardware resources available to me and perhaps
I can add value from an outsiders perspective.
We'd love your help. Perhaps you could devise a test that represents
how you'd use it?
- Joe
--
To unsubscribe from this list: send the line "unsubscribe
More majordomo info at http://vger.kernel.org/majordomo-info.html
As much as I dislike Oracle that is one of my primary applications. I
am attempting to get one of my customers to setup an Oracle instance
that is modular in that I can move the storage around to fit a
particular hardware setup and have a consistent benchmark that they use
in the real world to gauge performance. One of them is a debit card
transaction clearing entity on multi-TB databases so latency REALLY
matters there.
I am curious as to how SSD latency matters so much in the overall transaction times.

We do a lot of performance measurements using SQL database benchmarks. Transaction times vary a lot depending on location of data, complexity of the transaction etc. Typically TPM (transactions per minute) is of primary interest for TPC-C.
Hopefully I'll have a couple of them setup within a
week. At that point I may need help in getting the proper kernel trees
and patch sets munged into a working kernel. That seems to be the spot
where I fall over most of the time.
Unfortunately I probably could not share this specific setup but it is
likely that I can derive a version from it that can be opened.
That'll be good. I'll check with our testing team whether they can run TPC-C comparisons for these three caching solutions.

-Amit

PROPRIETARY-CONFIDENTIAL INFORMATION INCLUDED



This electronic transmission, and any documents attached hereto, may contain confidential, proprietary and/or legally privileged information. The information is intended only for use by the recipient named above. If you received this electronic message in error, please notify the sender and delete the electronic message. Any disclosure, copying, distribution, or use of the contents of information received in error is strictly prohibited, and violators will be pursued legally.
Jason Warr
2013-01-18 18:36:42 UTC
Permalink
Post by Amit Kale
Post by Jason Warr
As much as I dislike Oracle that is one of my primary applications. I
Post by Jason Warr
am attempting to get one of my customers to setup an Oracle instance
that is modular in that I can move the storage around to fit a
particular hardware setup and have a consistent benchmark that they use
in the real world to gauge performance. One of them is a debit card
transaction clearing entity on multi-TB databases so latency REALLY
matters there.
I am curious as to how SSD latency matters so much in the overall transaction times.
We do a lot of performance measurements using SQL database benchmarks. Transaction times vary a lot depending on location of data, complexity of the transaction etc. Typically TPM (transactions per minute) is of primary interest for TPC-C.
It's not specifically SSD latency. It's I/O transaction latency that
matters. This particular application is very sensitive to that because
it is literally someone standing at a POS terminal swiping a
debit/credit card. You only have a couple of seconds after the PIN is
entered for the transaction to go through your network, application
server to authorize against a DB and back to the POS.

The entire I/O stack on the DB is only a small time-slice of that round
trip. Your 99th percentile needs to be under 20ms on the DB storage
side. If your worst case DB I/O goes beyond 300ms it is considered an
outage because the POS transaction fails. So it obviously takes allot
of planning and optimization work on the DB itself to get good
tablespace layout to even get into the realm where you can have that
predictable of latency with multi-million dollar FC storage frames.

One of my goals is to be able to offer this level of I/O service on
commodity hardware. Simplify the scope of hardware, reduce the number
of points of failure, make the systems more portable, reduce or
eliminate dependence on any specific vendor below the application and
save money. Not to mention reduce the number of fingers that can point
away from themselves saying it is someone elses problem to find fault.

Allot of the pieces are already out there. A good block caching target
is one of the missing pieces to help fill the ever growing canyon
between non-block device system performance and storage. What they have
done with L2ARC and SLOG in ZFS/Solaris is good but it has some serious
short comings in other areas that DM/MD/LVM do extremely well.

I appreciate all of the brilliant work all of you guys do and hopefully
I can contribute a little bit of usefulness to this effort.

Thank you,

Jason
Darrick J. Wong
2013-01-18 21:25:43 UTC
Permalink
Since Joe is putting together a testing tree to compare the three caching
things, what do you all think of having a(nother) session about ssd caching at
this year's LSFMM Summit?

[Apologies for hijacking the thread.]
[Adding lsf-pc to the cc list.]

--D
Post by Jason Warr
Post by Amit Kale
Post by Jason Warr
As much as I dislike Oracle that is one of my primary applications. I
Post by Jason Warr
am attempting to get one of my customers to setup an Oracle instance
that is modular in that I can move the storage around to fit a
particular hardware setup and have a consistent benchmark that they use
in the real world to gauge performance. One of them is a debit card
transaction clearing entity on multi-TB databases so latency REALLY
matters there.
I am curious as to how SSD latency matters so much in the overall transaction times.
We do a lot of performance measurements using SQL database benchmarks. Transaction times vary a lot depending on location of data, complexity of the transaction etc. Typically TPM (transactions per minute) is of primary interest for TPC-C.
It's not specifically SSD latency. It's I/O transaction latency that
matters. This particular application is very sensitive to that because
it is literally someone standing at a POS terminal swiping a
debit/credit card. You only have a couple of seconds after the PIN is
entered for the transaction to go through your network, application
server to authorize against a DB and back to the POS.
The entire I/O stack on the DB is only a small time-slice of that round
trip. Your 99th percentile needs to be under 20ms on the DB storage
side. If your worst case DB I/O goes beyond 300ms it is considered an
outage because the POS transaction fails. So it obviously takes allot
of planning and optimization work on the DB itself to get good
tablespace layout to even get into the realm where you can have that
predictable of latency with multi-million dollar FC storage frames.
One of my goals is to be able to offer this level of I/O service on
commodity hardware. Simplify the scope of hardware, reduce the number
of points of failure, make the systems more portable, reduce or
eliminate dependence on any specific vendor below the application and
save money. Not to mention reduce the number of fingers that can point
away from themselves saying it is someone elses problem to find fault.
Allot of the pieces are already out there. A good block caching target
is one of the missing pieces to help fill the ever growing canyon
between non-block device system performance and storage. What they have
done with L2ARC and SLOG in ZFS/Solaris is good but it has some serious
short comings in other areas that DM/MD/LVM do extremely well.
I appreciate all of the brilliant work all of you guys do and hopefully
I can contribute a little bit of usefulness to this effort.
Thank you,
Jason
--
dm-devel mailing list
https://www.redhat.com/mailman/listinfo/dm-devel
Mike Snitzer
2013-01-18 21:37:59 UTC
Permalink
On Fri, Jan 18 2013 at 4:25pm -0500,
Post by Darrick J. Wong
Since Joe is putting together a testing tree to compare the three caching
things, what do you all think of having a(nother) session about ssd caching at
this year's LSFMM Summit?
[Apologies for hijacking the thread.]
[Adding lsf-pc to the cc list.]
Hopefully we'll have some findings on the comparisons well before LSF
(since we currently have some momentum). But yes it may be worthwhile
to discuss things further and/or report findings.

Mike
Amit Kale
2013-01-21 05:26:08 UTC
Permalink
-----Original Message-----
Sent: Saturday, January 19, 2013 3:08 AM
To: Darrick J. Wong
Thornber
Subject: Re: [LSF/MM TOPIC] Re: [dm-devel] Announcement: STEC EnhanceIO
SSD caching software for Linux kernel
On Fri, Jan 18 2013 at 4:25pm -0500,
Post by Darrick J. Wong
Since Joe is putting together a testing tree to compare the three
caching things, what do you all think of having a(nother) session
about ssd caching at this year's LSFMM Summit?
[Apologies for hijacking the thread.]
[Adding lsf-pc to the cc list.]
Hopefully we'll have some findings on the comparisons well before LSF
(since we currently have some momentum). But yes it may be worthwhile
to discuss things further and/or report findings.
We should have performance comparisons presented well before the summit. It'll be good to have ssd caching session in any case. The likelihood that one of them will be included in Linux kernel before April is very low.

-Amit

PROPRIETARY-CONFIDENTIAL INFORMATION INCLUDED



This electronic transmission, and any documents attached hereto, may contain confidential, proprietary and/or legally privileged information. The information is intended only for use by the recipient named above. If you received this electronic message in error, please notify the sender and delete the electronic message. Any disclosure, copying, distribution, or use of the contents of information received in error is strictly prohibited, and violators will be pursued legally.
Mike Snitzer
2013-01-21 13:09:51 UTC
Permalink
On Mon, Jan 21 2013 at 12:26am -0500,
Post by Amit Kale
-----Original Message-----
Sent: Saturday, January 19, 2013 3:08 AM
To: Darrick J. Wong
Thornber
Subject: Re: [LSF/MM TOPIC] Re: [dm-devel] Announcement: STEC EnhanceIO
SSD caching software for Linux kernel
On Fri, Jan 18 2013 at 4:25pm -0500,
Post by Darrick J. Wong
Since Joe is putting together a testing tree to compare the three
caching things, what do you all think of having a(nother) session
about ssd caching at this year's LSFMM Summit?
[Apologies for hijacking the thread.]
[Adding lsf-pc to the cc list.]
Hopefully we'll have some findings on the comparisons well before LSF
(since we currently have some momentum). But yes it may be worthwhile
to discuss things further and/or report findings.
We should have performance comparisons presented well before the
summit. It'll be good to have ssd caching session in any case. The
likelihood that one of them will be included in Linux kernel before
April is very low.
dm-cache is under active review for upstream inclusion. I wouldn't
categorize the chances of dm-cache going upstream when the v3.9 merge
window opens as "very low". But even if dm-cache does go upstream it
doesn't preclude bcache and/or enhanceio from going upstream too.
thornber-H+wXaHxf7aLQT0dZR+
2013-01-21 13:58:21 UTC
Permalink
Post by Mike Snitzer
dm-cache is under active review for upstream inclusion. I wouldn't
categorize the chances of dm-cache going upstream when the v3.9 merge
window opens as "very low". But even if dm-cache does go upstream it
doesn't preclude bcache and/or enhanceio from going upstream too.
As I understand it bcache is being reviewed too.
Amit Kale
2013-01-22 05:00:35 UTC
Permalink
-----Original Message-----
Sent: Monday, January 21, 2013 6:40 PM
To: Amit Kale
Cc: Darrick J. Wong; device-mapper development; linux-
Subject: Re: [LSF/MM TOPIC] Re: [dm-devel] Announcement: STEC EnhanceIO
SSD caching software for Linux kernel
On Mon, Jan 21 2013 at 12:26am -0500,
Post by Amit Kale
-----Original Message-----
Sent: Saturday, January 19, 2013 3:08 AM
To: Darrick J. Wong
Cc: device-mapper development; Amit Kale;
Subject: Re: [LSF/MM TOPIC] Re: [dm-devel] Announcement: STEC
EnhanceIO SSD caching software for Linux kernel
On Fri, Jan 18 2013 at 4:25pm -0500, Darrick J. Wong
Post by Darrick J. Wong
Since Joe is putting together a testing tree to compare the three
caching things, what do you all think of having a(nother) session
about ssd caching at this year's LSFMM Summit?
[Apologies for hijacking the thread.] [Adding lsf-pc to the cc
list.]
Hopefully we'll have some findings on the comparisons well before
LSF (since we currently have some momentum). But yes it may be
worthwhile to discuss things further and/or report findings.
We should have performance comparisons presented well before the
summit. It'll be good to have ssd caching session in any case. The
likelihood that one of them will be included in Linux kernel before
April is very low.
dm-cache is under active review for upstream inclusion. I wouldn't
categorize the chances of dm-cache going upstream when the v3.9 merge
window opens as "very low". But even if dm-cache does go upstream it
doesn't preclude bcache and/or enhanceio from going upstream too.
I agree. We haven't seen a full comparison yet, IMHO. If different solutions offer mutually exclusive benefits, it'll be worthwhile including them all.

We haven't submitted EnhanceIO for an inclusion yet. Need more testing from the community before we can mark it Beta.
-Amit

PROPRIETARY-CONFIDENTIAL INFORMATION INCLUDED



This electronic transmission, and any documents attached hereto, may contain confidential, proprietary and/or legally privileged information. The information is intended only for use by the recipient named above. If you received this electronic message in error, please notify the sender and delete the electronic message. Any disclosure, copying, distribution, or use of the contents of information received in error is strictly prohibited, and violators will be pursued legally.
Kent Overstreet
2013-02-04 20:33:26 UTC
Permalink
Post by Mike Snitzer
On Fri, Jan 18 2013 at 4:25pm -0500,
Post by Darrick J. Wong
Since Joe is putting together a testing tree to compare the three caching
things, what do you all think of having a(nother) session about ssd caching at
this year's LSFMM Summit?
[Apologies for hijacking the thread.]
[Adding lsf-pc to the cc list.]
Hopefully we'll have some findings on the comparisons well before LSF
(since we currently have some momentum). But yes it may be worthwhile
to discuss things further and/or report findings.
I'd be willing to go and talk a bit about bcache. Curious to hear more
about the dm caching stuff, too.

Amit Kale
2013-01-18 16:12:31 UTC
Permalink
-----Original Message-----
Sent: Friday, January 18, 2013 9:26 PM
To: Amit Kale
Subject: Re: [dm-devel] Announcement: STEC EnhanceIO SSD caching
software for Linux kernel
Post by Amit Kale
Post by Jason Warr
Can you explain what you mean by that in a little more detail?
Let's say latency of a block device is 10ms for 4kB requests. With
single threaded IO, the throughput will be 4kB/10ms = 400kB/s. If the
device is capable of more throughput, a multithreaded IO will generate
more throughput. So with 2 threads the throughput will be roughly
800kB/s. We can keep increasing the number of threads resulting in an
approximately linear throughput. It'll saturate at the maximum capacity
the device has. So it could saturate at perhaps at 8MB/s. Increasing
the number of threads beyond this will not increase throughput.
Post by Amit Kale
This is a simplistic computation. Throughput, latency and number of
threads are related in a more complex relationship. Latency is still
important, but throughput is more important.
Post by Amit Kale
The way all this matters for SSD caching is, caching will typically
show a higher latency compared to the base SSD, even for a 100% hit
ratio. It may be possible to reach the maximum throughput achievable
with the base SSD using a high number of threads. Let's say an SSD
shows 450MB/s with 4 threads. A cache may show 440MB/s with 8 threads.
Post by Amit Kale
A practical difficulty in measuring latency is that the latency seen
by an application is a sum of the device latency plus the time spent in
request queue (and caching layer, when present). Increasing number of
threads shows latency increase, although it's only because the requests
stay in request queue for a longer duration. Latency measurement in a
multithreaded environment is very challenging. Measurement of
throughput is fairly straightforward.
Post by Amit Kale
Post by Jason Warr
As an enterprise level user I see both as important overall.
However, the biggest driving factor in wanting a cache device in
front of any sort of target in my use cases is to hide latency as
the number of threads reading and writing to the backing device go
up. So for me the cache is basically a tier stage where your
ability to keep dirty blocks on it is determined by the specific
use case.
Post by Amit Kale
SSD caching will help in this case since SSD's latency remains almost
constant regardless of location of data. HDD latency for sequential and
random IO could vary by a factor of 5 or even much more.
Post by Amit Kale
Throughput with caching could even be 100 times the HDD throughput
when using multiple threaded non-sequential IO.
Post by Amit Kale
-Amit
Thank you for the explanation. In context your reasoning makes more
sense to me.
If I am understanding you correctly when you refer to throughput your
speaking more in terms of IOPS than what most people would think of as
referencing only bit rate.
I would expect a small increase in minimum and average latency when
adding in another layer that the blocks have to traverse. If my
minimum and average increase by 20% on most of my workloads, that is
very acceptable as long as there is a decrease in 95th and 99th
percentile maximums. I would hope that absolute maximum would decrease
as well but that is going to be much harder to achieve.
If I can help test and benchmark all three of these solutions please
ask. I have allot of hardware resources available to me and perhaps I
can add value from an outsiders perspective.
That'll be great. I have so far marked EIO's status as alpha. Will require a little more functionality testing before performance. Perhaps in a week or so.

-Amit

PROPRIETARY-CONFIDENTIAL INFORMATION INCLUDED



This electronic transmission, and any documents attached hereto, may contain confidential, proprietary and/or legally privileged information. The information is intended only for use by the recipient named above. If you received this electronic message in error, please notify the sender and delete the electronic message. Any disclosure, copying, distribution, or use of the contents of information received in error is strictly prohibited, and violators will be pursued legally.
Kent Overstreet
2013-01-24 23:55:41 UTC
Permalink
Post by Amit Kale
Post by Amit Kale
Post by Amit Kale
Post by Amit Kale
9. Performance - Throughput is generally most important. Latency is
also one more performance comparison point. Performance under
different load classes can be measured.
I think latency is more important than throughput. Spindles are
pretty good at throughput. In fact the mq policy tries to spot
when
Post by Amit Kale
Post by Amit Kale
we're doing large linear ios and stops hit counting; best leave
this
Post by Amit Kale
Post by Amit Kale
stuff on the spindle.
I disagree. Latency is taken care of automatically when the number of
application threads rises.
Can you explain what you mean by that in a little more detail?
Let's say latency of a block device is 10ms for 4kB requests. With single threaded IO, the throughput will be 4kB/10ms = 400kB/s. If the device is capable of more throughput, a multithreaded IO will generate more throughput. So with 2 threads the throughput will be roughly 800kB/s. We can keep increasing the number of threads resulting in an approximately linear throughput. It'll saturate at the maximum capacity the device has. So it could saturate at perhaps at 8MB/s. Increasing the number of threads beyond this will not increase throughput.
This is a simplistic computation. Throughput, latency and number of threads are related in a more complex relationship. Latency is still important, but throughput is more important.
The way all this matters for SSD caching is, caching will typically show a higher latency compared to the base SSD, even for a 100% hit ratio. It may be possible to reach the maximum throughput achievable with the base SSD using a high number of threads. Let's say an SSD shows 450MB/s with 4 threads. A cache may show 440MB/s with 8 threads.
Going through the cache should only (measurably) increase latency for
writes, not reads (assuming they're cache hits, not misses). It sounds
like you're talking about the overhead for keeping the index up to date,
which is only a factor for writes, but I'm not quite sure since you talk
about hit rate.

I don't know of any reason why throughput or latency should be noticably
worse than raw for reads from cache.

But for writes, yeah - as number of of concurrent IOs goes up, you can
amortize the metadata writes more and more so throughput compared to raw
goes up. I don't think latency would change much vs. raw, you're always
going to have an extra metadata write to wait on... though there are
tricks you can do so the metadata write and data write can go down in
parallel. Bcache doesn't do those yet.

_But_, you only have to pay the metadata write penalty when you see a
cache flush/FUA write. In the absense of cache flushes/FUA, for
metadata purposes you can basically treat a stream as sequential writes
as going down in parallel.
thornber-H+wXaHxf7aLQT0dZR+
2013-01-17 18:50:17 UTC
Permalink
Post by Amit Kale
Post by thornber-H+wXaHxf7aLQT0dZR+
The mq policy uses a multiqueue (effectively a partially sorted
lru list) to keep track of candidate block hit counts. When
candidates get enough hits they're promoted. The promotion
threshold his periodically recalculated by looking at the hit
counts for the blocks already in the cache.
Multi-queue algorithm typically results in a significant metadata
overhead. How much percentage overhead does that imply?
It is a drawback, at the moment we have a list head, hit count and
some flags per block. I can compress this, it's on my todo list.
Looking at the code I see you have doubly linked list fields per block
too, albeit 16 bit ones. We use much bigger blocks than you, so I'm
happy to get the benefit of the extra space.
Post by Amit Kale
Post by thornber-H+wXaHxf7aLQT0dZR+
I read through EnhanceIO yesterday, and think this is where
you're lacking.
We have an LRU policy at a cache set level. Effectiveness of the LRU
policy depends on the average duration of a block in a working
dataset. If the average duration is small enough so a block is most
of the times "hit" before it's chucked out, LRU works better than
any other policies.
Yes, in some situations lru is best, in others lfu is best. That's
why people try and blend in something like arc. Now my real point was
although you're using lru to choose what to evict, you're not using
anything to choose what to put _in_ the cache, or have I got this
totally wrong?
Post by Amit Kale
Post by thornber-H+wXaHxf7aLQT0dZR+
A couple of other things I should mention; dm-cache uses a large block
size compared to eio. eg, 64k - 1m. This is a mixed blessing;
Yes. We had a lot of debate internally on the block size. For now we
have restricted to 2k, 4k and 8k. We found that larger block sizes
result in too much of internal fragmentation, in-spite of a
significant reduction in metadata size. 8k is adequate for Oracle
and mysql.
Right, you need to describe these scenarios so you can show off eio in
the best light.
Post by Amit Kale
Post by thornber-H+wXaHxf7aLQT0dZR+
We do not keep the dirty state of cache blocks up to date on the
metadata device. Instead we have a 'mounted' flag that's set in the
metadata when opened. When a clean shutdown occurs (eg, dmsetup
suspend my-cache) the dirty bits are written out and the mounted flag
cleared. On a crash the mounted flag will still be set on reopen and
all dirty flags degrade to 'dirty'.
Not sure I understand this. Is there a guarantee that once an IO is
reported as "done" to upstream layer
(filesystem/database/application), it is persistent. The persistence
should be guaranteed even if there is an OS crash immediately after
status is reported. Persistence should be guaranteed for the entire
IO range. The next time the application tries to read it, it should
get updated data, not stale data.
Yes, we're careful to persist all changes in the mapping before
completing io. However the dirty bits are just used to ascertain what
blocks need writing back to the origin. In the event of a crash it's
safe to assume they all do. dm-cache is a slow moving cache, change
of dirty status occurs far, far more frequently than change of
mapping. So avoiding these updates is a big win.
Post by Amit Kale
Post by thornber-H+wXaHxf7aLQT0dZR+
Correct me if I'm wrong, but I
think eio is holding io completion until the dirty bits have been
committed to disk?
That's correct. In addition to this, we try to batch metadata updates if multiple IOs occur in the same cache set.
y, I batch updates too.
Post by Amit Kale
Post by thornber-H+wXaHxf7aLQT0dZR+
Post by Amit Kale
3. Availability - What's the downtime when adding, deleting caches,
making changes to cache configuration, conversion between cache
modes, recovering after a crash, recovering from an error condition.
Normal dm suspend, alter table, resume cycle. The LVM tools do this
all the time.
Cache creation and deletion will require stopping applications,
unmounting filesystems and then remounting and starting the
applications. A sysad in addition to this will require updating
fstab entries. Do fstab entries work automatically in case they use
labels instead of full device paths.
The common case will be someone using a volume manager like LVM, so
the device nodes are already dm ones. In this case there's no need
for unmounting or stopping applications. Changing the stack of dm
targets around on a live system is a key feature. For example this is
how we implement the pvmove functionality.
Post by Amit Kale
Post by thornber-H+wXaHxf7aLQT0dZR+
Well I saw the comment in your code describing the security flaw you
think you've got. I hope we don't have any, I'd like to understand
your case more.
Could you elaborate on which comment you are referring to?
Top of eio_main.c

* 5) Fix a security hole : A malicious process with 'ro' access to a
* file can potentially corrupt file data. This can be fixed by
* copying the data on a cache read miss.
Post by Amit Kale
Post by thornber-H+wXaHxf7aLQT0dZR+
Post by Amit Kale
5. Portability - Which HDDs, SSDs, partitions, other block devices it
works with.
I think we all work with any block device. But eio and bcache can
overlay any device node, not just a dm one. As mentioned in earlier
email I really think this is a dm issue, not specific to dm-cache.
DM was never meant to be cascaded. So it's ok for DM.
Not sure what you mean here? I wrote dm specifically with stacking
scenarios in mind.
Post by Amit Kale
Post by thornber-H+wXaHxf7aLQT0dZR+
Post by Amit Kale
7. Persistence of cached data - Does cached data remain across
reboots/crashes/intermittent failures. Is the "sticky"ness of data
configurable.
Surely this is a given? A cache would be trivial to write if it
didn't need to be crash proof.
There has to be a way to make it either persistent or volatile
depending on how users want it. Enterprise users are sometimes
paranoid about HDD and SSD going out of sync after a system shutdown
and before a bootup. This is typically for large complicated iSCSI
based shared HDD setups.
Well in those Enterprise users can just use dm-cache in writethrough
mode and throw it away when they finish. Writing our metadata is not
the bottle neck (copy for migrations is), and it's definitely worth
keeping so there are up to date hit counts for the policy to work off
after reboot.
Post by Amit Kale
That's correct. We don't have to worry about wear leveling. All of the competent SSDs around do that.
What I wanted to bring up was how many SSD writes does a cache
read/write result. Write back cache mode is specifically taxing on
SSDs in this aspect.
No more than read/writes to a plain SSD. Are you getting hit by extra
io because you persist dirty flags?
Post by Amit Kale
Databases run into torn-page error when an IO is found to be only
partially written when it was supposed to be fully written. This is
particularly important when an IO was reported to be "done". The
original flashcache code we started with over an year ago showed
torn-page problem in extremely rare crashes with writeback mode. Our
present code contains specific design elements to avoid it.
We get this for free in core dm.

- Joe
Amit Kale
2013-01-18 07:03:54 UTC
Permalink
Post by thornber-H+wXaHxf7aLQT0dZR+
Post by Amit Kale
Post by thornber-H+wXaHxf7aLQT0dZR+
The mq policy uses a multiqueue (effectively a partially
sorted
Post by Amit Kale
Post by thornber-H+wXaHxf7aLQT0dZR+
lru list) to keep track of candidate block hit counts. When
candidates get enough hits they're promoted. The promotion
threshold his periodically recalculated by looking at the hit
counts for the blocks already in the cache.
Multi-queue algorithm typically results in a significant metadata
overhead. How much percentage overhead does that imply?
It is a drawback, at the moment we have a list head, hit count and some
flags per block. I can compress this, it's on my todo list.
Looking at the code I see you have doubly linked list fields per block
too, albeit 16 bit ones. We use much bigger blocks than you, so I'm
happy to get the benefit of the extra space.
Post by Amit Kale
Post by thornber-H+wXaHxf7aLQT0dZR+
I read through EnhanceIO yesterday, and think this is where
you're lacking.
We have an LRU policy at a cache set level. Effectiveness of the LRU
policy depends on the average duration of a block in a working
dataset. If the average duration is small enough so a block is most
of
Post by Amit Kale
the times "hit" before it's chucked out, LRU works better than any
other policies.
Yes, in some situations lru is best, in others lfu is best. That's why
people try and blend in something like arc. Now my real point was
although you're using lru to choose what to evict, you're not using
anything to choose what to put _in_ the cache, or have I got this
totally wrong?
We simply put in any read or written blocks into the cache (subject to availability and controlled limits).
Post by thornber-H+wXaHxf7aLQT0dZR+
Post by Amit Kale
Post by thornber-H+wXaHxf7aLQT0dZR+
A couple of other things I should mention; dm-cache uses a large
block size compared to eio. eg, 64k - 1m. This is a mixed
blessing;
Yes. We had a lot of debate internally on the block size. For now we
have restricted to 2k, 4k and 8k. We found that larger block sizes
result in too much of internal fragmentation, in-spite of a
significant reduction in metadata size. 8k is adequate for Oracle and
mysql.
Right, you need to describe these scenarios so you can show off eio in
the best light.
Post by Amit Kale
Post by thornber-H+wXaHxf7aLQT0dZR+
We do not keep the dirty state of cache blocks up to date on the
metadata device. Instead we have a 'mounted' flag that's set in
the
Post by Amit Kale
Post by thornber-H+wXaHxf7aLQT0dZR+
metadata when opened. When a clean shutdown occurs (eg, dmsetup
suspend my-cache) the dirty bits are written out and the mounted
flag cleared. On a crash the mounted flag will still be set on
reopen and all dirty flags degrade to 'dirty'.
Not sure I understand this. Is there a guarantee that once an IO is
reported as "done" to upstream layer
(filesystem/database/application), it is persistent. The persistence
should be guaranteed even if there is an OS crash immediately after
status is reported. Persistence should be guaranteed for the entire
IO
Post by Amit Kale
range. The next time the application tries to read it, it should get
updated data, not stale data.
Yes, we're careful to persist all changes in the mapping before
completing io. However the dirty bits are just used to ascertain what
blocks need writing back to the origin. In the event of a crash it's
safe to assume they all do. dm-cache is a slow moving cache, change of
dirty status occurs far, far more frequently than change of mapping.
So avoiding these updates is a big win.
That's great.
Post by thornber-H+wXaHxf7aLQT0dZR+
Post by Amit Kale
Post by thornber-H+wXaHxf7aLQT0dZR+
Correct me if I'm wrong, but I
think eio is holding io completion until the dirty bits have been
committed to disk?
That's correct. In addition to this, we try to batch metadata updates
if multiple IOs occur in the same cache set.
y, I batch updates too.
Post by Amit Kale
Post by thornber-H+wXaHxf7aLQT0dZR+
Post by Amit Kale
3. Availability - What's the downtime when adding, deleting caches,
making changes to cache configuration, conversion between cache
modes, recovering after a crash, recovering from an error
condition.
Post by Amit Kale
Post by thornber-H+wXaHxf7aLQT0dZR+
Normal dm suspend, alter table, resume cycle. The LVM tools do
this
Post by Amit Kale
Post by thornber-H+wXaHxf7aLQT0dZR+
all the time.
Cache creation and deletion will require stopping applications,
unmounting filesystems and then remounting and starting the
applications. A sysad in addition to this will require updating fstab
entries. Do fstab entries work automatically in case they use labels
instead of full device paths.
The common case will be someone using a volume manager like LVM, so the
device nodes are already dm ones. In this case there's no need for
unmounting or stopping applications. Changing the stack of dm targets
around on a live system is a key feature. For example this is how we
implement the pvmove functionality.
Post by Amit Kale
Post by thornber-H+wXaHxf7aLQT0dZR+
Well I saw the comment in your code describing the security flaw
you
Post by Amit Kale
Post by thornber-H+wXaHxf7aLQT0dZR+
think you've got. I hope we don't have any, I'd like to
understand
Post by Amit Kale
Post by thornber-H+wXaHxf7aLQT0dZR+
your case more.
Could you elaborate on which comment you are referring to?
Top of eio_main.c
* 5) Fix a security hole : A malicious process with 'ro' access to a
* file can potentially corrupt file data. This can be fixed by
* copying the data on a cache read miss.
That's stale. Slipped out of our cleanup. Will remove that.

It's still possible for an ordinary user to "consume" a significant portion of a cache by perpetually reading all permissible data. Caches as of now don't have user based controls for caches.
-Amit
Post by thornber-H+wXaHxf7aLQT0dZR+
Post by Amit Kale
Post by thornber-H+wXaHxf7aLQT0dZR+
Post by Amit Kale
5. Portability - Which HDDs, SSDs, partitions, other block
devices
Post by Amit Kale
Post by thornber-H+wXaHxf7aLQT0dZR+
Post by Amit Kale
it
works with.
I think we all work with any block device. But eio and bcache
can
Post by Amit Kale
Post by thornber-H+wXaHxf7aLQT0dZR+
overlay any device node, not just a dm one. As mentioned in
earlier
Post by Amit Kale
Post by thornber-H+wXaHxf7aLQT0dZR+
email I really think this is a dm issue, not specific to dm-
cache.
Post by Amit Kale
DM was never meant to be cascaded. So it's ok for DM.
Not sure what you mean here? I wrote dm specifically with stacking
scenarios in mind.
DM can't use a device containing partitions, by design. It works on individual partitions, though.
Post by thornber-H+wXaHxf7aLQT0dZR+
Post by Amit Kale
Post by thornber-H+wXaHxf7aLQT0dZR+
Post by Amit Kale
7. Persistence of cached data - Does cached data remain across
reboots/crashes/intermittent failures. Is the "sticky"ness of
data
Post by Amit Kale
Post by thornber-H+wXaHxf7aLQT0dZR+
configurable.
Surely this is a given? A cache would be trivial to write if it
didn't need to be crash proof.
There has to be a way to make it either persistent or volatile
depending on how users want it. Enterprise users are sometimes
paranoid about HDD and SSD going out of sync after a system shutdown
and before a bootup. This is typically for large complicated iSCSI
based shared HDD setups.
Well in those Enterprise users can just use dm-cache in writethrough
mode and throw it away when they finish. Writing our metadata is not
the bottle neck (copy for migrations is), and it's definitely worth
keeping so there are up to date hit counts for the policy to work off
after reboot.
Agreed. However there are arguments both ways. The need to start afresh is valid, although not frequent.
Post by thornber-H+wXaHxf7aLQT0dZR+
Post by Amit Kale
That's correct. We don't have to worry about wear leveling. All of
the competent SSDs around do that.
Post by Amit Kale
What I wanted to bring up was how many SSD writes does a cache
read/write result. Write back cache mode is specifically taxing on
SSDs in this aspect.
No more than read/writes to a plain SSD. Are you getting hit by extra
io because you persist dirty flags?
It's a price users pay for metadata updates. Our three caching modes have different levels of SSD writes. Read-only < write-through < write-back. Users can look at the benefits versus SSD life and choose accordingly.
-Amit


PROPRIETARY-CONFIDENTIAL INFORMATION INCLUDED



This electronic transmission, and any documents attached hereto, may contain confidential, proprietary and/or legally privileged information. The information is intended only for use by the recipient named above. If you received this electronic message in error, please notify the sender and delete the electronic message. Any disclosure, copying, distribution, or use of the contents of information received in error is strictly prohibited, and violators will be pursued legally.
Loading...