Discussion:
[PATCH] bcache: fix writeback thread to sleep less intrusively
Darrick J. Wong
2014-04-11 00:26:36 UTC
Permalink
Hi all,

The attached patch fixes both the "writeback blocked for XXX seconds"
complaints from the kernel and the oddly high load averages on idle systems
problems for me. Can you give it a try to see if it fixes your problem too?

--D
---
Currently, the writeback thread performs uninterruptible sleep while
it waits for enough dirty data to accumulate to start writeback.
Unfortunately, uninterruptible sleep counts towards load average,
which artificially inflates it. Since the wb thread is a kernel
thread and kthreads don't receive signals, we can use the
interruptible sleep call, which eliminates the high load average
symptom.

A second symptom is that if we mount a non-writeback cache, the
writeback thread will be woken up. If the cache later accumulates
dirty data and writeback_running=1 (this seems to be a default) then
the writeback thread will enter uninterruptible sleep waiting for
dirty data. This is unnecessary and (I think) results in the
"bcache_writebac:155 blocked for more than XXX seconds" complaints
that people have been talking about. The fix for this is simple -- if
we're not in writeback mode, just go to (interruptible) sleep for a
long time. Alternately, we could use wait_event until the cache mode
changes.

Finally, change bch_cached_dev_attach() to always wake up the
writeback thread, because the newly created wb thread remains in
uninterruptible sleep state until something explicitly wakes it up.
This wakeup allows the thread to call bch_writeback_thread(),
whereupon it will most likely end up in interruptible sleep. In
theory we could just let the first write take care of this, but
there's really no reason not to do the transition quickly.

Signed-off-by: Darrick J. Wong <***@oracle.com>
---
drivers/md/bcache/super.c | 2 +-
drivers/md/bcache/writeback.c | 16 ++++++++++++++--
2 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 24a3a15..3ffe970 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -1048,8 +1048,8 @@ int bch_cached_dev_attach(struct cached_dev *dc, struct cache_set *c)
bch_sectors_dirty_init(dc);
atomic_set(&dc->has_dirty, 1);
atomic_inc(&dc->count);
- bch_writeback_queue(dc);
}
+ bch_writeback_queue(dc);

bch_cached_dev_run(dc);
bcache_device_link(&dc->disk, c, "bdev");
diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
index f4300e4..f49e6b1 100644
--- a/drivers/md/bcache/writeback.c
+++ b/drivers/md/bcache/writeback.c
@@ -239,7 +239,7 @@ static void read_dirty(struct cached_dev *dc)
if (KEY_START(&w->key) != dc->last_read ||
jiffies_to_msecs(delay) > 50)
while (!kthread_should_stop() && delay)
- delay = schedule_timeout_uninterruptible(delay);
+ delay = schedule_timeout_interruptible(delay);

dc->last_read = KEY_OFFSET(&w->key);

@@ -401,6 +401,18 @@ static int bch_writeback_thread(void *arg)

while (!kthread_should_stop()) {
down_write(&dc->writeback_lock);
+ if (BDEV_CACHE_MODE(&dc->sb) != CACHE_MODE_WRITEBACK) {
+ up_write(&dc->writeback_lock);
+ set_current_state(TASK_INTERRUPTIBLE);
+
+ if (kthread_should_stop())
+ return 0;
+
+ try_to_freeze();
+ schedule_timeout_interruptible(10 * HZ);
+ continue;
+ }
+
if (!atomic_read(&dc->has_dirty) ||
(!test_bit(BCACHE_DEV_DETACHING, &dc->disk.flags) &&
!dc->writeback_running)) {
@@ -436,7 +448,7 @@ static int bch_writeback_thread(void *arg)
while (delay &&
!kthread_should_stop() &&
!test_bit(BCACHE_DEV_DETACHING, &dc->disk.flags))
- delay = schedule_timeout_uninterruptible(delay);
+ delay = schedule_timeout_interruptible(delay);
}
}
Kent Overstreet
2014-04-11 00:46:49 UTC
Permalink
Post by Darrick J. Wong
Hi all,
The attached patch fixes both the "writeback blocked for XXX seconds"
complaints from the kernel and the oddly high load averages on idle systems
problems for me. Can you give it a try to see if it fixes your problem too?
--D
---
Currently, the writeback thread performs uninterruptible sleep while
it waits for enough dirty data to accumulate to start writeback.
Unfortunately, uninterruptible sleep counts towards load average,
which artificially inflates it. Since the wb thread is a kernel
thread and kthreads don't receive signals, we can use the
interruptible sleep call, which eliminates the high load average
symptom.
A second symptom is that if we mount a non-writeback cache, the
writeback thread will be woken up. If the cache later accumulates
dirty data and writeback_running=1 (this seems to be a default) then
the writeback thread will enter uninterruptible sleep waiting for
dirty data. This is unnecessary and (I think) results in the
"bcache_writebac:155 blocked for more than XXX seconds" complaints
that people have been talking about. The fix for this is simple -- if
we're not in writeback mode, just go to (interruptible) sleep for a
long time. Alternately, we could use wait_event until the cache mode
changes.
Finally, change bch_cached_dev_attach() to always wake up the
writeback thread, because the newly created wb thread remains in
uninterruptible sleep state until something explicitly wakes it up.
This wakeup allows the thread to call bch_writeback_thread(),
whereupon it will most likely end up in interruptible sleep. In
theory we could just let the first write take care of this, but
there's really no reason not to do the transition quickly.
---
drivers/md/bcache/super.c | 2 +-
drivers/md/bcache/writeback.c | 16 ++++++++++++++--
2 files changed, 15 insertions(+), 3 deletions(-)
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 24a3a15..3ffe970 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -1048,8 +1048,8 @@ int bch_cached_dev_attach(struct cached_dev *dc, struct cache_set *c)
bch_sectors_dirty_init(dc);
atomic_set(&dc->has_dirty, 1);
atomic_inc(&dc->count);
- bch_writeback_queue(dc);
}
+ bch_writeback_queue(dc);
bch_cached_dev_run(dc);
bcache_device_link(&dc->disk, c, "bdev");
diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
index f4300e4..f49e6b1 100644
--- a/drivers/md/bcache/writeback.c
+++ b/drivers/md/bcache/writeback.c
@@ -239,7 +239,7 @@ static void read_dirty(struct cached_dev *dc)
if (KEY_START(&w->key) != dc->last_read ||
jiffies_to_msecs(delay) > 50)
while (!kthread_should_stop() && delay)
- delay = schedule_timeout_uninterruptible(delay);
+ delay = schedule_timeout_interruptible(delay);
dc->last_read = KEY_OFFSET(&w->key);
@@ -401,6 +401,18 @@ static int bch_writeback_thread(void *arg)
while (!kthread_should_stop()) {
down_write(&dc->writeback_lock);
+ if (BDEV_CACHE_MODE(&dc->sb) != CACHE_MODE_WRITEBACK) {
+ up_write(&dc->writeback_lock);
+ set_current_state(TASK_INTERRUPTIBLE);
+
+ if (kthread_should_stop())
+ return 0;
+
+ try_to_freeze();
+ schedule_timeout_interruptible(10 * HZ);
+ continue;
+ }
+
So this addition isn't correct - cache mode might've been flipped to
writethrough, but we might still have dirty data we need to flush: that's why
the line below is checking dc->has_dirty.

I don't think your schedule_timeout_interruptible() is correct either, and I'm
not seeing what's wrong with the code right below - where it's doing the
set_current_state() then the schedule() - but from your report of high idle load
average (what about cpu?) it sounds like something is wrong.

Can you try and diagnose further? Also if you want to split the rest of the
changes out into a separate patch I'll definitely take that. Thanks!
Post by Darrick J. Wong
if (!atomic_read(&dc->has_dirty) ||
(!test_bit(BCACHE_DEV_DETACHING, &dc->disk.flags) &&
!dc->writeback_running)) {
@@ -436,7 +448,7 @@ static int bch_writeback_thread(void *arg)
while (delay &&
!kthread_should_stop() &&
!test_bit(BCACHE_DEV_DETACHING, &dc->disk.flags))
- delay = schedule_timeout_uninterruptible(delay);
+ delay = schedule_timeout_interruptible(delay);
}
}
Darrick J. Wong
2014-04-11 01:41:35 UTC
Permalink
Post by Kent Overstreet
Post by Darrick J. Wong
Hi all,
The attached patch fixes both the "writeback blocked for XXX seconds"
complaints from the kernel and the oddly high load averages on idle systems
problems for me. Can you give it a try to see if it fixes your problem too?
--D
---
Currently, the writeback thread performs uninterruptible sleep while
it waits for enough dirty data to accumulate to start writeback.
Unfortunately, uninterruptible sleep counts towards load average,
which artificially inflates it. Since the wb thread is a kernel
thread and kthreads don't receive signals, we can use the
interruptible sleep call, which eliminates the high load average
symptom.
A second symptom is that if we mount a non-writeback cache, the
writeback thread will be woken up. If the cache later accumulates
dirty data and writeback_running=1 (this seems to be a default) then
the writeback thread will enter uninterruptible sleep waiting for
dirty data. This is unnecessary and (I think) results in the
"bcache_writebac:155 blocked for more than XXX seconds" complaints
that people have been talking about. The fix for this is simple -- if
we're not in writeback mode, just go to (interruptible) sleep for a
long time. Alternately, we could use wait_event until the cache mode
changes.
Finally, change bch_cached_dev_attach() to always wake up the
writeback thread, because the newly created wb thread remains in
uninterruptible sleep state until something explicitly wakes it up.
This wakeup allows the thread to call bch_writeback_thread(),
whereupon it will most likely end up in interruptible sleep. In
theory we could just let the first write take care of this, but
there's really no reason not to do the transition quickly.
---
drivers/md/bcache/super.c | 2 +-
drivers/md/bcache/writeback.c | 16 ++++++++++++++--
2 files changed, 15 insertions(+), 3 deletions(-)
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 24a3a15..3ffe970 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -1048,8 +1048,8 @@ int bch_cached_dev_attach(struct cached_dev *dc, struct cache_set *c)
bch_sectors_dirty_init(dc);
atomic_set(&dc->has_dirty, 1);
atomic_inc(&dc->count);
- bch_writeback_queue(dc);
}
+ bch_writeback_queue(dc);
bch_cached_dev_run(dc);
bcache_device_link(&dc->disk, c, "bdev");
diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
index f4300e4..f49e6b1 100644
--- a/drivers/md/bcache/writeback.c
+++ b/drivers/md/bcache/writeback.c
@@ -239,7 +239,7 @@ static void read_dirty(struct cached_dev *dc)
if (KEY_START(&w->key) != dc->last_read ||
jiffies_to_msecs(delay) > 50)
while (!kthread_should_stop() && delay)
- delay = schedule_timeout_uninterruptible(delay);
+ delay = schedule_timeout_interruptible(delay);
dc->last_read = KEY_OFFSET(&w->key);
@@ -401,6 +401,18 @@ static int bch_writeback_thread(void *arg)
while (!kthread_should_stop()) {
down_write(&dc->writeback_lock);
+ if (BDEV_CACHE_MODE(&dc->sb) != CACHE_MODE_WRITEBACK) {
+ up_write(&dc->writeback_lock);
+ set_current_state(TASK_INTERRUPTIBLE);
+
+ if (kthread_should_stop())
+ return 0;
+
+ try_to_freeze();
+ schedule_timeout_interruptible(10 * HZ);
+ continue;
+ }
+
So this addition isn't correct - cache mode might've been flipped to
writethrough, but we might still have dirty data we need to flush: that's why
the line below is checking dc->has_dirty.
Good point.
Post by Kent Overstreet
I don't think your schedule_timeout_interruptible() is correct either, and I'm
not seeing what's wrong with the code right below - where it's doing the
set_current_state() then the schedule() - but from your report of high idle load
average (what about cpu?) it sounds like something is wrong.
The load average will settle at around 2.0 or so; powertop reports idleness of
90% or more for all cores, and the disks aren't doing anything either.

On a freshly booted VM, it calls schedule and goes to sleep for quite a while.
As soon as I start running some fs write tests, I see that has_dirty becomes 1.
Once in a while, searched_full_index==1, but RB_EMPTY_ROOT(&dc->writeback_keys.keys)
never hits 0, so has_dirty stays 1. When I kill the tests and go back to idle,
we end up looping in read_dirty for a long time, I guess because ... aha! It's
slowly trickling the dirty data out to the backing device, and cranking up
writeback_rate makes it finish (and go back to that schedule() sleep) faster.

Hmm. I'm wondering about the choice of _interruptible vs.
_uninterruptible -- what are you trying to prevent from happening?

<shrug> will diagnose further (it's dinnertime).
Post by Kent Overstreet
Can you try and diagnose further? Also if you want to split the rest of the
changes out into a separate patch I'll definitely take that. Thanks!
Do you mean the first chunk, which moves the bch_writeback_queue() in super.c?

--D
Post by Kent Overstreet
Post by Darrick J. Wong
if (!atomic_read(&dc->has_dirty) ||
(!test_bit(BCACHE_DEV_DETACHING, &dc->disk.flags) &&
!dc->writeback_running)) {
@@ -436,7 +448,7 @@ static int bch_writeback_thread(void *arg)
while (delay &&
!kthread_should_stop() &&
!test_bit(BCACHE_DEV_DETACHING, &dc->disk.flags))
- delay = schedule_timeout_uninterruptible(delay);
+ delay = schedule_timeout_interruptible(delay);
}
}
--
To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
Darrick J. Wong
2014-04-12 02:31:01 UTC
Permalink
Post by Darrick J. Wong
Post by Kent Overstreet
Post by Darrick J. Wong
Hi all,
The attached patch fixes both the "writeback blocked for XXX seconds"
complaints from the kernel and the oddly high load averages on idle systems
problems for me. Can you give it a try to see if it fixes your problem too?
--D
---
Currently, the writeback thread performs uninterruptible sleep while
it waits for enough dirty data to accumulate to start writeback.
Unfortunately, uninterruptible sleep counts towards load average,
which artificially inflates it. Since the wb thread is a kernel
thread and kthreads don't receive signals, we can use the
interruptible sleep call, which eliminates the high load average
symptom.
A second symptom is that if we mount a non-writeback cache, the
writeback thread will be woken up. If the cache later accumulates
dirty data and writeback_running=1 (this seems to be a default) then
the writeback thread will enter uninterruptible sleep waiting for
dirty data. This is unnecessary and (I think) results in the
"bcache_writebac:155 blocked for more than XXX seconds" complaints
that people have been talking about. The fix for this is simple -- if
we're not in writeback mode, just go to (interruptible) sleep for a
long time. Alternately, we could use wait_event until the cache mode
changes.
Finally, change bch_cached_dev_attach() to always wake up the
writeback thread, because the newly created wb thread remains in
uninterruptible sleep state until something explicitly wakes it up.
This wakeup allows the thread to call bch_writeback_thread(),
whereupon it will most likely end up in interruptible sleep. In
theory we could just let the first write take care of this, but
there's really no reason not to do the transition quickly.
---
drivers/md/bcache/super.c | 2 +-
drivers/md/bcache/writeback.c | 16 ++++++++++++++--
2 files changed, 15 insertions(+), 3 deletions(-)
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 24a3a15..3ffe970 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -1048,8 +1048,8 @@ int bch_cached_dev_attach(struct cached_dev *dc, struct cache_set *c)
bch_sectors_dirty_init(dc);
atomic_set(&dc->has_dirty, 1);
atomic_inc(&dc->count);
- bch_writeback_queue(dc);
}
+ bch_writeback_queue(dc);
bch_cached_dev_run(dc);
bcache_device_link(&dc->disk, c, "bdev");
diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
index f4300e4..f49e6b1 100644
--- a/drivers/md/bcache/writeback.c
+++ b/drivers/md/bcache/writeback.c
@@ -239,7 +239,7 @@ static void read_dirty(struct cached_dev *dc)
if (KEY_START(&w->key) != dc->last_read ||
jiffies_to_msecs(delay) > 50)
while (!kthread_should_stop() && delay)
- delay = schedule_timeout_uninterruptible(delay);
+ delay = schedule_timeout_interruptible(delay);
dc->last_read = KEY_OFFSET(&w->key);
@@ -401,6 +401,18 @@ static int bch_writeback_thread(void *arg)
while (!kthread_should_stop()) {
down_write(&dc->writeback_lock);
+ if (BDEV_CACHE_MODE(&dc->sb) != CACHE_MODE_WRITEBACK) {
+ up_write(&dc->writeback_lock);
+ set_current_state(TASK_INTERRUPTIBLE);
+
+ if (kthread_should_stop())
+ return 0;
+
+ try_to_freeze();
+ schedule_timeout_interruptible(10 * HZ);
+ continue;
+ }
+
So this addition isn't correct - cache mode might've been flipped to
writethrough, but we might still have dirty data we need to flush: that's why
the line below is checking dc->has_dirty.
Good point.
Post by Kent Overstreet
I don't think your schedule_timeout_interruptible() is correct either, and I'm
not seeing what's wrong with the code right below - where it's doing the
set_current_state() then the schedule() - but from your report of high idle load
average (what about cpu?) it sounds like something is wrong.
The load average will settle at around 2.0 or so; powertop reports idleness of
90% or more for all cores, and the disks aren't doing anything either.
On a freshly booted VM, it calls schedule and goes to sleep for quite a while.
As soon as I start running some fs write tests, I see that has_dirty becomes 1.
Once in a while, searched_full_index==1, but RB_EMPTY_ROOT(&dc->writeback_keys.keys)
never hits 0, so has_dirty stays 1. When I kill the tests and go back to idle,
we end up looping in read_dirty for a long time, I guess because ... aha! It's
slowly trickling the dirty data out to the backing device, and cranking up
writeback_rate makes it finish (and go back to that schedule() sleep) faster.
Ok, I have a little more to share about this -- the PD controller in charge of
WB tries to maintain (by default) 10% of the cache as dirty. On my relatively
idle laptop, it is frequently the case that < 10% of the cache is dirty. When
this happens, the writeback_rate falls to 512b/s. Meanwhile, read_dirty writes
out the dirty data at this relatively slow rate, apparently using
schedule_timeout_uninterruptible to throttle the writeout.

There's enough activity on my laptop to ensure that there's always _some_ dirty
data, hence the writeback thread is constantly in uninterruptible sleep. I
could probably set writeback_percent = 0 to mitigate this since in my case I
probably want as little dirty data as possible, but ... ugh.
Post by Darrick J. Wong
Hmm. I'm wondering about the choice of _interruptible vs.
_uninterruptible -- what are you trying to prevent from happening?
I changed it to _interruptible, and so far I haven't seen any problems running
xfstests.
Post by Darrick J. Wong
Post by Kent Overstreet
Can you try and diagnose further? Also if you want to split the rest of the
changes out into a separate patch I'll definitely take that. Thanks!
Do you mean the first chunk, which moves the bch_writeback_queue() in super.c?
Assuming you do, I'll issue a patch shortly.

--D
Post by Darrick J. Wong
--D
Post by Kent Overstreet
Post by Darrick J. Wong
if (!atomic_read(&dc->has_dirty) ||
(!test_bit(BCACHE_DEV_DETACHING, &dc->disk.flags) &&
!dc->writeback_running)) {
@@ -436,7 +448,7 @@ static int bch_writeback_thread(void *arg)
while (delay &&
!kthread_should_stop() &&
!test_bit(BCACHE_DEV_DETACHING, &dc->disk.flags))
- delay = schedule_timeout_uninterruptible(delay);
+ delay = schedule_timeout_interruptible(delay);
}
}
--
To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
Darrick J. Wong
2014-04-12 02:33:25 UTC
Permalink
Change bch_cached_dev_attach() to always wake up the writeback thread,
because the newly created wb thread remains in uninterruptible sleep
state until something explicitly wakes it up. This wakeup allows the
thread to call bch_writeback_thread(), whereupon it will most likely
end up in interruptible sleep. In theory we could just let the first
write take care of this, but there's really no reason not to do the
transition quickly.

Signed-off-by: Darrick J. Wong <***@oracle.com>
---
drivers/md/bcache/super.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 24a3a15..3ffe970 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -1048,8 +1048,8 @@ int bch_cached_dev_attach(struct cached_dev *dc, struct cache_set *c)
bch_sectors_dirty_init(dc);
atomic_set(&dc->has_dirty, 1);
atomic_inc(&dc->count);
- bch_writeback_queue(dc);
}
+ bch_writeback_queue(dc);

bch_cached_dev_run(dc);
bcache_device_link(&dc->disk, c, "bdev");
Daniel Smedegaard Buus
2014-04-30 11:51:14 UTC
Permalink
Hi :)

What's the state of this?

I have this issue on a number of EC2 instances that are using bcache
to speed up EBS access by caching on the ephemeral SSDs (writethrough
mode).

"echo 0 > /sys/block/bcache0/bcache/writeback_running" as suggested by
others stops the flood of "blocked" messages in dmesg, but it only
partly fixes the 2.0 minimum load average issue (instead, it's now a
1.0 minimum load average).

I just tried the latest kernel I could get my hands on for Ubuntu
Trusty, the mainline v3.15-rc2 (built April 20th), and it exhibits the
same behavior.

Thanks :)

Daniel
Darrick J. Wong
2014-04-30 17:24:55 UTC
Permalink
Post by Daniel Smedegaard Buus
Hi :)
What's the state of this?
I have this issue on a number of EC2 instances that are using bcache
to speed up EBS access by caching on the ephemeral SSDs (writethrough
mode).
"echo 0 > /sys/block/bcache0/bcache/writeback_running" as suggested by
others stops the flood of "blocked" messages in dmesg, but it only
partly fixes the 2.0 minimum load average issue (instead, it's now a
1.0 minimum load average).
I haven't spent time on figuring out the other source of load average. Kent
didn't seem to like the patch to convert the bcache_writeback thread to
interruptible sleep (I recall he said it was 'wrong', but didn't elaborate).

--D
Post by Daniel Smedegaard Buus
I just tried the latest kernel I could get my hands on for Ubuntu
Trusty, the mainline v3.15-rc2 (built April 20th), and it exhibits the
same behavior.
Thanks :)
Daniel
Daniel Smedegaard Buus
2014-05-01 09:38:48 UTC
Permalink
On Wed, Apr 30, 2014 at 7:24 PM, Darrick J. Wong
Post by Darrick J. Wong
I haven't spent time on figuring out the other source of load average. Kent
didn't seem to like the patch to convert the bcache_writeback thread to
interruptible sleep (I recall he said it was 'wrong', but didn't elaborate).
Sorry to hear that... Would be really nice to be able to go back to
normal load. And I cannot revert to an older kernel, as I need
3.15-rc2 or greater to fix a different problem concerning Oracle Java
:/
Slava Pestov
2014-05-01 21:54:29 UTC
Permalink
Hi Daniel and Darrick,

I mailed a patch that attempts to fix the uninterruptible issue while
taking Kent's feedback regarding your earlier patch into account.
Please test it out and let me know what you think.

On Thu, May 1, 2014 at 2:38 AM, Daniel Smedegaard Buus
Post by Daniel Smedegaard Buus
On Wed, Apr 30, 2014 at 7:24 PM, Darrick J. Wong
Post by Darrick J. Wong
I haven't spent time on figuring out the other source of load average. Kent
didn't seem to like the patch to convert the bcache_writeback thread to
interruptible sleep (I recall he said it was 'wrong', but didn't elaborate).
Sorry to hear that... Would be really nice to be able to go back to
normal load. And I cannot revert to an older kernel, as I need
3.15-rc2 or greater to fix a different problem concerning Oracle Java
:/
--
To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
Daniel J Blueman
2014-05-20 07:07:19 UTC
Permalink
Post by Slava Pestov
On Thu, May 1, 2014 at 2:38 AM, Daniel Smedegaard Buus
Post by Daniel Smedegaard Buus
On Wed, Apr 30, 2014 at 7:24 PM, Darrick J. Wong
Post by Darrick J. Wong
I haven't spent time on figuring out the other source of load average. Kent
didn't seem to like the patch to convert the bcache_writeback thread to
interruptible sleep (I recall he said it was 'wrong', but didn't elaborate).
Sorry to hear that... Would be really nice to be able to go back to
normal load. And I cannot revert to an older kernel, as I need
3.15-rc2 or greater to fix a different problem concerning Oracle Java
Hi Daniel and Darrick,
I mailed a patch that attempts to fix the uninterruptible issue while
taking Kent's feedback regarding your earlier patch into account.
Please test it out and let me know what you think.
Hi Slava,

Apologies for the delays. I rebuilt 3.14.4 with your bcache patch [1]
and the 'bcache_writeback blocked for more than 120 seconds' don't
occur, though when the bcache threads are torn down during reboot, we
crash [2] at:

static void cached_dev_free(struct closure *cl)
{
struct cached_dev *dc = container_of(cl, struct cached_dev, disk.cl);

cancel_delayed_work_sync(&dc->writeback_rate_update);
kthread_stop(dc->writeback_thread);

dc->writeback_thread is clearly zero, as likely the struct cached_dev
was freed already.

Many thanks,
Daniel

[1] http://www.spinics.net/lists/linux-bcache/msg02464.html

-- [2]

BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
IP: [<ffffffff8108bff6>] kthread_stop+0x16/0xe0
PGD 0
Oops: 0002 [#1] SMP
Modules linked in: microcode(F) nfsd(F) nfs_acl(F) auth_rpcgss(F)
nfs(F) fscache(F) lockd(F) sunrpc(F) joydev(F) ipmi_si(F)
ipmi_msghandler(F) psmouse(F) serio_raw(F) video(F) mac_hid(F)
lpc_ich(F) lp(F) parport(F) btrfs(F) raid6_pq(F) bcache(F) xor(F)
hid_generic(F) usbhid(F) hid(F) e1000e(F) ptp(F) pps_core(F) ahci(F)
CPU: 2 PID: 27 Comm: kworker/2:0 Tainted: GF 3.14.4-bcachefix+ #3
Hardware name: Supermicro X9SCL/X9SCM/X9SCL/X9SCM, BIOS 2.0a 06/08/2012
Workqueue: events cached_dev_free [bcache]
task: ffff8804097363c0 ti: ffff8804097d2000 task.ti: ffff8804097d2000
RIP: 0010:[<ffffffff8108bff6>] [<ffffffff8108bff6>] kthread_stop+0x16/0xe0
RSP: 0018:ffff8804097d3df0 EFLAGS: 00010292
RAX: 0000000fffffff00 RBX: ffff8804034e0010 RCX: 000000007fffffff
RDX: 0000000000000296 RSI: 000000007fffffff RDI: 0000000000000000
RBP: ffff8804097d3e08 R08: 20100d3800400000 R09: 0080000000000000
R10: dfef7acc030e0010 R11: 0000000000000400 R12: 0000000000000000
R13: ffff8804034e0010 R14: 0000000000000000 R15: 0000000000000080
FS: 0000000000000000(0000) GS:ffff88041fd00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000010 CR3: 0000000001c0e000 CR4: 00000000000407e0
Stack:
ffff8804034e0010 ffff88041fd13d80 ffff8804034e0010 ffff8804097d3e20
ffffffffa00b2dd5 ffff880409503580 ffff8804097d3e68 ffffffff81084482
000000001fd13d98 ffff88041fd17f00 ffff88041fd13d98 ffff8804095035b0
Call Trace:
[<ffffffffa00b2dd5>] cached_dev_free+0x25/0x100 [bcache]
[<ffffffff81084482>] process_one_work+0x182/0x450
[<ffffffff81085241>] worker_thread+0x121/0x410
[<ffffffff81085120>] ? rescuer_thread+0x3e0/0x3e0
[<ffffffff8108bdc2>] kthread+0xd2/0xf0
[<ffffffff8108bcf0>] ? kthread_create_on_node+0x190/0x190
[<ffffffff8173c67c>] ret_from_fork+0x7c/0xb0
[<ffffffff8108bcf0>] ? kthread_create_on_node+0x190/0x190
Code: e8 20 ff ff ff 48 89 df be 00 02 00 00 e8 63 10 01 00 5b 5d c3
66 66 66 66 90 55 48 89 e5 41 55 41 54 49 89 fc 53 66 66 66 66 90 <f0>
41 ff 44 24 10 49 8b 9c 24 a0 04 00 00 48 85 db 74 21 f0 80
RIP [<ffffffff8108bff6>] kthread_stop+0x16/0xe0
RSP <ffff8804097d3df0>
CR2: 0000000000000010
--
Daniel J Blueman
Daniel Smedegaard Buus
2014-07-09 10:27:51 UTC
Permalink
Post by Slava Pestov
Hi Daniel and Darrick,
I mailed a patch that attempts to fix the uninterruptible issue while taking
Kent's feedback regarding your earlier patch into account. Please test it
out and let me know what you think.
Curious about what ever happened to this bug. This patch fixed the
load bug, but with new problems added...

Did anyone ever fix this properly? Or is it being attended to?

Thanks, and thanks for the great work so far :)

Daniel
Peter Kieser
2014-07-09 15:53:46 UTC
Permalink
Post by Daniel Smedegaard Buus
Post by Slava Pestov
Hi Daniel and Darrick,
I mailed a patch that attempts to fix the uninterruptible issue while taking
Kent's feedback regarding your earlier patch into account. Please test it
out and let me know what you think.
Curious about what ever happened to this bug. This patch fixed the
load bug, but with new problems added...
What other bugs did it introduce?

-Peter
Daniel Smedegaard Buus
2014-07-09 18:02:38 UTC
Permalink
Post by Peter Kieser
What other bugs did it introduce?
Well, the one mentioned in this thread, and the one where losing (or
disabling? can't remember which) the cache device leaves you with a
system that won't boot (can't find that message ATM).
Post by Peter Kieser
-Peter
Slava Pestov
2014-07-09 18:19:49 UTC
Permalink
I sent two patches to fix the writeback thread -- if you apply both of
them, everything should work fine.

On Wed, Jul 9, 2014 at 11:02 AM, Daniel Smedegaard Buus
Post by Daniel Smedegaard Buus
Post by Peter Kieser
What other bugs did it introduce?
Well, the one mentioned in this thread, and the one where losing (or
disabling? can't remember which) the cache device leaves you with a
system that won't boot (can't find that message ATM).
Post by Peter Kieser
-Peter
--
To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
Daniel Smedegaard Buus
2014-07-09 19:18:50 UTC
Permalink
Post by Slava Pestov
I sent two patches to fix the writeback thread -- if you apply both of
them, everything should work fine.
That's fantastic! :D Where is the right place to look for these patches?

Thanks!

Daniel

Francis Moreau
2014-04-11 07:26:45 UTC
Permalink
Post by Kent Overstreet
Post by Darrick J. Wong
Hi all,
The attached patch fixes both the "writeback blocked for XXX seconds"
complaints from the kernel and the oddly high load averages on idle systems
problems for me. Can you give it a try to see if it fixes your problem too?
--D
---
Currently, the writeback thread performs uninterruptible sleep while
it waits for enough dirty data to accumulate to start writeback.
Unfortunately, uninterruptible sleep counts towards load average,
which artificially inflates it. Since the wb thread is a kernel
thread and kthreads don't receive signals, we can use the
interruptible sleep call, which eliminates the high load average
symptom.
A second symptom is that if we mount a non-writeback cache, the
writeback thread will be woken up. If the cache later accumulates
dirty data and writeback_running=1 (this seems to be a default) then
the writeback thread will enter uninterruptible sleep waiting for
dirty data. This is unnecessary and (I think) results in the
"bcache_writebac:155 blocked for more than XXX seconds" complaints
that people have been talking about. The fix for this is simple -- if
we're not in writeback mode, just go to (interruptible) sleep for a
long time. Alternately, we could use wait_event until the cache mode
changes.
Finally, change bch_cached_dev_attach() to always wake up the
writeback thread, because the newly created wb thread remains in
uninterruptible sleep state until something explicitly wakes it up.
This wakeup allows the thread to call bch_writeback_thread(),
whereupon it will most likely end up in interruptible sleep. In
theory we could just let the first write take care of this, but
there's really no reason not to do the transition quickly.
---
drivers/md/bcache/super.c | 2 +-
drivers/md/bcache/writeback.c | 16 ++++++++++++++--
2 files changed, 15 insertions(+), 3 deletions(-)
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 24a3a15..3ffe970 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -1048,8 +1048,8 @@ int bch_cached_dev_attach(struct cached_dev *dc, struct cache_set *c)
bch_sectors_dirty_init(dc);
atomic_set(&dc->has_dirty, 1);
atomic_inc(&dc->count);
- bch_writeback_queue(dc);
}
+ bch_writeback_queue(dc);
bch_cached_dev_run(dc);
bcache_device_link(&dc->disk, c, "bdev");
diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
index f4300e4..f49e6b1 100644
--- a/drivers/md/bcache/writeback.c
+++ b/drivers/md/bcache/writeback.c
@@ -239,7 +239,7 @@ static void read_dirty(struct cached_dev *dc)
if (KEY_START(&w->key) != dc->last_read ||
jiffies_to_msecs(delay) > 50)
while (!kthread_should_stop() && delay)
- delay = schedule_timeout_uninterruptible(delay);
+ delay = schedule_timeout_interruptible(delay);
dc->last_read = KEY_OFFSET(&w->key);
@@ -401,6 +401,18 @@ static int bch_writeback_thread(void *arg)
while (!kthread_should_stop()) {
down_write(&dc->writeback_lock);
+ if (BDEV_CACHE_MODE(&dc->sb) != CACHE_MODE_WRITEBACK) {
+ up_write(&dc->writeback_lock);
+ set_current_state(TASK_INTERRUPTIBLE);
+
+ if (kthread_should_stop())
+ return 0;
+
+ try_to_freeze();
+ schedule_timeout_interruptible(10 * HZ);
+ continue;
+ }
+
So this addition isn't correct - cache mode might've been flipped to
writethrough, but we might still have dirty data we need to flush: that's why
the line below is checking dc->has_dirty.
I don't think your schedule_timeout_interruptible() is correct either, and I'm
not seeing what's wrong with the code right below - where it's doing the
set_current_state() then the schedule() - but from your report of high idle load
average (what about cpu?) it sounds like something is wrong.
could you then try to explain then why people is currently facing this
issue (3.13 and 3.14 at least):

[Apr11 09:15] INFO: task bcache_writebac:166 blocked for more than 120
seconds.
[ +0.000009] Not tainted 3.14.0-4-ARCH #1
[ +0.000002] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ +0.000003] bcache_writebac D 0000000000000000 0 166 2
0x00000000
[ +0.000007] ffff880405223eb8 0000000000000046 ffff8800d8939d70
ffff880405223fd8
[ +0.000006] 00000000000142c0 00000000000142c0 ffff8800d8939d70
ffff880405223e38
[ +0.000004] ffffffff81097c1a 0000000000000000 0000000000000000
0000000000000046
[ +0.000005] Call Trace:
[ +0.000015] [<ffffffff81097c1a>] ? try_to_wake_up+0x1fa/0x2c0
[ +0.000005] [<ffffffff81097d32>] ? default_wake_function+0x12/0x20
[ +0.000007] [<ffffffff810a9b48>] ? __wake_up_common+0x58/0x90
[ +0.000035] [<ffffffffa0428ff0>] ? read_dirty_endio+0x60/0x60 [bcache]
[ +0.000007] [<ffffffff814d7d89>] schedule+0x29/0x70
[ +0.000005] [<ffffffff8108715d>] kthread+0xad/0xf0
[ +0.000005] [<ffffffff810870b0>] ? kthread_create_on_node+0x180/0x180
[ +0.000006] [<ffffffff814e2ebc>] ret_from_fork+0x7c/0xb0
[ +0.000004] [<ffffffff810870b0>] ? kthread_create_on_node+0x180/0x180

This has been reported several times but it seems that you ignored it
each time.

Thanks.
Loading...