View Issue Details
ID | Project | Category | View Status | Date Submitted | Last Update |
---|---|---|---|---|---|
0001316 | bareos-core | storage daemon | public | 2021-01-30 10:01 | 2021-12-21 13:57 |
Reporter | kardel | Assigned To | franku | ||
Priority | high | Severity | major | Reproducibility | always |
Status | resolved | Resolution | fixed | ||
Product Version | 20.0.0 | ||||
Summary | 0001316: storage daemon loses a configured device instance causing major confusion in device handling | ||||
Description | After start "status storage=<name>" show the device as not open or with it parameters - that is expected. After the first backup with spooling "status storage=<name>" shows the device as "not open or does not exist" - that is a hint => the the configured "device_resource->dev" value is nullptr. The follow up effects are that the reservation code is unable to match the same active device and the same volume in all cases. When the match fails (log shows "<name> (/dev/<tapename>) and "<name> (/dev/<tapename>) " with no differences) it attempts to allocate new volumes possibly with operator intervention even though the expected volume is available in the drive. The root cause is a temporary device created in spool.cc::295 => auto rdev(std::make_unique<SpoolDevice>()); Line 302 sets device resource rdev->device_resource = dcr->dev->device_resource; When rdev leaves scope the Device::~Device() Dtor is called which happily sets this.device_resource->dev = nullptr in dev.cc:1281 if (device_resource) { device_resource->dev = nullptr; } (=> potential memory leak) At this point the configured device_resource is lost (even though it might still be known by active volume reservations). After that the reservation code is completely confused due to new default allocations of devices (see additional info). A fix is provided as patch against 20.0.0. It only clears this.device_resource->dev when this.device_resource->dev references this instance. | ||||
Steps To Reproduce | start bareos system. observe "status storage=..." run a spooling job observer "status storage=..." If you want to see the confusion it involves a more elaborate test setup with multipe jobs where a spooling job finishes before another job for the same volume and device begins to run. | ||||
Additional Information | It might be worthwhile to check the validity of creating a device in dir_cmd.cc:932. During testing a difference of device pointer was seen in vol_mgr.cc:916 although the device parameter where the same. This is most likely cause by Device::this.device_resource->dev being a nullptr and the device creation in dir_cmd.cc:932. The normal expected lifetime of a device is from reading the configuration until the program termination. Autochanger support might change that rule though - I didn't analyze that far. | ||||
Tags | No tags attached. | ||||
dev.cc.patch (568 bytes)
--- core/src/stored/dev.cc.orig 2020-12-16 08:46:16.000000000 +0100 +++ core/src/stored/dev.cc 2021-01-30 09:41:07.065069696 +0100 @@ -1278,7 +1278,11 @@ pthread_mutex_destroy(&spool_mutex); // RwlDestroy(&lock); attached_dcrs.clear(); - if (device_resource) { device_resource->dev = nullptr; } + // drop device_resource link only if it references us + if (device_resource && device_resource->dev == this) { + Dmsg1(900, "term dev: link from device_resource cleared\n"); + device_resource->dev = nullptr; + } } bool Device::CanStealLock() const |
|
Thank you for your deep analysis and the proposed fix which solves the issue. See github PR https://github.com/bareos/bareos/pull/724/commits for more information on the fix and systemtests (which is draft at the time of adding this note). |
|
Experimental binaries with the proposed bugfix can be found here: http://download.bareos.org/bareos/experimental/CD/PR-724/ | |
Fix committed to bareos master branch with changesetid 14543. | |
bareos: master d2f8f6ef 2021-02-24 13:57 Ported: N/A Details Diff |
stored: fix spooling sideffect bug Fixes 0001316: stored loses a configured device instance The Device reference in the DeviceResource could become invalid due to a missing comparison in the Device destructor. This bug only occurs when spooling is turned on and has side effects on the volume management and the reservation. |
Affected Issues 0001316 |
|
mod - CHANGELOG.md | Diff File | ||
mod - core/src/stored/dev.cc | Diff File | ||
mod - core/src/stored/vol_mgr.cc | Diff File |
Date Modified | Username | Field | Change |
---|---|---|---|
2021-01-30 10:01 | kardel | New Issue | |
2021-01-30 10:01 | kardel | File Added: dev.cc.patch | |
2021-02-12 12:10 | franku | Assigned To | => franku |
2021-02-12 12:10 | franku | Status | new => resolved |
2021-02-12 12:10 | franku | Resolution | open => fixed |
2021-02-12 12:15 | franku | Note Added: 0004088 | |
2021-02-15 11:38 | franku | Note Added: 0004089 | |
2021-02-22 11:37 | franku | Summary | stored loses a configured device instance caused major confusion in device handling => stored loses a configured device instance causing major confusion in device handling |
2021-02-24 11:38 | franku | Summary | stored loses a configured device instance causing major confusion in device handling => storage daemon loses a configured device instance causing major confusion in device handling |
2021-02-24 13:22 | franku | Changeset attached | => bareos master d2f8f6ef |
2021-02-24 13:22 | franku | Note Added: 0004091 | |
2021-12-21 13:57 | arogge | Relationship added | related to 0001289 |