View Issue Details

IDProjectCategoryView StatusLast Update
0001316bareos-corestorage daemonpublic2021-12-21 13:57
Reporterkardel Assigned Tofranku  
PriorityhighSeveritymajorReproducibilityalways
Status resolvedResolutionfixed 
Product Version20.0.0 
Summary0001316: storage daemon loses a configured device instance causing major confusion in device handling
DescriptionAfter start "status storage=<name>" show the device as not open or with it parameters - that is expected.

After the first backup with spooling "status storage=<name>" shows the device as "not open or does not exist" - that is a hint
=> the the configured "device_resource->dev" value is nullptr.

The follow up effects are that the reservation code is unable to match the same active device and the same volume in all cases.
When the match fails (log shows "<name> (/dev/<tapename>) and "<name> (/dev/<tapename>) " with no differences) it attempts to allocate new volumes possibly with operator intervention even though the expected volume is available in the drive.

The root cause is a temporary device created in spool.cc::295 => auto rdev(std::make_unique<SpoolDevice>());
Line 302 sets device resource rdev->device_resource = dcr->dev->device_resource;
When rdev leaves scope the Device::~Device() Dtor is called which happily sets this.device_resource->dev = nullptr in
dev.cc:1281 if (device_resource) { device_resource->dev = nullptr; } (=> potential memory leak)

At this point the configured device_resource is lost (even though it might still be known by active volume reservations).
After that the reservation code is completely confused due to new default allocations of devices (see additional info).

A fix is provided as patch against 20.0.0. It only clears this.device_resource->dev when
this.device_resource->dev references this instance.
Steps To Reproducestart bareos system.
observe "status storage=..."
run a spooling job
observer "status storage=..."

If you want to see the confusion it involves a more elaborate test setup with multipe jobs where a spooling job finishes before
another job for the same volume and device begins to run.
Additional InformationIt might be worthwhile to check the validity of creating a device in dir_cmd.cc:932. During testing
a difference of device pointer was seen in vol_mgr.cc:916 although the device parameter where the same.
This is most likely cause by Device::this.device_resource->dev being a nullptr and the device creation
in dir_cmd.cc:932. The normal expected lifetime of a device is from reading the configuration until the
program termination. Autochanger support might change that rule though - I didn't analyze that far.
TagsNo tags attached.

Relationships

related to 0001289 closedarogge Release Bareos 21.0.0 

Activities

kardel

kardel

2021-01-30 10:01

reporter  

dev.cc.patch (568 bytes)   
--- core/src/stored/dev.cc.orig	2020-12-16 08:46:16.000000000 +0100
+++ core/src/stored/dev.cc	2021-01-30 09:41:07.065069696 +0100
@@ -1278,7 +1278,11 @@
   pthread_mutex_destroy(&spool_mutex);
   // RwlDestroy(&lock);
   attached_dcrs.clear();
-  if (device_resource) { device_resource->dev = nullptr; }
+  // drop device_resource link only if it references us
+  if (device_resource && device_resource->dev == this) {
+    Dmsg1(900, "term dev: link from device_resource cleared\n");
+    device_resource->dev = nullptr;
+  }
 }
 
 bool Device::CanStealLock() const
dev.cc.patch (568 bytes)   
franku

franku

2021-02-12 12:15

administrator   ~0004088

Thank you for your deep analysis and the proposed fix which solves the issue.

See github PR https://github.com/bareos/bareos/pull/724/commits for more information on the fix and systemtests (which is draft at the time of adding this note).
franku

franku

2021-02-15 11:38

administrator   ~0004089

Experimental binaries with the proposed bugfix can be found here: http://download.bareos.org/bareos/experimental/CD/PR-724/
franku

franku

2021-02-24 13:22

administrator   ~0004091

Fix committed to bareos master branch with changesetid 14543.

Related Changesets

bareos: master d2f8f6ef

2021-02-24 13:57

franku

Ported: N/A

Details Diff
stored: fix spooling sideffect bug

Fixes 0001316: stored loses a configured device instance

The Device reference in the DeviceResource could become invalid
due to a missing comparison in the Device destructor. This bug
only occurs when spooling is turned on and has side effects on
the volume management and the reservation.
Affected Issues
0001316
mod - CHANGELOG.md Diff File
mod - core/src/stored/dev.cc Diff File
mod - core/src/stored/vol_mgr.cc Diff File

Issue History

Date Modified Username Field Change
2021-01-30 10:01 kardel New Issue
2021-01-30 10:01 kardel File Added: dev.cc.patch
2021-02-12 12:10 franku Assigned To => franku
2021-02-12 12:10 franku Status new => resolved
2021-02-12 12:10 franku Resolution open => fixed
2021-02-12 12:15 franku Note Added: 0004088
2021-02-15 11:38 franku Note Added: 0004089
2021-02-22 11:37 franku Summary stored loses a configured device instance caused major confusion in device handling => stored loses a configured device instance causing major confusion in device handling
2021-02-24 11:38 franku Summary stored loses a configured device instance causing major confusion in device handling => storage daemon loses a configured device instance causing major confusion in device handling
2021-02-24 13:22 franku Changeset attached => bareos master d2f8f6ef
2021-02-24 13:22 franku Note Added: 0004091
2021-12-21 13:57 arogge Relationship added related to 0001289