View Issue Details

IDProjectCategoryView StatusLast Update
0000691bareos-core[All Projects] storage daemonpublic2017-06-08 16:31
ReportermedicmomciloAssigned To 
PrioritynormalSeveritymajorReproducibilityrandom
Status resolvedResolutionfixed 
PlatformLinuxOSUbuntuOS Version14.04
Product Version15.2.2 
Target VersionFixed in Version 
Summary0000691: Backup job fails with "No Volume name given."
DescriptionWe have occasional failed backup job due to 'No volume name given.' error.
Job can be manually re-ran and it is successful.

This job is running perfectly fine most of the time.

Digging through the code I found that this error message should only be seen if using AutoChanger, which we're not. From the src/stored/dev.c:

   /*
    * If this is a virtual autochanger (i.e. changer_res != NULL) we simply use
    * the device name, assuming it has been appropriately setup by the "autochanger".
    */
   if (!device->changer_res || device->changer_command[0] == 0) {
      if (VolCatInfo.VolCatName[0] == 0) {
         Mmsg(errmsg, _("Could not open file device %s. No Volume name given.\n"),
            print_name());
         clear_opened();
         return;
      }

      if (!IsPathSeparator(archive_name.c_str()[strlen(archive_name.c_str())-1])) {
         pm_strcat(archive_name, "/");
      }
      pm_strcat(archive_name, getVolCatName());
}

Do let me know what additional information if any, you require.
Steps To ReproduceUnfortunately this is random occurance.
TagsNo tags attached.
bareos-master: impactyes
bareos-master: actionwill care
bareos-18.2: impact
bareos-18.2: action
bareos-17.2: impact
bareos-17.2: action
bareos-16.2: impactyes
bareos-16.2: actionwill care
bareos-15.2: impactyes
bareos-15.2: action
bareos-14.2: impact
bareos-14.2: actionnone
bareos-13.2: impact
bareos-13.2: actionnone
bareos-12.4: impact
bareos-12.4: actionnone

Relationships

related to 0000647 closed Automatic mount of disk volumes intermittently fails 
child of 0000794 closedstephand Release bareos-16.2.6 
child of 0000798 assignedmaik Release bareos-15.2.5 

Activities

geovariances

geovariances

2016-09-21 10:37

reporter   ~0002356

Hello,

We encounter the same error randomly with the same product version (and OS version).
As for medicmomcilo, we can send additional information and logs.

Thank you,

Mickaël
joergs

joergs

2016-09-22 17:16

administrator   ~0002357

So what backends are you using? A single tape or disk? (or some other backend).
geovariances

geovariances

2016-09-22 18:27

reporter   ~0002358

Hello,

We use disks: several pools containing multiple media stored on disk.
ianbrown78

ianbrown78

2016-09-23 03:48

reporter   ~0002359

Hi,

We are encountering the same issue.
We are running CentOS 7.2.1511.
uname -a reports: Linux 3.10.0-327.22.2.el7.x86_64 0000001 SMP Thu Jun 23 17:05:11 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
We also, are only using disk backends.

Cheers,
Ian
medicmomcilo

medicmomcilo

2016-09-25 12:19

reporter   ~0002369

Hi,

We have multiple disk devices defined and that is all.
However, we don't use autochanger device.

Kind regards,
Momcilo.
ianbrown78

ianbrown78

2016-12-05 05:07

reporter   ~0002462

Hi,

Just upgraded to 16.2.4.
We are still getting the same error on random jobs.

What more information do you require to get this fixed?

Cheers,
Ian
joergs

joergs

2016-12-12 18:41

administrator   ~0002467

Confirmed, as several environments are affected by this problem,
joergs

joergs

2016-12-12 18:50

administrator   ~0002468

> What more information do you require to get this fixed?

Well, a setup to reproduce the problem would be perfect. Not being able to reproduce the problem makes it hard to fix it.

Maybe a debug trace might bring some insight. You can enable the debug log by the bconsole command:
setdebug level=100 trace=1 dir

Also director configuration and some information about the environment would be helpful.
nrobert13

nrobert13

2016-12-15 15:32

reporter   ~0002471

Hi,

I'm also getting these errors, please see my comments on the users group, maybe the info provided helps.
https://groups.google.com/forum/?fromgroups=#!topic/bareos-users/Nbv-_fRWxaY

I have debugging enabled for dir and sd, but it's kind of difficult to identify the related sections, because there's no timestamp in the tracefiles.

I assume the number after filename:xxx- is the jobid, and I can tell you that, one of my jobs hich failed had 0 bytes, and even though failed. I cannot find the jobid at all in the sd trace, but there are some lines in the dir trace.

let me try to upload the traces.

thanks,
Robert
nrobert13

nrobert13

2016-12-15 15:43

reporter  

bareos-dir.trace (40,333 bytes)
nrobert13

nrobert13

2016-12-15 15:44

reporter  

failed-job.log (3,258 bytes)
joergs

joergs

2016-12-15 15:45

administrator   ~0002472

The xxx after the filename is the line number in the source code.
nrobert13

nrobert13

2016-12-15 15:48

reporter   ~0002473

Last edited: 2016-12-15 15:49

View 2 revisions

I see, and after that comes the job-id, right?

would you please remove the trace file from this ticket?

thank you.

nrobert13

nrobert13

2016-12-16 08:47

reporter   ~0002476

hi,

I'm also getting messages like:

16-Dec 00:37 muc1pro-backup-1-sd JobId 8102: Job tw1kc0624.2016-12-16_00.13.09_12 is waiting. Cannot find any appendable volumes.
Please use the "label" command to create a new Volume for:
    Storage: "DiskDeviceIncr2" (/data/bareos-storage/Incr)
    Pool: DiskDaily
    Media type: FileIncr

although the pool limit is 600 Volumes and there are currently only 310 in use, and plenty of diskspace. If I try to run a single job, which requires new volumes to be labeled, it seem to work, but when the nightly schedule runs ( 5 in parallel) randomly throws either "No volume name given" or "Cannot find any appendable volumes"

cheers,
Robert
ianbrown78

ianbrown78

2016-12-19 01:27

reporter   ~0002477

Hi joergs,
Our setup is as follows:
Dedicated physical server is a Dell PowerEdge R710:
 - 2 x Intel Xeon X5650 Quad Core with Hyper Threading enabled,
 - 64GB RAM,
 - 6 x 900GB SAS in Raid5 (50GB /, 4.45TB /backup)
 - additional 6TB SAS via iSCSI (Dell Equallogic SANs)
 - 4 x 1GbE NICs connected to iSCSI VLAN in team config (round robin).
 - 4 x 1GbE NICs connected to Prod VLAN in team config (round robin).

Running CentOS 7.2.1511
Was running Bareos 15.2 up until 1 month ago.
Now running Bareos 16.2.4 - Upgraded and rejigged config files to match the new setup.
Currently using MariaDB, but in the process of migrating to PGSQL.
External storage node running in Google Cloud for Full & Diff migrations.

Problem has persisted across versions. However this morning, I have just noted that all hourly incrementals were failing. I restarted the bareos-dir service. Now the subsequent jobs are successful.

Attaching Bareos configuration as backup.tar.gz
Once the jobs start failing again, I will enable debugging and rerun a job.

Cheers,
Ian
ianbrown78

ianbrown78

2016-12-19 01:27

reporter  

backup.tar.gz (13,401 bytes)
nrobert13

nrobert13

2017-01-04 16:34

reporter   ~0002497

hi all,

I took a look into the trace file of the director, and seems the the problem is in the volume selection. I have defined a storage with 5 different devices pointing to the same folder. the storage with max 5 jobs and each device with max 1 job.

looking at the trace file, here's the SELECT count for each job to find the volume:

JobID : SELECT COUONT
10472 :10
10473 :121
10474 :189
10475 :291
10476 :189

used this command:
for i in `seq 2 6` ; do echo -n 1047$i : ; grep -- "-1047$i" muc1pro-backup-1-dir.trace | egrep "SELECT" | wc -l ; done


for the first job after the last select I see:
Find_next_vol MediaId=3735

MediaId=3735 means volume=Incr-3735

in all the other 4 jobs, after some selects, the same volume is found but then comes a "unwanted_volumes=Incr-3735" and continues to run other selects.
after the last select I see : Find_next_vol MediaId=0 , but there's no volume with mediaid = 0 , and most likely therefore it throws a no volume name given...

here you can see the selects and find_next_vol records in the trace:

grep -- "-10472" muc1pro-backup-1-dir.trace | egrep --color=always "SELECT|Find_next_vol|unwanted" | less -r

replace 10472 in the grep with other jobid's from above.

attaching the trace file. see the attached affected jobid's above.

bareos-15.2.3-trace-centos6.tar.gz

Thanks,
Robert
nrobert13

nrobert13

2017-01-04 16:34

reporter  

bareos-15.2.3-trace-centos6.tar.gz (40,117 bytes)
DocFraggle

DocFraggle

2017-03-01 17:59

reporter   ~0002586

Hi all,

we hit the same issue from time to time. CentOS 6, Bareos 15.2.2-37.1

No trace, only logs:

01-Mär 17:38 bareos-dir JobId 439358: console command: run BeforeJob "prune files client=clientname yes"
01-Mär 17:38 bareos-dir JobId 439358: Start Backup JobId 439358, Job=jobname.2017-03-01_17.38.33_29
01-Mär 17:38 bareos-dir JobId 439358: Using Device "FileStorageName0" to write.
01-Mär 17:39 bareos-dir JobId 439358: Fatal error: Catalog error creating JobMedia record. sql_create.c:153 Update Media record UPDATE Media SET EndFile=0, EndBlock=0 WHERE MediaId=0 failed: ERR=
01-Mär 17:39 bareos-sd JobId 439358: Fatal error: Error creating JobMedia record: 1992 Create JobMedia error
01-Mär 17:39 clientname JobId 439358: Fatal error: dir_cmd.c:2400 Bad response to Append Data command. Wanted 3000 OK data
01-Mär 17:39 bareos-sd JobId 439358: Marking Volume "VOLUME41965" in Error in Catalog.
01-Mär 17:39 bareos-sd JobId 439358: Fatal error: askdir.c:357 NULL Volume name. This shouldn't happen!!!
01-Mär 17:39 bareos-sd JobId 439358: Fatal error: Job 439358 canceled.

Regards, Christian
daftu

daftu

2017-05-21 13:00

reporter   ~0002643

Hi there.
Sam issue, pretty sure it's because of multiple devices defined (needed to forever incremental).
medicmomcilo

medicmomcilo

2017-06-08 15:17

reporter   ~0002648

I think I managed to track down properly this bug.
It seems to be introduced by a large fix for split brain problem[1].

Looks like there is a race condition for volumes on backup start.
This causes a lot of volumes to be 'reserved' and marked as 'unwanted' for other backups.
If a request is made for a volume that is 'unwanted' then a request should be repeated.
However, only case where this is properly handled is if you have ran out of all volumes and in need to purge/recycle oldest.
Only in such case it would retry on failure to get volume.

I am attaching proposed patch which would retry on every failure.
Patch is minimal and easy to review before you decide on implementing or not.

This is tested in small environment and proved to work fine.
Couldn't confirm in our production due to build limitations we have.

[1] https://github.com/bareos/bareos/commit/fa7a54a3b19f2f003a076b36532e277d98aaa812

Kind regards,
Momo.
medicmomcilo

medicmomcilo

2017-06-08 15:17

reporter  

0001-Fixes-bug-0000691-No-volume-name-given.patch (617 bytes)
From eb9e6b9f95c04fcedd04f9e76b5694a277f48dc7 Mon Sep 17 00:00:00 2001
From: Momcilo Medic <medicmomcilo@gmail.com>
Date: Thu, 8 Jun 2017 15:10:17 +0200
Subject: [PATCH] Fixes bug 0000691 - 'No volume name given'

---
 src/cats/sql_find.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/cats/sql_find.c b/src/cats/sql_find.c
index 77e2106f..3687dc1f 100644
--- a/src/cats/sql_find.c
+++ b/src/cats/sql_find.c
@@ -493,7 +493,7 @@ retry_fetch:
       break;
    }
 
-   if (!found_candidate && find_oldest) {
+   if (!found_candidate) {
       item++;
       goto retry_fetch;
    }
-- 
2.13.0

joergs

joergs

2017-06-08 15:21

administrator   ~0002650

Fix committed to bareos bareos-16.2 branch with changesetid 7019.
pstorz

pstorz

2017-06-08 15:23

administrator   ~0002653

Fix committed to bareos bareos-15.2 branch with changesetid 7026.
joergs

joergs

2017-06-08 16:31

administrator   ~0002662

@Momo: thank you for your patch. We analysed this problem in a customer environment, and think the patch Philipp did make, solved the problem. It just did take longer, until we pushed it to github.

Related Changesets

bareos: bareos-16.2 d1ff29b3

2017-04-25 10:50:44

pstorz


Committer: joergs

Ported: N/A

Details Diff
Don't return empty volname if volume is on unwanted vols list

When the volume was on the unwanted_list, the loop was
continued but the number of sql rows was not checked, so that
the code continued and returned an empty volumename.

Fixes 0000691: Backup job fails with "No Volume name given."
Affected Issues
0000691
mod - src/cats/sql_find.c Diff File

bareos: bareos-15.2 627d8d7e

2017-04-25 10:50:44

pstorz

Ported: N/A

Details Diff
Don't return empty volname if volume is on unwanted vols list

When the volume was on the unwanted_list, the loop was
continued but the number of sql rows was not checked, so that
the code continued and returned an empty volumename.

Fixes 0000691: Backup job fails with "No Volume name given."
Affected Issues
0000691
mod - src/cats/sql_find.c Diff File

Issue History

Date Modified Username Field Change
2016-08-24 10:31 medicmomcilo New Issue
2016-09-21 10:37 geovariances Note Added: 0002356
2016-09-22 17:16 joergs Note Added: 0002357
2016-09-22 18:27 geovariances Note Added: 0002358
2016-09-23 03:48 ianbrown78 Note Added: 0002359
2016-09-25 12:19 medicmomcilo Note Added: 0002369
2016-10-13 10:51 joergs Relationship added related to 0000641
2016-10-13 10:52 joergs Relationship deleted related to 0000641
2016-10-13 10:53 joergs Relationship added related to 0000647
2016-12-05 05:07 ianbrown78 Note Added: 0002462
2016-12-12 18:41 joergs Note Added: 0002467
2016-12-12 18:41 joergs Status new => confirmed
2016-12-12 18:44 joergs bareos-master: impact => yes
2016-12-12 18:44 joergs bareos-master: action => will care
2016-12-12 18:44 joergs bareos-16.2: impact => yes
2016-12-12 18:44 joergs bareos-16.2: action => will care
2016-12-12 18:44 joergs bareos-15.2: impact => yes
2016-12-12 18:44 joergs bareos-14.2: action => none
2016-12-12 18:44 joergs bareos-13.2: action => none
2016-12-12 18:44 joergs bareos-12.4: action => none
2016-12-12 18:50 joergs Note Added: 0002468
2016-12-15 15:32 nrobert13 Note Added: 0002471
2016-12-15 15:43 nrobert13 File Added: bareos-dir.trace
2016-12-15 15:44 nrobert13 File Added: failed-job.log
2016-12-15 15:45 joergs Note Added: 0002472
2016-12-15 15:48 nrobert13 Note Added: 0002473
2016-12-15 15:49 nrobert13 Note Edited: 0002473 View Revisions
2016-12-16 08:47 nrobert13 Note Added: 0002476
2016-12-19 01:27 ianbrown78 Note Added: 0002477
2016-12-19 01:27 ianbrown78 File Added: backup.tar.gz
2017-01-04 16:34 nrobert13 Note Added: 0002497
2017-01-04 16:34 nrobert13 File Added: bareos-15.2.3-trace-centos6.tar.gz
2017-03-01 17:59 DocFraggle Note Added: 0002586
2017-05-21 13:00 daftu Note Added: 0002643
2017-06-08 12:43 joergs Relationship added child of 0000794
2017-06-08 12:44 joergs Relationship added child of 0000798
2017-06-08 15:17 medicmomcilo Note Added: 0002648
2017-06-08 15:17 medicmomcilo File Added: 0001-Fixes-bug-0000691-No-volume-name-given.patch
2017-06-08 15:21 joergs Changeset attached => bareos bareos-16.2 d1ff29b3
2017-06-08 15:21 joergs Note Added: 0002650
2017-06-08 15:21 joergs Status confirmed => resolved
2017-06-08 15:21 joergs Resolution open => fixed
2017-06-08 15:23 pstorz Changeset attached => bareos bareos-15.2 627d8d7e
2017-06-08 15:23 pstorz Note Added: 0002653
2017-06-08 16:31 joergs Note Added: 0002662