View Issue Details

IDProjectCategoryView StatusLast Update
0001405bareos-corestorage daemonpublic2022-01-09 23:29
Reportergusevmk Assigned To 
PriorityurgentSeveritymajorReproducibilitysometimes
Status newResolutionopen 
PlatformLinuxOSRHELOS Version8.2
Product Version19.2.11 
Summary0001405: Storage daemon not answer, when there are no available devices.
DescriptionUPD: Version is 19.2.7
I have troubles when no more available devices on storage daemon.
how storage=pgbkps01.mgmt.
Storage {
  Name = "pgbkps01_loop"
  Address = "10.*.*.192"
  Password = "***"
  Device = "local-storage-0","local-storage-1","local-storage-2","local-storage-3"
  MediaType = "File"
  MaximumConcurrentJobs = 4
  SddPort = 9103
}

There are 4 jobs running at the moment.
When starting new job, it does not queue, but ends with error:

02-Dec 13:13 bareos-dir JobId 1903: Fatal error: Authorization key rejected bareos-dir.
02-Dec 13:13 bareos-dir JobId 1903: Fatal error: Director unable to authenticate with Storage daemon at "10.*.*.192:9103". Possible causes:
Passwords or names not the same or
TLS negotiation problem or
Maximum Concurrent Jobs exceeded on the SD or
SD networking messed up (restart daemon).

*status storage=pgbkps01_loop
Connecting to Storage daemon pgbkps01_loop at 10.*.*.192:9103
Failed to connect to Storage daemon pgbkps01_loop

When one of 4 jobs ends, there are no problem.

I did check strace of bareos-sd threads:
ps -eLo pid,tid,ppid,user:11,comm,state,wchan | grep bareos-sd
3256361 3256361 1 bareos bareos-sd S x64_sys_poll
3256361 3256363 1 bareos bareos-sd S -
3256361 1332308 1 bareos bareos-sd S x64_sys_poll
3256361 2252738 1 bareos bareos-sd S x64_sys_poll
3256361 2407428 1 bareos bareos-sd S x64_sys_poll
========================================================================
WHEN IS PROBLEM:
strace -p 3256361:
accept(3, {sa_family=AF_INET, sin_port=htons(48954), sin_addr=inet_addr("10.76.74.192")}, [128->16]) = 6
setsockopt(6, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0
clone(child_stack=0x7ff13d75feb0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7ff13d7609d0, tls=0x7ff13d760700, child_tidptr=0x7ff13d7609d0) = 2458575
futex(0x2487490, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1638442869, tv_nsec=440823368}, FUTEX_BITSET_MATCH_ANY) = 0
futex(0x2487418, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x24874c0, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x2487440, FUTEX_WAKE_PRIVATE, 1) = 1

strace -p 3256363
strace: Process 3256363 attached
restart_syscall(<... resuming interrupted futex ...>
========================================================================
WHEN IS NO PROBLEM:
strace -p 3256361:
accept(3, {sa_family=AF_INET, sin_port=htons(48954), sin_addr=inet_addr("10.76.74.192")}, [128->16]) = 6
setsockopt(6, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0
clone(child_stack=0x7ff13d75feb0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7ff13d7609d0, tls=0x7ff13d760700, child_tidptr=0x7ff13d7609d0) = 2458575
futex(0x2487490, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1638442869, tv_nsec=440823368}, FUTEX_BITSET_MATCH_ANY) = 0
futex(0x2487418, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x24874c0, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x2487440, FUTEX_WAKE_PRIVATE, 1) = 1

strace -p 3256363
strace: Process 3256363 attached
restart_syscall(<... resuming interrupted futex ...>
futex(0x7ff1508ce340, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7ff1508ce328, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1638442749, tv_nsec=490329000}, FUTEX_BITSET_MATCH_ANY) = 0
futex(0x7ff1508ce340, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7ff1508ce32c, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1638442807, tv_nsec=745975000}, FUTEX_BITSET_MATCH_ANY) = 0
futex(0x7ff1508ce340, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7ff1508ce328, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1638442807, tv_nsec=747064000}, FUTEX_BITSET_MATCH_ANY) = 0
futex(0x7ff1508ce340, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7ff1508ce32c, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1638442807, tv_nsec=792002000}, FUTEX_BITSET_MATCH_ANY) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x7ff1508ce340, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7ff1508ce328, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1638442807, tv_nsec=792408000}, FUTEX_BITSET_MATCH_ANY
=============================================

I think here is some inter process communication issue
TagsNo tags attached.
bareos-master: impact
bareos-master: action
bareos-19.2: impact
bareos-19.2: action
bareos-18.2: impact
bareos-18.2: action
bareos-17.2: impact
bareos-17.2: action
bareos-16.2: impact
bareos-16.2: action
bareos-15.2: impact
bareos-15.2: action
bareos-14.2: impact
bareos-14.2: action
bareos-13.2: impact
bareos-13.2: action
bareos-12.4: impact
bareos-12.4: action

Activities

gusevmk

gusevmk

2021-12-04 06:52

reporter   ~0004379

This problem happens, when director configuration was reload, steps to reproduce:
1. Set 4 maximum concurrent jobs on storage = 4
2. Run 4 simultaneous jobs
3. Run one more job - all ok, this job is in queue
4. Reload director configuration (add new job/schedule)
5. Start one more job - fail, this job starts and fail - because - no more available devices/Maximum concurent jobs limit exceeded on storage daemon
bruno-at-bareos

bruno-at-bareos

2021-12-06 14:17

developer   ~0004380

Could you provide you bareos-sd configuration bareos-sd.d/storage/bareos-sd.conf (normally) and also the bareos-dir.d/director/bareos-dir.conf (usually)

You have to check if you change the default values if there's enough for those (here the default)

Maximum Concurrent Jobs (Sd->Storage) = PINT32 20
Maximum Connections (Sd->Storage) = PINT32 42

As you see Maximum Connections is higher to allow to pass command to SD (like status etc).
bruno-at-bareos

bruno-at-bareos

2021-12-22 16:35

developer   ~0004408

Any news here ?
embareossed

embareossed

2022-01-03 22:35

reporter   ~0004446

Last edited: 2022-01-09 23:29

[EDIT]: Sorry, I saw the strace output, and it appeared to be the same as my own situation. Trouble is, I am now seeing something similar to the OP's situation, but not quite the same.

I will start a separate ticket for mine. Please disregard my original post here.

Issue History

Date Modified Username Field Change
2021-12-02 12:01 gusevmk New Issue
2021-12-04 06:52 gusevmk Note Added: 0004379
2021-12-06 14:17 bruno-at-bareos Note Added: 0004380
2021-12-06 14:18 bruno-at-bareos Description Updated
2021-12-22 16:35 bruno-at-bareos Note Added: 0004408
2022-01-03 22:35 embareossed Note Added: 0004446
2022-01-03 22:36 embareossed Note Edited: 0004446
2022-01-04 00:01 embareossed Note Edited: 0004446
2022-01-04 00:06 embareossed Note Edited: 0004446
2022-01-04 00:11 embareossed Note Edited: 0004446
2022-01-09 23:29 embareossed Note Edited: 0004446