View Revisions: Issue #1405

Summary 0001405: Storage daemon not answer, when there are no available devices.
Revision 2021-12-06 14:18 by bruno-at-bareos
Description UPD: Version is 19.2.7
I have troubles when no more available devices on storage daemon.
how storage=pgbkps01.mgmt.
Storage {
  Name = "pgbkps01_loop"
  Address = "10.*.*.192"
  Password = "***"
  Device = "local-storage-0","local-storage-1","local-storage-2","local-storage-3"
  MediaType = "File"
  MaximumConcurrentJobs = 4
  SddPort = 9103
}

There are 4 jobs running at the moment.
When starting new job, it does not queue, but ends with error:

02-Dec 13:13 bareos-dir JobId 1903: Fatal error: Authorization key rejected bareos-dir.
02-Dec 13:13 bareos-dir JobId 1903: Fatal error: Director unable to authenticate with Storage daemon at "10.*.*.192:9103". Possible causes:
Passwords or names not the same or
TLS negotiation problem or
Maximum Concurrent Jobs exceeded on the SD or
SD networking messed up (restart daemon).

*status storage=pgbkps01_loop
Connecting to Storage daemon pgbkps01_loop at 10.*.*.192:9103
Failed to connect to Storage daemon pgbkps01_loop

When one of 4 jobs ends, there are no problem.

I did check strace of bareos-sd threads:
ps -eLo pid,tid,ppid,user:11,comm,state,wchan | grep bareos-sd
3256361 3256361 1 bareos bareos-sd S x64_sys_poll
3256361 3256363 1 bareos bareos-sd S -
3256361 1332308 1 bareos bareos-sd S x64_sys_poll
3256361 2252738 1 bareos bareos-sd S x64_sys_poll
3256361 2407428 1 bareos bareos-sd S x64_sys_poll
========================================================================
WHEN IS PROBLEM:
strace -p 3256361:
accept(3, {sa_family=AF_INET, sin_port=htons(48954), sin_addr=inet_addr("10.76.74.192")}, [128->16]) = 6
setsockopt(6, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0
clone(child_stack=0x7ff13d75feb0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7ff13d7609d0, tls=0x7ff13d760700, child_tidptr=0x7ff13d7609d0) = 2458575
futex(0x2487490, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1638442869, tv_nsec=440823368}, FUTEX_BITSET_MATCH_ANY) = 0
futex(0x2487418, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x24874c0, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x2487440, FUTEX_WAKE_PRIVATE, 1) = 1

strace -p 3256363
strace: Process 3256363 attached
restart_syscall(<... resuming interrupted futex ...>
========================================================================
WHEN IS NO PROBLEM:
strace -p 3256361:
accept(3, {sa_family=AF_INET, sin_port=htons(48954), sin_addr=inet_addr("10.76.74.192")}, [128->16]) = 6
setsockopt(6, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0
clone(child_stack=0x7ff13d75feb0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7ff13d7609d0, tls=0x7ff13d760700, child_tidptr=0x7ff13d7609d0) = 2458575
futex(0x2487490, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1638442869, tv_nsec=440823368}, FUTEX_BITSET_MATCH_ANY) = 0
futex(0x2487418, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x24874c0, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x2487440, FUTEX_WAKE_PRIVATE, 1) = 1

strace -p 3256363
strace: Process 3256363 attached
restart_syscall(<... resuming interrupted futex ...>
futex(0x7ff1508ce340, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7ff1508ce328, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1638442749, tv_nsec=490329000}, FUTEX_BITSET_MATCH_ANY) = 0
futex(0x7ff1508ce340, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7ff1508ce32c, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1638442807, tv_nsec=745975000}, FUTEX_BITSET_MATCH_ANY) = 0
futex(0x7ff1508ce340, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7ff1508ce328, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1638442807, tv_nsec=747064000}, FUTEX_BITSET_MATCH_ANY) = 0
futex(0x7ff1508ce340, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7ff1508ce32c, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1638442807, tv_nsec=792002000}, FUTEX_BITSET_MATCH_ANY) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x7ff1508ce340, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7ff1508ce328, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1638442807, tv_nsec=792408000}, FUTEX_BITSET_MATCH_ANY
=============================================

I think here is some inter process communication issue
Revision 2021-12-02 12:01 by gusevmk
Description UPD: Version is 19.2.7
I have troubles when no more available devices on storage daemon.
how storage=pgbkps01.mgmt.tensor.ru_loop
Storage {
  Name = "pgbkps01_loop"
  Address = "10.*.*.192"
  Password = "***"
  Device = "local-storage-0","local-storage-1","local-storage-2","local-storage-3"
  MediaType = "File"
  MaximumConcurrentJobs = 4
  SddPort = 9103
}

There are 4 jobs running at the moment.
When starting new job, it does not queue, but ends with error:

02-Dec 13:13 bareos-dir JobId 1903: Fatal error: Authorization key rejected bareos-dir.
02-Dec 13:13 bareos-dir JobId 1903: Fatal error: Director unable to authenticate with Storage daemon at "10.*.*.192:9103". Possible causes:
Passwords or names not the same or
TLS negotiation problem or
Maximum Concurrent Jobs exceeded on the SD or
SD networking messed up (restart daemon).

*status storage=pgbkps01_loop
Connecting to Storage daemon pgbkps01_loop at 10.*.*.192:9103
Failed to connect to Storage daemon pgbkps01_loop

When one of 4 jobs ends, there are no problem.

I did check strace of bareos-sd threads:
ps -eLo pid,tid,ppid,user:11,comm,state,wchan | grep bareos-sd
3256361 3256361 1 bareos bareos-sd S x64_sys_poll
3256361 3256363 1 bareos bareos-sd S -
3256361 1332308 1 bareos bareos-sd S x64_sys_poll
3256361 2252738 1 bareos bareos-sd S x64_sys_poll
3256361 2407428 1 bareos bareos-sd S x64_sys_poll
========================================================================
WHEN IS PROBLEM:
strace -p 3256361:
accept(3, {sa_family=AF_INET, sin_port=htons(48954), sin_addr=inet_addr("10.76.74.192")}, [128->16]) = 6
setsockopt(6, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0
clone(child_stack=0x7ff13d75feb0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7ff13d7609d0, tls=0x7ff13d760700, child_tidptr=0x7ff13d7609d0) = 2458575
futex(0x2487490, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1638442869, tv_nsec=440823368}, FUTEX_BITSET_MATCH_ANY) = 0
futex(0x2487418, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x24874c0, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x2487440, FUTEX_WAKE_PRIVATE, 1) = 1

strace -p 3256363
strace: Process 3256363 attached
restart_syscall(<... resuming interrupted futex ...>
========================================================================
WHEN IS NO PROBLEM:
strace -p 3256361:
accept(3, {sa_family=AF_INET, sin_port=htons(48954), sin_addr=inet_addr("10.76.74.192")}, [128->16]) = 6
setsockopt(6, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0
clone(child_stack=0x7ff13d75feb0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7ff13d7609d0, tls=0x7ff13d760700, child_tidptr=0x7ff13d7609d0) = 2458575
futex(0x2487490, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1638442869, tv_nsec=440823368}, FUTEX_BITSET_MATCH_ANY) = 0
futex(0x2487418, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x24874c0, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x2487440, FUTEX_WAKE_PRIVATE, 1) = 1

strace -p 3256363
strace: Process 3256363 attached
restart_syscall(<... resuming interrupted futex ...>
futex(0x7ff1508ce340, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7ff1508ce328, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1638442749, tv_nsec=490329000}, FUTEX_BITSET_MATCH_ANY) = 0
futex(0x7ff1508ce340, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7ff1508ce32c, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1638442807, tv_nsec=745975000}, FUTEX_BITSET_MATCH_ANY) = 0
futex(0x7ff1508ce340, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7ff1508ce328, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1638442807, tv_nsec=747064000}, FUTEX_BITSET_MATCH_ANY) = 0
futex(0x7ff1508ce340, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7ff1508ce32c, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1638442807, tv_nsec=792002000}, FUTEX_BITSET_MATCH_ANY) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x7ff1508ce340, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7ff1508ce328, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1638442807, tv_nsec=792408000}, FUTEX_BITSET_MATCH_ANY
=============================================

I think here is some inter process communication issue