View Issue Details

IDProjectCategoryView StatusLast Update
0000899bareos-coredirectorpublic2018-06-22 17:16
Reporterzendx Assigned Tojoergs  
PriorityhighSeveritymajorReproducibilitysometimes
Status closedResolutionwon't fix 
PlatformLinuxOSDebianOS Version8
Product Version17.2.4 
Summary0000899: Bareos crashes with some clients
DescriptionWhen I try to start work, for some clients, director crashes.
Any errors in logs, but i have backtrace.

Two problem clients - one with CentOS 7 x64 (FD ver 17.2.4), and other - WS 2008 (FD ver 17.2.4)
Additional InformationBacktrace attached.
TagsCollect Statistics

Relationships

child of 0000903 closed director crashes some time after a reload if Collect Statistic is enabled 

Activities

zendx

zendx

2018-01-27 09:10

reporter  

9gn7SD8M.txt (9,121 bytes)   
Created /var/lib/bareos/bareos-dir.core.25321 for doing postmortem debugging
[New LWP 25322]
[New LWP 25325]
[New LWP 25326]
[New LWP 25327]
[New LWP 25361]
[New LWP 25363]
[New LWP 25321]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/usr/sbin/bareos-dir'.
#0  0x00007f6fde63414d in nanosleep () at ../sysdeps/unix/syscall-template.S:81
81        ../sysdeps/unix/syscall-template.S: No such file or directory.
$1 = 0x6b6b80 <my_name> "bareos-dir"
$2 = 0x17ab758 "bareos-dir"
$3 = 0x17ab798 "/usr/sbin/bareos-dir"
$4 = 0x7f6fc4012498 "PostgreSQL"
$5 = 0x7f6fdfd710a6 "17.2.4 (21 Sep 2017)"
$6 = 0x7f6fdfd71092 "x86_64-pc-linux-gnu"
$7 = 0x7f6fdfd7108b "debian"
$8 = 0x7f6fdfd7106d "Debian GNU/Linux 8.0 (jessie)"
$9 = "bareos", '\000' <repeats 249 times>
$10 = 0x7f6fdfd71588 "debian Debian GNU/Linux 8.0 (jessie)"
Environment variable "TestName" not defined.
#0  0x00007f6fde63414d in nanosleep () at ../sysdeps/unix/syscall-template.S:81
#1  0x00007f6fdfd40854 in bmicrosleep (sec=sec@entry=30, usec=usec@entry=0) at bsys.c:171
#2  0x00007f6fdfd5165c in check_deadlock () at lockmgr.c:568
#3  0x00007f6fde62d064 in start_thread (arg=0x7f6fdd436700) at pthread_create.c:309
#4  0x00007f6fdd93c62d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 7 (Thread 0x7f6fe0e78740 (LWP 25321)):
#0  0x00007f6fde63414d in nanosleep () at ../sysdeps/unix/syscall-template.S:81
#1  0x00007f6fdfd40854 in bmicrosleep (sec=60, usec=0) at bsys.c:171
#2  0x0000000000447959 in wait_for_next_job (one_shot_job_to_run=0x0) at scheduler.c:126
#3  0x000000000040f9d6 in main (argc=<optimized out>, argv=<optimized out>) at dird.c:434

Thread 6 (Thread 0x7f6fceffd700 (LWP 25363)):
#0  0x00007f6fde633a9d in read () at ../sysdeps/unix/syscall-template.S:81
#1  0x00007f6fdfd3ea60 in BSOCK_TCP::read_nbytes (this=0x7f6fb8001c78, ptr=<optimized out>, nbytes=4) at bsock_tcp.c:978
#2  0x00007f6fdfd3e2df in BSOCK_TCP::recv (this=0x7f6fb8001c78) at bsock_tcp.c:550
#3  0x0000000000426bcf in bget_dirmsg (bs=0x7f6fb8001c78, allow_any_message=false) at getmsg.c:154
#4  0x0000000000433fcc in msg_thread (arg=0xb, arg@entry=0x18101c8) at msgchan.c:434
#5  0x00007f6fdfd516ef in lmgr_thread_launcher (x=0x7f6fb8002f48) at lockmgr.c:928
#6  0x00007f6fde62d064 in start_thread (arg=0x7f6fceffd700) at pthread_create.c:309
#7  0x00007f6fdd93c62d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 5 (Thread 0x7f6fcf7fe700 (LWP 25361)):
#0  0x00007f6fde633a9d in read () at ../sysdeps/unix/syscall-template.S:81
#1  0x00007f6fdfd3ea60 in BSOCK_TCP::read_nbytes (this=0x7f6fb8004008, ptr=<optimized out>, nbytes=4) at bsock_tcp.c:978
#2  0x00007f6fdfd3e2df in BSOCK_TCP::recv (this=0x7f6fb8004008) at bsock_tcp.c:550
#3  0x0000000000426bcf in bget_dirmsg (bs=0x7f6fb8004008, allow_any_message=false) at getmsg.c:154
#4  0x0000000000412692 in wait_for_job_termination (jcr=0x18101c8, timeout=0) at backup.c:715
#5  0x00000000004147f9 in do_native_backup (jcr=jcr@entry=0x18101c8) at backup.c:650
#6  0x0000000000429684 in job_thread (arg=0x18101c8) at job.c:514
#7  0x000000000042eba1 in jobq_server (arg=0x6b7240 <job_queue>) at jobq.c:485
#8  0x00007f6fdfd516ef in lmgr_thread_launcher (x=0x18192c8) at lockmgr.c:928
#9  0x00007f6fde62d064 in start_thread (arg=0x7f6fcf7fe700) at pthread_create.c:309
#10 0x00007f6fdd93c62d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 4 (Thread 0x7f6fd501e700 (LWP 25327)):
#0  0x00007f6fde634489 in __libc_waitpid (pid=25862, stat_loc=0x7f6fd501c6cc, options=0) at ../sysdeps/unix/sysv/linux/waitpid.c:40
#1  0x00007f6fdfd61154 in signal_handler (sig=7) at signal.c:240
#2  <signal handler called>
#3  0x00007f6fdfd36563 in bnet_host2ipaddrs (host=0xaaaaaaaaaaaaaaaa <error: Cannot access memory at address 0xaaaaaaaaaaaaaaaa>, family=0, errstr=0x7f6fd501d028) at bnet.c:418
#4  0x00007f6fdfd3fc06 in BSOCK_TCP::open (this=0x7f6fc000bfb8, jcr=0x38, name=0x18 <error: Cannot access memory at address 0x18>, host=0x7f6fc0000020 "", service=0x0, port=-1431655766, heart_beat=-6148914691236517206, fatal=0x7f6fd501dbdc) at bsock_tcp.c:182
#5  0x00007f6fdfd3ebc5 in BSOCK_TCP::connect (this=0x7f6fc000bfb8, jcr=0x7f6fc0001078, retry_interval=2, max_retry_time=140117939322912, max_retry_time@entry=1, heart_beat=0, heart_beat@entry=-6148914691236517206, name=0x47c661 "Storage daemon", host=0xaaaaaaaaaaaaaaaa <error: Cannot access memory at address 0xaaaaaaaaaaaaaaaa>, service=0x0, port=-1431655766, verbose=false) at bsock_tcp.c:115
#6  0x00000000004452de in connect_to_storage_daemon (jcr=0x7f6fc0001078, retry_interval=<optimized out>, max_retry_time=<optimized out>, verbose=<optimized out>) at sd_cmds.c:118
#7  0x000000000044802b in statistics_thread_runner (arg=0x7f6fc000c4b8, arg@entry=0x0) at stats.c:233
#8  0x00007f6fdfd516ef in lmgr_thread_launcher (x=0x180d558) at lockmgr.c:928
#9  0x00007f6fde62d064 in start_thread (arg=0x7f6fd501e700) at pthread_create.c:309
#10 0x00007f6fdd93c62d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 3 (Thread 0x7f6fcd01e700 (LWP 25326)):
#0  pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
#1  0x00007f6fdfd51b8c in bthread_cond_timedwait_p (cond=cond@entry=0x7f6fdff87dc0 <_ZL5timer>, m=m@entry=0x7f6fdff87e00 <_ZL11timer_mutex>, abstime=abstime@entry=0x7f6fcd01de20, file=file@entry=0x7f6fdfd75952 "watchdog.c", line=line@entry=313) at lockmgr.c:813
#2  0x00007f6fdfd6a3fd in watchdog_thread (arg=arg@entry=0x0) at watchdog.c:313
#3  0x00007f6fdfd516ef in lmgr_thread_launcher (x=0x180d558) at lockmgr.c:928
#4  0x00007f6fde62d064 in start_thread (arg=0x7f6fcd01e700) at pthread_create.c:309
#5  0x00007f6fdd93c62d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 2 (Thread 0x7f6fd581f700 (LWP 25325)):
#0  0x00007f6fdd933aed in poll () at ../sysdeps/unix/syscall-template.S:81
#1  0x00007f6fdfd35723 in bnet_thread_server_tcp (addr_list=addr_list@entry=0x17c74d8, max_clients=<optimized out>, sockfds=<optimized out>, client_wq=client_wq@entry=0x6b7540 <socket_workq>, nokeepalive=<optimized out>, handle_client_request=handle_client_request@entry=0x442160 <handle_connection_request(void*)>) at bnet_server_tcp.c:306
#2  0x00000000004423ef in connect_thread (arg=arg@entry=0x17c74d8) at socket_server.c:115
#3  0x00007f6fdfd516ef in lmgr_thread_launcher (x=0x180e848) at lockmgr.c:928
#4  0x00007f6fde62d064 in start_thread (arg=0x7f6fd581f700) at pthread_create.c:309
#5  0x00007f6fdd93c62d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 1 (Thread 0x7f6fdd436700 (LWP 25322)):
#0  0x00007f6fde63414d in nanosleep () at ../sysdeps/unix/syscall-template.S:81
#1  0x00007f6fdfd40854 in bmicrosleep (sec=sec@entry=30, usec=usec@entry=0) at bsys.c:171
#2  0x00007f6fdfd5165c in check_deadlock () at lockmgr.c:568
#3  0x00007f6fde62d064 in start_thread (arg=0x7f6fdd436700) at pthread_create.c:309
#4  0x00007f6fdd93c62d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
#0  0x00007f6fde63414d in nanosleep () at ../sysdeps/unix/syscall-template.S:81
81        in ../sysdeps/unix/syscall-template.S
No locals.
#1  0x00007f6fdfd40854 in bmicrosleep (sec=sec@entry=30, usec=usec@entry=0) at bsys.c:171
171        bsys.c: No such file or directory.
timeout = {tv_sec = 30, tv_nsec = 0}
tv = {tv_sec = 0, tv_usec = 0}
tz = {tz_minuteswest = -582785280, tz_dsttime = 32623}
status = <optimized out>
#2  0x00007f6fdfd5165c in check_deadlock () at lockmgr.c:568
568        lockmgr.c: No such file or directory.
__cancel_buf = {__cancel_jmp_buf = {{__cancel_jmp_buf = {0, 3039939368786552139, 0, 140118491484256, 0, 140118430279424, -3103065658840095413, -3103060313982671541}, __mask_was_saved = 0}}, __pad = {0x7f6fdd435f30, 0x0, 0x7f6fdd436700, 0x7f6fdd436700}}
__cancel_routine = 0x7f6fdfd51770 <cln_hdl(void*)>
__not_first_call = <optimized out>
old = 0
#3  0x00007f6fde62d064 in start_thread (arg=0x7f6fdd436700) at pthread_create.c:309
309        pthread_create.c: No such file or directory.
__res = <optimized out>
pd = 0x7f6fdd436700
now = <optimized out>
unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140118430279424, 3039939368786552139, 0, 140118491484256, 0, 140118430279424, -3103065658850581173, -3103063180958145205}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
not_first_call = <optimized out>
pagesize_m1 = <optimized out>
sp = <optimized out>
freesize = <optimized out>
__PRETTY_FUNCTION__ = "start_thread"
#4  0x00007f6fdd93c62d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
111        ../sysdeps/unix/sysv/linux/x86_64/clone.S: No such file or directory.
No locals.
#0  0x0000000000000000 in ?? ()
No symbol table info available.
#0  0x0000000000000000 in ?? ()
No symbol table info available.
#0  0x0000000000000000 in ?? ()
No symbol table info available.
9gn7SD8M.txt (9,121 bytes)   
zendx

zendx

2018-01-27 10:00

reporter   ~0002885

New backtrace (crashed at the middle of the running job (about 20GB\40Gb backed up):
https://pastebin.com/sMigZd3T
joergs

joergs

2018-01-27 19:45

developer   ~0002886

Thank you for this report.

I see, that the problem is related to the statistic thread.

You may try to disable it by
http://doc.bareos.org/master/html/bareos-manual-main-reference.html#directiveDirStorageCollect%20Statistics

A few additional questions:
Have you used the packages from http://download.baeos.org/ ?
The backup works okay with your other clients?
Does the backup always fails on this two clients?
Have you configured storages that are no longer available?
zendx

zendx

2018-01-27 20:17

reporter   ~0002888

Yes, i have enabled statistic collection.
I'll try to disable it...

1. I'm using debian repo from download.bareos.com.
2. Randomly. On example, by restarting failed job with same client i got successfull backup (1 backup on 4 crashes).
3. Other clients have less files, so, maybe, this is answer.
4. No unavailable or failed storages.

This night I will run the backup again (with the statistics collection disabled).
joergs

joergs

2018-06-08 13:52

developer   ~0003035

Any news no this? Has the problem been related to the statistic thread? If yes, there is now a fix for it, already released in the source and soon to be released as package.
joergs

joergs

2018-06-22 17:16

developer   ~0003050

Received no feedback and I think this have been fixed in 17.2.6. Closing ticket.

Issue History

Date Modified Username Field Change
2018-01-27 09:10 zendx New Issue
2018-01-27 09:10 zendx File Added: 9gn7SD8M.txt
2018-01-27 10:00 zendx Note Added: 0002885
2018-01-27 19:34 joergs Priority normal => high
2018-01-27 19:34 joergs Severity minor => major
2018-01-27 19:34 joergs Reproducibility always => sometimes
2018-01-27 19:34 joergs Status new => acknowledged
2018-01-27 19:45 joergs Note Added: 0002886
2018-01-27 19:45 joergs Status acknowledged => feedback
2018-01-27 19:46 joergs Tag Attached: Collect Statistics
2018-01-27 20:17 zendx Note Added: 0002888
2018-01-27 20:17 zendx Status feedback => new
2018-02-02 14:40 joergs Relationship added child of 0000903
2018-06-08 13:52 joergs Note Added: 0003035
2018-06-08 13:52 joergs Assigned To => joergs
2018-06-08 13:52 joergs Status new => feedback
2018-06-22 17:16 joergs Note Added: 0003050
2018-06-22 17:16 joergs Status feedback => closed
2018-06-22 17:16 joergs Resolution open => won't fix