View Issue Details

IDProjectCategoryView StatusLast Update
0000657bareos-corestorage daemonpublic2020-12-16 09:12
Reporterbarbugs Assigned Toarogge  
PriorityhighSeveritycrashReproducibilitysometimes
Status closedResolutionunable to reproduce 
PlatformLinuxOSUbuntuOS Version14.04
Product Version15.2.2 
Summary0000657: SD crashes almost every time when scheduled jobs are starting
DescriptionAlmost every night I get a Bareos GDB traceback. After that I have to restart the SD and start some jobs by hand again. Usually following scheduled jobs are OK then.

I found this bug here:

https://bugs.bareos.org/view.php?id=414

That seems to be my problem but it is fixed already, isn't it? My version is bareos-storage 15.2.2-37.1.

Should I provide your with traceback and core as well?
Steps To ReproduceLet scheduled jobs run a couple of times. After the 3rd or 4th attempt it occurs. Often just right away.
Additional InformationI installed bareos-dbg and gdb. Thus I have tracebacks and core files.

Something strange I spotted in traceback:

0x00007fc7ecb4a215 in bnet_host2ipaddrs (host=host@entry=0xaaaaaaaaaaaaaaaa <error: Cannot access memory at address 0xaaaaaaaaaaaaaaaa>, family=family@entry=0, errstr=errstr@entry=0x7fc7e37fcf78) at bnet.c:414

TagsNo tags attached.

Activities

mvwieringen

mvwieringen

2016-05-09 19:01

developer   ~0002262

At least the traceback as that show the stacktraces. The above seems to
indicate a freed buffer as it contains 0xaaaaaa etc which is the pattern
for freed data.
mvwieringen

mvwieringen

2016-05-09 19:03

developer   ~0002263

I honestly cannot tie the two bugs together e.g. this one and the one you
mention. Do you get a overrun buffer ASSERT ? Next to the traceback it
might also make sense to paste the exact error you get in the log.
barbugs

barbugs

2016-05-10 15:12

reporter   ~0002268

OK, that was just a hunch.

There are different unpleasant things in the logs:

- JobId 474: Error: sql_create.c:450 Volume "client2-2016-5-6-474" already exists.
- Storage daemon didn't accept Device "V3-RAID-20" command.
- JobId 411: Fatal error: filed/dir_cmd.c:2397 Comm error with SD. bad response to Append Data. ERR=Unknown error
- JobId 411: Error: sql_create.c:450 Volume "client7-2016-4-29-411" already exists.
- JobId 411: Job client7-differential.2016-04-29_01.00.00_36 is waiting. Cannot find any appendable volumes.
- Fatal error: bsock_tcp.c:134 Unable to connect to Storage daemon on <IP>:9103. ERR=Connection refused (after it died)


It seems the whole concept of having one Volume File for each client and job accomplished by many storage devices (because you want to write more than one Volume at a time) is crap.

However here is the full traceback:

Created /var/lib/bareos/bareos-dir.core.10802 for doing postmortem debugging
[New LWP 10804]
[New LWP 10806]
[New LWP 10807]
[New LWP 10808]
[New LWP 10994]
[New LWP 10802]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/usr/sbin/bareos-dir'.
#0 0x00007fc7ec336b9d in nanosleep () at ../sysdeps/unix/syscall-template.S:81
81 ../sysdeps/unix/syscall-template.S: No such file or directory.
$1 = 0x6a3ca0 <my_name> "vampire3-dir"
$2 = 0x2058058 "bareos-dir"
$3 = 0x2058098 "/usr/sbin/bareos-dir"
$4 = 0x7fc7d4011b98 "MySQL"
$5 = 0x7fc7ecb83c3a "15.2.2 (16 November 2015)"
$6 = 0x7fc7ecb83c0e "x86_64-pc-linux-gnu"
$7 = 0x7fc7ecb83c07 "ubuntu"
$8 = 0x7fc7ecb83c29 "Ubuntu 14.04 LTS"
$9 = "vampire3", '\000' <repeats 247 times>
$10 = 0x7fc7ecb83c22 "ubuntu Ubuntu 14.04 LTS"
Environment variable "TestName" not defined.
#0 0x00007fc7ec336b9d in nanosleep () at ../sysdeps/unix/syscall-template.S:81
0000001 0x00007fc7ecb545a4 in bmicrosleep (sec=sec@entry=30, usec=usec@entry=0) at bsys.c:171
0000002 0x00007fc7ecb6486c in check_deadlock () at lockmgr.c:566
0000003 0x00007fc7ec32f182 in start_thread (arg=0x7fc7e9e25700) at pthread_create.c:312
0000004 0x00007fc7ebd5847d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 6 (Thread 0x7fc7eda8b780 (LWP 10802)):
#0 0x00007fc7ec336b9d in nanosleep () at ../sysdeps/unix/syscall-template.S:81
0000001 0x00007fc7ecb545a4 in bmicrosleep (sec=sec@entry=60, usec=usec@entry=0) at bsys.c:171
0000002 0x0000000000442b9b in wait_for_next_job (one_shot_job_to_run=<optimized out>) at scheduler.c:124
0000003 0x000000000040f367 in main (argc=<optimized out>, argv=<optimized out>) at dird.c:393

Thread 5 (Thread 0x7fc7e2ffd700 (LWP 10994)):
#0 0x00007fc7ec336b9d in nanosleep () at ../sysdeps/unix/syscall-template.S:81
0000001 0x00007fc7ecb545a4 in bmicrosleep (sec=sec@entry=2, usec=usec@entry=0) at bsys.c:171
0000002 0x000000000042eba7 in jobq_server (arg=arg@entry=0x6a46c0 <job_queue>) at jobq.c:649
0000003 0x00007fc7ecb64955 in lmgr_thread_launcher (x=0x205a088) at lockmgr.c:926
0000004 0x00007fc7ec32f182 in start_thread (arg=0x7fc7e2ffd700) at pthread_create.c:312
0000005 0x00007fc7ebd5847d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 4 (Thread 0x7fc7e37fe700 (LWP 10808)):
#0 0x00007fc7ec336ee9 in __libc_waitpid (pid=10995, stat_loc=0x7fc7e37fc5fc, options=0) at ../sysdeps/unix/sysv/linux/waitpid.c:40
0000001 0x00007fc7ecb73fa4 in signal_handler (sig=7) at signal.c:240
0000002 <signal handler called>
0000003 0x00007fc7ecb4a215 in bnet_host2ipaddrs (host=host@entry=0xaaaaaaaaaaaaaaaa <error: Cannot access memory at address 0xaaaaaaaaaaaaaaaa>, family=family@entry=0, errstr=errstr@entry=0x7fc7e37fcf78) at bnet.c:414
0000004 0x00007fc7ecb534b7 in BSOCK_TCP::open (this=0x7fc7d000a208, jcr=0x7fc7d0001148, name=0x47269a "Storage daemon", host=0xaaaaaaaaaaaaaaaa <error: Cannot access memory at address 0xaaaaaaaaaaaaaaaa>, service=<optimized out>, port=-1431655766, heart_beat=-6148914691236517206, fatal=0x7fc7e37fdb4c) at bsock_tcp.c:182
0000005 0x00007fc7ecb52a40 in BSOCK_TCP::connect (this=this@entry=0x7fc7d000a208, jcr=jcr@entry=0x7fc7d0001148, retry_interval=retry_interval@entry=2, max_retry_time=max_retry_time@entry=1, heart_beat=heart_beat@entry=-6148914691236517206, name=0x47269a "Storage daemon", host=host@entry=0xaaaaaaaaaaaaaaaa <error: Cannot access memory at address 0xaaaaaaaaaaaaaaaa>, service=service@entry=0x0, port=port@entry=-1431655766, verbose=verbose@entry=false) at bsock_tcp.c:115
0000006 0x000000000044038a in connect_to_storage_daemon (jcr=0x7fc7d0001148, retry_interval=2, max_retry_time=1, verbose=verbose@entry=false) at sd_cmds.c:113
0000007 0x00000000004404b8 in connect_to_storage_daemon (jcr=jcr@entry=0x7fc7d0001148, retry_interval=retry_interval@entry=2, max_retry_time=max_retry_time@entry=1, verbose=verbose@entry=false) at sd_cmds.c:132
0000008 0x00000000004432db in statistics_thread_runner (arg=arg@entry=0x0) at stats.c:228
0000009 0x00007fc7ecb64955 in lmgr_thread_launcher (x=0x20aa0a8) at lockmgr.c:926
0000010 0x00007fc7ec32f182 in start_thread (arg=0x7fc7e37fe700) at pthread_create.c:312
0000011 0x00007fc7ebd5847d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 3 (Thread 0x7fc7e3fff700 (LWP 10807)):
#0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
0000001 0x00007fc7ecb64dbc in bthread_cond_timedwait_p (cond=cond@entry=0x7fc7ecd99720 <_ZL5timer>, m=m@entry=0x7fc7ecd99760 <_ZL11timer_mutex>, abstime=abstime@entry=0x7fc7e3ffede0, file=file@entry=0x7fc7ecb88302 "watchdog.c", line=line@entry=313) at lockmgr.c:811
0000002 0x00007fc7ecb7d4a9 in watchdog_thread (arg=arg@entry=0x0) at watchdog.c:313
0000003 0x00007fc7ecb64955 in lmgr_thread_launcher (x=0x20926d8) at lockmgr.c:926
0000004 0x00007fc7ec32f182 in start_thread (arg=0x7fc7e3fff700) at pthread_create.c:312
0000005 0x00007fc7ebd5847d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 2 (Thread 0x7fc7e8ee5700 (LWP 10806)):
#0 0x00007fc7ebd4b12d in poll () at ../sysdeps/unix/syscall-template.S:81
0000001 0x00007fc7ecb49283 in bnet_thread_server_tcp (addr_list=addr_list@entry=0x205a8c8, max_clients=<optimized out>, sockfds=<optimized out>, client_wq=client_wq@entry=0x6a4980 <socket_workq>, nokeepalive=<optimized out>, handle_client_request=handle_client_request@entry=0x43d340 <handle_connection_request(void*)>) at bnet_server_tcp.c:298
0000002 0x000000000043d59f in connect_thread (arg=arg@entry=0x205a8c8) at socket_server.c:100
0000003 0x00007fc7ecb64955 in lmgr_thread_launcher (x=0x20a9ff8) at lockmgr.c:926
0000004 0x00007fc7ec32f182 in start_thread (arg=0x7fc7e8ee5700) at pthread_create.c:312
0000005 0x00007fc7ebd5847d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 1 (Thread 0x7fc7e9e25700 (LWP 10804)):
#0 0x00007fc7ec336b9d in nanosleep () at ../sysdeps/unix/syscall-template.S:81
0000001 0x00007fc7ecb545a4 in bmicrosleep (sec=sec@entry=30, usec=usec@entry=0) at bsys.c:171
0000002 0x00007fc7ecb6486c in check_deadlock () at lockmgr.c:566
0000003 0x00007fc7ec32f182 in start_thread (arg=0x7fc7e9e25700) at pthread_create.c:312
0000004 0x00007fc7ebd5847d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
#0 0x00007fc7ec336b9d in nanosleep () at ../sysdeps/unix/syscall-template.S:81
81 in ../sysdeps/unix/syscall-template.S
No locals.
0000001 0x00007fc7ecb545a4 in bmicrosleep (sec=sec@entry=30, usec=usec@entry=0) at bsys.c:171
171 bsys.c: No such file or directory.
timeout = {tv_sec = 30, tv_nsec = 0}
tv = {tv_sec = 140496646714409, tv_usec = 0}
tz = {tz_minuteswest = -371042560, tz_dsttime = 32711}
status = <optimized out>
0000002 0x00007fc7ecb6486c in check_deadlock () at lockmgr.c:566
566 lockmgr.c: No such file or directory.
__cancel_buf = {__cancel_jmp_buf = {{__cancel_jmp_buf = {140496599144192, 8888629645796315974, 0, 0, 140496599144896, 140496599144192, -8875167333342899386, -8875160014926507194}, __mask_was_saved = 0}}, __pad = {0x7fc7e9e24ef0, 0x0, 0x7fc7ec32f124 <start_thread+100>, 0x7fc7e9e25700}}
__cancel_routine = 0x7fc7ecb647f0 <cln_hdl(void*)>
__not_first_call = <optimized out>
old = 0
0000003 0x00007fc7ec32f182 in start_thread (arg=0x7fc7e9e25700) at pthread_create.c:312
312 pthread_create.c: No such file or directory.
__res = <optimized out>
pd = 0x7fc7e9e25700
now = <optimized out>
unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140496599144192, 8888629645796315974, 0, 0, 140496599144896, 140496599144192, -8875167333368065210, -8875158953774279866}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
not_first_call = <optimized out>
pagesize_m1 = <optimized out>
sp = <optimized out>
freesize = <optimized out>
__PRETTY_FUNCTION__ = "start_thread"
0000004 0x00007fc7ebd5847d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
111 ../sysdeps/unix/sysv/linux/x86_64/clone.S: No such file or directory.
No locals.
#0 0x0000000000000000 in ?? ()
No symbol table info available.
#0 0x0000000000000000 in ?? ()
No symbol table info available.
#0 0x0000000000000000 in ?? ()
No symbol table info available.
arogge

arogge

2020-02-12 11:28

manager   ~0003799

Sorry for coming back to you that late. Are you still using Bareos? Can you still reproduce this with the current release?

Issue History

Date Modified Username Field Change
2016-05-09 16:56 barbugs New Issue
2016-05-09 19:01 mvwieringen Note Added: 0002262
2016-05-09 19:01 mvwieringen Status new => feedback
2016-05-09 19:03 mvwieringen Note Added: 0002263
2016-05-10 15:12 barbugs Note Added: 0002268
2016-05-10 15:12 barbugs Status feedback => new
2020-02-12 11:28 arogge Status new => feedback
2020-02-12 11:28 arogge Note Added: 0003799
2020-12-16 09:12 arogge Assigned To => arogge
2020-12-16 09:12 arogge Status feedback => resolved
2020-12-16 09:12 arogge Resolution open => unable to reproduce
2020-12-16 09:12 arogge Status resolved => closed