View Issue Details

IDProjectCategoryView StatusLast Update
0000374bareos-coreGeneralpublic2015-03-24 11:25
Reportergglynn Assigned To 
PrioritynormalSeverityminorReproducibilityalways
Status closedResolutionopen 
PlatformLinuxOSCentOSOS Version6
Product Version14.2.2 
Fixed in Version14.2.3 
Summary0000374: NDMP SMTAPE restores fail with "SMTAPE: Error: Read Socket received EOF."
DescriptionI'm attempting to restore an NDMP SMTAPE backup I made of a NetApp FAS2520 flexible volume to a restricted volume of the same size on another identical NetApp (from a Bareos FileStorage pool) but the restore job fails immediately with the error "SMTAPE: Error: Read Socket received EOF" reported by the NetApp FAS unit.
Additional InformationI've confirmed that the correct nlist is provided to NetApp by the Bareos Director (bareos-dir) via NDMP, and the NetApp connects to the Bareos Storage service (bareos-sd) on its random NDMP session port, but TCPdump shows that bareos-sd closes the connection immediately after it opens. I can't tell from the Bareos logs and debug output or the NetApp's ndmpd debug log whether bareos-sd is closing the storage connection *because* the NetApp has already informed the bareos-dir that it encountered the SMTAPE EOF error, or if the disconnection is *causing* the SMTAPE EOF error.

I'll note that I made the backup with Bareos 13.2.2, but I upgraded Bareos to 14.2.2 after I ran into Bareos Bug 0000345 (NDMP SMTAPE restore does not work because Bareos didn't supply an nlist), and couldn't find a binary 13.2.4 RPM with the patch included.

Also note that I was able to successfully extract the SMTAPE backup to a file on the Bareos server using bextract, so the backup is there intact in the FileStorage Pool, I just can't get the Netapp to pull it in from the bareos-sd.
TagsNo tags attached.

Relationships

related to 0000345 closed NDMP restore does not work 
child of 0000393 closed Release bareos-14.2.3 

Activities

mvwieringen

mvwieringen

2014-12-16 20:48

developer   ~0001111

Try adding the following to your bareos-dir.conf and bareos-sd.conf
both in the server section

  NDMP Snooping = yes
  NDMP Loglevel = 7

e.g. for SD that is in the Storage section of the conf
     and for the DIR its the Director section.

Smells a bit like the SD doesn't know how to connect to the NDMP
data agent on the NetAPP box.
gglynn

gglynn

2014-12-16 21:50

reporter  

ndmp-snooping-output-201412161549.txt (20,917 bytes)   
16-Dec 12:44 sneakernet-dir JobId 54: Start Restore Job Restore_SRCNETAPP:vol1_smtape_to_DSTNETAPP:SRCNETAPP_vol1.2014-12-16_12.44.45_07
16-Dec 12:44 sneakernet-dir JobId 54: Using Device "FileStorage" to read.
16-Dec 12:44 sneakernet-dir JobId 54: >C NDMP0_NOTIFY_CONNECTED 1
16-Dec 12:44 sneakernet-dir JobId 54:    reason=NDMP0_CONNECTED protocol_version=4 text_reason='Connection successful'
16-Dec 12:44 sneakernet-dir JobId 54: C> NDMP0_CONNECT_OPEN 1
16-Dec 12:44 sneakernet-dir JobId 54:    version=4
16-Dec 12:44 sneakernet-dir JobId 54: >R NDMP0_CONNECT_OPEN 1 (2)
16-Dec 12:44 sneakernet-dir JobId 54:    error=NDMP0_NO_ERR
16-Dec 12:44 sneakernet-dir JobId 54: C> NDMP4_CONFIG_GET_AUTH_ATTR 2
16-Dec 12:44 sneakernet-dir JobId 54:    auth_type=NDMP4_AUTH_MD5
16-Dec 12:44 sneakernet-dir JobId 54: >R NDMP4_CONFIG_GET_AUTH_ATTR 2 (3)
16-Dec 12:44 sneakernet-dir JobId 54:    <<unimplemented pp>>
16-Dec 12:44 sneakernet-dir JobId 54: C> NDMP4_CONNECT_CLIENT_AUTH 3
16-Dec 12:44 sneakernet-dir JobId 54:    auth_type=NDMP4_AUTH_MD5 auth_id=ndmpbackup
16-Dec 12:44 sneakernet-dir JobId 54: >R NDMP4_CONNECT_CLIENT_AUTH 3 (4)
16-Dec 12:44 sneakernet-dir JobId 54:    error=NDMP4_NO_ERR
16-Dec 12:44 sneakernet-dir JobId 54: >C NDMP0_NOTIFY_CONNECTED 1
16-Dec 12:44 sneakernet-dir JobId 54:    reason=NDMP0_CONNECTED protocol_version=4 text_reason='Hello'
16-Dec 12:44 sneakernet-dir JobId 54: C> NDMP0_CONNECT_OPEN 1
16-Dec 12:44 sneakernet-dir JobId 54:    version=4
16-Dec 12:44 sneakernet-dir JobId 54: >R NDMP0_CONNECT_OPEN 1 (2)
16-Dec 12:44 sneakernet-dir JobId 54:    error=NDMP0_NO_ERR
16-Dec 12:44 sneakernet-dir JobId 54: C> NDMP4_CONFIG_GET_AUTH_ATTR 2
16-Dec 12:44 sneakernet-dir JobId 54:    auth_type=NDMP4_AUTH_MD5
16-Dec 12:44 sneakernet-dir JobId 54: >R NDMP4_CONFIG_GET_AUTH_ATTR 2 (3)
16-Dec 12:44 sneakernet-dir JobId 54:    <<unimplemented pp>>
16-Dec 12:44 sneakernet-dir JobId 54: C> NDMP4_CONNECT_CLIENT_AUTH 3
16-Dec 12:44 sneakernet-dir JobId 54:    auth_type=NDMP4_AUTH_MD5 auth_id=ndmpbackup
16-Dec 12:44 sneakernet-dir JobId 54: >R NDMP4_CONNECT_CLIENT_AUTH 3 (4)
16-Dec 12:44 sneakernet-dir JobId 54:    error=NDMP4_NO_ERR
16-Dec 12:44 sneakernet-dir JobId 54: ndmca_connect_tape_agent: 0 0x7f09e80f8a70
16-Dec 12:44 sneakernet-dir JobId 54: C> NDMP4_MOVER_SET_RECORD_SIZE 4
16-Dec 12:44 sneakernet-dir JobId 54:    len=64512
16-Dec 12:44 sneakernet-dir JobId 54: >R NDMP4_MOVER_SET_RECORD_SIZE 4 (5)
16-Dec 12:44 sneakernet-dir JobId 54:    error=NDMP4_NO_ERR
16-Dec 12:44 sneakernet-dir JobId 54: Opening tape drive NNLE-BCOP-OODO-CLOL-GBDA-OIMP-ELMG-IAAK@/vol/vol1 read-only
16-Dec 12:44 sneakernet-dir JobId 54: C> NDMP4_TAPE_OPEN 5
16-Dec 12:44 sneakernet-dir JobId 54:    device='NNLE-BCOP-OODO-CLOL-GBDA-OIMP-ELMG-IAAK@/vol/vol1' mode=NDMP4_TAPE_READ_MODE
16-Dec 12:44 sneakernet-dir JobId 54: >R NDMP4_TAPE_OPEN 5 (6)
16-Dec 12:44 sneakernet-dir JobId 54:    error=NDMP4_NO_ERR
16-Dec 12:44 sneakernet-dir JobId 54: Commanding tape drive to rewind
16-Dec 12:44 sneakernet-dir JobId 54: C> NDMP4_TAPE_MTIO 6
16-Dec 12:44 sneakernet-dir JobId 54:    op=NDMP4_MTIO_REW count=1
16-Dec 12:44 sneakernet-dir JobId 54: >R NDMP4_TAPE_MTIO 6 (7)
16-Dec 12:44 sneakernet-dir JobId 54:    error=NDMP4_NO_ERR resid_count=0
16-Dec 12:44 sneakernet-dir JobId 54: C> NDMP4_MOVER_SET_WINDOW 7
16-Dec 12:44 sneakernet-dir JobId 54:    offset=0 length=-1
16-Dec 12:44 sneakernet-dir JobId 54: >R NDMP4_MOVER_SET_WINDOW 7 (8)
16-Dec 12:44 sneakernet-dir JobId 54:    error=NDMP4_NO_ERR
16-Dec 12:44 sneakernet-dir JobId 54: C> NDMP4_MOVER_LISTEN 8
16-Dec 12:44 sneakernet-dir JobId 54:    mode=NDMP4_MOVER_MODE_WRITE addr_type=NDMP4_ADDR_TCP
16-Dec 12:44 sneakernet-dir JobId 54: >R NDMP4_MOVER_LISTEN 8 (9)
16-Dec 12:44 sneakernet-dir JobId 54:    error=NDMP4_NO_ERR data_conn_addr=NDMP4_ADDR_TCP #0(c0a8039d,48651)
16-Dec 12:44 sneakernet-dir JobId 54: C> NDMP4_DATA_CONNECT 4
16-Dec 12:44 sneakernet-dir JobId 54:    addr=NDMP4_ADDR_TCP #0(c0a8039d,48651)
16-Dec 12:44 sneakernet-sd JobId 54: Ready to read from volume "FVol0043" on device "FileStorage" (/home/bareos).
16-Dec 12:44 sneakernet-sd JobId 54: Forward spacing Volume "FVol0043" to file:block 7:3140372906.
16-Dec 12:44 sneakernet-dir JobId 54: >R NDMP4_DATA_CONNECT 4 (5)
16-Dec 12:44 sneakernet-dir JobId 54:    error=NDMP4_NO_ERR
16-Dec 12:44 sneakernet-dir JobId 54: C> NDMP4_DATA_START_RECOVER 5
16-Dec 12:44 sneakernet-dir JobId 54:    butype_name='smtape' n_env=1 n_nlist=1
16-Dec 12:44 sneakernet-dir JobId 54:    env[0] name='PREFIX' value='/vol/SRCNETAPP_vol1'
16-Dec 12:44 sneakernet-dir JobId 54:    nl[0] original_path=''
16-Dec 12:44 sneakernet-dir JobId 54:    ..... destination_path='/vol/SRCNETAPP_vol1'
16-Dec 12:44 sneakernet-dir JobId 54:    ..... name='' other=''
16-Dec 12:44 sneakernet-dir JobId 54:    ..... node=-1 fh_info=-1
16-Dec 12:44 sneakernet-dir JobId 54: >R NDMP4_DATA_START_RECOVER 5 (6)
16-Dec 12:44 sneakernet-dir JobId 54:    error=NDMP4_NO_ERR
16-Dec 12:44 sneakernet-dir JobId 54: Waiting for operation to start
16-Dec 12:44 sneakernet-dir JobId 54: C> NDMP4_DATA_GET_STATE 6
16-Dec 12:44 sneakernet-dir JobId 54: >R NDMP4_DATA_GET_STATE 6 (7)
16-Dec 12:44 sneakernet-dir JobId 54:    unsupp=3 error=NDMP4_NO_ERR op=NDMP4_DATA_OP_RECOVER
16-Dec 12:44 sneakernet-dir JobId 54:    state=NDMP4_DATA_STATE_ACTIVE
16-Dec 12:44 sneakernet-dir JobId 54:    halt_reason=NDMP4_DATA_HALT_NA
16-Dec 12:44 sneakernet-dir JobId 54:    bytes_processed=0 est_bytes_remain=0
16-Dec 12:44 sneakernet-dir JobId 54:    est_time_remain=0 data_conn_addr=NDMP4_ADDR_TCP #0(c0a8039d,48651)
16-Dec 12:44 sneakernet-dir JobId 54:    read_offset=0 read_length=0
16-Dec 12:44 sneakernet-dir JobId 54: C> NDMP4_MOVER_GET_STATE 9
16-Dec 12:44 sneakernet-dir JobId 54: >C NDMP4_LOG_MESSAGE 10
16-Dec 12:44 sneakernet-dir JobId 54:    log_type=NDMP4_LOG_DEBUG id=1418762687 message='mover going active'
16-Dec 12:44 sneakernet-dir JobId 54: Async request NDMP4_LOG_MESSAGE
16-Dec 12:44 sneakernet-dir JobId 54: >R NDMP4_MOVER_GET_STATE 9 (11)
16-Dec 12:44 sneakernet-dir JobId 54:    error=NDMP4_NO_ERR state=NDMP4_MOVER_STATE_ACTIVE
16-Dec 12:44 sneakernet-dir JobId 54:    pause_reason=NDMP4_MOVER_PAUSE_NA
16-Dec 12:44 sneakernet-dir JobId 54:    halt_reason=NDMP4_MOVER_HALT_NA
16-Dec 12:44 sneakernet-dir JobId 54:    record_size=64512 record_num=0 bytes_moved=0
16-Dec 12:44 sneakernet-dir JobId 54:    seek=0 to_read=0 win_off=0 win_len=-1
16-Dec 12:44 sneakernet-dir JobId 54:    data_conn_addr=NDMP4_ADDR_TCP #0(c0a8039d,48651)
16-Dec 12:44 sneakernet-dir JobId 54: C> NDMP4_TAPE_GET_STATE 10
16-Dec 12:44 sneakernet-dir JobId 54: >R NDMP4_TAPE_GET_STATE 10 (12)
16-Dec 12:44 sneakernet-dir JobId 54:    unsupp=30 error=NDMP4_NO_ERR flags=0x0 file_num=0
16-Dec 12:44 sneakernet-dir JobId 54:    soft_errors=0 block_size=0 blockno=0
16-Dec 12:44 sneakernet-dir JobId 54:    total_space=-1 space_remain=-1
16-Dec 12:44 sneakernet-dir JobId 54: Operation started
16-Dec 12:44 sneakernet-dir JobId 54: Monitoring recover
16-Dec 12:44 sneakernet-dir JobId 54: mon_wait_for_something() entered
16-Dec 12:44 sneakernet-dir JobId 54: >C NDMP4_NOTIFY_DATA_READ 8
16-Dec 12:44 sneakernet-dir JobId 54:    offset=0 length=-1
16-Dec 12:44 sneakernet-dir JobId 54: mon_wait_for_something() happened, resid=30
16-Dec 12:44 sneakernet-dir JobId 54: C> NDMP4_DATA_GET_STATE 7
16-Dec 12:44 sneakernet-dir JobId 54: >R NDMP4_DATA_GET_STATE 7 (9)
16-Dec 12:44 sneakernet-dir JobId 54:    unsupp=3 error=NDMP4_NO_ERR op=NDMP4_DATA_OP_RECOVER
16-Dec 12:44 sneakernet-dir JobId 54:    state=NDMP4_DATA_STATE_ACTIVE
16-Dec 12:44 sneakernet-dir JobId 54:    halt_reason=NDMP4_DATA_HALT_NA
16-Dec 12:44 sneakernet-dir JobId 54:    bytes_processed=0 est_bytes_remain=0
16-Dec 12:44 sneakernet-dir JobId 54:    est_time_remain=0 data_conn_addr=NDMP4_ADDR_TCP #0(c0a8039d,48651)
16-Dec 12:44 sneakernet-dir JobId 54:    read_offset=0 read_length=-1
16-Dec 12:44 sneakernet-dir JobId 54: C> NDMP4_MOVER_GET_STATE 11
16-Dec 12:44 sneakernet-dir JobId 54: >R NDMP4_MOVER_GET_STATE 11 (13)
16-Dec 12:44 sneakernet-dir JobId 54:    error=NDMP4_NO_ERR state=NDMP4_MOVER_STATE_ACTIVE
16-Dec 12:44 sneakernet-dir JobId 54:    pause_reason=NDMP4_MOVER_PAUSE_NA
16-Dec 12:44 sneakernet-dir JobId 54:    halt_reason=NDMP4_MOVER_HALT_NA
16-Dec 12:44 sneakernet-dir JobId 54:    record_size=64512 record_num=0 bytes_moved=0
16-Dec 12:44 sneakernet-dir JobId 54:    seek=0 to_read=0 win_off=0 win_len=-1
16-Dec 12:44 sneakernet-dir JobId 54:    data_conn_addr=NDMP4_ADDR_TCP #0(c0a8039d,48651)
16-Dec 12:44 sneakernet-dir JobId 54: C> NDMP4_TAPE_GET_STATE 12
16-Dec 12:44 sneakernet-dir JobId 54: >R NDMP4_TAPE_GET_STATE 12 (14)
16-Dec 12:44 sneakernet-dir JobId 54:    unsupp=30 error=NDMP4_NO_ERR flags=0x0 file_num=0
16-Dec 12:44 sneakernet-dir JobId 54:    soft_errors=0 block_size=0 blockno=0
16-Dec 12:44 sneakernet-dir JobId 54:    total_space=-1 space_remain=-1
16-Dec 12:44 sneakernet-dir JobId 54: DATA: bytes 0KB  MOVER: read 0KB record 0
16-Dec 12:44 sneakernet-dir JobId 54: C> NDMP4_MOVER_READ 13
16-Dec 12:44 sneakernet-dir JobId 54:    offset=0 length=-1
16-Dec 12:44 sneakernet-dir JobId 54: >R NDMP4_MOVER_READ 13 (15)
16-Dec 12:44 sneakernet-dir JobId 54:    error=NDMP4_NO_ERR
16-Dec 12:44 sneakernet-dir JobId 54: mon_wait_for_something() entered
16-Dec 12:44 sneakernet-dir JobId 54: >C NDMP4_NOTIFY_MOVER_PAUSED 16
16-Dec 12:44 sneakernet-dir JobId 54:    reason=NDMP4_MOVER_PAUSE_EOF seek_position=0
16-Dec 12:44 sneakernet-dir JobId 54: mon_wait_for_something() happened, resid=10
16-Dec 12:44 sneakernet-dir JobId 54: C> NDMP4_DATA_GET_STATE 8
16-Dec 12:44 sneakernet-dir JobId 54: >R NDMP4_DATA_GET_STATE 8 (10)
16-Dec 12:44 sneakernet-dir JobId 54:    unsupp=3 error=NDMP4_NO_ERR op=NDMP4_DATA_OP_RECOVER
16-Dec 12:44 sneakernet-dir JobId 54:    state=NDMP4_DATA_STATE_ACTIVE
16-Dec 12:44 sneakernet-dir JobId 54:    halt_reason=NDMP4_DATA_HALT_NA
16-Dec 12:44 sneakernet-dir JobId 54:    bytes_processed=0 est_bytes_remain=0
16-Dec 12:44 sneakernet-dir JobId 54:    est_time_remain=0 data_conn_addr=NDMP4_ADDR_TCP #0(c0a8039d,48651)
16-Dec 12:44 sneakernet-dir JobId 54:    read_offset=0 read_length=-1
16-Dec 12:44 sneakernet-dir JobId 54: C> NDMP4_MOVER_GET_STATE 14
16-Dec 12:44 sneakernet-dir JobId 54: >R NDMP4_MOVER_GET_STATE 14 (17)
16-Dec 12:44 sneakernet-dir JobId 54:    error=NDMP4_NO_ERR state=NDMP4_MOVER_STATE_PAUSED
16-Dec 12:44 sneakernet-dir JobId 54:    pause_reason=NDMP4_MOVER_PAUSE_EOF
16-Dec 12:44 sneakernet-dir JobId 54:    halt_reason=NDMP4_MOVER_HALT_NA
16-Dec 12:44 sneakernet-dir JobId 54:    record_size=64512 record_num=0 bytes_moved=0
16-Dec 12:44 sneakernet-dir JobId 54:    seek=0 to_read=-1 win_off=0 win_len=-1
16-Dec 12:44 sneakernet-dir JobId 54:    data_conn_addr=NDMP4_ADDR_TCP #0(c0a8039d,48651)
16-Dec 12:44 sneakernet-dir JobId 54: C> NDMP4_TAPE_GET_STATE 15
16-Dec 12:44 sneakernet-dir JobId 54: >R NDMP4_TAPE_GET_STATE 15 (18)
16-Dec 12:44 sneakernet-dir JobId 54:    unsupp=30 error=NDMP4_NO_ERR flags=0x0 file_num=0
16-Dec 12:44 sneakernet-dir JobId 54:    soft_errors=0 block_size=0 blockno=1
16-Dec 12:44 sneakernet-dir JobId 54:    total_space=-1 space_remain=-1
16-Dec 12:44 sneakernet-dir JobId 54: DATA: bytes 0KB  MOVER: read 0KB record 0
16-Dec 12:44 sneakernet-dir JobId 54: Mover paused, reason=NDMP9_MOVER_PAUSE_EOF
16-Dec 12:44 sneakernet-dir JobId 54: End of tapes
16-Dec 12:44 sneakernet-dir JobId 54: C> NDMP4_MOVER_CLOSE 16
16-Dec 12:44 sneakernet-dir JobId 54: >R NDMP4_MOVER_CLOSE 16 (19)
16-Dec 12:44 sneakernet-dir JobId 54:    error=NDMP4_NO_ERR
16-Dec 12:44 sneakernet-dir JobId 54: mon_wait_for_something() entered
16-Dec 12:44 sneakernet-dir JobId 54: >C NDMP4_LOG_MESSAGE 11
16-Dec 12:44 sneakernet-dir JobId 54:    log_type=NDMP4_LOG_ERROR id=1 message='SMTAPE: Error: Read Socket received EOF.
'
16-Dec 12:44 sneakernet-dir JobId 54: Error: LOG_MESSAGE: 'SMTAPE: Error: Read Socket received EOF.'
16-Dec 12:44 sneakernet-dir JobId 54: >C NDMP4_LOG_MESSAGE 12
16-Dec 12:44 sneakernet-dir JobId 54:    log_type=NDMP4_LOG_NORMAL id=2 message='SMTAPE: Cannot Init Input, aborting
'
16-Dec 12:44 sneakernet-dir JobId 54: LOG_MESSAGE: 'SMTAPE: Cannot Init Input, aborting'
16-Dec 12:44 sneakernet-dir JobId 54: >C NDMP4_LOG_MESSAGE 13
16-Dec 12:44 sneakernet-dir JobId 54:    log_type=NDMP4_LOG_ERROR id=3 message='smtape operation failed due to: image restore aborted due to snapmirror failure.
'
16-Dec 12:44 sneakernet-dir JobId 54: Error: LOG_MESSAGE: 'smtape operation failed due to: image restore aborted due to snapmirror failure.'
16-Dec 12:44 sneakernet-dir JobId 54: >C NDMP4_LOG_MESSAGE 14
16-Dec 12:44 sneakernet-dir JobId 54:    log_type=NDMP4_LOG_ERROR id=4 message='DATA: Operation terminated: EVENT: INTERNAL ERROR (for SRCNETAPP_vol1)
'
16-Dec 12:44 sneakernet-dir JobId 54: Error: LOG_MESSAGE: 'DATA: Operation terminated: EVENT: INTERNAL ERROR (for SRCNETAPP_vol1)'
16-Dec 12:44 sneakernet-dir JobId 54: >C NDMP4_NOTIFY_DATA_HALTED 15
16-Dec 12:44 sneakernet-dir JobId 54:    reason=NDMP4_DATA_HALT_INTERNAL_ERROR
16-Dec 12:44 sneakernet-dir JobId 54: mon_wait_for_something() happened, resid=10
16-Dec 12:44 sneakernet-dir JobId 54: C> NDMP4_DATA_GET_STATE 9
16-Dec 12:44 sneakernet-dir JobId 54: >R NDMP4_DATA_GET_STATE 9 (16)
16-Dec 12:44 sneakernet-dir JobId 54:    unsupp=3 error=NDMP4_NO_ERR op=NDMP4_DATA_OP_RECOVER
16-Dec 12:44 sneakernet-dir JobId 54:    state=NDMP4_DATA_STATE_HALTED
16-Dec 12:44 sneakernet-dir JobId 54:    halt_reason=NDMP4_DATA_HALT_INTERNAL_ERROR
16-Dec 12:44 sneakernet-dir JobId 54:    bytes_processed=0 est_bytes_remain=0
16-Dec 12:44 sneakernet-dir JobId 54:    est_time_remain=0 data_conn_addr=NDMP4_ADDR_TCP #0(c0a8039d,48651)
16-Dec 12:44 sneakernet-dir JobId 54:    read_offset=0 read_length=-1
16-Dec 12:44 sneakernet-dir JobId 54: C> NDMP4_MOVER_GET_STATE 17
16-Dec 12:44 sneakernet-dir JobId 54: >C NDMP4_NOTIFY_MOVER_HALTED 20
16-Dec 12:44 sneakernet-dir JobId 54:    reason=NDMP4_MOVER_HALT_CONNECT_CLOSED
16-Dec 12:44 sneakernet-dir JobId 54: Async request NDMP4_NOTIFY_MOVER_HALTED
16-Dec 12:44 sneakernet-dir JobId 54: >R NDMP4_MOVER_GET_STATE 17 (21)
16-Dec 12:44 sneakernet-dir JobId 54:    error=NDMP4_NO_ERR state=NDMP4_MOVER_STATE_HALTED
16-Dec 12:44 sneakernet-dir JobId 54:    pause_reason=NDMP4_MOVER_PAUSE_NA
16-Dec 12:44 sneakernet-dir JobId 54:    halt_reason=NDMP4_MOVER_HALT_CONNECT_CLOSED
16-Dec 12:44 sneakernet-dir JobId 54:    record_size=64512 record_num=0 bytes_moved=0
16-Dec 12:44 sneakernet-dir JobId 54:    seek=0 to_read=-1 win_off=0 win_len=-1
16-Dec 12:44 sneakernet-dir JobId 54:    data_conn_addr=NDMP4_ADDR_TCP #0(c0a8039d,48651)
16-Dec 12:44 sneakernet-dir JobId 54: C> NDMP4_TAPE_GET_STATE 18
16-Dec 12:44 sneakernet-dir JobId 54: >R NDMP4_TAPE_GET_STATE 18 (22)
16-Dec 12:44 sneakernet-dir JobId 54:    unsupp=30 error=NDMP4_NO_ERR flags=0x0 file_num=0
16-Dec 12:44 sneakernet-dir JobId 54:    soft_errors=0 block_size=0 blockno=1
16-Dec 12:44 sneakernet-dir JobId 54:    total_space=-1 space_remain=-1
16-Dec 12:44 sneakernet-dir JobId 54: DATA: bytes 0KB  MOVER: read 0KB record 0
16-Dec 12:44 sneakernet-dir JobId 54: Operation done, cleaning up
16-Dec 12:44 sneakernet-dir JobId 54: Waiting for operation to halt
16-Dec 12:44 sneakernet-dir JobId 54: mon_wait_for_something() entered
16-Dec 12:44 sneakernet-dir JobId 54: mon_wait_for_something() happened, resid=2
16-Dec 12:44 sneakernet-dir JobId 54: C> NDMP4_DATA_GET_STATE 10
16-Dec 12:44 sneakernet-dir JobId 54: >R NDMP4_DATA_GET_STATE 10 (17)
16-Dec 12:44 sneakernet-dir JobId 54:    unsupp=3 error=NDMP4_NO_ERR op=NDMP4_DATA_OP_RECOVER
16-Dec 12:44 sneakernet-dir JobId 54:    state=NDMP4_DATA_STATE_HALTED
16-Dec 12:44 sneakernet-dir JobId 54:    halt_reason=NDMP4_DATA_HALT_INTERNAL_ERROR
16-Dec 12:44 sneakernet-dir JobId 54:    bytes_processed=0 est_bytes_remain=0
16-Dec 12:44 sneakernet-dir JobId 54:    est_time_remain=0 data_conn_addr=NDMP4_ADDR_TCP #0(c0a8039d,48651)
16-Dec 12:44 sneakernet-dir JobId 54:    read_offset=0 read_length=-1
16-Dec 12:44 sneakernet-dir JobId 54: C> NDMP4_MOVER_GET_STATE 19
16-Dec 12:44 sneakernet-dir JobId 54: >R NDMP4_MOVER_GET_STATE 19 (23)
16-Dec 12:44 sneakernet-dir JobId 54:    error=NDMP4_NO_ERR state=NDMP4_MOVER_STATE_HALTED
16-Dec 12:44 sneakernet-dir JobId 54:    pause_reason=NDMP4_MOVER_PAUSE_NA
16-Dec 12:44 sneakernet-dir JobId 54:    halt_reason=NDMP4_MOVER_HALT_CONNECT_CLOSED
16-Dec 12:44 sneakernet-dir JobId 54:    record_size=64512 record_num=0 bytes_moved=0
16-Dec 12:44 sneakernet-dir JobId 54:    seek=0 to_read=-1 win_off=0 win_len=-1
16-Dec 12:44 sneakernet-dir JobId 54:    data_conn_addr=NDMP4_ADDR_TCP #0(c0a8039d,48651)
16-Dec 12:44 sneakernet-dir JobId 54: C> NDMP4_TAPE_GET_STATE 20
16-Dec 12:44 sneakernet-dir JobId 54: >R NDMP4_TAPE_GET_STATE 20 (24)
16-Dec 12:44 sneakernet-dir JobId 54:    unsupp=30 error=NDMP4_NO_ERR flags=0x0 file_num=0
16-Dec 12:44 sneakernet-dir JobId 54:    soft_errors=0 block_size=0 blockno=1
16-Dec 12:44 sneakernet-dir JobId 54:    total_space=-1 space_remain=-1
16-Dec 12:44 sneakernet-dir JobId 54: Commanding tape drive to rewind
16-Dec 12:44 sneakernet-dir JobId 54: C> NDMP4_TAPE_MTIO 21
16-Dec 12:44 sneakernet-dir JobId 54:    op=NDMP4_MTIO_REW count=1
16-Dec 12:44 sneakernet-dir JobId 54: >R NDMP4_TAPE_MTIO 21 (25)
16-Dec 12:44 sneakernet-dir JobId 54:    error=NDMP4_NO_ERR resid_count=0
16-Dec 12:44 sneakernet-dir JobId 54: Closing tape drive NNLE-BCOP-OODO-CLOL-GBDA-OIMP-ELMG-IAAK@/vol/vol1
16-Dec 12:44 sneakernet-dir JobId 54: C> NDMP4_TAPE_CLOSE 22
16-Dec 12:44 sneakernet-dir JobId 54: >R NDMP4_TAPE_CLOSE 22 (26)
16-Dec 12:44 sneakernet-dir JobId 54:    error=NDMP4_NO_ERR
16-Dec 12:44 sneakernet-dir JobId 54: Operation halted, stopping
16-Dec 12:44 sneakernet-dir JobId 54: Operation ended questionably
16-Dec 12:44 sneakernet-dir JobId 54: C> NDMP4_DATA_STOP 11
16-Dec 12:44 sneakernet-dir JobId 54: >R NDMP4_DATA_STOP 11 (18)
16-Dec 12:44 sneakernet-dir JobId 54:    error=NDMP4_NO_ERR
16-Dec 12:44 sneakernet-dir JobId 54: C> NDMP4_MOVER_STOP 23
16-Dec 12:44 sneakernet-dir JobId 54: >R NDMP4_MOVER_STOP 23 (27)
16-Dec 12:44 sneakernet-dir JobId 54:    error=NDMP4_NO_ERR
16-Dec 12:44 sneakernet-dir JobId 54: C> NDMP4_DATA_GET_STATE 12
16-Dec 12:44 sneakernet-dir JobId 54: >R NDMP4_DATA_GET_STATE 12 (19)
16-Dec 12:44 sneakernet-dir JobId 54:    unsupp=3 error=NDMP4_NO_ERR op=NDMP4_DATA_OP_NOACTION
16-Dec 12:44 sneakernet-dir JobId 54:    state=NDMP4_DATA_STATE_IDLE
16-Dec 12:44 sneakernet-dir JobId 54:    halt_reason=NDMP4_DATA_HALT_NA
16-Dec 12:44 sneakernet-dir JobId 54:    bytes_processed=0 est_bytes_remain=0
16-Dec 12:44 sneakernet-dir JobId 54:    est_time_remain=0 data_conn_addr=NDMP4_ADDR_LOCAL
16-Dec 12:44 sneakernet-dir JobId 54:    read_offset=0 read_length=0
16-Dec 12:44 sneakernet-dir JobId 54: C> NDMP4_MOVER_GET_STATE 24
16-Dec 12:44 sneakernet-dir JobId 54: >R NDMP4_MOVER_GET_STATE 24 (28)
16-Dec 12:44 sneakernet-dir JobId 54:    error=NDMP4_NO_ERR state=NDMP4_MOVER_STATE_IDLE
16-Dec 12:44 sneakernet-dir JobId 54:    pause_reason=NDMP4_MOVER_PAUSE_NA
16-Dec 12:44 sneakernet-dir JobId 54:    halt_reason=NDMP4_MOVER_HALT_NA
16-Dec 12:44 sneakernet-dir JobId 54:    record_size=10240 record_num=0 bytes_moved=0
16-Dec 12:44 sneakernet-dir JobId 54:    seek=0 to_read=0 win_off=0 win_len=-1
16-Dec 12:44 sneakernet-dir JobId 54:    data_conn_addr=NDMP4_ADDR_LOCAL
16-Dec 12:44 sneakernet-dir JobId 54: C> NDMP4_TAPE_GET_STATE 25
16-Dec 12:44 sneakernet-dir JobId 54: >R NDMP4_TAPE_GET_STATE 25 (29)
16-Dec 12:44 sneakernet-dir JobId 54:    unsupp=3f error=NDMP4_DEV_NOT_OPEN_ERR flags=0x0 file_num=4294967295
16-Dec 12:44 sneakernet-dir JobId 54:    soft_errors=4294967295 block_size=4294967295 blockno=4294967295
16-Dec 12:44 sneakernet-dir JobId 54:    total_space=-1 space_remain=-1
16-Dec 12:44 sneakernet-dir JobId 54: ERR=reply-error
16-Dec 12:44 sneakernet-sd JobId 54: ERR=EOF
16-Dec 12:44 sneakernet-dir JobId 54: Error: Bareos sneakernet-dir 14.2.2 (12Dec14):
  Build OS:               x86_64-unknown-linux-gnu redhat CentOS release 6.2 (Final)
  JobId:                  54
  Job:                    Restore_SRCNETAPP:vol1_smtape_to_DSTNETAPP:SRCNETAPP_vol1.2014-12-16_12.44.45_07
  Restore Client:         DSTNETAPP
  Start time:             16-Dec-2014 12:44:47
  End time:               16-Dec-2014 12:44:47
  Elapsed time:           0 secs
  Files Expected:         1
  Files Restored:         0
  Bytes Restored:         0
  Rate:                   0.0 KB/s
  SD termination status:  Running
  Termination:            *** Restore Error ***
gglynn

gglynn

2014-12-16 21:56

reporter   ~0001112

I've attached a text file with the messages output of another restore attempt with NDMP Snooping = yes and NDMP Loglevel = 7 as requested. In case it helps, the name of my Bareos server is "sneakernet" and it's IP address is 192.168.3.157 (c0a8039d in hex). The IP address of the DSTNETAPP system is 192.168.3.216, and SRCNETAPP is the NetApp from which I backed up the volume (/vol/vol1 on SRCNETAPP, and /vol/SRCNETAPP_vol1 on DSTNETAPP). Note that I have the same problem when I try to restore to the same volume name (i.e. /vol/vol1) on DSTNETAPP, so it doesn't appear to be related to redirecting the restore to a different volume name.
mvwieringen

mvwieringen

2014-12-16 22:27

developer   ~0001113

The NDMP trace shows indeed that the full handshake succeeds and it starts
a restore session it only seems it doesn't read a block which must be because
the restore selected no media to read.

Next option would be to run the SD in debug mode starting with a level
100 or maybe even 200 but this gives really a lot of output.

How does your restore session look like in bconsole ? It would also be
interesting to see the BSR that gets created as that describes what the
SD is going to read.
gglynn

gglynn

2014-12-16 23:28

reporter   ~0001114

Yes, that's what I thought I was seeing in the NDMP trace, too, which is weird. Like I said, I know the data is intact in the FileStorage pool volumes, because I extracted it using bextract, and in fact, I used a BSR generated by a restore job attempt to do it. Note that when I did the extract, it put the extracted files relative to a top-level directory named "@NDMP", so when I extracted an SMTAPE backup of /vol/vol1 to /home/tmp, it actually put the file in /home/tmp/@NDMP/vol/vol1. I also manually copied that extracted "vol1" file to DSTNETAPP and successfully used it to restore a volume using the "smtape restore" command right on the NetApp CLI, so it's definitely not munged.

Anyway, here's what the restore job looks like in bconsole right before I run it:

Run Restore job
JobName: Restore SRCNETAPP:vol1 smtape to DSTNETAPP:SRCNETAPP_vol1
Bootstrap: /var/lib/bareos/sneakernet-dir.restore.3.bsr
Where: /vol/SRCNETAPP_vol1
Replace: Always
FileSet: SRCNETAPP_vol1-fileset
Backup Client: SRCNETAPP
Restore Client: DSTNETAPP
Format: smtape
Storage: File
When: 2014-12-16 14:13:16
Catalog: MyCatalog
Priority: 10
Plugin Options: *None*

I also attached the BSR referenced above to this bug.

Any idea why bextract is reading the FileStorage volumes properly, but SD seems to be choking on them and returning EOF right on the first read?
mvwieringen

mvwieringen

2014-12-17 19:46

developer   ~0001118

As the same code is used there must be something that it doesn't like.
Which can be also still on the NDMP level. Only a debug run of the SD will
show what is going on.

So run bareos-sd -f -d 200 > logfile and make sure you only have the NDMP
restore as this will give quite some debug output and gets fully unreadable
when more then one job is running.

Keep in mind that the SD probably runs as user bareos so you need to run
the debug also under that user (as the user has not login shell you need
su - bareos -s /bin/bash or so)
gglynn

gglynn

2014-12-17 20:47

reporter   ~0001119

Last edited: 2014-12-17 20:47

I've uploaded the bareos-sd log output (at debug level 200) for an attempt to restore vol2. I had to split it into two parts in order to upload it to this bug, but it's all from one run.

mvwieringen

mvwieringen

2014-12-17 20:54

developer   ~0001120

Ok log looks normal and the SD processes everything only the NDMP tape
never opens.

The SD get a cancel:

sneakernet-sd: dir_cmd.c:262-0 <dird: cancel Job=Restore_SRCNETAPP:vol2_smtape_to_DSTNETAPP:vol2.2014-12-17_11.16.47_10

So the dir is canceling the restore for whatever reason.

So think we can remove the SD from the problem.
mvwieringen

mvwieringen

2014-12-17 20:54

developer   ~0001121

Need a debug log from the director which is more or less the same cmdline.
gglynn

gglynn

2014-12-17 21:12

reporter   ~0001122

Director debug log (level 200) for another restore job with the same parameters as for the bareos-sd debug is attached.
gglynn

gglynn

2014-12-17 21:17

reporter   ~0001123

Just want to throw this out there: I seem to remember something about SMTAPE on the NetApp using a 240KB tape block size, but in the debug logs I just uploaded, I see a lot of references to a max block size of 64KB. Any relevance?
mvwieringen

mvwieringen

2014-12-17 22:00

developer   ~0001124

The logs at least confirm that everything is setup right and tape is opened
by the NDMP DMA and the moment the data mover gets active the NetAPP seems
to abort the restore.

About the tape block size as NDMP is so device specific you could be right.
The blocksize you see is the default blocksize but its configurable using
the NdmpBlockSize parameter in the client definition it defaults to 64512.
gglynn

gglynn

2014-12-17 22:12

reporter   ~0001125

Last edited: 2014-12-17 22:13

Does the Director communicate the blocksize to the NetApp prior to the abort? With the FileStorage volume type, does the block size on the restore have to match the block size on the backup? I'd like to try to change the NdmpBlockSize parameter to 245760 bytes, but I want to make sure this won't cause any major problems on the NetApp.

mvwieringen

mvwieringen

2014-12-17 22:24

developer   ~0001126

The NDMP blocksize is only used as send size over the NDMP protocol as we don't
support direct writing to NDMP Tape agents e.g. we emulate the Tape Agent in
the bareos SD the actual data is written in the blocksize of the lower level
device anyway. So I think you can just change the blocksize but what impact that
has on the NetAPP I have no idea then again its not an enormous blocksize.
gglynn

gglynn

2014-12-18 03:37

reporter   ~0001128

Last edited: 2014-12-18 03:43

I changed the NdmpBlockSize on the Client entry for DSTNETAPP to 24576 (240KB), and now it no longer fails immediately! However, the NDMP Mover function of the SD doesn't actually move any data. The TCP connection established from the NetApp to the SD stays connected now, but they just keep exchanging 0 byte payload packets, and the restore job just keeps logging "DATA: bytes 0KB MOVER: read 0KB record 0" every 30 seconds until I cancel the job.

gglynn

gglynn

2014-12-18 04:18

reporter   ~0001129

Uploaded new debug logs for DIR and SD on new restore attempt described in my last note.
mvwieringen

mvwieringen

2014-12-18 09:56

developer   ~0001131

Ok removed the debugging output from the report as it only shows obvious
things.

Looked at the whole blocksize thing with SMTAPE and it seems its hardcoded
to use a blocksizes from 4 to 256 Kb in multiples of 4 Kb. Don't seem to
remember that was like that before. But that does show a problem with the
default blocksize we use which is 15.75 x 4 Kb so that is probably why its
not happy.

The problem with replying at night is that you sometimes don't see the whole
picture. But the current code will always return the saved data in the same
chunk size as the data was saved in so changing it to 240 Kb will still return
only chunks of the original size.

So I think your only option is to:

- Change the NDMP blocksize to something that can be divided by 4 Kb
- Create a Pool with a blocksize that matches the NDMP blocksize
- Create a new backup with these new settings.
gglynn

gglynn

2014-12-18 23:08

reporter   ~0001134

Last edited: 2014-12-18 23:09

I'm still unclear as to what's happening here. Can the SD not read the backed-up data in the File-type Volumes that hold the backups for some reason? Is it trying to use the newly-set 240KB NdmpBlockSize to read them and failing to do so because it doesn't match the blocksize used to back up the data?

I'm still confused as to how the source Netapp was able to stream the SMTAPE back to the SD in the first place if the size of the 240KB blocks the Netapp was sending exceeded Bareos' default block size for File-type Volumes (is this 63KB)? Did the SD just ignore the blocking and take the data as a stream of bytes, reblocking it to whatever block size it uses for the File-type Volumes (is this also 63KB)? Is the block size of the File-type Volumes configurable, or is it the same for all such File-type Volumes? If it is configurable/variable, is there any way to tell what the block size of an existing File-type Volume is? No matter what I set the "Maximum Block Size" and "Minimum Block Size" for the "File" Device in bareos-sd.conf, bls seems to be able to read the existing Volumes without issue, so I suspect this parameter is fixed and ignore for this Volume/Device type, but that's just a guess, I'm sure you know for sure.

mvwieringen

mvwieringen

2014-12-19 10:15

developer   ~0001135

Those are a lot of questions and they mostly make little sense.

One things is for sure the NetAPP is indeed buggy as it allows you to use
a blocksize that it doesn't support for backup but on restore it fails to
restore such a blocksize. The 240 Kb blocksize is the default but is in no
way a must see my previous note on how it really works. Could be it are two
different levels of the NetAPP OS. When you search for blocksize and NetAPP
it looks as if 8.2 is somewhat prominently mentioned.

As to why 240 Kb reading doesn't work is because the SD reads the old
NDMP blocks (which are completely different then what Bareos calls a block)
back in their own size. The NDMP blocks are encoded as normal so called
DATA streams (each piece of backup data has it own streamtype e.g. data,
metadata, compressed data etc.) The current code doesn't try to assemble parts
or multipe parts of data into one bigger block. That is also fine as its a
known limitation that you should use the same NDMP blocksize for backup and
restore.

As to why it works with bextract is fairly simple, that just concatenates
all data blocks into one long data stream. That is something you can of course
read in any blocksize you want until it fully processed.

Any setting of the minimum and maximum block size on the bareos tape/file
device have zero impact on the whole thing as the NDMP blocksize is a separate
configuration setting and the code will chunk things in the right way as it
operates on NDMP data streams which are for the SD an opaque data stream.

The SD just returns the exact same data as the bextract only in 63 Kb e.g.
the POSIX default blocksize. The NetAPP for whatever reason (probably because
its not a 4 Kb chunked piece of data) doesn't like the data coming in over
the NDMP protocol.

I would be possible to write re-assembly code that maps from one NDMP blocksize
to an other on restore but that is not something that is really high on my
priority list, also given that NDMP blocksize is a known fixed thing for NDMP
so it only possible because we emulate the NDMP Tape Agent to do such
translation.

So I guess your only option is to perform the actions I mentioned in the
note above. The blocksize on the disk pool is not strictly needed but for
performance reasons it probably makes sense to have a blocksize that is
a bit bigger even then NDMP blocksize so a single chunk of backup data
fits and doesn't need to be cut into multiple pieces when saving it.
gglynn

gglynn

2014-12-19 16:08

reporter   ~0001136

"One things is for sure the NetAPP is indeed buggy as it allows you to use
a blocksize that it doesn't support for backup but on restore it fails to
restore such a blocksize. The 240 Kb blocksize is the default but is in no
way a must see my previous note on how it really works. Could be it are two
different levels of the NetAPP OS. When you search for blocksize and NetApp
it looks as if 8.2 is somewhat prominently mentioned."

I think SMTAPE is somewhat different than a non-SMTAPE NDMP backup. The NetApp generates a single SMTAPE file for the volume you're backing up and transfers it via NDMP to the Tape Agent as it's being generated. The SMTAPE volume backup file seems to be somewhat like a tarfile, and it is generated with a 240KB internal data block size by default. When an SMTAPE backup is conducted via NDMP, the NetApp doesn't seem to care if the NDMP Tape Agent session's block size is different than the SMTAPE block size. With an NdmpBlockSize of 63KB configured on a session, the NDMP service on the NetApp just breaks the stream of bytes from its SMTAPE generation process into 63KB blocks and transfers them to the Tape Agent. It's on the NDMP SMTAPE restore sessions that the NetApp seems to require an exactly 240KB NDMP block size, or else it fails immediately, with the error I originally reported, regardless of the DATA_BLOCK_SIZE meta option configured on the restore fileset in bareos-dir.conf (I tried setting that to 63KB at one point, too). This seems to me to be an arbitrary and harmful limitation of the NetApp, so I agree with your "buggy" assessment in this case.

"The SD just returns the exact same data as the bextract only in 63 Kb e.g.
the POSIX default blocksize. The NetAPP for whatever reason (probably because
its not a 4 Kb chunked piece of data) doesn't like the data coming in over
the NDMP protocol."

I don't think that's what we're seeing here. I've run tcpdump during the restore attempts, and as far as I can tell, the SD isn't sending anything to the NetApp at all. The SD and the NetApp just exchange TCP packets with zero-byte payloads back and forth over the NDMP Tape Agent connection forever until I abort the job by stopping the Director. Is the SD not sending anything because the size of the blocks that it's reading from the backup Volumes is different from the NdmpBlockSize? Is there any indication of this in the SD debug logs?

I can live with making another NDMP SMTAPE backup of the source volume using a 240KB NdmpBlockSize if I have to, I just want to make sure we understand why the SD isn't reading and sending blocks of data to the NetApp with the current configuration before I do so, because it's a big volume, and it will probably take a full 24 hours for me to back it up again.
mvwieringen

mvwieringen

2014-12-19 16:36

developer   ~0001137

I have no idea but I would say create a small test set. As NDMP is such a weird
protocol it seems everyone does it own way of implementing things. I think its
a protocol violation if the NetAPP doesn't adhere to the correct blocksize given
on restore then again NetAPP is one of the inventors of NDMP.

But we have seen more strangeness from NetAPP, but unless I get some money to
spend on it this is how far I can debug it remotely. After all this is open source
you could try adding some more debugging to the NDMP tape callbacks in the SD to
get a better idea what its doing. We based this on top of a reference
implementation of NDMP which is also used by other open source vendors.
mvwieringen

mvwieringen

2014-12-19 16:37

developer   ~0001138

b.t.w. as we are entering the holiday season I won't be doing much this year.
gglynn

gglynn

2014-12-19 16:50

reporter   ~0001140

Ok, thanks for your help, I appreciate it. You've pointed me in the right direction, and I'll post back to this bug after I've got it figured out. Cheers!
gglynn

gglynn

2015-01-09 05:15

reporter   ~0001175

Just a quick update. I tried setting the NdmpBlockSize on the source NetApp's Client definition and the Max Block Size on a new Pool and Storage Device all to 245760 bytes (240KB) , but tripped over an issue with setting the NdmpBlockSize greater than 102400 bytes (100KB). I think this is because the NDMP Channel Image Stream Buffer Size is currently set to 100KB in src/ndmp/ndmagents.h, here:

#ifndef NDM_N_IMAGE_STREAM_BUF
#define NDM_N_IMAGE_STREAM_BUF (100*1024)
#endif

As a result, when you specify the NdmpBlockSize bigger than NDM_N_IMAGE_STREAM_BUF on a backup, the image stream buffer fills up waiting for the Bareos SD to start pulling it out of the buffer and writing it to disk, but the SD is waiting for a full NDMP block to be received, which will never happen, because the NDMP channel is out of buffer space to store more data, so the SD starts advertising a zero-byte TCP window to the client (the NetApp, in this case).

I'm going to increase the stream buffer size in the header file to 256KB and recompile to see if that allows me to take a backup with 240KB NDMP, tape, and SMTAPE image block sizes.
gglynn

gglynn

2015-01-09 08:21

reporter   ~0001176

Increasing that NDM_M_IMAGE_STREAM_BUF value to 256KB did indeed allow a backup to proceed with NdmpBlockSize and the volume max block size set to 245760 bytes (note that I left the volume label block size at the new default value of 63KB), but restores are still failing almost immediately with the same EOF error I originally reported. However, I am now almost certain that this error is being generated by the Bareos SD. Here's the relevant debug output from the bareos-sd process (run at debug level 999) during a restore attempt:

*** sneakernet-sd: ndmp_tape.c:245-0 NDMP: [#C] [5] >C NDMP4_MOVER_READ 11
*** sneakernet-sd: ndmp_tape.c:245-0 NDMP: [#C] [6] offset=0 length=-1
*** sneakernet-sd: ndmp_tape.c:245-0 NDMP: [#C] [5] R> NDMP4_MOVER_READ 11 (13)
*** sneakernet-sd: ndmp_tape.c:245-0 NDMP: [#C] [6] error=NDMP4_NO_ERR
*** sneakernet-sd: ndmp_tape.c:245-0 NDMP: [#C] [9] writing 32 ...
*** sneakernet-sd: ndmp_tape.c:245-0 NDMP: [#C] [9] 80 00 00 1c 00 00 00 0d 54 af 66 78 00 00 00 01
*** sneakernet-sd: ndmp_tape.c:245-0 NDMP: [#C] [9] 00 00 0a 06 00 00 00 0b 00 00 00 00 00 00 00 00
*** sneakernet-sd: ndmp_tape.c:245-0 NDMP: [#C] [8] write=32 len=32

In the snippet above, the NetApp appears to tell the MOVER state machine within Bareos SD's NDMP TAPE service to start reading the backup from the file volume.

*** sneakernet-sd: record.c:669-0 Block=0 Ver=2 size=245760
*** sneakernet-sd: record.c:677-0 Enter read_record_block: remlen=245736 data_len=0 rem=0 blkver=2
*** sneakernet-sd: record.c:741-0 rd_rec_blk() got FI=SOS_LABEL SessId=1 Strm=128 len=223 remlen=245724 data_len=0
*** sneakernet-sd: record.c:805-0 Rtn full rd_rec_blk FI=SOS_LABEL SessId=1 Strm=128 len=223
*** sneakernet-sd: read_record.c:285-0 read-OK. state_bits= blk=0 rem=0 file:block=0:245973
*** sneakernet-sd: read_record.c:296-0 recno=1 state_bits= blk=0 SI=1 ST=1420780864 FI=-4
*** sneakernet-sd: read_record.c:74-0 Begin Session Record: VolSessionId=1 VolSessionTime=1420780864 JobId=128 DataLen=223

Then, the debug shows that the SD successfully read the SOS (Start of Session) record from the volume, and it correctly identifies the backup as JobId 128.

*** sneakernet-sd: ndmp_tape.c:245-0 NDMP: [#C] [5] C> NDMP4_NOTIFY_MOVER_PAUSED 14
*** sneakernet-sd: ndmp_tape.c:245-0 NDMP: [#C] [6] reason=NDMP4_MOVER_PAUSE_EOF seek_position=0
*** sneakernet-sd: ndmp_tape.c:245-0 NDMP: [#C] [9] writing 40 ...
*** sneakernet-sd: ndmp_tape.c:245-0 NDMP: [#C] [9] 80 00 00 24 00 00 00 0e 54 af 66 78 00 00 00 00
*** sneakernet-sd: ndmp_tape.c:245-0 NDMP: [#C] [9] 00 00 05 04 00 00 00 00 00 00 00 00 00 00 00 02
*** sneakernet-sd: ndmp_tape.c:245-0 NDMP: [#C] [9] 00 00 00 00 00 00 00 00
*** sneakernet-sd: ndmp_tape.c:245-0 NDMP: [#C] [8] write=40 len=40

Right here is the problem: immediately after the SD successfully reads the SOS record, the SD notifies the NDMP client (the NetApp) that it has paused its MOVER because it reached an unexpected EOF reading from the volume, which is incorrect. As soon as the NetApp sees this EOF notification, it tells the Director to cancel the restore job.

So, any idea why the SD is pausing the MOVER and blaming it on a non-existent EOF read?
mvwieringen

mvwieringen

2015-01-10 11:07

developer   ~0001177

I guess you only find out if you put thing under a debugger the interesting
entry points are bndmp_read_data_from_block() and bndmp_tape_read().

bndmp_tape_read() is the callback called by the NDMP code for reading tape
data.

I also added two patches that change the checking of the SMTAPE blocksize to
be in the range defined by NetAPP and one other patch allocating the image stream
buffer dynamically according to the used NDMP Job.

We spend quite some time on modernizing this reference implementation of the
NDMP protocol but its from 2001 and that shows in some cases.

Although this has my personal interest there are currently a lot of things
preempting any time I can spend on it other then replying and giving some
hints.
gglynn

gglynn

2015-01-12 16:15

reporter   ~0001179

Last edited: 2015-01-12 18:07

I think I got it figured out Friday night. In the main loop of bndmp_read_data_from_block(), when read_next_record_from_block() succeeds, the maskedStream field (an int32_t) of the read record is checked and matched against three possible values (defined in streams.h):

* STREAM_UNIX_ATTRIBUTES [1]
* STREAM_FILE_DATA [2]
* STREAM_NDMP_SEPERATOR [999]

STREAM_UNIX_ATTRIBUTES records are silently discarded (i.e. the read loop just continues) and STREAM_FILE_DATA records are passed to the DATA agent. STREAM_NDMP_SEPERATOR and all other record types (including SOS_LABEL records) are also discarded, but not silently. Rather, the data_length variable is set to zero, and the function returns true back to bndmp_tape_read() from which it was called. The data_length value in bndmp_read_data_from_block() is passed back to the done_count variable in bndmp_tape_read(), and if done_count equals zero (which it does, in this case), then bndmp_tape_read() returns an NDMP9_EOF_ERR to its caller, which kills the restore job as soon as it begins.

To fix this behavior, I modified bndmp_read_data_from_block() to silently discard the SOS_LABEL record (with maskedStream of 128) in the same manner as it does STREAM_UNIX_ATTRIBUTES records. An EOF is still generated when the STREAM_NDMP_SEPERATOR record that appears after the SMTAPE file data stream is fully read, which causes Bareos to report that the restore failed, but it actually succeeds on the NetApp (i.e. the restore target volume is fully restored from the SMTAPE backup). So, some further modification is necessary to get Bareos to close out the restore job properly, but it appears to do what I need it to do now.

mvwieringen

mvwieringen

2015-01-13 10:47

developer   ~0001180

Thanks for the analyses it seems this is a side effect of some things changing
over time in the upper levels which means that the reader now needs to handle
those label records itself (e.g. ignore them.) Although your workaround probably
works I think the fix is a bit more subtle e.g. we need to skip any label record
e.g. check if FileIndex is negative and skip those. I added a patch that does
that and also prints an error now when it gets a datastream that it doesn't like
which means we sooner see these kind of problems in the debugging output may it
ever occur again.

I added that patch maybe you can test with these 3 patches and we can take
it from there.

Regarding the STREAM_NDMP_SEPERATOR that is a valid EOF marker so I would be
interested as to why the restore Job is actually getting a failed state. As
each NDMP implementation is different this could be some strange side effect.
I do remeber that when writing the code I also got some strange error on the
Oracle NDMP implementation when it asked for a certain return status after the
restore but when looking better at the NDMP rfc it was obvious that was a
false implementation in the NDMP server software.
mvwieringen

mvwieringen

2015-02-02 13:27

developer   ~0001250

Could you elaborate a bit more on what error you see when restoring after
the fixes you made ? Then I can either close this report or look what needs
to be fixed.

Related Changesets

bareos: master 051522be

2014-12-18 11:36

Marco van Wieringen

Ported: N/A

Details Diff
Add checks for SMTAPE blocksizes. Affected Issues
0000374
mod - src/dird/ndmp_dma.c Diff File

bareos: master 694f1364

2015-01-10 11:46

Marco van Wieringen

Ported: N/A

Details Diff
Make NDMP ImageStream buffer dynamically allocated size.

Instead of hardcoding the Image Stream buffer to 100 K size it to the
actual recordsize in use by the NDMP Job. This way we don't have to
check if the recordsize specified is outside the allowable size.
Affected Issues
0000374
mod - src/ndmp/ndma_image_stream.c Diff File
mod - src/ndmp/ndmagents.h Diff File

bareos: master 739c923d

2015-01-13 11:36

Marco van Wieringen

Ported: N/A

Details Diff
Ignore tape labels in NDMP code.

When we encounter some sort of tape label in the backup stream
(e.g. FileIndex < 0) then we should just skip that data and not
use the default handling which leads to an EOF. When we now encounter a
unknown data stream we also print an error including the number of the
unwanted datastream.
Affected Issues
0000374
mod - src/stored/ndmp_tape.c Diff File

Issue History

Date Modified Username Field Change
2014-12-16 17:04 gglynn New Issue
2014-12-16 20:48 mvwieringen Note Added: 0001111
2014-12-16 20:48 mvwieringen Status new => feedback
2014-12-16 21:50 gglynn File Added: ndmp-snooping-output-201412161549.txt
2014-12-16 21:56 gglynn Note Added: 0001112
2014-12-16 21:56 gglynn Status feedback => new
2014-12-16 22:27 mvwieringen Note Added: 0001113
2014-12-16 22:27 mvwieringen Status new => feedback
2014-12-16 23:20 gglynn File Added: sneakernet-dir.restore.3.bsr
2014-12-16 23:28 gglynn Note Added: 0001114
2014-12-16 23:28 gglynn Status feedback => new
2014-12-17 19:46 mvwieringen Note Added: 0001118
2014-12-17 19:46 mvwieringen Assigned To => mvwieringen
2014-12-17 19:46 mvwieringen Status new => feedback
2014-12-17 20:45 gglynn File Added: bareos-sd-debuglog-part1.txt
2014-12-17 20:45 gglynn File Added: bareos-sd-debuglog-part2.txt
2014-12-17 20:47 gglynn Note Added: 0001119
2014-12-17 20:47 gglynn Status feedback => assigned
2014-12-17 20:47 gglynn Note Edited: 0001119
2014-12-17 20:54 mvwieringen Note Added: 0001120
2014-12-17 20:54 mvwieringen Note Added: 0001121
2014-12-17 20:54 mvwieringen Status assigned => feedback
2014-12-17 21:10 gglynn File Added: bareos-dir-debuglog.txt
2014-12-17 21:12 gglynn Note Added: 0001122
2014-12-17 21:12 gglynn Status feedback => assigned
2014-12-17 21:17 gglynn Note Added: 0001123
2014-12-17 22:00 mvwieringen Note Added: 0001124
2014-12-17 22:00 mvwieringen Status assigned => feedback
2014-12-17 22:12 gglynn Note Added: 0001125
2014-12-17 22:12 gglynn Status feedback => assigned
2014-12-17 22:13 gglynn Note Edited: 0001125
2014-12-17 22:24 mvwieringen Note Added: 0001126
2014-12-17 22:25 mvwieringen Status assigned => feedback
2014-12-18 03:37 gglynn Note Added: 0001128
2014-12-18 03:37 gglynn Status feedback => assigned
2014-12-18 03:43 gglynn Note Edited: 0001128
2014-12-18 04:17 gglynn File Added: bareos-dir-vol2restore-201412172200-debuglog.txt
2014-12-18 04:17 gglynn File Added: bareos-sd-vol2restore-201412172200-debuglog-part1.txt
2014-12-18 04:17 gglynn File Added: bareos-sd-vol2restore-201412172200-debuglog-part2.txt
2014-12-18 04:18 gglynn Note Added: 0001129
2014-12-18 09:49 mvwieringen File Deleted: bareos-sd-vol2restore-201412172200-debuglog-part2.txt
2014-12-18 09:49 mvwieringen File Deleted: bareos-sd-vol2restore-201412172200-debuglog-part1.txt
2014-12-18 09:49 mvwieringen File Deleted: bareos-dir-vol2restore-201412172200-debuglog.txt
2014-12-18 09:49 mvwieringen File Deleted: bareos-dir-debuglog.txt
2014-12-18 09:49 mvwieringen File Deleted: bareos-sd-debuglog-part2.txt
2014-12-18 09:50 mvwieringen File Deleted: bareos-sd-debuglog-part1.txt
2014-12-18 09:50 mvwieringen File Deleted: sneakernet-dir.restore.3.bsr
2014-12-18 09:56 mvwieringen Note Added: 0001131
2014-12-18 09:56 mvwieringen Status assigned => feedback
2014-12-18 14:32 mvwieringen Relationship added related to 0000345
2014-12-18 23:08 gglynn Note Added: 0001134
2014-12-18 23:08 gglynn Status feedback => assigned
2014-12-18 23:09 gglynn Note Edited: 0001134
2014-12-19 10:15 mvwieringen Note Added: 0001135
2014-12-19 10:15 mvwieringen Status assigned => feedback
2014-12-19 16:08 gglynn Note Added: 0001136
2014-12-19 16:08 gglynn Status feedback => assigned
2014-12-19 16:36 mvwieringen Note Added: 0001137
2014-12-19 16:37 mvwieringen Note Added: 0001138
2014-12-19 16:37 mvwieringen Status assigned => feedback
2014-12-19 16:50 gglynn Note Added: 0001140
2014-12-19 16:50 gglynn Status feedback => assigned
2015-01-09 05:15 gglynn Note Added: 0001175
2015-01-09 08:21 gglynn Note Added: 0001176
2015-01-10 11:01 mvwieringen File Added: 0001-Add-checks-for-SMTAPE-blocksizes.patch
2015-01-10 11:01 mvwieringen File Added: 0002-Make-NDMP-ImageStream-buffer-dynamically-allocated-s.patch
2015-01-10 11:07 mvwieringen Note Added: 0001177
2015-01-10 11:07 mvwieringen Status assigned => feedback
2015-01-10 12:19 mvwieringen File Deleted: 0002-Make-NDMP-ImageStream-buffer-dynamically-allocated-s.patch
2015-01-10 12:20 mvwieringen File Added: 0002-Make-NDMP-ImageStream-buffer-dynamically-allocated-s.patch
2015-01-12 16:15 gglynn Note Added: 0001179
2015-01-12 16:15 gglynn Status feedback => assigned
2015-01-12 18:07 gglynn Note Edited: 0001179
2015-01-13 10:47 mvwieringen Note Added: 0001180
2015-01-13 10:48 mvwieringen File Added: 0003-Ignore-tape-labels-in-NDMP-code.patch
2015-01-13 10:50 mvwieringen Status assigned => feedback
2015-01-14 20:34 mvwieringen File Deleted: 0003-Ignore-tape-labels-in-NDMP-code.patch
2015-01-14 20:35 mvwieringen File Added: 0003-Ignore-tape-labels-in-NDMP-code.patch
2015-01-28 12:44 mvwieringen Relationship added child of 0000393
2015-02-02 13:22 mvwieringen Changeset attached => bareos master 739c923d
2015-02-02 13:23 mvwieringen Changeset attached => bareos master 694f1364
2015-02-02 13:25 mvwieringen Changeset attached => bareos master 051522be
2015-02-02 13:25 mvwieringen File Deleted: 0003-Ignore-tape-labels-in-NDMP-code.patch
2015-02-02 13:26 mvwieringen File Deleted: 0002-Make-NDMP-ImageStream-buffer-dynamically-allocated-s.patch
2015-02-02 13:26 mvwieringen File Deleted: 0001-Add-checks-for-SMTAPE-blocksizes.patch
2015-02-02 13:27 mvwieringen Note Added: 0001250
2015-03-24 11:25 mvwieringen Assigned To mvwieringen =>
2015-03-24 11:25 mvwieringen Status feedback => closed
2015-03-24 11:25 mvwieringen Fixed in Version => 14.2.3