ZFS Replication - synchronization and error handling issues
|Reported by:||jal||Owned by:||delphij|
Description (last modified by delphij)
(This was originally based on code inspection, but is easy to reproduce using an invalid remote server.)
ZFS Replication (in autosnap.py) will advance "lastsnapshot" forward even if replication fails for any reason (e.g. server offline / not responding), thus it will skip filesystem updates and lose synchronization with the source.
(1) The "zfs send -I" needs to be an incremental from the last successfully received snapshot. (As a special case, it needs to leave "lastsnapshot" null until the first successful transfer.)
(2) Given that periodic snapshots are expired/destroyed by autosnap.py without regards to replication state, it is also possible for the two filesystems to get permanently out of sync if the remote server is offline for an extended period of time and autosnap.py gets around to destroying the periodic snapshot which also happens to be the last replication snapshot.
It would be more robust if ZFS Replication did its own two-phase snapshot management, i.e. create a "newsnap" for replication, attempt to replicate the incremental from "oldsnap" to "newsnap" and then, only upon success, destroy "oldsnap" and set "lastsnaphot" to point to "newsnap".
(This would also allow for independent replication vs. primary snapshot schedules/frequency, which is also highly desirable!)