Opened 18 months ago

Closed 7 days ago

#1066 closed defect (invalid)

Not reporting a removed drive as a problem?

Reported by: jgreco Owned by:
Priority: major Milestone:
Component: Backend Version: 8.0.2-RELEASE
Keywords: Cc:

Description

Not quite understanding what's intended to happen here.

I have a very minimal FreeNAS 8.0.2R-i386 on a VM; it's got 512MB RAM and is tuned to serve two file extents via iSCSI. These file extents reside on two sets of ZFS mirrored drives, one's a pair of 500GB drives, the other's a pair of 240GB SSD's. Don't tell me it's too small a configuration, it's broken on a larger configuration too.

So I created the extent file on the SSD mirror. I then pulled one of the SSD's. It still works, of course, but no meaningful error reporting seems to have happened.

I would have appreciated an e-mailed warning of impending doom and nonredundancy of my data. But I could live without that, I guess.

What really didn't make any sense is that FreeNAS didn't seem to be at all fazed by the loss of a drive. The "Alert" stoplight is green. "View All Volumes" shows both volumes as healthy. "View Disks" (from the Active Volumes screen) shows it without a name and with "Unknown" for a serial number. "zpool status" from there just says "Sorry, an error occurred." The only real sign of problems in the GUI is the system log output, where it does say "kernel: (ada3:ahcich3:0:0:0): lost device".

Now maybe the problem is ZFS; when I ssh in, I get odd-seeming stuff like

# zpool list hespssd
NAME SIZE USED AVAIL CAP HEALTH ALTROOT
hespssd 220G 216G 3.62G 98% ONLINE /mnt
# zpool status

pool: hesphyb

state: ONLINE
scrub: none requested

config:

NAME STATE READ WRITE CKSUM
hesphyb ONLINE 0 0 0

mirror ONLINE 0 0 0

ada0p2 ONLINE 0 0 0
ada1p2 ONLINE 0 0 0

errors: No known data errors

pool: hespssd

state: ONLINE

status: One or more devices has experienced an unrecoverable error. An

attempt was made to correct the error. Applications are unaffected.

action: Determine if the device needs to be replaced, and clear the errors

using 'zpool clear' or replace the device with 'zpool replace'.

see: http://www.sun.com/msg/ZFS-8000-9P

scrub: none requested

config:

NAME STATE READ WRITE CKSUM
hespssd ONLINE 0 0 0

mirror ONLINE 0 0 0

ada2p2 ONLINE 0 0 0
ada3p2 ONLINE 6.92K 7.31K 41

errors: No known data errors

So it's showing a drive I have in my hand as "ONLINE" but throwing errors, I guess. Ooookay.

I don't really know what ZFS is supposed to report in this case, but I'm guessing that someone's just parsing the "ONLINE" and moving forward with that as "HEALTHY". But it's not healthy if my redundancy doesn't exist.

Anyone who cares to explain how this is all supposed to work together to report a failed drive ... well, please feel free. :-) I ditched Nexenta too long ago to really remember how that handled this sort of thing, and I don't have any other ZFS implementations handy right now.

Change History (6)

comment:1 follow-up: Changed 18 months ago by gcooper

http://www.freebsd.org/cgi/query-pr.cgi?pr=kern/134491 - problem is ZFS + geom and it's only resolved in 9+. I thought this was documented somewhere..

zfs scrubs should fix this problem.

comment:2 follow-up: Changed 18 months ago by jgreco

I've seen that PR. From a user's point of view, that appears to be a problem with hot spare replacement, which would be annoying, yes, but I could definitely cope with that if the system were at least to provide warning that Something Bad(tm) was happening.

We're using some of those SandForce? SSD's ("dodgy" is the polite word, I believe) mirrored to provide redundancy. Losing one of them isn't a problem, and I give it even odds that it will happen in the first year, but if and when it does, I need to know, so that we can migrate stuff off that VMware iSCSI datastore onto more conventional technology. All I need is for the second one to remain operable long enough to do that migration. But I need some sort of alert to that.

Now I can probably hang a notifier off the syslog "lost device" warning, so I'm not completely screwed by the drive removal scenario, but what happens when other forms of corruption occur? Will FreeNAS report those? Or do I need to be having our network monitoring log in every 15 minutes to try to ferret clues out of cmdline stuff? And what should I be looking for?

Some documentation as to what happens during various error scenarios would be most helpful.

comment:3 in reply to: ↑ 1 Changed 18 months ago by gcooper

Replying to gcooper:

http://www.freebsd.org/cgi/query-pr.cgi?pr=kern/134491 - problem is ZFS + geom and it's only resolved in 9+. I thought this was documented somewhere..

I stand corrected about this statement. This is only fixable via the zfsd branch, which still hasn't hit production 9.x.

comment:4 in reply to: ↑ 2 Changed 18 months ago by gcooper

Replying to jgreco:

I've seen that PR. From a user's point of view, that appears to be a problem with hot spare replacement, which would be annoying, yes, but I could definitely cope with that if the system were at least to provide warning that Something Bad(tm) was happening.

Indeed, and while it seems like a good idea to have hotspare replacement, it's a really hard problem to solve when we don't know the hardware configuration of the box that we've installed on. The is not so much the case with TrueNAS as we have fixed hardware configurations, but it's definitely the case with FreeNAS as people are using COTS hardware to accomplish the task of making a NAS.

The best recommendation I have for this is that if you need alerting, hotswap replacement, etc, use a fully manageable hardware RAID controller, like an LSI MegaRAID controller, or a 3Ware twa compatible controller.

We're using some of those SandForce? SSD's ("dodgy" is the polite word, I believe) mirrored to provide redundancy. Losing one of them isn't a problem, and I give it even odds that it will happen in the first year, but if and when it does, I need to know, so that we can migrate stuff off that VMware iSCSI datastore onto more conventional technology. All I need is for the second one to remain operable long enough to do that migration. But I need some sort of alert to that.

I agree. It can be done to some degree via devd, but it's still a tricky proposition and resolving it that way can bring about unwanted consequences; sometimes devices drop on and off, then come back to life later on -- it's usually a sign of a driver or hardware issue, but it can make a bit situation worse than it needs to be if something hardware-wise gets twitchy.

Now I can probably hang a notifier off the syslog "lost device" warning, so I'm not completely screwed by the drive removal scenario, but what happens when other forms of corruption occur? Will FreeNAS report those? Or do I need to be having our network monitoring log in every 15 minutes to try to ferret clues out of cmdline stuff? And what should I be looking for?

I tried hacking in a solution before. Let me see if I can do it as a part of the 8.2 release.

Some documentation as to what happens during various error scenarios would be most helpful.

Once I have more salient details I'll post them to the ticket.

comment:5 Changed 7 days ago by trininox

Anything new with this issue, it seems related to the issue I'm having. I have a 16 port 3ware controller, but using it as JBOD as recommended by FreeNAS however pulling a drive doesn't set off any alarms. It does start a replace process with a spare after a reboot.

comment:6 Changed 7 days ago by dwhite

  • Resolution set to invalid
  • Status changed from new to closed

The originally reported bug does not apply to RAID controllers. FreeNAS has no support for monitoring RAID controllers of any brand and as such no notification emails or other alarms will be posted for RAID controller issues, including drive failures.

Please do not necro old bugs to report unrelated issues.

I am closing this ticket as Invalid to avoid further abuse of this ticket. Please post followup discussion to the Forums. Thank you for your understanding.

Note: See TracTickets for help on using tickets.