#1038 closed defect (invalid)

FreeNAS 8.0.2 crashed after 11 days

Reported by: henk Owned by:
Priority: critical Milestone:
Component: FreeBSD Version: 8.0.4-RELEASE-p2
Keywords: Cc:

Description

My FreeNAS crashed after 11 days.

text on console:

Fatal double fault
rip = 0xffff ffff 8053 61a0
rsp = 0xffff ff82 3dba 7000
rbp = 0xffff ff82 3dba 7030
cpu id = 1 ; apic id = 01
panic: double fault
cpuid = 1
Uptime = 11d6h51m42s

system information:
FreeNAS Build FreeNAS-8.0.2-RELEASE-amd64 (8288)
Platform AMD Athlon(tm) II Neo N36L Dual-Core Processor
Memory 8049MB
System Time Fri Nov 25 14:23:18 2011
Uptime 2:23PM up 12 mins, 0 users
Load Average 0.00, 0.07, 0.11
OS Version FreeBSD 8.2-RELEASE-p3

system: HP Proliant Microserver N36L
disks: USB-stick

5 x 2Tb, raidz2

Attachments (3)

debug-zatte-20120213193125.txt.bz2 (39.3 KB) - added by henk 16 months ago.
debug output
messages (50.2 KB) - added by casadeferro 12 months ago.
debug-freenas-20120517120554.txt.zip (52.9 KB) - added by casadeferro 12 months ago.
GA-MA69VM-S2 debug

Download all attachments as: .zip

Change History (18)

comment:1 Changed 16 months ago by jpaetzel

Double faults are notoriously hard to debug. Was there ever any resolution of this?

comment:2 Changed 16 months ago by henk

No, it crahed nearly every week with a similar text.
It got better after upgrading to 8.0.3. It didn't crash anymore but after
35 days it did not give any response anymore. Only resolution: power off + on.
After rebooting the rsyncd didn't work anymore.

This weekend I'm planning to boot a live-cd and do some hardware tests.

comment:3 Changed 16 months ago by delphij

Hi, henk,

Just wanted to confirm -- are you using ECC memory?

The 35 days might be a zpool "scrub" that validates your data to make sure there is no silent data corruption, this operation is quite I/O intensive.

comment:4 follow-up: Changed 16 months ago by henk

I'm not sure, but I think it is not ECC memory. Could that be my problem?

comment:5 in reply to: ↑ 4 Changed 16 months ago by delphij

Replying to henk:

I'm not sure, but I think it is not ECC memory. Could that be my problem?

Well, it's not necessarily the reason you are seeing these issue but it's a possible cause, especially when your memory chips are not server grade ones. Sometimes it's very hard to trigger memory issue without real workload.

I'd personally recommend using ECC memory for any storage system by the way, though, because memory corruption can lead to very serious consequences on file systems and sometimes data could get lost permanently. Even with advanced filesystems that have validation and self-healing features, like ZFS, there is still possibility.

Another thing I'd recommend would be to double check if there is something happen with your USB stick, I hit this a few times -- after writing the firmware into another section, the problem goes away. But as Josh said, this type of issues are harder to track down, especially when it's not reliably triggerable or it takes long time in between.

Would you please also send the last few lines from 'zpool history tank' (replace tank with your zpool's name) and the whole 'zpool status -x' output?

comment:6 Changed 16 months ago by henk

I have been running a memory test for about a week. No errors!
I will buy some ecc-memory anyway in the near future.

The output of the zpool commands are:

# zpool history vol1
...
2012-01-21.00:13:19 zpool set cachefile=/data/zfs/zpool.cache vol1
2012-02-05.01:35:27 zpool import -o cachefile=none -R /mnt -f vol1
2012-02-05.01:35:35 zpool set cachefile=/data/zfs/zpool.cache vol1
2012-02-05.02:00:17 zfs snapshot -r vol1@auto-20120205.0200-50y
2012-02-05.02:00:32 zfs destroy vol1@auto-20120113.0200-10d
2012-02-05.02:00:48 zfs destroy vol1@auto-20120120.0938-10d
2012-02-05.02:01:04 zfs destroy vol1@auto-20120114.0200-10d
2012-02-05.02:01:19 zfs destroy vol1@auto-20120112.0200-10d
2012-02-05.02:01:35 zfs destroy vol1@auto-20120111.0200-10d
2012-02-05.02:01:51 zfs destroy vol1@auto-20120116.0200-10d
2012-02-06.02:00:04 zfs snapshot -r vol1@auto-20120206.0200-10d
2012-02-12.00:01:14 zpool import -o cachefile=none -R /mnt -f vol1
2012-02-12.00:01:18 zpool set cachefile=/data/zfs/zpool.cache vol1
2012-02-12.15:46:48 zpool import -o cachefile=none -R /mnt -f vol1
2012-02-12.15:46:50 zpool set cachefile=/data/zfs/zpool.cache vol1
2012-02-12.15:48:04 zfs snapshot -r vol1@auto-20120212.1548-50y

# zpool status -x
all pools are healthy

comment:7 Changed 16 months ago by henk

I also noticed some periods that the nas seemes to be frozen. In /var/log/messages I found

Feb 12 20:24:37 zatte kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 2097588, size: 45056
Feb 12 21:52:39 zatte kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 2097467, size: 24576
Feb 12 22:52:45 zatte kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 2097447, size: 12288
Feb 12 23:18:06 zatte kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 2102148, size: 4096
Feb 12 23:18:09 zatte kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 2101355, size: 20480
Feb 12 23:18:14 zatte kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 2102533, size: 32768
Feb 12 23:18:17 zatte kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 2102177, size: 8192
Feb 12 23:18:26 zatte kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 2102148, size: 4096
Feb 12 23:18:29 zatte kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 2101355, size: 20480
Feb 12 23:18:34 zatte kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 2102533, size: 32768
Feb 12 23:18:37 zatte kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 2102177, size: 8192
Feb 12 23:18:46 zatte kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 2102148, size: 4096
Feb 12 23:18:49 zatte kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 2101355, size: 20480
Feb 12 23:18:54 zatte kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 2102533, size: 32768
Feb 12 23:18:57 zatte kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 2102177, size: 8192

comment:8 Changed 16 months ago by gcooper

Can you please provide the contents of /var/log/messages and/or your debug output [1]?

  1. http://doc.freenas.org/index.php/Settings#Advanced_Tab , look for "Save Debug".

Changed 16 months ago by henk

debug output

comment:9 Changed 14 months ago by henk

I've replaced the memory with ecc-memory and did an upgrade to 8.04.
Since then I still see the "indefinite wait buffer" messages but less frequent.
The nas is responsive, also during/after the scrub.

Changed 12 months ago by casadeferro

comment:10 follow-up: Changed 12 months ago by casadeferro

Hi, I'm trying to sort out what seams to be a similar problem.

My hardware profile is:
mobo: GA-MA69VM-SE
BIOS: F10e
RAM: 2x 2GB Crucial DDR2 PC2-5300 CL5 Unbuffered NON-ECC DDR2 667MHz

2x 1GB Kingston 667MHz DDR2 Non-ECC CL5 DIMM

System Information

FreeNAS Build FreeNAS-8.0.4-RELEASE-p2-x64 (11367)
Platform AMD Athlon(tm) 64 X2 Dual Core Processor 4600+
Memory 6000MB
OS Version FreeBSD 8.2-RELEASE-p7

So, yes, its non ECC RAM, I'll upgrade it to ECC asap if this is the issue.

The thing is, I've run a stress test (actually we could call it PRIME tortured test) with ubuntu live, booting from a USB pen at the same port freenas pen is and it doesn't crash.

I've attached the message log, but I can't find a way to extract it with panic log, because the path to this logs are volatile: /var/logs

I'm investigating on how to log to a syslog server (trying to configure my MacBook? Pro with lion to server). If any of you have a better suggestion to dump logs, let me know.

Last edited 12 months ago by casadeferro (previous) (diff)

comment:11 Changed 12 months ago by casadeferro

  • Version changed from 8.0.2-RELEASE to 8.0.4-RELEASE-p2

Changed 12 months ago by casadeferro

GA-MA69VM-S2 debug

comment:12 in reply to: ↑ 10 Changed 12 months ago by delphij

Replying to casadeferro:

Hi, I'm trying to sort out what seams to be a similar problem.

Please create a new ticket as this will mess information up.

If possible please include your panic message (ideally a screenshot), thanks!

(Speaking for your ECC RAM question: no, ECC memory is NOT required for normal operation, but we can not tell if it's your RAM without the panic message, etc., a simple way of determining is to try running the system with less RAM and see if the problem still happens, then replace it with a your other RAM stick and see)

comment:13 Changed 12 months ago by casadeferro

Hi delphij and thanks for you attention.

If possible please include your panic message (ideally a screenshot), thanks!

Unfortunately I can't, when it crashes the display losses signal... machine dies. No power, and it doesn't boot when reset button is pressed. I've to turn the power switch off and on again to boot. I'm starting to believe this is a power issue, I'll give it a try with a spare power next monday.

a simple way of determining is to try running the system with less RAM and see if the problem still happens

I only have 6GB, isn't 6GB ram a requirement for freenas?

Last edited 12 months ago by casadeferro (previous) (diff)

comment:14 Changed 12 months ago by casadeferro

Just to confirm that this was a hardware related issue and had nothing to do with the OS. It was POWER related. I've installed a new Power Supply and it's up and running (fast like it was before).
I've strike out all my comments to clear up the mess.

comment:15 Changed 12 months ago by jpaetzel

  • Resolution set to invalid
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.