Stop the press… this is a must read from Cormac Hogan..
Describes what we were seeing and still are seeing to a degree with HA or aam process locking up now. So we dont get a host not responding but the issue has moved from complete disaster to just a broken HA for the cluster. To fix often a hostd process restart is needed and occasionally a full reboot of the vsphere host.. This is still on esxi 4.1 with the latest build.
With the changes done in the above link we have reduced our issue significantly. Yes we still get the disconnect and reconnect but at less of a rate.
With the aid of splunk I found that we are getting volume moves (autoteiring) about 4 – 5 times a day on high load days however averaging about 2-3 a day (did not realise it was moving so much data).
So with the slices of cheese constantly moving the mouse occasionally simply has to open its mouth and bite….. okay a bit off track there.
When a slice moves from member to member we will see the disconnect and reconnect or close and successful login through splunk.
This is expected behaviour and this is what both VMware and Dell say to ignore, the question is whether you are seeing excessive entrys and whether they line up to what is occuring underneath as in slices moving.
If you find that they do line up, then you have done everything you can, do as VMWARE and DELL suggest and if the Log Entries frustrate you then disable the alarm.
I have found some very interesting write ups about the equallogic and I hate to admit it but I am a fan, they are my second favourite SAN now.
Mainly due to the cost and ease of maintenance / setup.
Anybody still playing in the dell equallogic space I plead with you to talk to your network guys.
Jumbo frames is not always 9000. Please set your jumbo frames on your switch ports to the cisco devices max (within jumbo frames, not talking baby jumbos here).
Most cisco equip will run a 9216, this allows the SAN and ESX to talk with a 9000 packet without it being fragmented when the switch adds its overhead.
When you get your network boffins to change this also plead (buy them coffee, redbull, coke) to turn on LLDP so you can turn on DLP within the Equallogic, this will greatly / dramtically reduce your retransmissions count.
To test this, download ethereal / wireshark.
On your iscsi virtual switch (hopefully you have a seperate switch but even capturing on the vlan will be okay) create a virtual machine port group and run up either a windows guest or linux.
Set the port group to accept promisicous mode and start ethereal capture.
Or tcpdump on linux.
btw… vlan 4096 is promisicous for all vlans in vmware.. be careful with this one though as you will capture a truck load of data in a large environment.
I like to capture around 1 Million packets and have great enjoyment trying to get the same number everytime.. yes you can automate this to capture a certain amount, but hey there is no reason this cannot be fun right.
So filter for “tcp.analysis.retransmission” and see how many you get… We hit the 5000 / 1 million packets and could repeat this count constantly.
Yes you can get the graphs from the equallogic about retransmissions but are you sure you are getting the whole picture?
Sadly we are moving away from the equallogic as my team does not have the trust in it anymore, but hopefully others out there can benefit from my findings to get theirs hummming..
And in case any of you are asking my SAN Weapon of choice is the Netapp.
“You mean avamar”
“no I mean dedupe”
“not backups, data on disk connected to esx hosts as NFS or ISCSI or FC and dedupe’d”
“Okay let me paint a picture, 100 X Window 2008 R2 servers all patched to the same level with their System drives on the same 3Tb Datastore”
“lol, why would you do that”
“because I can, now shutup and listen. Each system drive is 40Gb in size, but installed only 12Gb is consumed”
“okay go on”
“On the datastore those 100 X Windows 2008 R2 servers are consuming 40Gb of disk if thick provisioned and less if thin provisioned”
“wait what, how?”