Recently I have had the pleasure of working on the following environment..
vCenter 4.1.0 build 345043
ESXi 4.1.0 build 800380
Eqaullogic PS6510 (4 Pools made up of 4 members)
Cisco WS-C4900M (X 2 10Gb)
This environment had some serious issues, so a learning curve of the Dell MEM software was started one week ago.
This is what I learned;
I am happy and open to corrections in assumptions made as well as correct terminology.
So a setup without MEM installed results in the following design with the red dots representing the paths. Keeping in mind these paths are actually physical connections as well. If you ran a wireshark or ethereal across this subnet you will see a number of ARP requests from the esx hosts to the group lead.
With MEM version 1.1.0 installed we get the following configuration.
You will notice the “Volume to member table” this replaces the ARP scenario as stated above, think of it as replicated DNS servers with the group lead being the master and all esx hosts with MEM installed bing the DNS slaves.
2 paths per vmk per slice with up to six paths per volume total.
So if the volume was spread across three members we would see three paths from each vmk with two to each SANMember that holds a slice of the volume so we have increased our iSCSI connection count by double.
With this configuration we quickly overshot our iscsi connection limit. The work around to this issue was to reduce the membersessions from the default of 2 to 1.
Or for ease of reading, the total sessions per slice was reduced to 1 whilst the volume count was left at default.
Now in our situation this resulted in the following error.
[2012-12-21 00:21:44.982 146F2B90 info ‘ha-eventmgr’] Event 288 : Lost path redundancy to storage device naa.xxxxxxxxxxxxxxxxxx. Path vmhba38:XX:XXX:XX is down. Affected datastores: “Datastore 1A”.
I capture the below with splunk running from VMA, the red is a host that we have now fixed (where the red columns stop) whilst the yellow are from another host we have not fixed.
To paint the picture, we were getting this error and associated errors that often that our logs were saturated and trying to find any other issue or troubleshoot was like a needle in a haystack. Using the free version of splunk we would go over the 500Mb limit in a couple of hours whilst capturing a couple of hosts only.
These errors averaged around the 22+ an hour per host.
So what was going on?
Okay so by reducing the session count to 1 we had in affect achieved the same configuration as a host without MEM installed (but with one extra path if the volume is spread across three SANMembers). But there was a problem, DELL_PSP_EQL_ROUTED or MEM appears to be a load balancing policy / protocol so if two paths are on one nic it will see the other nic as being less busy and will try to balance.
Okay so you can see that the volume is spread or sliced across three SANMembers and we have two paths out of vmk2. What appears to happen is that the second path on vmk2 will swap to vmk1, then it will swap back. What could be causing this, well from what we can gleam it is trying to load balance across both vmk’s, now let’s think about this for a second. If vmk2 has two paths it will get more throughput than vmk1 so to load balance we will take the second path and put it on vmk1, but wait vmk1 is now working harder than vmk2 so let’s move the path back to vmk2.
This was driving me a little batty, having just inherited this environment and being a little OCD when it comes to events and errors. Apparently a support call had been in for 3 odd months raising this issue and the staff were told to ignore or filter out the events. It is an expected outcome and nothing to worry about.
Change the sessions to 2 (the default setting) but reduce the volume from 6 to 4.
So the config looks similar to below.
# Copyright (c) 2009-2011 by Dell, Inc.
# All rights reserved. This software may not be copied, disclosed,
# transferred, or used except in accordance with a license granted
# by Dell, Inc. This software embodies proprietary information
# and trade secrets of Dell, Inc.
# Configuration file for Dell EqualLogic Multipathing Extension Module
# Location: /etc/cim/dell/ehcmd.conf
# EqualLogic Host Integration Tools Parameters
# Logging Level (0 = off, 3 = most verbose)
DebugLevel = 2
# Max size before rotating logs
MaxLogSizeMB = 50
# Location for log files
LogDir = /var/log/equallogic
# EqualLogic Multipathing Configuration Parameters
# Maximum number of MPIO sessions to be used per member per volume
MemberSessions = 2
# Maximum number of MPIO sessions to be used for entire volume
VolumeSessions = 4
# Maximum number of MPIO sessions to create from this host
TotalSessions = 512
# Frequency of iSCSI session evaluation and reconfiguration (in seconds).
Reconfigure = 240
# Frequency of retrieval of volume layout information (in seconds).
TableUpdate = 120
UseIPv4 = 1
UseIPv6 = 0
EnableSessionReconfiguration = 1
ReconfigBehavior = 1
ReconfigThreshold = 3
ReconfigBackoff = 10800
MinAdapterSpeed = 1000
IpcVersion = 2832
AdapterRescan = 240
UseMPIOForSnapshots = 1
MaxScsiXfer = 64512
ModuleName = EHCM
I am not sure of the implications of simply doing a copy paste of the above into vi so below is the command to make this change.
For ESXi 4.1
setup.pl –setparam –name=VolumeSessions –value=4
For vSphere 5
esxcli equallogic param set –name=VolumeSessions –value=4