[lopsa-tech] Server Recommendations

Tracy Reed treed at ultraviolet.org
Fri Jan 14 13:07:47 PST 2011


On Fri, Jan 14, 2011 at 01:59:39PM -0500, Brian Mathis spake thusly:
> There's a reason they are the big guys.

Slick marketing which appeals to the kind of guys who like to buy cool looking
servers that talk a good game and stick me with their junk which isn't the best
fit for our situation but gave the big guy a warm fuzzy. Supermicro system
configurations are far more flexible. We often got a lot more bells and
whistles with Dell than we really needed. Even that silly LCD front panel finds
a way to be a hassle. The bottom edge of that thing was dropping down just a
tiny bit too low on one of our 1u systems and preventing a drive from sliding
in. Had to call a Dell tech in the next day to fix it. I didn't want to just
start prying on it. He ended up replacing the part. Weird stuff.

> My most recent work with Dells has led me to a lot of respect.
> Their rapid-rails mounting system shows they are really putting a lot
> of thought into all aspects of their system design.  You can really
> have a server in the rack in about 2 minutes.

I spend very little time racking machines, usually only doing it once.

Every Dell machine I have purchased in the last year has caused problems.
I'm not very happy with my Dell experience overall. 

First was the sales process. I don't want to have to haggle for a week to get a
good price. But that's what we did. And the price came down a fair bit.
Probably not as much as the time it cost us though. I don't want to have to pay
extortionate prices for hard drives either. I hate that a 6 bay hot swap
machine comes with blanks instead of drive trays. If you want more trays you
have to buy them from Dell with marked up Dell drives. I understand wanting to
only support drives known to work but tell me what model number that is so I
can get them wherever I want and give me drive trays with the machine. It's
games like this...

We bought a memory upgrade from Dell for our 2970's bring them up to 32G of
RAM. After installing the RAM and rebooting the computer said the memory
configuration was not optimal and prompted me to press F1 to continue. It would
then boot up just fine. But I can't have the servers requiring human
intervention for a reboot.  So I had to figure out what the problem was. I
called Dell support and it turned out that the BIOS did not properly support
32G without a BIOS upgrade.

We were told they supported up to 32G when we bought them but it turns out the
BIOS they were shipped with didn't properly support 32G. So...that's broken at
time of purchase in my book. 

Every one of our Dell servers has required a BIOS upgrade. The 610's would
spontaneously reboot after a couple of months in operation at first. They all
did it. Then I upgraded the BIOS. Now it has been at least 9 months since that
happened and I hope it is cured. I really don't expect to ever have to upgrade
BIOS in a server. If I do that means it was broken when I bought it.  Bugs
don't appear by themselves over time, they are there at time of shipment.  Not
only that but there is mainboard BIOS firmware, DRAC/BMC firmware, and RAID
controller firmware all in need of updating. That's just too much stuff
requiring post-sale fixing.

As for the process of doing the BIOS upgrade there is room for improvement.
First, I am happy that there are Linux executables for doing this. It used to
be that only DOS binaries were distributed for stuff like this. But the process
for obtaining and executing the upgrade is rather obtuse. 

The first step is to download the BIOS update. I was given this url by tech
support:

http://support.dell.com/support/downloads/download.aspx?c=us&cs=555&l=en&s=biz&r
T&osl=en&deviceid=11598&devlib=0&typecnt=0&vercnt=11&catid=-1&impid=-1&formatcnt
362396

Wow. That's a mess of a url. I don't like to have to download the BIN file on a
desktop or laptop and then scp the file over to the Linux server as it is
inconvenient.  We don't run a web browser or any GUI desktop at all on our
servers as it is a waste of resources and not best practice. But I pretty much
need one to copy and paste that url and navigate the webpage it points to.

It would be nice if Dell provided a simple direct download link. Or at least
didn't wrap the Download button with a javascript function. If I am on my
laptop I like to right click the download link on my laptop and select "Copy
link location", then paste the url into an ssh terminal on the server and pull
the binary directly down to it. Currently when I right click the download
button and copy the link I get:

javascript:downloadslink('http://ftp.us.dell.com/bios/PE2970_BIOS_LX_4.1.1_1.BIN
verDownloadManager.application?c=us&l=en&fileid=362790&fileloc=ftp://ftp.us.dell
alse','PE2970_BIOS_LX_4.1.1_1.BIN');

Ugly and unusable. However, from this I can see that the actual path to the
file is:

http://ftp.us.dell.com/bios/PE2970_BIOS_LX_4.1.1_1.BIN

So on the server I can do:
# wget http://ftp.us.dell.com/bios/PE2970_BIOS_LX_4.1.1_1.BIN
and download the file directly onto the server.

Much more convenient. I can even type that by hand without copy and paste if I
really have to.

When I execute this BIN file it produces an error indicating that it wants
another program called lockfile to be installed on the system. It took me a
while to remember this program. I had seen it before somewhere. Turns out it is
part of the procmail mail filtering program which we do not normally install
onto our servers. Most people shouldn't be installing that unless they need it
as part of a mail server. I had to install it to get the file to run.

Then I find that I also have to install compat-libstdc++-33-3.2.3-47.3.i386.rpm
but at least the BIN file gives me a useful error directing me to install it.
This is only needed for executables compiled against the old C++ library.
Moving to the newer one (why wouldn't they just use straight C for a firmware
installer?) would remove a barrier to getting the firmware update done.


This is pretty sweet:

Continue? Y/N:y
Executing update...
WARNING: DO NOT STOP THIS PROCESS OR INSTALL OTHER DELL PRODUCTS WHILE
UPDATE IS IN PROGRESS.
THESE ACTIONS MAY CAUSE YOUR SYSTEM TO BECOME UNSTABLE!
.../tmp/PE2970_BIOS_LX_4.1.1_1.BIN-6001-9159/./UpdRollBack: error
while loading shared libraries: libxml2.so.2: cannot open shared
object file: No such file or directory
.
The update failed to complete

Oops...looks like it is complaining that it can't find libxml2.so.2 so I gess
there is some XML nuttiness in this firmware somewhere. Installing libxml2 with
yum resolved that.

Then the firmware update installed and I rebooted. Yeay.

So that covers firmware.

The RAID card management tools leave MUCH to be desired as well. As far as I
can tell, the MegaCli package is the way to manage the PERC from the command
line in Linux. To work with it you have to hunt down the
MegaCli-1.01.39-0.i386.rpm tools since the tools are proprietary to LSI and
don't ship with RHEL.

Then you RPM install it and go looking for the software it installed.  MegaCli
is rarely used. Only when setting up disks. They didn't call it megacli or
something I might remember. They called it MegaCli64 (case sensitive) which is
installed in /opt/MegaRAID/MegaCli/MegaCli64.

Then you have to figure out how to use it.

# /opt/MegaRAID/MegaCli/MegaCli64
Fatal error - Command Tool invoked with wrong parameters

hmm...ok

# /opt/MegaRAID/MegaCli/MegaCli64 --help
Invalid input at or near token -

hmmm

# /opt/MegaRAID/MegaCli/MegaCli64 -h

whoah! This gets you a massive amount of cryptic command line options with no
explanation as to their purpose. I have pasted the output here:

http://pastebin.ca/1968565

This is their idea of "help". I'm a command line commando of 20+ years and this
scares even me! It would have been nice if they at least tried to make it work
somewhat like the Linux mdadm command or at least provided some examples of
common use cases etc. Because of the oddity of this command various people out
on the net have compiled "cheat sheets" to help poor souls like me figure out
how to use this thing:

http://tools.rapidsoft.de/perc/perc-cheat-sheet.pdf

Usually I avoid using this command and just reboot the server into the BIOS and
configure the RAID card from there but often it is not a convenient time for a
server reboot. I also avoid it because it is so complicated and one wrong
command can lose all of the data in the server. Yes there are backups, which I
would really rather not have to restore.

I needed to add a couple of disks on the fly and did not want to reboot. The
command line I seemed to need and response it gave me was:

# /opt/MegaRAID/MegaCli/MegaCli64 -CfgLdAdd -r0 [32:4] -a0
Adapter 0: Configured the adapter!!

Not a very reassuring response. Configured it how with what? It would be nice
if it said "Added virtual disk number 4 as a RAID 0" since that is what that
command told it to do.

Using the command:

/opt/MegaRAID/MegaCli/MegaCli -LDInfo -Lall -aALL

I was able to verify that it had in fact created virtual disk number 4 as a
RAID 0. However, I didn't have a file to work with in /dev representing the
disk. The operating system simply refused to see the disk so that I could
actually do something with it. I spent some time trying to figure out why but
couldn't come up with a solution.  So I called tech support.

Dell tech support people are always friendly and, thankfully, seem to be US
based. That is a big help when the tech support person and I are yelling
instructions at each other over a noisy datacenter on a mobile phone. They
don't always have the solution, though. In this case with the RAID controller
I had added a disk and was trying to make it usable/visible to the OS. The guy
first guessed that I needed to partition the disks. I explained that the disks
were not visible to the OS to be partitioned. Then he guessed at some MegaCli
commands which were not useful. Eventually I had to get off the phone and head
out for an appointment. Later I got an email explaining that he had the
solution: I needed to run partprobe. That command finds partitions. You can't
find partitions on disks which you can't see. Way off the mark.  Eventually it
became more convenient to reboot the server. So that is what I did and the
disks appeared.  Problem solved, sort of. Although with this hot swap stuff it
really should be possible to add disks on the fly. That's the whole point.

Speaking of RAID controllers, we have a pair of identical R410's. And they BOTH
consistently produce these errors:

mptbase: ioc0: LogInfo(0x31123000): Originator={PL}, Code={Abort}, SubCode(0x3000)
mptbase: ioc0: LogInfo(0x31123000): Originator={PL}, Code={Abort}, SubCode(0x3000)
mptbase: ioc0: LogInfo(0x31123000): Originator={PL}, Code={Abort}, SubCode(0x3000)
mptbase: ioc0: LogInfo(0x31080000): Originator={PL}, Code={SATA NCQ Fail All Commands After Error}, SubCode(0x0000)
mptbase: ioc0: LogInfo(0x31080000): Originator={PL}, Code={SATA NCQ Fail All Commands After Error}, SubCode(0x0000)
mptbase: ioc0: LogInfo(0x31080000): Originator={PL}, Code={SATA NCQ Fail All Commands After Error}, SubCode(0x0000)

They produce these errors at a rate of around 10 per day throughout the day.
Both machines produce the exact same error. Same hex codes, etc. Identical. I
don't think it is actually a drive failing because the chances of both machines
failing at the same time in exactly the same way are slim. One of these
machines recently had what looked like a RAID controller crash which lost data
and didn't do our filesystems any good.

Whenever I call Dell tech support I always wonder why it is that Dell's phone
system always asks me for the long service code number instead of the shorter
service tag which is just the base-36 encoding (therefore much shorter) of the
service code. Sometimes I have one but not the other on hand. They are clearly
the same thing. Lots of people have even put up little webpages (which I have
used) that will convert from one to the other for you:

http://www.google.com/search?q=dell+service+tag+converter

Why would they ever ask for or deal in the long version and make me yell it at
them over a mobile phone in a noisy datacenter?

Then the next person I talk to wants the service tag again even though I just
told the phone system the service code.

Then the NEXT person wants to confirm the service tag.

At least they tend to understand the ICAO phonetic alphabet so we don't have to
haggle over whether I said b, c, d, e, g, p, t, v, z or 3.

I need to learn more about the DRAC. I would love to be able to get remote
console. But I don't want to have to install special management software on a
Windows box on the same physical network as the machines which I have a feeling
is what I would have to do.

I hate those pointless bezels that come with the machines. I try not to pay the
small amount of extra money for them anymore because they just go in a pile.
These machines sit in a datacenter, not a showroom.

My Supermicro machines occupy far less of my time as admin.

> Dells have a built in DRAC which gives you a remote KVM (HP has something
> similar), which work well for the few times you need it.  They don't, and no
> Intel server will, have a serial only console that works like a Sun console.

Apparently there are at least two different kinds of DRAC: iDRAC Enterprise and
iDRAC Express.

My machines have iDRAC Express. iDRAC used to be something called BMC. Not sure
why they changed the name. The iDRAC stuff is nice. It took me a while to get
around to learning how to use it but it is worthy. Reminds me of some old
systems I had worked with in the past such as Sun, HP, and even Pyramid which
had service processors. I have long awaited the day that x86 got this feature.

However, it has some weird limitations and is expensive compared to the latest
stuff from Supermicro. For example, it is odd that iDRAC Enterprise supports
public key auth and Express does not. The DRAC is a little processor (MIPS or
ARM on most platforms) running Linux or Busybox. Why not support public key? We
do everything with ssh keys. Without public key auth I have another password to
worry about.

-- 
Tracy Reed
http://tracyreed.org
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://lists.lopsa.org/pipermail/tech/attachments/20110114/3970e5cf/attachment.pgp 


More information about the Tech mailing list