What happened this weekend

Author:Ilie
Date:Mon May 5 09:39:08 2003
Id:1944
As some of you noticed, MUME was down for about 40 hours this weekend.
(From may 1st, 20:10:09 GMT to May 3rd 11:49:25 GMT).

Initially, the cooling system in the machine room at PVV failed. Since
the room is quite full, they had to emergency shutdown/power-off most
of the HW in the machine room. They also had to move all the HW from
one half of the room (the parts under the cooling systems - roof
mounted cooling systems is a really bad idea btw), including the fire
machine.

I noticed the machine was down, and called PVV and asked them to
powercycle the machine for me. The joker told me "ok" instead of
telling me that they had a major disaster going on there. More about
this guy later.

I went up there the next day (Saturday) to look at the poor machine on
my own, quickly discovered the above, and "talked" to the guy who had
answered the phone about missing information. He agreed he hadn't been
very helpful, and decided he would be more helpful in the future (I
guess I looked angry).

After finding a new spot for the machine, the machine failed to come
up again. Opened up, removed all cpus/memory, cleaned connectors and
re-seated everything, and it came up again. Some file system damage
which Dain fixed up later by rolling in an earlier backup.

The machine is getting quite old (more than 5 years now), we might
have to think about finding a replacement soon. Impossible to get new
parts for it if something fails, luckily nothing had failed beyond
recovery this time.

- Ilie