Post by Tim BradshawPost by Ian CollinsHow mechanically robust does a box that sits on a bench and runs for 10
years have to be??
It doesn't, of course. But it still makes you (well, me) feel bad when
the machine is clearly made out of tinfoil. However I suspect there's
a correlation between systems that are well physically made and systems
that are reliable - things like quality of connectors and rigidity of
the things which connect together counts for a fair amount I suspect.
There must be a reason that 25ks are nicely made (though, perhaps, it
is because people *expect* such machines to be nicely made).
The whole question of the correlation between various things (physical
quality, "high-end-ness") and reliability is interesting. No-one
releases figures of course, but I sometimes wonder if the most reliable
systems are actually small, simple ones, with less to go wrong.
By default, when you add components to a system, the reliability drops.
It is possible to add components and have reliability improve, but that
needs more careful design and implementation. It's probably better at
this point to refer to availability than reliability. The reliability
of a DIMM memory module is probably not a whole lot different in a
desktop system than a large server. When it dies, your desktop probably
halts (or worse, carries on unknowningly using corrupt data). In a more
complex high availability server, the telemetry is likely to have noticed
the DIMM starting to go bad before data was lost, and retired it from use,
and may even allow it to be replaced whilst the system carries on running
and brought back into use again. So whilst the reliability of the DIMM
was probably no different, the availability of the server system was
vastly improved over that of the desktop. The server was more complex,
but that complexity went into telemetry monitoring of the components,
and hardware and software designs which allowed the faulty DIMM to be
hot swapped and brought back into service, so it was geared very
specifically at higher availability.
Another example I like, just because I've seen it _so_ many times...
After a few power cuts or brown-outs, someone decides that the solution
to their problems is to go out and buy a UPS. The UPS is duely installed.
There then follow a string of system outages due to the UPS, almost always
worse than the original problem of a few power cuts. What went wrong?
Well, the system was added to and became more complex, and as a result,
the reliability dropped. Now a UPS _can_ be used to improve system
reliability, but remember what I said; by default, when you add components
to a system, the reliability drops. To improve the reliability, you need
to stop and think about the system design in the light of a UPS, read the
manual on how to configure it, and implement the solution correctly. Then
you can achieve improved availability of a more complex system.
--
Andrew Gabriel
[email address is not usable -- followup in the newsgroup]