-
People being sys admins? One of my biggest issue is over-training or excessive research. I seem to spend more time reading journals and feeds than I do actually working. It's a tough habit to break.
-
just a few
- not listening to others [ this starts with requirements gathering.. ]
- not reading the f... manuals
- assuming everything will go well [ who needs backups anyway ? ]
- assuming that if things work now they will in the future [ who needs monitoring ? ]
- relying blindly on 3rd parties... we have 99.999 internet access sla. [ things will work fine. or not.. in worst case you'll pay 10% less of your monthly bill in exchange for 10h downtime, convince your 'the business' to be satisfied with this ]
- this leads us to not explaining clearly to non-techies what are consequences, costs, risks and opportunities. and double checking they actually understood.
-
Rushing to conclusions...
Stop and think. Make sure brain is in gear before engaging mouth. Quit leading me on wild goose chases when troubleshooting problems and wasting my time reporting things that are really by-design behavior.
-
Making assumptions on 'what seems sensible'.
E.g. A computer isn't turning on, so you think the power supply is shot because that's the sensible thing. Instead the user is calling on their mobile in the middle of a power cut.
-
- Underestimating Murphy
- Not using checklists
- Taking shortcuts
- Assuming that almost complete is "good enough"
- "Who would want to steal /our/ data?"
- Implementation without proper testing
-
Document first, then execute
-
This question is pretty close to the post: "Common mistakes made by System Administrators and how we can avoid them"
I'd like to add that one thing to do wrong is to NOT document your work.
-
Many of the "fiasco" scenarios that I've been called in to resolve come down to admins not applying consistent and scientific troubleshooting technique.
When you're troubleshooting a problem in a "black box" (read: closed source software/hardware, 3rd party system, etc), you should change one thing at a time (and document your changes) and exercise a consistent test case with each change. If your hypothesis doesn't bear out, return things back to their original state and start again.
Lather, rinse, repeat.
What I see, more often than not, are frazzled admins running around making random changes without documenting what has changed, and without testing whether or not their change made a difference. Before long, the initial conditions are lost. When the issue is finally resolved, no root cause analysis can ever be done because no one is sure what fixed the problem.
We make a bad name for our trade when we act that way.
-
overconfidence
-
Trusting your end users to tell the truth. Sometimes they may think that if they tell you what really happened that they would suffer some sort of repercussions, or perhaps they don't know which information is relevant. But, at the end of the day, it's best to be skeptical and to ask as many questions as possible.
-
Motivating Staff
Get a contractor in to do their job,
which you haven’t given them enough
time to do, and asked them to do
during a time when they’re swamped
with other work. And then complain
that you’re running over budget.
RAID
When trying to rebuild a mirrored RAID
array, select the new disk as the
place to mirror from….
Working with Crucial Services
- Make sure that the only environment a crucial service can run on (the one
that takes payments) is an old PC that
used to belong to an Ex Developer.
- Make sure that this box has faulty RAM.
- Regularly play with the innards of this box.
- When people ask what happens if the box goes down, tell them it’s a low
priority.
- When the box actually does go down, tell people that it’s ok, there are no
issues, and that you’ll have it fixed
in 5 minutes…
Staff Motivation #2
Break your development team’s dev
environments, and delete half of the
work they’ve done that week. A week
later, ask why they didn’t do unpaid
overtime to keep up to speed
Be the person to break things, then go on holiday
This morning, the first day our DBA is
away on holiday for a week, we find
that he’s changed the servers to
master-master replication. We’re
learning fast about howto fix
replication issues, but even we know
that ignoring replication issues by
default is a bad idea
All things that have happened to myself in the last... 4 months. Courtesy of http://www.stopyouredoingitwrong.com (my website, and the reason for this question!)
-
Plan for the worst, hope for the best - anything else is wrong in my book.
-
My two golden rules
Don't install from source on a binary distribution. It breaks your upgrade path and security patches.
Run the same versions of all packages on development and production.
-
From Deep Thoughts by SysAdmins...some of my favorites
THe Mack Truck Scenario: If no one will be able to figure this out if you get hit by a Mack Truck then you are doing something wrong.
If you havent thought of at least one potential negative outcome of hitting 'enter' after the command you just typed then you dont understand the command well enough to use it on a production system.
If you do it more than once, automate it. If you cant automate it then document it. Document it anyway.
AND MY FAVORITE - if it seems like someone else may have encountered this problem before, they probably have. Google for the answer.
-
Two things:
Make sure you have a solid backup plan for your data.
Also, it is critical to have a solid support team around you when you get stuck. Being the loan ranger doesn't work.
-
Assuming the user will have enough knowledge of how the software should work and avoid dealing with error handling that comes back to bite you in the butt.
-
Here are my 9 rules to Failure.
1 - DON'T Think for yourself... lack of confidence.
2 - DON'T Keep it simple... the best way, the hard way.
3 - DON'T Control inner your chaos... AVOID to become a stress Zen.
4 - IGNORE your enviroment... you're the only one!
5 - DON'T Keep Backups... why bother disk space?
6 - DON'T Test Backups... AVOID backups, skip this waste of time.
7 - AVOID exploring new paths of knowledge... better walk in familiar paths.
8 - DON'T Gain extra motivation... winners suck... ordinary fits best!
9 - AVOID Social Media, Trobleshooting, reading manuals... only trust in your experience.
-
Forgetting that there are hundreds or thousands of people who will be affected by the consequences of your actions.
Failing to get the basics adequately covered in your blind rush to the exciting stuff.
-
Most common pitfalls that I've ever seen and fallen into myself, ranked on order of importance/criticality.
1.) Assumption. Example: "I assumed that the problem had to be with the network card and had been troubleshooting the device for an hour before it had occurred to me to check the cable." This is one of the number one killers I've ever seen. Never assume anything and remember Mr. Holmes's lesson, `When you take away everything that cannot be the problem, what you're left with, no matter how impossible, must be the answer.'
2.) Arrogance. Example: "I'm the freakin' senior admin, what does the junior think he/she can positively contribute to the troubleshooting?". During an ITR, I've had a web developer point out a very small, yet critical problem in a router configuration that would have saved me hours of troubleshooting. Another set of eyes on a problem can't hurt and many times even is beneficial for training.
3.) Lack of RTFM. Example: "I've been working with Brocade fiberchannel switches for years. I know how to zone a fabric, ok?". The tech in question ended up creating a a zone for a tape library that consisted of a massive amount of devices all trying to talk to the tape library at once, instead of a one-to-one zoning plan. Without a quick consult with `El Manuel', the tech didn't know he was far outside best-practices. The one-to-one example was in the first three pages.
4.) Poor change-management/lack of communication and documentation. Example: The umteenth email sent out to a group asking, "Did anyone mess around with the webserver over the weekend, because it's down, we're out of clues and corporate wants it back up ASAP." This is another huge killer. No matter how good of an admin you are, if you didn't document or communicate what you did to fix that 20-hour router outage, and another one goes down three hours after you've finally gone home to get some sleep, you're only a.) looking like a fool and b.) doing yourself harm.
5.) Bad management/dysfunctional team. Example: Fear of looking stupid or assassination from co-workers causes things to be `swept under the rug', etc. etc. A good team is a reflection of it's leader and vice versa. A team's leader is responsible for a.) Ensuring the entire team gets credit when someone does a stellar job, and rewarding the stellar worker. b.) Shielding the team (AND responsible party) from the wrath of others when someone screws up, taking full responsibility for the problem, privately counseling the responsible one and taking positive steps to ensure it never happens again. A good manager will also remove all obstacles in the path of his or her team.
Finally, A good leader/manager, especially in tech, is NEVER the smartest guy on the team. Good leaders surround themselves with smarter advisers. A leader's job is to enable the team, not become bogged down and responsible for every little detail, in effect carrying a team who can't get the job done. It becomes a self-defeating fallacy.
HTH.