[lopsa-discuss] Interruptions coverage...

sysadm-quatsch at LuftHans.com sysadm-quatsch at LuftHans.com
Tue Dec 20 15:31:47 PST 2005


Am 19. Dec, 2005 schwätzte Betsy Schwartz so:

> At Genuity, I was part of a team that had 6-12 people over time. We had an 
> on-call rotation where one person was primary and one person was backup for a

That's somewhat how we operate for off-hours. One person is primary and we
had secondaries for each functional area. It's fallen apart due to team
reduction. Now the primary now either fixes the probs or gets ahold of
the person responsible for that functional area.

> full week. The oncall person did little but respond to tpages , around the 
> clock. The backup oncall person handled the ticket queue, which was often 
> quite large. They did the quick tickets themselves and handed off others to 
> folks who had particular areas of responsibility. Plus, they sometimes got 
> paged by the primary oncall or engaged if the primary didn't respond within 
> 15 minutes.
>
> The down side of this was that everyone had long-term  projects to work on, 
> and both the primary and backup oncall ended up sandbagged for an entire week 
> at a time. This could play hell with deadlines. The upside was that it was 
> always pretty clear whose responsibility the queue was. There was often a bit 
> of friction about how many open tickets were left in the queue when it was 
> handed over to the next person; I'd wonder if this could be an issue with 
> very frequent turnover of responsibility.

Having one person responsible for all of the issues during the week worked
well. We're a small group and had a need to understand what everyone else
was doing, so it was forced cross-training. It also gave us the time to go
after problems and fix them. My first week on call we got more than 10x
the normal pages and most of the procedures for dealing with them weren't
documented. I got documentation written and fixed a lot of configs and
code to stop spurious paging.

In a better designed system that shouldn't be necessary. With the hacked
up infrastructure here I used the week of on call as an excuse to make
infrastructure improvements. Yes, I tested things before rolling them out.
Heck, most of the work was getting things into revision management...

ciao,

der.hans
-- 
#  https://www.LuftHans.com/        http://www.CiscoLearning.org/
#  Join the League of Professional System Administrators! https://LOPSA.org/
#  "The reasons for my decision to quit were myriad, but central to the
#  decision was the realization that there are two kinds of companies:
#  Good ones ask you to think for them.
#  The others tell you to think like them." -- Benjy Feen


More information about the Discuss mailing list