You are a platform engineer at DigiTron! While working on the product, all of a sudden, you can no longer log in! All incidents and alerts are inaccessible through the platform.
<br class="show"/>
There are a few teams that you could reach out to for more information:
* [[DevOps->ops-management]]
* [[Front End Web->front-end]]
* [[Backend->middle-tier]]
(set: $ttr to 0)
(set: $stress to 0)
<div class="timer">Time to resolve: *$ttr* minutes</div>
<div class="timer">Stress: *$stress*%</div>"We deployed config changes yesterday, but nothing recent. I don't think we can help with this issue. Maybe ask another team?"
* [[Front End Web->front-end]]
* [[Backend->middle-tier]]
(set: $ttr to $ttr + 2)
(set: $stress to $stress + 5)
<div class="timer">Time to resolve: *$ttr* minutes</div>
<div class="timer">Stress: *$stress*%</div>"We noticed the issue too. No one has been able to log in and we are all investigating what could cause this."
* [[DevOps->ops-management]]
* [[Backend->middle-tier]]
(set: $ttr to $ttr + 3)
(set: $stress to $stress + 10)
<div class="timer">Time to resolve: *$ttr* minutes</div>
<div class="timer">Stress: *$stress*%</div>"We did just deploy code changes and have been getting reports of others having issues! We're fairly sure that the issues are related."
* [[Look at diff of code changes from deploy->look-at-diff]]
* [[Look at dashboards->look-at-graphs]]
* [[Look at Splunk logs->look-at-logs]]
(set: $ttr to $ttr + 2)
(set: $stress to $stress + 5)
<div class="timer">Time to resolve: *$ttr* minutes</div>
<div class="timer">Stress: *$stress*%</div>The diff shows changes irrelevant to this issue. There must have been another change to cause this failure
* [[Look at dashboards->look-at-graphs]]
* [[Look at Splunk logs->look-at-logs]]
* [[Look at the config diff since last deploy->look-at-puppet]]
(set: $ttr to $ttr + 5)
(set: $stress to $stress + 20)
<div class="timer">Time to resolve: *$ttr* minutes</div>
<div class="timer">Stress: *$stress*%</div>Graphs are showing nominal levels, except for a drop-off in websocket connections. This is probably indicating less users logging in.
* [[Look at diff of code changes from deploy->look-at-diff]]
* [[Look at Splunk logs->look-at-logs]]
(set: $ttr to $ttr + 2)
(set: $stress to $stress + 15)
<div class="timer">Time to resolve: *$ttr* minutes</div>
<div class="timer">Stress: *$stress*%</div>There is a spike in log volume starting at the time of the deploy! You notice most of the logs are related to the system. There might have been a config change.
* [[Look at diff of code changes from deploy->look-at-diff]]
* [[Look at dashboards->look-at-graphs]]
* [[Look at the config diff since last deploy->look-at-puppet]]
(set: $ttr to $ttr + 1)
<div class="timer">Time to resolve: *$ttr* minutes</div>
<div class="timer">Stress: *$stress*%</div>There were changes to the system in yesterday's config deploy that were incorrect! This is definitely the cause of the outage!
* [[Yell at DevOps->yell]]
* [[Rollback config changes->rollback-puppet]]
* [[SSH to box to manually correct config->manual-correction]]
(set: $ttr to $ttr + 1)
(set: $stress to $stress - 10)
<div class="timer">Time to resolve: *$ttr* minutes</div>
<div class="timer">Stress: *$stress*%</div>Everyone is now stressed and nothing more has been done to resolve the outage...
* [[Rollback changes->rollback-puppet]]
* [[SSH to box to manually correct config->manual-correction]]
(set: $ttr to $ttr + 10)
(set: $stress to $stress + 50)
<div class="timer">Time to resolve: *$ttr* minutes</div>
<div class="timer">Stress: *$stress*%</div>You started the rollback, but the configs take 20 minutes to propagate changes.
* [[Wait for rollback to take->wait-for-rollback]]
* [[SSH to box to manually correct config->manual-correction-2]]
(set: $ttr to $ttr + 3)
(set: $stress to $stress + 10)
<div class="timer">Time to resolve: *$ttr* minutes</div>
<div class="timer">Stress: *$stress*%</div>You manually correct the config and the changes are able to take immediately after you deploy. Customers are able to log in again!
* [[Go get a beer->beer]]
* [[Rollback config->rollback-puppet-2]]
(set: $ttr to $ttr + 5)
(set: $stress to $stress - 10)
<div class="timer">Time to resolve: *$ttr* minutes</div>
<div class="timer">Stress: *$stress*%</div>You manually correct the config and the changes are able to take immediately. Customers are able to log in again! Since you already made the changes to the config repo, you are set for the future!
(set: $ttr to $ttr + 5)
(set: $stress to $stress - 20)
(set: $url to "https://twitter.com/intent/tweet?text=I%20beat%20the%20@VictorOps%20DevOps%20game%20in%20"+(text: $ttr)+"%20minutes.%20Test%20your%20skillz at&url=https%3A%2F%2Fdevopsgame.victorops.com%2F&hashtags=devops,oncall&via=VictorOps&related=splunk%3ASplunk")
<div class="timer">Time to resolve: *$ttr* minutes</div>
<div class="timer">Stress: *$stress*%</div>The rollback takes but customers were affected for longer than needed. Customers are mad, your boss is disappointed, and you are sad :(.
(set: $ttr to $ttr + 20)
(set: $stress to $stress + 40)
(set: $url to "https://twitter.com/intent/tweet?text=I%20beat%20the%20@VictorOps%20DevOps%20game%20in%20"+(text: $ttr)+"%20minutes.%20Test%20your%20skillz at&url=https%3A%2F%2Fdevopsgame.victorops.com%2F&hashtags=devops,oncall&via=VictorOps&related=splunk%3ASplunk")
<div class="timer">Time to resolve: *$ttr* minutes</div>
<div class="timer">Stress: *$stress*%</div>Wow! What a day! However, during your refreshing beer break, you get paged for another outage! It's the same issue! The backend team did another deployment and because you did not rollback the config changes, your manual fixes were overwritten! Your failure has resulted in being removed as the incident commander...
(set: $ttr to $ttr + 20)
(set: $stress to $stress + 30)
(set: $url to "https://twitter.com/intent/tweet?text=I%20beat%20the%20@VictorOps%20DevOps%20game%20in%20"+(text: $ttr)+"%20minutes.%20Test%20your%20skillz at&url=https%3A%2F%2Fdevopsgame.victorops.com%2F&hashtags=devops,oncall&via=VictorOps&related=splunk%3ASplunk")
<div class="timer">Time to resolve: *$ttr* minutes and counting...</div>
<div class="timer">Stress: *$stress*%</div>You started the rollback, but the configs take 20 minutes to propagate changes. Good thing you already manually changed the configs so that customers don't have to wait for the changes to propagate!
<br class="show"/>
Congratulations!
(set: $ttr to $ttr + 0)
(set: $stress to $stress - 20)
(set: $url to "https://twitter.com/intent/tweet?text=I%20beat%20the%20@VictorOps%20DevOps%20game%20in%20"+(text: $ttr)+"%20minutes.%20Test%20your%20skillz at&url=https%3A%2F%2Fdevopsgame.victorops.com%2F&hashtags=devops,oncall&via=VictorOps&related=splunk%3ASplunk")
<div class="timer">Time to resolve: *$ttr* minutes</div>
<div class="timer">Stress: *$stress*%</div><div class="header-block"><a href="https://victorops.com"><img src="assets/vo-80s.png" alt="VictorOps"></a><div id="custom-audio"><span class="audio-title">Audio:</span></div></div>
<br class="show"/>(set: $passage to (passage:))
(if: $passage's tags contains "end")[
(set: $url to "https://twitter.com/intent/tweet?text=I%20beat%20the%20@VictorOps%20DevOps%20game%20in%20"+(text: $ttr)+"%20minutes.%20Test%20your%20skillz at&url=https%3A%2F%2Fdevopsgame.victorops.com%2F&hashtags=devops,oncall&via=VictorOps&related=splunk%3ASplunk")
<div id="overlay-container" class="overlay">
<div class="popup">
<div><a class="close" id="close-popup">×</a></div>
<div class="content">
<div class="flex">
<div class="flex-right">
<h3>Think Your Score Sucks?</h3>
<p>Learn how to make on-call suck less by downloading our free webinar with 5 quick wins for a better on-call experience.</p>
<div class="vertical-space">
<a name="button" title="Get the webinar" class="subscribe-btn" href="https://victorops.com/webinars/how-to-make-on-call-suck-less/">Check It Out</a>
</div>
</div>
<div class="flex-left">
<img class="subscribe-img" src="assets/vogame-subscribe.jpg" alt="VictorOps Resource"></img?
</div>
</div>
</div>
</div>
</div>
](else:)[
(set: $url to "https://twitter.com/intent/tweet?text=Think%20you%20got%20what%20it%20takes%20to%20resolve%20a%20system%20outage%3F%20Test%20your%20DevOps%20skillz%20with%20our%20On-Call%20Adventure%20game.&url=https%3A%2F%2Fdevopsgame.victorops.com%2F&hashtags=devops,oncall&via=VictorOps&related=splunk%3ASplunk")
]
<div class="row">
(if: $passage's tags contains "end")[(link-repeat: "Start Over")[(reload:)]]
(link-repeat: "Share")[(open-url: $url)]
(link-repeat: "Get Better")[(open-url: "https://victorops.com/ebooks/why-devops-matters-collaborative-transparency-in-incident-management/")]
</div>
<script>
var event = new Event('build');
window.dispatchEvent(event)
</script>