Chargement...

The Site Reliability Workbook: Practical Ways to Implement SRE

par Betsy Beyer

Autres auteurs: Voir la section autres auteur(e)s.

Membres	Critiques	Popularité	Évaluation moyenne	Discussions
88	1	306,619	(4)	Aucun
In 2016, Google ?s Site Reliability Engineering book ignited an industry discussion on what it means to run production services today ?and why reliability considerations are fundamental to service design. Now, Google engineers who worked on that bestseller introduce The Site Reliability Workbook , a hands-on companion that uses concrete examples to show you how to put SRE principles and practices to work in your environment. This new workbook not only combines practical examples from Google ?s experiences, but also provides case studies from Google ?s Cloud Platform customers who underwent this journey. Evernote, The Home Depot, The New York Times , and other companies outline hard-won experiences of what worked for them and what didn ?t. Dive into this workbook and learn how to flesh out your own SRE practice, no matter what size your company is. You ?ll learn: How to run reliable services in environments you don ?t completely control ?like cloud Practical applications of how to create, monitor, and run your services via Service Level Objectives How to convert existing ops teams to SRE ?including how to dig out of operational overload Methods for starting SRE from either greenfield or brownfield… (plus d'informations)

tous les utilisateurs

▾Membres

Récemment ajouté par

libraryjk, gabeforster, mattparsons, jaredtbates, morganjaymee, amackera, RELEXFI

▾Mots-clés

39 (1) 00724 (1) A lire (12) Administration (1) Devops (3) dont-track-reading (2) Développement (1) ENG (1) entreprise (1) goodreads-20220324 (1) Génie logiciel (2) humble (1) Humble Book Bundle: DevOps by O'Reilly (1) Humble Bundle (2) Informatique (1) Informatique (1) Livre électronique (5) Non lu (1) non-fiction (6) Ordinateurs (2) pdf-book (1) Possédé (3) Programmation (2) Recycled (1) REF (2) source/humble (1) system administration (3) Tech (2) tech-to-read (1) Travail (1)

Traduction des mots-clés activée

▾Recommandations de LibraryThing

▾Listes

Aucun

▾Aimerez-vous ce livre ?

Chargement...

Inscrivez-vous à LibraryThing pour découvrir si vous aimerez ce livre

▾Discussions (À propos des liens)

Actuellement, il n'y a pas de discussions au sujet de ce livre.

▾Critiques des utilisateurs

Indeholder "Foreword I", "Foreword II", "Preface", " Conventions Used in This Book", " Using Code Examples", " O'Reilly Safari", " How to Contact Us", " Acknowledgments", "1. How SRE Relates to DevOps", " Background on DevOps", " No More Silos", " Accidents Are Normal", " Change Should Be Gradual", " Tooling and Culture Are Interrelated", " Measurement Is Crucial", " Background on SRE", " Operations Is a Software Problem", " Manage by Service Level Objectives (SLOs)", " Work to Minimize Toil", " Automate This Year's Job Away", " Move Fast by Reducing the Cost of Failure", " Share Ownership with Developers", " Use the Same Tooling, Regardless of Function or Job Title", " Compare and Contrast", " Organizational Context and Fostering Successful Adoption", " Narrow, Rigid Incentives Narrow Your Success", " It's Better to Fix It Yourself; Don't Blame Someone Else", " Consider Reliability Work as a Specialized Role", " When Can Substitute for Whether", " Strive for Parity of Esteem: Career and Financial", " Conclusion", "Part I. Foundations", "2. Implementing SLOs", " Why SREs Need SLOs", " Getting Started", " Reliability Targets and Error Budgets", " What to Measure: Using SLIs", " A Worked Example", " Moving from SLI Specification to SLI Implementation", " Measuring the SLIs", " Using the SLIs to Calculate Starter SLOs", " Choosing an Appropriate Time Window", " Getting Stakeholder Agreement", " Establishing an Error Budget Policy", " Documenting the SLO and Error Budget Policy", " Dashboards and Reports", " Continuous Improvement of SLO Targets", " Improving the Quality of Your SLO", " Decision Making Using SLOs and Error Budgets", " Advanced Topics", " Modeling User Journeys", " Grading Interaction Importance", " Modeling Dependencies", " Experimenting with Relaxing Your SLOs", " Conclusion", "3. SLO Engineering Case Studies", " Evernote's SLO Story", " Why Did Evernote Adopt the SRE Model?", " Introduction of SLOs: A Journey in Progress", " Breaking Down the SLO Wall Between Customer and Cloud Provider", " Current State", " The Home Depot's SLO Story", " The SLO Culture Project", " Our First Set of SLOs", " Evangelizing SLOs", " Automating VALET Data Collection", " The Proliferation of SLOs", " Applying VALET to Batch Applications", " Using VALET in Testing", " Future Aspirations", " Summary", " Conclusion", "4. Monitoring", " Desirable Features of a Monitoring Strategy", " Speed", " Calculations", " Interfaces", " Alerts", " Sources of Monitoring Data", " Examples", " Managing Your Monitoring System", " Treat Your Configuration as Code", " Encourage Consistency", " Prefer Loose Coupling", " Metrics with Purpose", " Intended Changes", " Dependencies", " Saturation", " Status of Served Traffic", " Implementing Purposeful Metrics", " Testing Alerting Logic", " Conclusion", "5. Alerting on SLOs", " Alerting Considerations", " Ways to Alert on Significant Events", " 1: Target Error Rate ≥ SLO Threshold", " 2: Increased Alert Window", " 3: Incrementing Alert Duration", " 4: Alert on Burn Rate", " 5: Multiple Burn Rate Alerts", " 6: Multiwindow, Multi-Burn-Rate Alerts", " Low-Traffic Services and Error Budget Alerting", " Generating Artificial Traffic", " Combining Services", " Making Service and Infrastructure Changes", " Lowering the SLO or Increasing the Window", " Extreme Availability Goals", " Alerting at Scale", " Conclusion", "6. Eliminating Toil", " What Is Toil?", " Measuring Toil", " Toil Taxonomy", " Business Processes", " Production Interrupts", " Release Shepherding", " Migrations", " Cost Engineering and Capacity Planning", " Troubleshooting for Opaque Architectures", " Toil Management Strategies", " Identify and Measure Toil", " Engineer Toil Out of the System", " Reject the Toil", " Use SLOs to Reduce Toil", " Start with Human-Backed Interfaces", " Provide Self-Service Methods", " Get Support from Management and Colleagues", " Promote Toil Reduction as a Feature", " Start Small and Then Improve", " Increase Uniformity", " Assess Risk Within Automation", " Automate Toil Response", " Use Open Source and Third-Party Tools", " Use Feedback to Improve", " Case Studies", " Case Study 1: Reducing Toil in the Datacenter with Automation", " Background", " Problem Statement", " What We Decided to Do", " Design First Effort: Saturn Line-Card Repair", " Implementation", " Design Second Effort: Saturn Line-Card Repair Versus Jupiter Line-Card Repair", " Implementation", " Lessons Learned", " Case Study 2: Decommissioning Filer-Backed Home Directories", " Background", " Problem Statement", " What We Decided to Do", " Design and Implementation", " Key Components", " Lessons Learned", " Conclusion", "7. Simplicity", " Measuring Complexity", " Simplicity Is End-to-End, and SREs Are Good for That", " Case Study 1: End-to-End API Simplicity", " Case Study 2: Project Lifecycle Complexity", " Regaining Simplicity", " Case Study 3: Simplification of the Display Ads Spiderweb", " Case Study 4: Running Hundreds of Microservices on a Shared Platform", " Case Study 5: pDNS No Longer Depends on Itself", " Conclusion", "Part II. Practices", "8. On-Call", " Recap of 'Being On-Call' Chapter of First SRE Book", " Example On-Call Setups Within Google and Outside Google", " Google: Forming a New Team", " Evernote: Finding Our Feet in the Cloud", " Practical Implementation Details", " Anatomy of Pager Load", " On-Call Flexibility", " On-Call Team Dynamics", " Conclusion", "9. Incident Response", " Incident Management at Google", " Incident Command System", " Main Roles in Incident Response", " Case Studies", " Case Study 1: Software Bug - The Lights Are On but No One's (Google) Home", " Case Study 2: Service Fault - Cache Me If You Can", " Case Study 3: Power Outage - Lightning Never Strikes Twice... Until It Does", " Case Study 4: Incident Response at PagerDuty", " Putting Best Practices into Practice", " Incident Response Training", " Prepare Beforehand", " Drills", " Conclusion", "10. Postmortem Culture: Learning from Failure", " Case Study", " Bad Postmortem", " Why Is This Postmortem Bad?", " Good Postmortem", " Why Is This Postmortem Better?", " Organizational Incentives", " Model and Enforce Blameless Behavior", " Reward Postmortem Outcomes", " Share Postmortems Openly", " Respond to Postmortem Culture Failures", " Tools and Templates", " Postmortem Templates", " Postmortem Tooling", " Conclusion", "11. Managing Load", " Google Cloud Load Balancing", " Anycast", " Maglev", " Global Software Load Balancer", " Google Front End", " GCLB: Low Latency", " GCLB: High Availability", " Case Study 1: Pokémon GO on GCLB", " Autoscaling", " Handling Unhealthy Machines", " Working with Stateful Systems", " Configuring Conservatively", " Setting Constraints", " Including Kill Switches and Manual Overrides", " Avoiding Overloading Backends", " Avoiding Traffic Imbalance", " Combining Strategies to Manage Load", " Case Study 2: When Load Shedding Attacks", " Conclusion", "12. Introducing Non-Abstract Large System Design", " What Is NALSD?", " Why 'Non-Abstract'?", " AdWords Example", " Design Process", " Initial Requirements", " One Machine", " Distributed System", " Conclusion", "13. Data Processing Pipelines", " Pipeline Applications", " Event Processing/Data Transformation to Order or Structure Data", " Data Analytics", " Machine Learning", " Pipeline Best Practices", " Define and Measure Service Level Objectives", " Plan for Dependency Failure", " Create and Maintain Pipeline Documentation", " Map Your Development Lifecycle", " Reduce Hotspotting and Workload Patterns", " Implement Autoscaling and Resource Planning", " Adhere to Access Control and Security Policies", " Plan Escalation Paths", " Pipeline Requirements and Design", " What Features Do You Need?", " Idempotent and Two-Phase Mutations", " Checkpointing", " Code Patterns", " Pipeline Production Readiness", " Pipeline Failures: Prevention and Response", " Potential Failure Modes", " Potential Causes", " Case Study: Spotify", " Event Delivery", " Event Delivery System Design and Architecture", " Event Delivery System Operation", " Customer Integration and Support", " Summary", " Conclusion", "14. Configuration Design and Best Practices", " What Is Configuration?", " Configuration and Reliability", " Separating Philosophy and Mechanics", " Configuration Philosophy", " Configuration Asks Users Questions", " Questions Should Be Close to User Goals", " Mandatory and Optional Questions", " Escaping Simplicity", " Mechanics of Configuration", " Separate Configuration and Resulting Data", " Importance of Tooling", " Ownership and Change Tracking", " Safe Configuration Change Application", " Conclusion", "15. Configuration Specifics", " Configuration-Induced Toil", " Reducing Configuration-Induced Toil", " Critical Properties and Pitfalls of Configuration Systems", " Pitfall 1: Failing to Recognize Configuration as a Programming Language Problem", " Pitfall 2: Designing Accidental or Ad Hoc Language Features", " Pitfall 3: Building Too Much Domain-Specific Optimization", " Pitfall 4: Interleaving 'Configuration Evaluation' with 'Side Effects'", " Pitfall 5: Using an Existing General-Purpose Scripting Language Like Python, Ruby, or Lua", " Integrating a Configuration Language", " Generating Config in Specific Formats", " Driving Multiple Applications", " Integrating an Existing Application: Kubernetes", " What Kubernetes Provides", " Example Kubernetes Config", " Integrating the Configuration Language", " Integrating Custom Applications (In-House Software)", " Effectively Operating a Configuration System", " Versioning", " Source Control", " Tooling", " Testing", " When to Evaluate Configuration", " Very Early: Checking in the JSON", " Middle of the Road: Evaluate at Build Time", " Late: Evaluate at Runtime", " Guarding Against Abusive Configuration", " Conclusion", "16. Canarying Releases", " Release Engineering Principles", " Balancing Release Velocity and Reliability", " What Is Canarying?", " Release Engineering and Canarying", " Requirements of a Canary Process", " Our Example Setup", " A Roll Forward Deployment Versus a Simple Canary Deployment", " Canary Implementation", " Minimizing Risk to SLOs and the Error Budget", " Choosing a Canary Population and Duration", " Selecting and Evaluating Metrics", " Metrics Should Indicate Problems", " Metrics Should Be Representative and Attributable", " Before/After Evaluation Is Risky", " Use a Gradual Canary for Better Metric Selection", " Dependencies and Isolation", " Canarying in Noninteractive Systems", " Requirements on Monitoring Data", " Related Concepts", " Blue/Green Deployment", " Artificial Load Generation", " Traffic Teeing", " Conclusion", "Part III. Processes", "17. Identifying and Recovering from Overload", " From Load to Overload", " Case Study 1: Work Overload When Half a Team Leaves", " Background", " Problem Statement", " What We Decided to Do", " Implementation", " Lessons Learned", " Case Study 2: Perceived Overload After Organizational and Workload Changes", " Background", " Problem Statement", " What We Decided to Do", " Implementation", " Effects", " Lessons Learned", " Strategies for Mitigating Overload", " Recognizing the Symptoms of Overload", " Reducing Overload and Restoring Team Health", " Conclusion", "18. SRE Engagement Model", " The Service Lifecycle", " Phase 1: Architecture and Design", " Phase 2: Active Development", " Phase 3: Limited Availability", " Phase 4: General Availability", " Phase 5: Deprecation", " Phase 6: Abandoned", " Phase 7: Unsupported", " Setting Up the Relationship", " Communicating Business and Production Priorities", " Identifying Risks", " Aligning Goals", " Setting Ground Rules", " Planning and Executing", " Sustaining an Effective Ongoing Relationship", " Investing Time in Working Better Together", " Maintaining an Open Line of Communication", " Performing Regular Service Reviews", " Reassessing When Ground Rules Start to Slip", " Adjusting Priorities According to Your SLOs and Error Budget", " Handling Mistakes Appropriately", " Scaling SRE to Larger Environments", " Supporting Multiple Services with a Single SRE Team", " Structuring a Multiple SRE Team Environment", " Adapting SRE Team Structures to Changing Circumstances", " Running Cohesive Distributed SRE Teams", " Ending the Relationship", " Case Study 1: Ares", " Case Study 2: Data Analysis Pipeline", " Conclusion", "19. SRE: Reaching Beyond Your Walls", " Truths We Hold to Be Self-Evident", " Reliability Is the Most Important Feature", " Your Users, Not Your Monitoring, Decide Your Reliability", " If You Run a Platform, Then Reliability Is a Partnership", " Everything Important Eventually Becomes a Platform", " When Your Customers Have a Hard Time, You Have to Slow Down", " You Will Need to Practice SRE with Your Customers", " How to: SRE with Your Customers", " Step 1: SLOs and SLIs Are How You Speak", " Step 2: Audit the Monitoring and Build Shared Dashboards", " Step 3: Measure and Renegotiate", " Step 4: Design Reviews and Risk Analysis", " Step 5: Practice, Practice, Practice", " Be Thoughtful and Disciplined", " Conclusion", "20. SRE Team Lifecycles", " SRE Practices Without SREs", " Starting an SRE Role", " Finding Your First SRE", " Placing Your First SRE", " Bootstrapping Your First SRE", " Distributed SREs", " Your First SRE Team", " Forming", " Storming", " Norming", " Performing", " Making More SRE Teams", " Service Complexity", " SRE Rollout", " Geographical Splits", " Suggested Practices for Running Many Teams", " Mission Control", " SRE Exchange", " Training", " Horizontal Projects", " SRE Mobility", " Travel", " Launch Coordination Engineering Teams", " Production Excellence", " SRE Funding and Hiring", " Conclusion", "21. Organizational Change Management in SRE", " SRE Embraces Change", " Introduction to Change Management", " Lewin's Three-Stage Model", " McKinsey's 7-S Model", " Kotter's Eight-Step Process for Leading Change", " The Prosci ADKAR Model", " Emotion-Based Models", " The Deming Cycle", " How These Theories Apply to SRE", " Case Study 1: Scaling Waze - From Ad Hoc to Planned Change", " Background", " The Messaging Queue: Replacing a System While Maintaining Reliability", " The Next Cycle of Change: Improving the Deployment Process", " Lessons Learned", " Case Study 2: Common Tooling Adoption in SRE", " Background", " Problem Statement", " What We Decided to Do", " Design", " Implementation: Monitoring", " Lessons Learned", " Conclusion", "Conclusion", " Onward...", " The Future Belongs to the Past", " SRE + 'Insert Other Discipline'", " Trickles, Streams, and Floods", " SRE Belongs to All of Us", " On Gratitude", "A. Example SLO Document", " Service Overview", " SLIs and SLOs", " Rationale", " Error Budget", " Clarifications and Caveats", "B. Example Error Budget Policy", " Service Overview", " Goals", " Non-Goals", " SLO Miss Policy", " Outage Policy", " Escalation Policy", " Background", "C. Results of Postmortem Analysis", "Index".

Superfedt med en beskrivelse af hvad man kan gøre, hvis man har fat i rorpinden på fx Google. Andre steder (og sikkert også i Google) kan skrigene fra maskinrummet ikke høres på ledelsesgangen. (

)

bnielsen | Aug 27, 2021 |

▾Critiques presse

aucune critique | ajouter une critique

▾Autres auteurs

» Ajouter d'autres auteur(e)s

Nom de l'auteur	Rôle	Type d'auteur	Œuvre ?	Statut
Betsy Beyer	—	auteur principal	toutes les éditions	calculé
Kawahara, Kent	Directeur de publication	auteur principal	quelques éditions	confirmé
Murphy, Niall Richard	Directeur de publication	auteur principal	quelques éditions	confirmé
Rensin, David K.	Directeur de publication	auteur principal	quelques éditions	confirmé
Thorne, Stephen	Directeur de publication	auteur principal	quelques éditions	confirmé
Rensin, Dave	Auteur	auteur secondaire	quelques éditions	confirmé

▾Séries et œuvres liées

▾Prix et distinctions

voir l'historique

▾Partage des connaissances

Vous devez vous identifier pour modifier le Partage des connaissances.

Pour plus d'aide, voir la page Aide sur le Partage des connaissances [en anglais].

Titre canonique

Titre original

Titres alternatifs

Date de première publication

Personnes ou personnages

Lieux importants

Évènements importants

Films connexes

Épigraphe

Dédicace

Premiers mots

Citations

Derniers mots

Notice de désambigüisation

Directeur de publication

Courtes éloges de critiques

Langue d'origine

DDC/MDS canonique

LCC canonique

▾Références

Références à cette œuvre sur des ressources externes.

Wikipédia en anglais

Aucun

▾Descriptions de livres

In 2016, Google ?s Site Reliability Engineering book ignited an industry discussion on what it means to run production services today ?and why reliability considerations are fundamental to service design. Now, Google engineers who worked on that bestseller introduce The Site Reliability Workbook , a hands-on companion that uses concrete examples to show you how to put SRE principles and practices to work in your environment. This new workbook not only combines practical examples from Google ?s experiences, but also provides case studies from Google ?s Cloud Platform customers who underwent this journey. Evernote, The Home Depot, The New York Times , and other companies outline hard-won experiences of what worked for them and what didn ?t. Dive into this workbook and learn how to flesh out your own SRE practice, no matter what size your company is. You ?ll learn: How to run reliable services in environments you don ?t completely control ?like cloud Practical applications of how to create, monitor, and run your services via Service Level Objectives How to convert existing ops teams to SRE ?including how to dig out of operational overload Methods for starting SRE from either greenfield or brownfield

▾Descriptions provenant de bibliothèques

Aucune description trouvée dans une bibliothèque

▾Description selon les utilisateurs de LibraryThing

Description du livre

Résumé sous forme de haïku