week13-banner

Back to ITSM Room

Week 13 - Ensuring Services Perform, Stay Available, and Meet the Promise

17 November 2025

When ITSM Becomes About Reliability and Trust. By Week 13, the course had moved past operations and transitions - now we were looking at what makes IT services reliable in the long term. This week shifted my focus toward the "steady state" practices that quietly determine whether users experience smooth, dependable services or frustrating downtime.

What I learned this week is that availability, performance, service levels, and catalog clarity are not just technical concerns - they are promises. Promises IT teams make to users, and promises users expect to be honored every single day. My assignment was to explore four practices that uphold those promises: Availability Management, Capacity & Performance Management, Service Level Management (SLM), and Service Catalogue Management. Each practice feeds into the others like pieces of a reliability ecosystem, and this week's work helped me see how they operate behind the scenes.

Part 1 - Availability Management

The first section of my assignment explored how organizations ensure that services are ready and functioning whenever they are needed. Availability Management is essentially the commitment to keep services up, accessible, and reliable according to business expectations.

What Availability Really Means

Availability is defined as the ability of a service or component to perform its agreed function whenever required. It is not just uptime - it is about readiness, reliability, and meeting user expectations.

Understanding the Availability Process

The Availability Management process breaks into key activities: setting targets, designing for reliability, monitoring and logging, planning improvement, validating recovery tests, and calculating metrics. This flow helped me see availability as something engineered, not accidental.

Core KPIs

I worked through five core KPIs: Percentage Availability calculated using MTBF / (MTBF + MTRS), User Outage Minutes, Lost Transactions, Lost Business Value, and User Satisfaction (Availability Score). These KPIs taught me that availability is both technical and experiential - downtime affects productivity, operations, and business value.

MTBF vs MTRS

High MTBF (Mean Time Between Failures) means services rarely fail. Low MTRS (Mean Time to Restore Service) means services recover quickly. I realized that a service can fail often but still have "good availability" if recovery is extremely fast - something I had not fully understood before.

Part 2 - Capacity & Performance Management

Next, I studied the practice responsible for ensuring systems run fast, stay responsive, and can scale with demand. If Availability is about "is the service up?", Capacity and Performance is "can the service handle what we throw at it?"

Performance vs Capacity

Performance is defined as what the service achieves in terms of speed, throughput, and responsiveness. Capacity is the maximum workload the service can support. These two work together - many outages are caused not by failure, but by overload.

Key Activities

The practice is split into two core activities: Service Performance and Capacity Analysis, which includes monitoring live performance, modeling workload patterns, and detecting bottlenecks. The second is Service Performance and Capacity Planning, which involves forecasting future demand, preparing resources for growth, and scaling systems before they reach their limits. This reminded me of real-world scenarios like registration portals crashing on day one because no one modeled peak load.

KPI Structure

My assignment lists four groups of KPIs: Performance Metrics such as response time, throughput, and latency; Capacity Metrics including utilization, headroom, and scalability; Forecasting Metrics measuring demand accuracy and growth indicators; and Stability Metrics tracking capacity-related incidents and bottleneck frequency. These KPIs made me understand why performance issues must always be monitored before they become user issues.

Part 3 - Service Level Management (SLM)

SLM shifted the narrative from engineering to expectations. This practice ensures that both provider and customer agree on what "good service" means - and that the provider continuously monitors and meets that target.

Purpose of SLM

Service Level Management sets business-based service targets and ensures service delivery is evaluated and managed against those targets. This includes commitments like uptime, response time, resolution time, and customer experience.

Core SLM Processes

The SLM lifecycle includes: establishing shared views with customers, collecting and analyzing metrics, conducting regular reviews, capturing issues, engaging with customers, and gathering insights from multiple sources. This showed me that SLM is truly customer-centric - not just operational.

SLA Requirements

A good SLA must align with defined services in the catalogue, must be outcome-oriented, must reflect real agreements between parties, and must be clear and easy to understand. I learned about the Watermelon SLA effect (green outside, red inside) - meaning SLAs can look good on reports but deliver poor user experience.

SLM KPIs

The KPIs include customer experience, service performance, business outcome metrics, SLA compliance, and service review improvements. This practice ties the technical world to the business world - translating metrics into meaning.

Part 4 - Service Catalogue Management

To close the week, I explored how organizations document and communicate their available services. The catalogue is the single source of truth for everything IT provides.

Purpose of Service Catalogue

The Service Catalogue provides consistent information about all active services and service offerings available to relevant audiences. It is the bridge between IT operations and business understanding.

Catalogue Management Activities

I worked through five main activities: Publishing new services, Editing and Updating entries, Maintaining service descriptions, Providing tailored views such as user view, customer view, and IT-to-IT view, and Avoiding isolated, fragmented catalogues. Clear catalogues prevent confusion, duplication, and misaligned expectations.

Real-World Examples

The assignment included examples of real catalogues such as GovTech Singapore and UNSW MyIT. These helped me visualize how professional IT organizations present services to users - with clarity, organization, and comprehensive information.

Reflection

Week 13 connected everything we have learned about reliability, performance, and user expectations into a cohesive picture. I realized that stable IT services do not just happen - they are engineered, monitored, forecasted, reviewed, and clearly communicated. Availability keeps systems running, performance keeps them fast, SLM keeps them aligned with business expectations, and catalogues keep them understandable. Together, these practices shape the everyday experience users have with technology, and this week helped me appreciate how much precision and care goes into making IT "just work."