week5-banner
Back to ITSM Room

Week 5 – Problem Management

September 23, 2025

After spending the last week exploring the Service Desk, this week my focus shifted to Problem Management. If Incident Management is about putting out fires quickly, Problem Management is about asking why the fire started in the first place and making sure it never happens again.

By the end of the week, I could clearly see how Problem Management differs from Incident Management. Incidents are about restoring service immediately, while Problems are about digging into the root cause, preventing repetition, and reducing long-term risk. ITIL 4 taught me that Problem Management is structured around three big activities: Problem Identification, Problem Control, and Error Control.


Understanding Problem Management

The first task asked me to define Problem Management in my own words. I described it as a structured practice focused on identifying, analyzing, and eliminating the root causes of incidents.

As I wrote, I realized its value lies in being proactive. Users often only care if the service comes back. But behind the scenes, IT must care about preventing the same disruption tomorrow. That’s where Problem Management begins.

At first, I thought this was just another layer of Incident Management. But as I reflected, I saw the contrast:

  • Problems are about prevention, understand and eliminate the cause.

  • Incidents are about speed, restore the service quickly.

The Lifecycle of a Problem

When I reached this part of the assignment, I realized Problem Management is not just one big activity, it is a set of sub-processes, each with its own role in preventing recurring incidents. Writing them out made me feel like I was mapping the different gears in a machine: every piece turns to keep the system running smoothly.

Sub-ProcessPurpose
Proactive Problem IdentificationIncrease service availability by proactively identifying Problems—finding issues before incidents recur.
Problem Categorization and PrioritizationRecord and prioritize Problems accurately to ensure quick and effective resolution.
Problem Diagnosis and ResolutionIdentify the root cause of a Problem and initiate the most suitable and efficient solution. Provide a temporary workaround if possible.
Problem and Error ControlMonitor unresolved Problems and their processing status to enable corrective action if needed.
Problem Closure and EvaluationAfter a permanent solution is applied, ensure the Problem Record contains a complete historical description and update Known Error Records as needed.
Major Problem ReviewConduct a formal review of major Problem resolutions to ensure the issue is fully addressed and generate lessons learned to prevent recurrence.
Problem Management ReportingProvide Problem reports, including status, available workarounds, and a list of outstanding problems to other service management processes and IT management.

Measuring Success with KPIs

The next task made me think about how we measure success. With incidents, the KPI is usually speed, how fast did you respond, how fast did you close the ticket?

But with Problem Management, the story is different. It’s measured by how many problems were eliminated, how many root causes were identified, and whether repeat incidents declined. Even customer satisfaction became an important measure, because if problems are solved permanently, users begin to notice stability and reliability.

KPIDescription
Number of problems recordedMeasures the number of problems identified in a given period.
Average resolution timeAverage time taken to resolve problems.
% of problems with identified root causePercentage of problems where the root cause was successfully identified.
% of problems resolved with workaroundPercentage of problems temporarily addressed with a workaround.
Number of repeat incidentsMeasures the effectiveness of problem management in preventing recurring incidents.
Customer satisfactionLevel of user satisfaction regarding problem resolution.

To me, these KPIs told a bigger story: success here is not about racing to the finish line, but about making sure the race track itself is safe for the future.

Known Error Database (KEDB)

This was my “aha” moment of the week. The KEDB is like a memory bank for IT. It stores every known problem and its workaround so that the next time it appears, no one has to start from scratch.

The KEDB was one of the most fascinating discoveries this week. It works like the organization’s memory bank, a repository where every identified problem, its root cause, and its workaround are stored. Instead of starting from scratch every time a repeated incident occurs, Service Desk agents can check the KEDB and instantly apply a solution.

For example, we discussed a case where a bug in an HR application caused login failures when users opened it on older browsers. The root cause had already been identified, and the workaround was simple: advise users to switch to the latest browser version. Once this information was recorded in the KEDB, the Service Desk could respond in seconds instead of wasting time reanalyzing the same bug.

This is a KEDB that I created on the Lab Session

kedb

Workarounds as Survival Tools

One of the most important lessons I picked up this week was the role of workarounds. At first, I thought of them as “half-baked fixes,” something you do when you can’t solve the problem properly. But the more I studied, the more I realized that workarounds are actually survival tools. They buy time, reduce impact, and keep users productive while IT works on a permanent solution.

Take the example of the HR application bug that caused login failures on outdated browsers. The permanent fix required patching the application, which would take time. But the workaround, advising users to update to the latest browser, gave employees immediate access to the system. It didn’t remove the bug, but it restored productivity almost instantly.

In my group’s lab exercise, I saw this play out again with the VPN router connectivity problem. The root causes included firmware incompatibility and lack of redundancy. A permanent solution meant rolling back firmware versions and redesigning the network. But in the meantime, the workaround was to manually restart the router and redirect traffic through alternative access points. It wasn’t elegant, but it kept critical healthcare services like Telemedicine and EMR online.

These stories helped me see that workarounds are not failures, they are acts of resilience. They prove that in ITSM, success is not always about perfection, but about flexibility and pragmatism. A good workaround can make the difference between an organization grinding to a halt and one that keeps moving while engineers quietly prepare the permanent cure.

The Link Between Incident and Problem Management

AspectIncident ManagementProblem ManagementRelationship
FocusRestoring services as quickly as possible.Finding the root cause of incidents to prevent recurrence.Frequent incidents can trigger problem management analysis.
Results from problem management (such as workarounds or KEDB) are used by incident management to speed up service restoration.
RoleProcess for restoring service fast.Process for eliminating causes and preventing future incidents.

At first, I thought Incident Management and Problem Management were two names for the same thing. But this week made me realize they are different roles in the same story. Incident Management is about speed: when a system goes down, the priority is to restore service as quickly as possible so users can keep working. Problem Management takes over afterward, asking why the failure happened and how to stop it from happening again. Together, they form a cycle, one solves the immediate pain, the other works to remove the cause.

Incident Management is a process. It is the structured set of steps designed to restore service as fast as possible. Service Desk is a function. It is the team that carries out the steps, communicates with users, and makes sure the process feels human.


Reflection

Week 5 was the week ITSM began to feel like detective work. Incidents taught me to act fast, but Problem Management taught me to slow down, look deeper, and think long term. I learned that resilience isn’t just about restoring service, but about ensuring the same issue doesn’t haunt users again and again.

The lab especially left an impression on me. Cleaning messy data taught me discipline. Identifying patterns taught me focus. Building the Fishbone Diagram showed me how problems are systemic, not isolated. And writing the KEDB entry showed me how lessons, once captured, become assets for the future.

By the end of the week, I began to see Problem Management not just as a process, but as a mindset: always ask why, always look deeper, and always turn pain into learning.