A postmortem by ojeifodavid842.hashnode.dev

Issue Summary

Duration:
Start Time: November 8, 2023, 03:45 PM WAT
End Time: November 8, 2023, 04:30 PM WAT

Impact:
During the outage, the primary service affected was the web server, resulting in slow response times and intermittent connectivity issues. Approximately 20% of users experienced degraded performance and occasional service unavailability.

Timeline

  • 03:45 PM WAT:

    • Issue Detected: An engineer received user complaints about slow website performance.
  • 03:50 PM WAT:

    • Investigation Started: The team initiated an investigation, checking server logs and monitoring metrics.
  • 04:00 PM WAT:

    • Root Cause Assumption: Initial analysis pointed towards a potential firewall issue.
  • 04:10 PM WAT:

    • Escalation: As the issue persisted, the incident was escalated to the network and infrastructure team.
  • 04:15 PM WAT:

    • Misleading Paths: Initially, the investigation focused on server misconfigurations and application issues.
  • 04:20 PM WAT:

    • Further Escalation: With no improvement, the incident was escalated to senior system administrators.
  • 04:30 PM WAT:

    • Issue Resolved: The root cause, a misconfigured firewall blocking essential traffic, was identified and rectified.

Root Cause and Resolution

Root Cause:
The root cause of the issue was identified as a misconfigured firewall rule, blocking incoming traffic on port 80, a well-known port number assigned to the "HTTP" (Hypertext Transfer Protocol) service. It is the default port for unencrypted web communication, specifically for serving web pages over the internet

Resolution:
The firewall rule was corrected to allow incoming traffic on port 80. This immediately restored normal web server functionality.

Corrective and Preventative Measures

Improvements/Fixes:

  1. Automated Firewall Audits: Implement automated tools (Puppet: a CMT[Configuration Management Tool.] that can be used to manage and enforce firewall configurations across multiple servers. It ensures consistency and compliance with predefined rules.)to regularly audit and validate firewall rules.

  2. Enhanced Monitoring: Utilizing [Datadog] a cloud-based monitoring and analytics platform that provides comprehensive visibility into infrastructure, applications, and logs. Offers features like anomaly detection and correlation. Strengthen monitoring on key services and ports to quickly detect anomalies.

Tasks:

  1. Implement Firewall Change Management: Establish a process for reviewing and approving firewall rule changes.

  2. Documentation Update: Ensure comprehensive documentation for firewall configurations is maintained and up-to-date.

This incident highlights the critical importance of robust monitoring and swift response in maintaining the availability of web services. By automating routine tasks and enhancing monitoring capabilities, we can fortify our infrastructure against similar issues in the future.

This postmortem serves as a learning opportunity, emphasizing the need for proactive measures and collaborative problem-solving. The corrective actions taken aim to prevent the recurrence of similar incidents, ensuring a more resilient and reliable web infrastructure.

In conclusion, by analyzing and addressing the root cause, implementing corrective measures, and fostering a culture of continuous improvement, we strive to enhance the overall reliability and performance of our web stack.