Observability in Action: Case Study

    Observability in the cloud is revolutionizing how businesses monitor and maintain their infrastructure. One of our customers from the automobile industry was facing frequent performance issues with their application, causing dissatisfaction among their clients.

    As a MSP provider, when we took over the project, our priority was to set up robust observability tools. This allowed the platform's operations team to gain real-time insights into system performance, track detailed metrics, and correlate logs and traces to pinpoint the root cause of the issues without substantial cost overruns.

Typical Requirement

Pillars

Top-Level Terminologies

The Monitoring Layers

> Application Layer

    A typical scenario --

server-min
  • Request Tracing:
  • >>OpenTelemetry -- The Savior:

    The OpenTelemetry project is an opensource framework that standardizes the way observability data (metrics, logs, and traces) are gathered, processed, and exported. It provides a vendor-agnostic pathway to nearly any back-end for insight and analysis.

    Below is just for illustration purpose with respect to OpenTel

    Otel

    => Primary components of the OpenTelemetry collector

    Together, these components can be linked together to create a logical, human-readable observability data pipeline within the collector’s configuration.

    ---

    => Types of Implementation

    ----

    > Infrastructure Layer

    The approach & configuration for Log and Metrics overlaps among Application and Infrastructure Layers. The log/metrics collection strategy is implemented using agents like FluentBit or Logstash, which collect and export logs to a centralized logging solution.

    infralayer

    >> Logs / Metrics

  • Microservices:
  • Monolithic:
  • Functions:
  • >> Security Monitoring

    siem

    We can also configure Security level Monitoring/Alerting for aggregating and analyzing telemetry in real time for threat detection and compliance. Try to collect event data from various sources like endpoints, network devices, cloud workloads, and applications for broader security coverage.. This includes:

    ----

    > Tools Comparison

    1. Tools Comparison (Importance to Tracing)

    Compared to the tools and usecase, we can opt for other available tools as well. But OpenTelemetry is the preferred choice, especially for tracing due to its open-source nature and robust community support.

    tracecompare

    2. Tools Comparison for Logging

    logcompare

    -- Choice is yours:

    matrix
    choice

    3. Tools for Metrics

      There are many tools and vendors available in market. If you want a tool with most of the capabilities and still opensource, I would recommend to go with Grafana + Prometheus which is mature and flexible.

    > Dashboard / Alerting / KPIs

    > Alerts

    alerts

    We have to come up with the KPIs and Alerts based on the requirement.

    incident

    > Dashboards

    The Log dashboards will be created along with inputs from the application team

    > The Advantages:

    With OpenTelemetry, we can achieve

    ----