New｜CanWay BlueWhale Full-Stack Observability Center V5.0: Smarter Detection, Deeper Visibility, Closed-Loop Operations

As digital transformation deepens, enterprise IT systems are facing unprecedented complexity challenges — the proliferation of microservice architectures, multi-cloud hybrid deployments, and the deep coupling of business and technology have made traditional monitoring methods increasingly inadequate for handling high-frequency, hidden, and cross-domain failure scenarios. Shifting from "passive response" to "proactive prediction" and from "fragmented observation" to "holistic control" has become a core imperative for ensuring business continuity.

CanWay BlueWhale Full-Stack Observability Center is committed to the mission of "making operations more intelligent and business more stable", continuously refining product capabilities. This V5.0 release is a comprehensive upgrade focused on three directions: intelligence-driven, precise sensing, and closed-loop resolution, delivering an intelligent observability closed loop that covers the entire chain of "collection → detection → analysis → resolution → optimization", helping enterprises build a future-ready observability system.

Release Overview

CanWay BlueWhale Full-Stack Observability Center V5.0（King Eye) introduces five major capability enhancements, delivering smarter detection, richer contextual insights, more integrated operational workflows, and more granular health assessment.

PART 01

Full-Stack Visualization

1.1 Business Architecture Topology Overview

Have you ever had that moment of frustration?

You open a business topology, and it feels like unfolding a treasure map — nodes are densely scattered across the screen, service names are truncated to “…”, and IPs are hidden in corners. Trying to find an abnormal node? You have to rely on scrolling, zooming, and sheer luck.

Observability Center V5.0 puts an end to this inefficient experience once and for all!

Intelligent Hierarchical Layout: Displays the architecture structure of all monitored objects under a business across eight layers (Applications, Services, Components, Infrastructure Resources, Hardware Resources, etc.), presenting the complete dependency chain in layers. This turns the topology into a navigation guide for business stability, shifting from fragmented troubleshooting to panoramic insight.

Multi-Dimensional Node Aggregation: Faced with a massive number of nodes, the topology provides three aggregation strategies (Custom Aggregation, CMDB Aggregation, Cluster Aggregation) to easily manage complex architectures. Clicking on any aggregated node instantly expands its child nodes into a high-density honeycomb array, increasing space utilization by 300%, with key information clear at a glance!

1.2 The "Natural Barrier" for Heterogeneous Environments

In previous business panoramic topology views, cluster nodes automatically discovered by APM were often mixed at the same level as CMDB instances such as databases and middleware. Horizontal connections crisscrossed across multiple columns, making it difficult to identify relationships between service call chains and monitoring data sources at a glance.

Observability Center V5.0 restructures the topology hierarchy: cluster nodes discovered by APM are displayed independently, and connection directions are vertically aligned to present the data flow of "Instance → Component Cluster → Service → Application". This allows rapid upward tracing from a faulty instance to its associated APM service and business application when an anomaly occurs, making the root cause path clearly visible.

1.3 Expand Business Observability Coverage

You no longer need to switch between multiple systems to verify — when user access goes wrong, it is now displayed in the most prominent position of the business panorama!

「Website Service」is incorporated into the business panorama as an independent resource model. For websites configured with HTTP/HTTPS dial tests and host-level TCP/UDP/ICMP dial tests, their overviews and alerts are directly rendered on the nodes. Dial test alerts are no longer isolated but are directly embedded into the business context.

PART 02

Monitoring: Multi-Dimensional Fusion, Accurately Defining Anomalies

2.1 Multi-Metric Combination

Have you ever been jolted awake in the middle of the night by an alert like this? — Disk usage 95%, only to rush over and find 500GB still free, the business running just fine. This is not a failure of monitoring, but a limitation of single‑metric thresholds in complex business scenarios.

The detection engine in Observability Center V5.0 has undergone a new evolution, achieving an intelligent leap from “single‑point threshold” to “business state”. It now supports multi‑metric logical combination detection, allowing alerts to truly reflect the actual health of the business. This enables composite alerts that are more aligned with real business conditions, effectively reducing false positives and false negatives, and building more comprehensive anomaly detection criteria.

2.2 Enhanced Hardware Monitoring Capabilities

In today’s rapidly advancing era of cloud-native and hyperconverged infrastructures, traditional protocols like IPMI and SNMP have reached their limits – they allow you to "see the device" but fail to reveal its true health. Basic metrics such as temperature and fan speed are easily accessible, yet critical risks – aging RAID cache batteries, degrading disk arrays, or latent storage controller failures – remain hidden until a service outage occurs.

Observability Center V5.0 introduces native support for two next-generation hardware management standards – Redfish and SMI-S – elevating infrastructure monitoring from superficial "body temperature" checks to in-depth "organ-level" diagnostics:

Redfish: Comprehensive Visibility for Server Hardware

Built on a RESTful API + JSON architecture, Redfish is naturally suited for DevOps and cloud-native environments. It fully supports mainstream servers released after 2015, such as Dell iDRAC9+ and HPE iLO5+. With Redfish, you can monitor the health status of storage controllers, predictive failure of disk drives, and operational anomalies of RAID cards in real time. This enables full visualization and end-to-end traceability of server hardware status, greatly improving operational efficiency.

SMI-S Protocol Support: The Common Language for Storage Devices

Strictly conforming to the SNIA international standard, SMI-S allows a single standardized interface to manage multi‑vendor storage devices, including Dell EMC and Huawei OceanStor. It effectively breaks down vendor‑specific barriers, transforming storage operations from “experience‑driven” to “data‑driven”, reducing both complexity and cost.

Redfish + SMI‑S + SNMP + IPMI: Deep integration of the four protocols – fully upgrading hardware observability capabilities, opening a new era of intelligent hardware monitoring, and building a solid security foundation for infrastructure operations.

2.3 Blackbox Monitoring

Leveraging built-in probing capabilities based on the Prometheus Blackbox Exporter, Blackbox Monitoring supports TCP and ICMP probe protocols. It actively checks the availability of web services, API endpoints, middleware, and other components. This effectively addresses the blind spots of availability monitoring from an "external perspective", enabling first-time detection of hidden failures such as service hanging, firewall blocking, or network isolation — failures that appear normal but are actually unusable. It truly ensures business continuity from the user's perspective.

2.4 Intelligent Activation

Detection policies now support multiple time windows synchronized with calendars — you can flexibly configure different trading sessions (e.g., morning, afternoon, night sessions) and link them to custom calendars (such as the Shanghai/Shenzhen stock exchange calendar). Monitoring is precisely enabled on trading days and automatically enters silent mode on non‑trading days. As a result, alerts are no longer distracting — they only speak when it matters, enabling the monitoring system to truly understand the pulse of high‑sensitivity businesses like securities and futures.

2.5 Out-of-the-Box

No configuration required, no need to import templates — built-in Kubernetes core resource monitoring views: from CPU/memory/network usage at the individual Pod level, to Node resource utilization and eviction risks, and up to Cluster-level resource scheduling health. Get a full-stack view of operational status at a glance, moving operations from “jigsaw‑style troubleshooting” to “panoramic insight”.

PART 03

Alerting: Intelligent Correlation, Closed-Loop Driven Resolution

3.1 Intelligent Decision Support with Change Context and Knowledge Integration

The alert details page allows one-click access to the associated "Change Ticket + Knowledge Base", breaking down information silos. Every alert is automatically enriched with dual intelligent context, turning each resolution into a basis for preventing future incidents.

Customer Value:

Accelerated Root Cause Analysis: Change-related fault localization is reduced from hours to minutes.
Reduced Repetitive Handling: Resolution efficiency for similar issues increases by 60%, avoiding repeated mistakes.
Strengthened Change Closure: Fault verification reversely validates change quality, driving continuous improvement in operations.

3.2 Alerts with Business Topology Context

The alert details are automatically injected with dependency link and business context of the alert object from CMDB, including its upstream dependencies, downstream impact, associated business system, operations owner, environment tags, and other key attributes — transforming static configuration data into dynamic diagnostic clues. Each alert is no longer isolated but comes with a "built‑in business topology map", helping operations teams quickly locate the root cause within seconds, shifting from "firefighting‑style response" to "precise, governance‑oriented resolution".

3.3 Intelligent Alert Aggregation

When an incident occurs, what operations teams fear most is not the alerts themselves, but being overwhelmed by an alert storm — hundreds of similar alerts flooding the screen, diluting critical information, and wasting precious remediation time as they sift through the chaos.

The Observability Center V5.0 introduces intelligent alert aggregation that restores order to massive alert streams:

Precise Grouping: Supports intelligent grouping based on any field such as alert object, CMDB business, cluster, cloud region, etc. Similar alerts are precisely aggregated, while differing details are retained intact — avoiding the loss of focus caused by “mixed bundling”, and clearly revealing the scope of impact.
Timely Triggering: High‑priority incidents trigger immediate aggregation, delivering notifications within seconds to seize the golden time for remediation. Low‑priority anomalies use periodic aggregation, with complete information collected and pushed as a unified summary to avoid fragmented interruptions. The balance between timeliness and completeness is defined by your business scenarios.

3.4 Alert Suppression Preview

In complex IT environments, alert suppression policies are a critical tool for reducing noise and focusing on key issues. However, traditional suppression rules — effective as soon as they are configured — often lead to critical alerts being accidentally suppressed due to misconfiguration, or require repeated tuning to achieve the desired effect.

To address this, Observability Center V5.0 introduces the Alert Suppression Preview capability — before a policy takes effect, it simulates the suppression results in real time, allowing you to clearly see “which alerts will be suppressed and which will still trigger”. This greatly improves configuration accuracy and operational confidence.

3.5 Online Debugging for Alert Enrichment

Have you ever been in this situation?

You write an alert enrichment rule, save and deploy it with high expectations — only to find that key fields in the triggered alert are still empty. Is it an expression error? CMDB data not synchronized? Or a matching logic issue? You have to delete, retry, repeatedly trigger, dig through logs… A single debugging session takes half a day.

Observability Center V5.0 introduces the brand‑new Online Debugging for Alert Enrichment capability, ensuring every alert configuration is accurate. It supports field‑level before/after comparison and real‑time effect preview, delivering an efficient configuration experience of “what you configure is what you see, what you see is what you get”. Every step of the enrichment logic is clearly visible, key fields are perfectly aligned, and differences are highlighted.

PART 04

APM: Health Visibility, Service Autonomy

4.1 Intelligent API Governance: More Accurate Health Scores, Clearer Service Visibility

In complex business scenarios, certain special endpoints — such as high-frequency health checks, load testing entry points, or debugging endpoints — can severely distort the overall health score of an application if they are included in global calculations. Meanwhile, large numbers of dynamic paths (e.g., /api/user/123, /api/user/456) actually belong to the same logical API, leading to metric fragmentation and increased alert noise. To address this, we provide business‑semantic‑oriented API governance capabilities that solve two core problems with one click:

1. Accurate interference endpoint suppression: Suppress span data for specified endpoints, excluding them from health score calculation, preventing their high frequency or abnormal fluctuations from dragging down the overall health score.

2. Intelligent aggregation of multiple API paths into a single logical API: The health score is calculated based on the business‑logic unit, no longer disrupted by ID explosion. This not only reflects true service performance but also allows alert policies to apply to the “logical API”, significantly reducing noise.

This is not simply“hiding data” – it is a deep calibration of the observability system.

For operations: Health score = real business availability, no longer misled by “noise endpoints”.
For SREs: Metric aggregation = business‑semantic view, saying goodbye to fragmented dashboards.
For products: Fault localization = second‑level focus on core paths, improving MTTR.

4.2 Health Scoring System Upgrade

The health scoring system has been upgraded from a “unified weight, unified standard” model to a configurable, service‑level weight and differentiated health criteria model. Operations teams can now flexibly configure the health weights of individual microservices within an application based on business criticality, and independently define health criteria for each service. The system then dynamically calculates a weighted overall application health score based on each service’s real‑time health status, custom health criteria, and its assigned weight — a critical service anomaly will significantly impact the total score, while fluctuations in non‑critical services have a controlled effect. This makes the assessment results more closely reflect actual business conditions, providing fine‑grained, scenario‑driven data support for capacity planning, resource scheduling, and fault priority assessment.

4.3 Trace/Span Visualization Insight

The Trace/Span search page has been enhanced with three new core charts:

Request Volume Chart: Displays real‑time trends of service/API call counts.
Error Count Chart: Visualizes the distribution of abnormal requests, quickly pinpointing incident periods.
Latency Heatmap: Uses color intensity to show response time distribution across different time periods and APIs, making performance bottlenecks instantly identifiable.

These enhancements help users quickly establish a global understanding from massive amounts of span data, upgrading call chain analysis from a “text list” to “visual insight”.

4.4 Fine-Grained Trace Data Management

To meet the differentiated needs of enterprises regarding APM data storage cost, performance, and high availability, we have introduced application‑level storage configuration capabilities. These support per‑application storage cluster isolation, allowing Trace data of an APM application to be written to a dedicated Elasticsearch cluster for resource isolation and cost optimization, with flexible configuration of index shards and replicas.

PART 05

Logging: Reliable Archiving, Insight That Prevents

5.1 Log Archival Retry

In complex production environments, transient issues such as network flapping, temporary storage unavailability, or intermittent permission anomalies can cause log archival tasks to fail, leading to the loss of critical audit or troubleshooting data — severely impacting post‑incident analysis and compliance traceability.

We have introduced a smart log archival retry mechanism that, upon archival failure, automatically performs multiple retries at the collection item level and records detailed failure reasons. This ensures eventual consistency and successful write of log data, solidifying the observability data foundation and providing unbreakable assurance for log reliability and data governance.

Operations: No more manually patching logs at 3 AM — bid farewell to "data black hole" anxiety.
Security: 100% traceability of critical operational logs — audit inspections handled with ease.
Business: Pos‑incident reviews backed by solid evidence — MTTR directly cut in half.
Enterprises: Strengthening the observability data foundation, making "data‑driven operations" a reality.

5.2 One-Click Monitoring Log Retrieval

When you discover a critical anomaly on the log search page, there is no need to switch to the configuration interface — simply click “Add as Monitoring Policy” to instantly convert the current search keyword into a monitoring policy, turning every log insight into a sentinel that prevents future incidents. This shifts the paradigm from “post‑mortem troubleshooting” to “proactive prevention”. Logs are no longer just historical records — they become intelligent real‑time alert sentinels.

5.3 Unified Search Across Log Topics

When an incident occurs, the most time‑consuming part is often not the repair itself, but the traceback — clues in real‑time logs break off, historical archives hold data dormant, and operations teams are forced to switch back and forth between “present” and “past”, while the root cause quietly hides in the time gap.

Observability Center V5.0 introduces Log Topic Federated Search, breaking down time barriers:

A single search simultaneously covers both non‑archived real‑time logs and already‑archived historical logs.
No need to switch data sources or remember archival time points — seamlessly correlate from minutes ago to months ago.
The full anomaly propagation chain is presented, tracing back from the “alert at this moment” to the “first signs hours earlier”.

Conclusion

CanWay BlueWhale Full-Stack Observability Center V5.0 is anchored in business value and powered by intelligent technology, helping enterprises:

Shift from “passive firefighting” to “active foresight”
Elevate from “data stacking” to “decision insights”
Evolve from “operations tools” into“business growth partners”

CanWay BlueWhale will continue to deepen the integration of observability, AIOps, and business operations, driving the deep application of large language models in scenarios such as root cause reasoning, incident management, and anomaly prediction, so that observability systems can deliver tangible momentum for enterprise digital transformation.

New｜CanWay BlueWhale AITSM: AI Processes Digital Employee New｜CanWay BlueWhale Shines at NETCRAFT WEEK 2026 in Macau

Prev Return to List Next

Popular tags

AI Monitoring Observability

Subscribe to articles

Please leave your contact information so that we can serve you better.

By subscribing to this newsletter, I consent to CanWay sending me Newsletter emails in accordance with CanWay's Privacy Statement. I may opt-out at anytime by clicking "unsubscribe" in the email footer or by visiting our Communications Preference Center.

Submit