Ten years ago, CanWay has redefined enterprise IT O&M management via industry-leading "PaaS+SaaS" DevOps framework. Today, an even more profound transformation is underway—the PaaS architecture has evolved into an agent-based platform, with AI agents powering new productivity.
About ten years ago, CanWay pioneered a new model of O&M development through PaaS platform and SaaS products, enhancing the productivity of enterprise IT O&M organizations. Since then, integration and platformization has become our core strengths and the technicial foundation of our technology in this field.
Today, we are witnessing another historic moment where IT O&M applications flourish. But this time, what's emerging is not traditional O&M software, but AI agents—a new software paradigm capable of perceiving their environment, reasoning autonomously, driving tools, and completing multi-step tasks.

Figue 1:Paas soft to intelligent Saas
This transformation is far more than a partial efficiency improvement brought by knowledge Q&A or process automation tools. It truly points to another restructuring of the production relations within O&M organizations: evolving from "human-driven tools" to "agents autonomously completing tasks", with human value shifting to strategy formulation, boundary control and exception decision-making.
However, this path is not smooth in enterprise practice. The following are the experiences and insights we have accumulated from promoting the implementation of AgentOps.
I. Industry Challenges: Why Is Implementing AgentOps Difficult?
Over the past two years, almost every IT O&M team has attempted to apply AI in IT O&M scenarios. There have been numerous Demos and POCs, but very few cases have achieved large-scale implementation in production, delivering quantifiable value. Where does the problem lie? We have summarized the core difficulties as follows:
1.1 Value Cannot Be Measured
When enterprises promote AgentOps, the first major obstacle they encounter is the lack of visible value. How much efficiency has AI improved? How much manpower has been virtually increased? How much manpower has been saved? How many failures have been avoided? These questions cannot be answered during project approval. Without credible ROI data, budget requests lack persuasiveness, and scenarios remain stuck in the demo phase instead of being deployed to production. However, without investment, it is impossible to accumulate sufficient data and scenarios to prove value. This dilemma is the most common barrier encountered to AgentOps implementation.
1.2 Insufficient Basic Data and Tool Capabilities
The upper limit of an AI agent’s performance largely depends on the quality of accessible data and tools. However, many enterprises have IT O&M data scattered across multiple platforms with low standardization. MCP (Model Context Protocol) tool interfaces are not yet established, and historical event knowledge bases lack structured sorting. Agents operating on such a foundation cannot deliver true operation-level capabilities.
1.3 Dual Resistance from Organization and Trust
Handing over tasks originally performed by humans to AI for autonomous execution requires overcoming two thresholds: the technical threshold (Can the Agent perform correctly?) and the trust threshold (Who is responsible if something goes wrong?). The former requires time for verification, while the latter requires a shift in organization culture. Without mature manual checkpoints and audit frameworks, IT O&M managers hardly truly trust autonomous operation on core production systems.
1.4 Legacy Issues
The long development path from automated O&M, to AI-assisted O&M, and then to AI-autonomous O&M involves not only continuous technological iteration, but also the skills development of employee, as well as the need for O&M organization management to keep pace with the times. Missing links in this chain can become weaknesses in the AI transformation. For example, the maturity of the integrated O&M architecture affects the accuracy and success rate of AI decision-making, the realization of monetized management influences the rationality and value measurement of AI manpower investment, and the capacity for O&M computing power scheduling determines the long-term and stable investment an O&M team can dedicate to AI development.
II. Practice in a Large Organization
Faced with the difficulties, an organization systematically implemented AgentOps based on the BlueWhale's integrated O&M platform. Currently, through the construction of an agent development ecosystem model, it has generated 20+ digital employees each working over 8 hours per day, orchestrating 500+ agents, covering the entire O&M scenario chain from test environment control to production environment reliability assurance.
Core Approach: Scenarios driven by efficiency (digital IT O&M employees) + Scenario construction based on platformization (IT O&M agent development) + Rich integrated O&M capabilities (MCP + Data + Knowledge).
Key Strategy:
01)Rich Integrated O&M Capabilities: MCP + Data + Knowledge
The true capability of an agent stems from the support of underlying integrated O&M capabilities. By reusing the API gateway, the original integrated O&M platform APIs are rapidly transformed into MCP interfaces callable by LLMs, enabling Agents to directly orchestrate automated O&M. Simultaneously, the platform has accumulated structured O&M data and historical events into database, providing high-quality data for agent reasoning. The encapsulation of Skills ensures standardized and reusable O&M, while the RAG knowledge base drives agents to obtain precise information and reduce misjudgments. This combination of capabilities ensures to achieve quantifiable value in production environments.
02)Platform-Based Construction: O&M Agent Development
Based on the architectural concept of "PaaS + Scenarios," the foundation is the O&M Agent Development Platform, providing a unified base for R&D agent. The platform integrates key capabilities required for large model application development, including RAG knowledge base, MCP interface management, prompt management, and Skill management. Through a platform-based design, O&M teams can rapidly build multi-scenario agents and reuse existing components and skills, avoiding redundant development of underlying capabilities, thereby significantly improving agent R&D efficiency and implementation speed.
03)Efficiency-Driven Scenarios: Digital SRE
Agent development is always centered on improving IT O&M efficiency, rather than merely conversation or retrieval applications. Each agent is designed according to the service catalog of the Ops organization, with clear responsibilities in O&M processes. Typical scenarios include business inspection, SQL risk checking, role-based automation, and root cause analysis. All focus on reducing repetitive manual work, shortening response time, and lowering error rates. Driven by such scenarios, digital employees effectively take over jobs originally done by humans, realizing the efficient shift from manual O&M to intelligent O&M.
2.1 Typical Application Scenarios
The following are several representative scenarios that have been operated on a large scale in both test and production:






2.2 Key Insights
The implementation of these scenarios has validated a core judgment: taking measurable efficiency in working hours as the starting point drives more practical organization progress than pursuing grand top-level designs. Meanwhile, the rollout of scenarios has in turn improved data and tool capabilities—the high requirements of agents for data quality and tool interfaces have forced the continuous improvement of integrated O&M capabilities.
III. O&M Agent Capability Upgrade Roadmap
Back to the core of IT O&M business, we have mapped out feasible scenarios for AI substitution according to different domains, organically connecting business analysis, engineering design, and long-term planning. We have defined the construction of AgentOps in four stages and maturity levels like Figure 2:

Figure2: AgentOps Maturity Stages
3.1 Task-Level AI Replacement in O&M Processes: An incident management perspective
As can be seen from the above table: Among the 10 tasks in the entire incident management process, 6 reach Lv.4 (Agent Autonomous) or Lv.3 (Highly Intelligent). This distribution pattern is generally applicable to other O&M domains, which tasks that are structured and can be governed by rules are prioritized for substitution, while tasks requiring cross-role consensus retain manual judgment.

3.2 Conducting Task-Level Process Design and Designing Manual Checkpoints
Based on the process design, design Skill, MCP, knowledge bases, and prompts.For example, the incident task:
Execution Process: Alert Trigger → Alert Aggregation and Noise Reduction → Parallel: Metric Analysis + Log Analysis + Trace Analysis + CMDB Correlation → Root Cause Candidate Ranking → Historical Case Retrieval → LLM Comprehensive Root Cause Inference → Push Diagnostic Report + SOP Recommendations → Record Diagnosis Process for Post-Mortem.
Manual Checkpoint: For P0 level incidents, the root cause inference result requires confirmation by the Incident Commander before automatic remediation can be triggered, when root cause confidence is <60%, uncertainty is flagged, prompting manual intervention.
3.3 Sustainably Upgradable Agent
The following table shows the capability definition and core value of core agents in various domains:

3.4 Multi-Agent Collaboration
In certain scenarios, multi-agent collaboration is required to complete tasks. Taking the example of fault diagnosis multi-agent collaboration that we are still exploring: based on a prioritized list of hypotheses, subtasks are routed to sub-agents for parallel execution, calling tools for troubleshooting analysis, followed by evidence analysis and the generation of a diagnostic report.

Figure3: Multi-Agent Collaborative Fault Diagnosis Scenario
IV. Architectural Approach: Agent Development Platform + Integration + Large Models
Our overall architectural design adheres to a core principle: Positive synergy with the direction of large model capability enhancement. This means: avoiding redundant development of capabilities already proficiently handled by large models, and instead focusing on the connection layer between large models and O&M business—data, tool interfaces, knowledge accumulation, and orchestration frameworks.

Figure4: CanWay AgentOps Overall Architecture

Figure5: Agent Development Platform Function
The architectural design focuses on the "connection layer" between large models and O&M business—data standardization, tool interface governance, knowledge accumulation, and agent orchestration—rather than duplicating capabilities already possessed by large models. The specific design principles are as follows:
01)MCP First: All external system calls are uniformly accessed through the MCP protocol, avoiding direct coupling between agents and underlying APIs, reducing maintenance costs, and improving tool reuse rates.
02)Encapsulation of Skills: Encapsulate repeatedly used atomic O&M (such as alert aggregation, RunBook matching, impact assessment, etc.) into standard Skills that can be invoked by multiple agents, avoiding redundant development.
03)RAG-Driven Knowledge: Fully vectorize historical O&M events, O&M manuals, and system architecture documents. Agents obtain precise context through semantic retrieval during inference, significantly reducing the hallucination rate.
04)Agent Scenarios: Design based on O&M business scenario processes, gradually forming an ecological model of single agents, multiple agents, and agent collaboration to solve more complex scenarios.
05)Human-in-the-Loop: Mandatory insertion of manual approval checkpoints in high-risk O&M nodes (such as deleting resources, switching traffic, modifying core configurations). The more autonomous the AI, the more precise the control required.
06)Audit First: All agent O&Ms are recorded in immutable O&M logs to meet compliance requirements and serve as the infrastructure for building organizational trust.
V. Systematic Approach: How to Promote AgentOps Implementation within an Enterprise
Based on our practices, the following promotion methods are summarized for reference:
5.1 Let the Organization Form New Habits
The promotion of AgentOps cannot wait until a "perfect solution" is ready. It is recommended to select 2-3 scenarios with clear pain points and visible value for rapid implementation, converging scenario-first approaches with top-level design. This allows engineers to refine requirements and form new collaboration habits through actual use. When team members actively create and use agents, the ecosystem truly begins to operate.
Practical Suggestions: Prioritize scenarios that are cumbersome in process but logically clear, such as account provisioning, service inspections, and SQL reviews. These scenarios have a high success rate for AI substitution, value can be quickly quantified, and risks are controllable.
5.2 Sorting out business processes is still the top priority
The root cause of failure in many teams' AgentOps exploration is skipping the step of business process modeling. AI cannot automatically clarify chaotic processes. It only execute the clear processes faster. Before introducing agents, the following must be completed: sort out the current process → conduct task-level AI substitution analysis → establish a baseline for quantifiable value.
Specific Method: Take the O&M business domain as a unit (ITSM, change, monitoring, release, etc.), break it down task by task, evaluate the feasibility of AI substitution for each task (Is the data available? Are the tool interfaces ready? Are the risks controllable?), and form a scenario priority matrix.
5.3 Foundational Capability Building is a Long-Term Investment
The upper limit of an agent's intelligence depends on the quality of the data it can access and the richness of the tool capabilities available to it. The integrated O&M platform provides AgentOps not just with a display interface, but more importantly, with a data foundation and an MCP tool ecosystem. Specifically:
01)Data Layer: Data such as monitoring metrics, logs, change records, and CMDB relationships need to be standardized and available in real-time so that agents have the necessary data to perform.
02)Tool Layer: Establishing an MCP tool registry that covers read and write O&M interfaces for major systems is a prerequisite for autonomous agent execution.
03)Knowledge Layer: The vectorization and storage of historical events, RunBooks, and system architecture documents in a knowledge base is the foundation for the accuracy of RAG retrieval.
5.4 Start with Simple Measurement Methods and Evolve Gradually
In the initial stages, do not pursue a sophisticated measurement system, as it can become an obstacle to progress. A two-phase approach is recommended:
Phase 1 – Efficiency Measurement: Track metrics such as the number of agent executions, task success rate, processing time (Before vs. After), and rate of manual intervention. These data are easy to collect and directly reflect efficiency improvements.
Phase 2 – Quality Measurement: Focus on metrics such as agent decision accuracy, error rate, SLA compliance rate, and the rate of human substitution. It requires a longer period of data accumulation but truly reflects the depth of AI substitution.
Key Principle: Before each agent goes live, a value baseline (Before data) must be established. After going live, regularly review and compare the data. Otherwise, the scenario remains forever a demo.
5.5 Organization Change and Technological Change Advance in Tandem
The essence of AgentOps is a transformation of the production relations within the IT O&M organization, with technology serving merely as the vehicle. Equally important to technological advancement are the corresponding organization changes:
01)Establish the Role of Agent Owner: Agents for each business scenario require someone responsible for continuous iteration and maintenance.
02)Foster an Agent Development Ecosystem: Lower the barrier for ordinary engineers to build and publish agents, allowing Skills and agent capabilities to be freely shared and reused within the organization.
03)Redefine SRE Value: The focus of SRE needs to shift from execution to strategy configuration, boundary approval, and agent quality management, which requires clear adjustments in role definitions.
5.6 Emphasize Permission Control and Process Transparency
As AI gains the right to act, for example, executing scripts on servers or performing self-healing actions risks also increase. When implementing AI agents, it is essential to emphasize security control and governance.
01)Implement Permission Control: Strictly control O&M permissions in production. Interfaces requiring manual secondary confirmation can be provided at the API layer for the AI agent.
02)Transparency: More detailed logging and audit functions are needed, allowing human administrators to trace back why an agent made a particular decision.
VI. Current Assessment: Understanding the Present and Judging the Future
A truly "unattended O&M closed loop" (Lv.4) is still a considerable gap to large-scale implementation. The core challenges lie in:
01)Extremely High Reliability Requirements: Autonomous O&M in production systems mean that a single misjudgment could be catastrophic. The reliability of agent decisions needs to be validated across thousands of executions.
02)Complexity of Cross-Domain Collaboration: Lv.4 relies on seamless collaboration between multiple agents (A2A), which requires standardized protocols, shared context, and coordination mechanisms. The industry is still exploring these areas.
03)Organization Trust Takes Time to Build: Even if technically feasible, building genuine organization trust for AI to autonomously execute high-risk O&M requires a substantial accumulation of successful cases and gradually expand boundary of authorization.
04)Data and Tool Barriers Will Become Core Competitiveness: Organizations that are the first to establish high-quality O&M data assets and a rich MCP tool ecosystem will gain a sustained competitive advantage in the AgentOps era.
AgentOps is not a gamble on the future, but a pragmatic investment in strengthening. Starting with efficiency tools, ensuring data quality, interoperating tool interfaces, and accumulating knowledge accumulation, these fundamental tasks undertaken today are precisely the firm foundation for large-scale agent autonomy three years later.



























