Integrated Operations Platform Construction Practice

To address the new international competitive environment and domestic regulatory landscape, industry digital transformation is burgeoning.Technologies such as cloud computing, big data, and AI are increasingly moving from back-end applications to front-end business, penetrating every link of investment research, marketing, risk control, and operations. In the wave of digital transformation, the application systems and infrastructure supporting business operations are becoming increasingly complex, leading to higher maintenance costs and risks, which presents a significant challenge to operations management.Building a safer, more efficient, and smarter operations management system has become a key issue in industry IT construction.

Penghu Fund incorporated digital transformation into its corporate development strategy, practicing a "System + Process + Data" triune approach that tightly integrates system construction with business processes. As this involved the construction of new systems and the upgrade of existing ones, the operations objects and scenarios managed by the team were constantly changing.

Restricted by a "stovepipe operations tool ecosystem", operations management faced three major challenges: "Isolated tools and low efficiency," "Difficulty in implementing management standards," and "Bottlenecks in organizational improvement".

01. Integrated Operations Platform Construction

To resolve the challenges faced by the operations team during digital transformation, Penghua Fund prioritized tool development. Leveraging an advanced Operations PaaS platform concept, it conducted a comprehensive upgrade and optimization of its operations management system, focusing on the tool support system, process management system, metrics system, and personnel organization system.

1) Tool Support System: Building an Autonomous, Integrated, and Unified Operations Platform

The approach to building operations support tools shifted from a "stovepipe" model to a "platform-based" model. Built upon a PaaS foundation, it established an operations tool system centered on the concepts of "Unified Control and Management, Unified Platform, and Unified Linkage".

Unified Control and Management: All upper-layer scenarios, such as monitoring collection and automated execution, are implemented via a single Agent to avoid the performance overhead caused by multiple Agents on servers.

Unified Platform: The PaaS architecture governs and precipitates core capabilities, including job execution, automated orchestration engine, workflow engine, and container management. This prevents siloed systems and redundant investment, ensuring sustainable construction. Furthermore, the platform's Ops Development capabilities help the team achieve autonomous control and capability upgrades.

Unified Linkage: Using the operations base platform as the foundation and the CMDB as the core, the fund built an integrated and unified operations tool ecosystem through the construction of IT governance processes, monitoring and alerting, operations automation tools, and visualization. This enhances operations assurance efficiency and creates a differentiated competitive advantage.

① Building a Consumption-Centric CMDB

The team standardized overall configuration information, finalized model design and calibration, and achieved automated discovery and collection of configurations for cloud platforms, operating systems, databases, and middleware to support consumption scenarios like monitoring and inspection. Configuration change management was automated and integrated with automation scenarios to achieve a closed loop from approval flow to automated execution and configuration data write-back (e.g., in resource application scenarios). After a requester submits a ticket and an administrator approves it, the resources are automatically registered in the CMDB. By linking with the process management platform, configuration data entry and changes are made procedural, standardized, and automated, ensuring data uniformity and accuracy.

② Building an Integrated Monitoring System Driven by Events and Data

The integrated monitoring system is driven by a dual core of events and data, aiming to solve the problem of dispersed monitoring and alerting. The Monitoring Center organizes monitoring objects by dimensions such as hardware, OS, component services, and application performance. All monitoring objects originate from the CMDB. When integrating performance data from monitoring sources like Zabbix and Prometheus, the data is tagged with CMDB model and instance labels, creating an object-oriented unified monitoring metrics management system, including metric definition and custom enable/disable controls. The centralized monitoring data also serves as a data source for operations big data analysis, consumed in scenarios like unified visualization, web view display, and report statistics. The Alert Center centralizes alert event aggregation and achieves efficiency improvements through seamless linkage with the CMDB, Automation, and ITSM tools for alert convergence, dispatch, work order conversion, and self-healing.

③ Building an Agile IT Service Management (ITSM) Tool

The fund implemented IT governance processes for demands, changes, and incidents, as well as internal IT service processes. It integrated these with monitoring, CMDB, and automation tools to link management processes with operations tools. Combined with the adoption of the Feishu mobile application, this enhanced process agility, significantly improving the department's process handling efficiency and closure rate.

④ Building Autonomous and Controllable Operations Automation Tools

Based on actual operations scenario needs and the front-end and back-end development framework provided by the operations platform, the team quickly built support tools and operations systems through the flexible invocation and assembly of platform services.

⑤ Building Operations Visualization

A series of dashboards were released to display data from engineering and management perspectives, covering IT assets, processes, monitoring, and security. These include an IT Process Cockpit + CMDB Resource Dashboard + Monitoring Dashboard + an Information Security Dashboard independently designed and connected to data sources via the platform's dashboard designer.

2) Process Management System: Service Catalog Planning

The service catalog was planned by service type, divided into three categories: Internal IT Services, IT Governance Services, and Built-in Approval Services. For users, this makes it easier to understand and select services, allowing for quick ticket submission via web or IM. For process managers, it helps in managing and optimizing services based on their technical characteristics.

3) Personnel Organization System: Exploration of Ops Development (OpsDev) Transformation

The transformation to Ops Development originated in the internet industry, where companies sought to enhance operational efficiency and reduce costs to cope with fierce competition.Shifting to Ops Development to achieve autonomous tool control is one way to boost efficiency and cut costs. After introducing the Operations PaaS system, Penghua Fund actively conducted training and learning activities to encourage the team to transform and upgrade.

During this process, individuals with tool development capabilities and a pursuit of growth transitioned from traditional operations to Ops Development (OpsDev). They independently developed the Automated Health Check Center SaaS, changing the previously discrete inspection method and effectively supporting the unified management of the department's daily inspections.

The entire tool construction process can be divided into three main steps:

Step 1: Inventory Atomic Capabilities. Inspection scenarios can be broken down into operations such as retrieving inspection objects and executing inspection commands. Retrieving inspection objects requires the atomic capability of the CMDB to batch-retrieve objects like hosts and databases. Executing inspection commands requires the atomic capability of job execution to run commands on controlled machines via the Agent to complete the inspection.

Step 2: Orchestrate Atomic Capabilities. The atomic capabilities mentioned above are linked together using logical controls such as serial, parallel, conditional (if/then), and loop to build the operations automation workflow.

Step 3: SaaS Capability Encapsulation. Based on the front-end/back-end development framework and the maintenance-free hosting environment provided by the PaaS platform, the inspection automation workflow is encapsulated as a SaaS tool to enhance usability.

With the success story of automated inspection, a "tool culture" is gradually permeating every corner of the team, accelerating the transformation of the operations organization.

Inspection Center: Leveraging the platform's development framework, the team self-developed the Inspection Center to unify inspection tasks across all businesses and developed a Feishu mobile application, effectively supporting the unified management of the department's daily inspections.

Performance Dashboard: A self-developed performance portal displays R&D efficiency metrics using ticket data.

Operations Management Notification Center: Integrates with Feishu group robots to provide notifications for relevant nodes in the IT governance process at scheduled times.

4) Metrics System: Continuous Operation and Improvement

CMDB construction is a process of continuous improvement. Data operations is a method that effectively helps the configuration manager "monitor" the state of the CMDB during its construction. Data operations techniques include monitoring for data islands, relational integrity, attribute completeness, and attribute standardization.

Data Island: Refers to an instance that is not associated with any other instance, has no configured owner maintaining its data, or represents a resource object that is not currently in use.

Relational Integrity: Means that some objects may have important but unmaintained relationships, which can affect the consumption of core scenarios, such as fault impact analysis.

Attribute Completeness: Essentially checks whether attributes deemed important by the configuration administrator are populated, for example: checking if the owner attribute of a resource object is filled in.

Attribute Standardization: CMDB construction is often a continuous and iterative process. In the early stages, model attribute standards are often not clearly defined, leading to poor data quality during initial collection. As the construction progresses and attribute rules are more clearly defined based on consumption scenarios, checks must be run against the current CMDB data using defined rules—which is essentially a post-implementation task.

The team tracks IT Service Management (ITSM) metrics such as total incidents, SLA (Service Level Agreement), and satisfaction surveys to continuously analyze and improve IT service quality. By leveraging the valuable data accumulated in ITSM tickets for statistical analysis of demands, incidents, and changes, an efficiency metrics system is established to effectively aid internal IT management.

Monitoring management metrics primarily focus on three aspects: Monitoring Coverage Rate, Alert Classification, and Fault Management.

Monitoring Coverage Rate involves two metrics: monitoring object collection coverage and monitoring metric policy coverage, which primarily measure the adoption and use of monitoring. Object collection coverage is usually calculated by comparing the number of object instances covered by monitoring tasks to the total number of instances of that object in the CMDB; metric policy coverage is usually calculated by comparing the number of metric policies configured for a specific instance to the total number of metrics collected for that instance.

Alert Classification is generally divided into three levels: Fatal, Warning, and Notification. Fatal alerts typically indicate that a service is already abnormal and requires immediate attention; Warning alerts usually mean that the service will soon become abnormal if not addressed promptly; Notification alerts generally represent potential issues that require attention or preemptive action to prevent anomalies.

Fault Management refers to metrics related to the fault handling process, including MTTI (Mean Time to Identify) = Discovery Time - Occurrence Time; MTTA (Mean Time to Acknowledge) = Acknowledgment Time - Discovery Time; MTTR (Mean Time to Recover) = Recovery Time - Occurrence Time; and MTBF (Mean Time Between Failures) = Uptime - Downtime. The fundamental goal of fault management is to reduce MTTR and increase MTBF.

02. Benefits and Outlook

To meet the demands of business digital transformation, Penghua Fund adopted a combination of multiple technologies and management tools based on the advanced concepts of "Unified Platform, Unified Control, and Unified Linkage". This achieved unified management and coordination of data, resources, and personnel, solving the problems caused by previous "stovepipe" tool construction. The platform's Ops Development capabilities also initiated the operations team's path toward Site Reliability Engineering (SRE) transformation.

Looking ahead, in response to the trend of Cloud Native transformation, the fund can build a Full-Stack Observability System (integrating Trace, Metric, Log) to solve fault management issues in complex distributed architectures.

Facing the impact of AI technology, the team can explore using deep learning and Natural Language Processing in operations scenarios, such as intelligent thresholds in monitoring and intelligent alert clustering in alert management.

About Penghua Fund

Industry

Financial

Location

China Shenzhen

Related Products

Configuration Management Database

Build a next-generation Configuration Management Database (CMDB) system, supported by functional features including in-depth automatic discovery, seamless process linkage, flexible data consumption, and closed-loop data governance. Provides reliable and effective data support for the IT O&M system.

Explore more

IT Service Management

Based on ITIL best practices, it provides capabilities in incident management, problem management, change management, and request management, enabling the upgrade of an intelligent IT service management system.

Explore more

IT Monitoring Platform

Centralized collection, management, visualization, and analysis to enable centralized monitoring. Gain real-time control over application and resource health anytime, anywhere, ensuring stable and secure business operations.

Explore more

Automation Operations Platform

Realize full lifecycle operation and maintenance of IT resources, providing users with ready-to-use operation scenarios and continuously expandable plugin capabilities.

Explore more

Related Solutions

Securities Industry Operation Guarantee

Covers 7 capabilities (online/offline, change, etc.), via platform architecture, improves efficiency, ensures operation.

Explore more

Integrated and Intelligent IT Operations

Solve management and technical problems in the IT operation and maintenance process, and enhancing overall organizational and operation capabilities

Explore more

IT Service Management（ITSM)

A platform based architecture with thick platform and thin front-end to meet the agile and experiential needs of IT services

Explore more

Configuration Management

Assist enterprises in establishing a full lifecycle IT resource management system, including planning, design, implementation, operation, and other stages.

Explore more

Unified Monitoring

One stop comprehensive IT monitoring capability, including collection, detection, alarm, disposal, display, and alarm management capabilities.

Explore more

Automated Operations

Provide systematic, atomized, scenario based, and scalable automated operation and maintenance capabilities for massive, cross regional, and heterogeneous IT environments.

Explore more

Ready to get started?

Try For Free

Customer Success Stories

Review All Cases

Talk with an expert

Connect with our product experts to arrange a custom demo. See how this Product

can be implemented in your organization.

Schedule a Demo

Talk with an expert

Visualization of Technical Strength

Accurate Matching of Personalized Needs

Rapid Reuse of Successful Experiences

Zero threshold in-depth experience

Visualization of Technical Strength Accurate Matching of Personalized Needs Rapid Reuse of Successfu

Free Demo Request

By submitting this form, I confirm that I have read and agree to the Privacy Statement.

Submit

Contact

DEMO

marketing@canway.net

We use cookies to understand how our audience uses our site.

CanWay BlueWhale websites use cookies to deliver and improve the website experience, See our cookie policy for further details on how we use cookies and how to change your cookie settings Cookies policy.

Reject

Cases

Customer Success Stories

Integrated Operations Practice

Replaces ServiceNow for Localized ITSM at MNC China HQ

WeOps Enhances Operations Efficiency and Supports Intelligent Business Operations!

Talk with an expert