To address the new international competitive environment and domestic regulatory landscape, industry digital transformation is burgeoning.Technologies such as cloud computing, big data, and AI are increasingly moving from back-end applications to front-end business, penetrating every link of investment research, marketing, risk control, and operations. In the wave of digital transformation, the application systems and infrastructure supporting business operations are becoming increasingly complex, leading to higher maintenance costs and risks, which presents a significant challenge to operations management.Building a safer, more efficient, and smarter operations management system has become a key issue in industry IT construction.
Penghu Fund incorporated digital transformation into its corporate development strategy, practicing a "System + Process + Data" triune approach that tightly integrates system construction with business processes. As this involved the construction of new systems and the upgrade of existing ones, the operations objects and scenarios managed by the team were constantly changing.
Restricted by a "stovepipe operations tool ecosystem", operations management faced three major challenges: "Isolated tools and low efficiency," "Difficulty in implementing management standards," and "Bottlenecks in organizational improvement".
01. Integrated Operations Platform Construction
To resolve the challenges faced by the operations team during digital transformation, Penghua Fund prioritized tool development. Leveraging an advanced Operations PaaS platform concept, it conducted a comprehensive upgrade and optimization of its operations management system, focusing on the tool support system, process management system, metrics system, and personnel organization system.
1) Tool Support System: Building an Autonomous, Integrated, and Unified Operations Platform
The approach to building operations support tools shifted from a "stovepipe" model to a "platform-based" model. Built upon a PaaS foundation, it established an operations tool system centered on the concepts of "Unified Control and Management, Unified Platform, and Unified Linkage".
Unified Control and Management: All upper-layer scenarios, such as monitoring collection and automated execution, are implemented via a single Agent to avoid the performance overhead caused by multiple Agents on servers.
Unified Platform: The PaaS architecture governs and precipitates core capabilities, including job execution, automated orchestration engine, workflow engine, and container management. This prevents siloed systems and redundant investment, ensuring sustainable construction. Furthermore, the platform's Ops Development capabilities help the team achieve autonomous control and capability upgrades.
Unified Linkage: Using the operations base platform as the foundation and the CMDB as the core, the fund built an integrated and unified operations tool ecosystem through the construction of IT governance processes, monitoring and alerting, operations automation tools, and visualization. This enhances operations assurance efficiency and creates a differentiated competitive advantage.
① Building a Consumption-Centric CMDB
The team standardized overall configuration information, finalized model design and calibration, and achieved automated discovery and collection of configurations for cloud platforms, operating systems, databases, and middleware to support consumption scenarios like monitoring and inspection. Configuration change management was automated and integrated with automation scenarios to achieve a closed loop from approval flow to automated execution and configuration data write-back (e.g., in resource application scenarios). After a requester submits a ticket and an administrator approves it, the resources are automatically registered in the CMDB. By linking with the process management platform, configuration data entry and changes are made procedural, standardized, and automated, ensuring data uniformity and accuracy.
② Building an Integrated Monitoring System Driven by Events and Data
The integrated monitoring system is driven by a dual core of events and data, aiming to solve the problem of dispersed monitoring and alerting. The Monitoring Center organizes monitoring objects by dimensions such as hardware, OS, component services, and application performance. All monitoring objects originate from the CMDB. When integrating performance data from monitoring sources like Zabbix and Prometheus, the data is tagged with CMDB model and instance labels, creating an object-oriented unified monitoring metrics management system, including metric definition and custom enable/disable controls. The centralized monitoring data also serves as a data source for operations big data analysis, consumed in scenarios like unified visualization, web view display, and report statistics. The Alert Center centralizes alert event aggregation and achieves efficiency improvements through seamless linkage with the CMDB, Automation, and ITSM tools for alert convergence, dispatch, work order conversion, and self-healing.
③ Building an Agile IT Service Management (ITSM) Tool
The fund implemented IT governance processes for demands, changes, and incidents, as well as internal IT service processes. It integrated these with monitoring, CMDB, and automation tools to link management processes with operations tools. Combined with the adoption of the Feishu mobile application, this enhanced process agility, significantly improving the department's process handling efficiency and closure rate.
④ Building Autonomous and Controllable Operations Automation Tools
Based on actual operations scenario needs and the front-end and back-end development framework provided by the operations platform, the team quickly built support tools and operations systems through the flexible invocation and assembly of platform services.
⑤ Building Operations Visualization
A series of dashboards were released to display data from engineering and management perspectives, covering IT assets, processes, monitoring, and security. These include an IT Process Cockpit + CMDB Resource Dashboard + Monitoring Dashboard + an Information Security Dashboard independently designed and connected to data sources via the platform's dashboard designer.
2) Process Management System: Service Catalog Planning
The service catalog was planned by service type, divided into three categories: Internal IT Services, IT Governance Services, and Built-in Approval Services. For users, this makes it easier to understand and select services, allowing for quick ticket submission via web or IM. For process managers, it helps in managing and optimizing services based on their technical characteristics.
3) Personnel Organization System: Exploration of Ops Development (OpsDev) Transformation
The transformation to Ops Development originated in the internet industry, where companies sought to enhance operational efficiency and reduce costs to cope with fierce competition.Shifting to Ops Development to achieve autonomous tool control is one way to boost efficiency and cut costs.After introducing the Operations PaaS system, Penghua Fund actively conducted training and learning activities to encourage the team to transform and upgrade.
During this process, individuals with tool development capabilities and a pursuit of growth transitioned from traditional operations to Ops Development (OpsDev). They independently developed the Automated Health Check Center SaaS, changing the previously discrete inspection method and effectively supporting the unified management of the department's daily inspections.
The entire tool construction process can be divided into three main steps:
Step 1: Inventory Atomic Capabilities. Inspection scenarios can be broken down into operations such as retrieving inspection objects and executing inspection commands. Retrieving inspection objects requires the atomic capability of the CMDB to batch-retrieve objects like hosts and databases. Executing inspection commands requires the atomic capability of job execution to run commands on controlled machines via the Agent to complete the inspection.
Step 2: Orchestrate Atomic Capabilities. The atomic capabilities mentioned above are linked together using logical controls such as serial, parallel, conditional (if/then), and loop to build the operations automation workflow.
Step 3: SaaS Capability Encapsulation. Based on the front-end/back-end development framework and the maintenance-free hosting environment provided by the PaaS platform, the inspection automation workflow is encapsulated as a SaaS tool to enhance usability.
With the success story of automated inspection, a "tool culture" is gradually permeating every corner of the team, accelerating the transformation of the operations organization.
Inspection Center: Leveraging the platform's development framework, the team self-developed the Inspection Center to unify inspection tasks across all businesses and developed a Feishu mobile application, effectively supporting the unified management of the department's daily inspections.
Performance Dashboard: A self-developed performance portal displays R&D efficiency metrics using ticket data.
Operations Management Notification Center: Integrates with Feishu group robots to provide notifications for relevant nodes in the IT governance process at scheduled times.
4) Metrics System: Continuous Operation and Improvement
CMDB construction is a process of continuous improvement. Data operations is a method that effectively helps the configuration manager "monitor" the state of the CMDB during its construction. Data operations techniques include monitoring for data islands, relational integrity, attribute completeness, and attribute standardization.
Data Island: Refers to an instance that is not associated with any other instance, has no configured owner maintaining its data, or represents a resource object that is not currently in use.
Relational Integrity: Means that some objects may have important but unmaintained relationships, which can affect the consumption of core scenarios, such as fault impact analysis.
Attribute Completeness: Essentially checks whether attributes deemed important by the configuration administrator are populated, for example: checking if the owner attribute of a resource object is filled in.
Attribute Standardization: CMDB construction is often a continuous and iterative process. In the early stages, model attribute standards are often not clearly defined, leading to poor data quality during initial collection. As the construction progresses and attribute rules are more clearly defined based on consumption scenarios, checks must be run against the current CMDB data using defined rules—which is essentially a post-implementation task.
The team tracks IT Service Management (ITSM) metrics such as total incidents, SLA (Service Level Agreement), and satisfaction surveys to continuously analyze and improve IT service quality. By leveraging the valuable data accumulated in ITSM tickets for statistical analysis of demands, incidents, and changes, an efficiency metrics system is established to effectively aid internal IT management.
Monitoring management metrics primarily focus on three aspects: Monitoring Coverage Rate, Alert Classification, and Fault Management.
Monitoring Coverage Rate involves two metrics: monitoring object collection coverage and monitoring metric policy coverage, which primarily measure the adoption and use of monitoring. Object collection coverage is usually calculated by comparing the number of object instances covered by monitoring tasks to the total number of instances of that object in the CMDB; metric policy coverage is usually calculated by comparing the number of metric policies configured for a specific instance to the total number of metrics collected for that instance.
Alert Classification is generally divided into three levels: Fatal, Warning, and Notification. Fatal alerts typically indicate that a service is already abnormal and requires immediate attention; Warning alerts usually mean that the service will soon become abnormal if not addressed promptly; Notification alerts generally represent potential issues that require attention or preemptive action to prevent anomalies.
Fault Management refers to metrics related to the fault handling process, including MTTI (Mean Time to Identify) = Discovery Time - Occurrence Time; MTTA (Mean Time to Acknowledge) = Acknowledgment Time - Discovery Time; MTTR (Mean Time to Recover) = Recovery Time - Occurrence Time; and MTBF (Mean Time Between Failures) = Uptime - Downtime. The fundamental goal of fault management is to reduce MTTR and increase MTBF.
02. Benefits and Outlook
To meet the demands of business digital transformation, Penghua Fund adopted a combination of multiple technologies and management tools based on the advanced concepts of "Unified Platform, Unified Control, and Unified Linkage". This achieved unified management and coordination of data, resources, and personnel, solving the problems caused by previous "stovepipe" tool construction. The platform's Ops Development capabilities also initiated the operations team's path toward Site Reliability Engineering (SRE) transformation.
Looking ahead, in response to the trend of Cloud Native transformation, the fund can build a Full-Stack Observability System (integrating Trace, Metric, Log) to solve fault management issues in complex distributed architectures.
Facing the impact of AI technology, the team can explore using deep learning and Natural Language Processing in operations scenarios, such as intelligent thresholds in monitoring and intelligent alert clustering in alert management.

































