Antonio Castillo Nieto, ATOS Spain
1 March 2021
With the advent of the architecture paradigm called microservices and the development of cloud-native containerized applications, operational management of applications has become more complex, but at the same time more flexible. An application composed of multiple microservices does not need to scale all at the same time or have the same number of replicas or instances of each service. This situation differs from traditional or monolithic applications in which it is necessary to scale the entire application, with the consequent inefficient use of resources, or at best scale a particular layer or tier if we have an N-tiers architecture (frontend, business logic or backend and data tiers).
Current container-based orchestration systems (Kubernetes, OpenShift, Docker Swarm, etc.) allow us to deploy microservices-based applications and manage their operating lifecycle at the microservices level of detail. To be able to carry out this efficient management, you need to define the expected quality of service for each component or microservice in terms of KPIs or metrics. Most orchestrators base these metrics on compute resources (CPU, memory) but are descensed with the intelligence needed to use metrics based on other types of resources such as storage or network. In addition, there are no tools for the application owner to define their own quality of service metrics and incorporate them into decision-making during application operation or a particular microservice.
In the case of an application owner-defined quality of service violation, orchestration systems often have few alternatives that typically consist of scaling out the application or microservice (measures to alleviate computing capacity). Sometimes the bottleneck is not found in computing power or memory consumption, but can be due to the slowness of storage layer I/O operations or the transfer rate of the network interface or the heavy computing processing usage that goes beyond that provided by a CPU such as graphics, crypto and AI workloads where a GPU or special hardware is needed. You can also add to this list the location of resources relative to the consumer such as computing on the Edge or IoT where sometime the latency is critical.
Because of all of the above, it is necessary to provide the orchestration systems with tools that allow smarter decisions about resource utilization while meeting the expectations or quality of service of both the application owner (QoS) and potential consumers of the application (QoE). Among the tools that are needed can include those of defining any type of metrics, collecting such metrics, detecting QoS violations based on these metrics, making complex decisions not only based on rules or heuristics, e.g. using machine learning (ML) systems, and those that allow the orchestrator to perform more complex actions such as defining network policies, security policies, storage, computing processing, resource location, etc., in near real time.
At Pledger, we intend to offer some of these tools as outcomes of the project and share it with opensource communities. If you have to confront these challenges in you workday, stay connected to Pledger to see our progress.