How to Achieve AWS Operational Excellence in Your Cloud Workload

In today’s landscape, achieving operational excellence can be difficult, but not impossible. With operations often viewed as distinct from the rest of the business, it sometimes isn’t integrated into the flow like it is for other departments.

We have seen the industry recognize this divide with the creation of DevOps—combining development and IT operations into one process to enable more streamlined creation and implementation of software throughout the software development lifecycle (SDLC).

Amazon Web Services (AWS) continue to publish design principals for building applications that adhere to their well-architected frameworks. The best practices for the AWS Well-Architected Framework are based on five different pillars.

This includes:

  1. Operational Excellence
  2. Security
  3. Reliability
  4. Performance Efficiency
  5. Cost Optimization

Focusing on the pillar of operational excellence, AWS has defined five design principles that spread across the areas of “organization,” “prepare,” “operate,” and “evolve.”

The five Operational Excellence design principles:

1. Perform operations as code. The beauty of the cloud is that you can apply the same scripting skills you use to code applications to your entire environment, including operations. This means you can reduce the need for human intervention by scripting code that will automate operations and trigger appropriate responses to any events or incidents.

2. Make frequent, small, reversible changes. When multiple, large changes are made at once, it becomes exceedingly difficult to troubleshoot a problem when things don’t work in production. When designing your workloads, allow for small and frequent deployments that are easily reversable to make the process of identifying the source of the problem quick and easy when something isn’t running as intended in production.

3. Refine operations procedures frequently. There is always room for improvement. Continually analyzing and poking holes in your processes and procedures helps you to constantly increase the efficiency of how you serve your customer needs.

4. Anticipate failure. It is always better to expect failure, rather than assuming that what you’ve created is flawless. If you don’t anticipate errors, how can you catch them before deployment? This is effectively the process of threat modeling and risk assessment.

5. Learn from all operational failures. The point of going back and analyzing a failure is to learn from it. It is important to set up structures and processes that enable the sharing of learnings across teams and the business.

The area of “organization” is critical to your success. It concerns the way your business organizes who is responsible for what, in relation to your engineering and operations departments. You want to ask, who is responsible for the platform? Who is responsible for applications? How do we communicate between our different departments? At the end of the day, you need to be organized in a way that enables you to build software and applications, for example, that fulfill your business’ strategy.

In order to make any decisions about organization, the following seven high-level organization priorities, as delivered by AWS, must first be reviewed and determined:

I. Evaluate external customer needs. Involve key stakeholders, including business, development, and operations teams, to determine where to focus efforts on external customer needs. This will ensure that you have a thorough understanding of the operations support that is required to achieve your desired business outcomes.

II. Evaluate internal customer needs. Engage key stakeholders to identify internal customer needs and operational support required for business outcomes. Prioritize improvement areas, such as skill development, workload performance, cost reduction, automation, and monitoring enhancement, based on established priorities. Continuously update priorities to adapt to changing needs.

III. Evaluate governance requirements. Organizational governance encompasses policies, rules, and frameworks guiding business goals. These requirements internally influence technology choices and workload operations. Incorporate them into your workload to ensure compliance and demonstrate implementation of governance requirements.

IV. Evaluate compliance requirements. Compliance requirements shape organizational priorities, potentially limiting technology and geographic choices. Conduct due diligence if no external frameworks exist. Validate compliance through audits/reports. For advertised compliance, establish internal processes for continuous adherence. Standards like PCI DSS, FedRAMP, and HIPAA vary based on data types and supported regions.

V. Evaluate threat landscape. Assess business threats, maintain a risk registry, and consider their impact when prioritizing efforts. The Well-Architected Framework offers a consistent approach to evaluate and scale architectures. Our Cloud Risk Self-Assessment gives you insight on how to improve your cloud risk posture in three simple steps.

VI. Evaluate tradeoffs. Evaluate tradeoffs and alternatives to make informed decisions when prioritizing efforts or selecting a course of action. For instance, prioritize speed to market over cost optimization or choose a relational database for non-relational data to simplify migration instead of using an optimized database.

VII. Manage benefits and risks. Balance benefits and risks when prioritizing efforts. Consider deploying workloads with unresolved issues to provide significant new features but mitigate associated risks. Address unacceptable risks as needed. Emphasize specific priorities when necessary. Maintain a balanced approach for long-term capability development and risk management. Update priorities based on changing needs.

Determine your businesses risk by looking at the possible attacks that could occur, as well as the likelihood of it coming to fruition. While the cloud has been around for a while, we need to pay close attention to managing the risks it can introduce, as it is still considered a new ecosystem that we are all learning to manage. How we deploy software and manage patches and updates have an impact on the businesses threat landscape.

In their report, Operational Excellence Pillar, AWS looks at engineering as the process of developing and testing applications and the infrastructure. Then, operations is responsible for the deployment and ongoing maintenance of the applications and infrastructure in production. But it isn’t always this straight forward and every business has its own processes, which is why they discuss four different operating models in the context of engineering and operations that businesses can use:

I. Fully Separated Operating Model
II. Separated Application Engineering and Operations (AEO) and Infrastructure Engineering and Operations (IEO) with Centralized Governance
III. Separated AEO and IEO with Centralized Governance and a Service Provider
IV. Separated AEO and IEO with Decentralized Governance

Note, it may be necessary to alter your business culture to conform to any one of these models.

The “prepare” area which is where you start to get into the work software developers are more familiar. However, just because it is more familiar, doesn’t mean it is more important than the area of organization. Without having proper organization in your business and processes, it would be very difficult to address the other three areas required to fulfill your business’ strategy. AWS has broken this area into four actions:

I. Design telemetry into your cloud workloads

Telemetry provides you with information on the current health and risk level of your applications and infrastructure, giving you the ability to better manage and respond effectively to events or incidents. This is done predominantly with logs and metrics. Our Trend Micro Knowledge Base provide steps that you can take to confirm AWS CloudTrail is enabled or Amazon CloudWatch Logs are encrypted with instructions on how to remediate according to best practice. It is also good to ensure that you have metrics configured to monitor things like the functional status of your APIs.

You can audit your environment manually with 750+ industry best practices articles or give our free trial a shot and have your entire environment audited automatically in real time and continuously.

II. Improve your cloud workload flow

AWS says we need to adopt approaches that “enable refactoring, fast feedback on quality, and bug fixing.” Improving the way changes flow into production is what AWS is pointing to here. So, it is essential to have version control and ensure that you test and validate any changes before they reach production.

As a result, configuration management is a crucial topic. This relates back to one of the design principals: Making small, frequent, and reversible changes is critical to build into our processes. It is good to setup services, such as Amazon Simple Notification Service (Amazon SNS) to receive messages for services like AWS CloudFormation. Receiving a notification when stack events occur; such as create, update, and delete, allows for a faster response to unauthorized actions.

III. Deployment risk mitigation processes

There are many steps that can be taken to mitigate deployment risks. Before those, it is crucial to have the attitude that changes pushed to production don’t always work. This will help you to always be prepared. Before pushing to production, always look for what would cause a failure:

i. Test
ii. Validate
iii. Use deployment management systems
iv. Deploy small changes
v. Know how to reverse your changes before they are done

IV. Understand your operational readiness

Once you understand what operational readiness is, the next step is to verify that your personnel is just as knowledgeable, so they can provide operational support. From there, you’ll want to determine whether or not you’ve automated everything you can.

The “operate” area includes three key understandings that are required to ensure you achieve your business outcomes. AWS says that it is critical to:

I. Understand workload health
II. Understand operational health
III. Respond to events

Understanding the health of your workloads or operations comes down to metrics. In order to know how to improve, it is critical to be able to show how things are functioning and how your customers are interacting with your sites. Enabling logging on Amazon CloudWatch Logs, and then aggregating those logs for analysis is very important. These logs can help generate the information needed to produce the metrics you need to improve operations and can be delivered through AWS Health Events on the AWS Personal Health Dashboard. Our Trend Micro Knowledge Base also has rules to assist in the creation of logs and health events. It is possible to use these rules manually, or to use an automated tool, which is always looking for misconfigurations.

Once the logs are created, delivered, and analyzed, it is possible to respond to an event. In ITIL language, an event is a change of state. These may be planned monitored, or unplanned and problematic. With the latter, we need to ensure that we able to respond effectively.

AWS Systems Manager OpsCenter is a central place to manage issues. You can view, investigate, and resolve issues within this tool, while ensuring that information is kept confidential. There is a Trend rule for this: SSM Parameter Encryption. And as with all the rules, it is included in our automated tool. When beginning on the path to operational effectiveness, having an automated tool to analyze our cloud looking for missing configurations is essential.

Automating responses to detected events is the next step. You can utilize Amazon CloudWatch Events to create rules that respond to specific triggers. Otherwise, there would be alarms that might get missed. For example, our Trend Knowledge Base and the tool have alarms to alert us when costs are reaching a threshold we have defined.

With the “evolve” area, AWS believes that, in the context of the cloud, to properly evolve, you must learn, share, and improve. For example, use your post-incident meetings, to learn from what has occurred and make improvements for the future. There needs to be a process to manage and promote continuous improvement in an effort to change behaviors that are not working.

As more security breaches hit the news and data protection becomes a key focus, ensuring your organization adhere to the well-architected framework’s design principles is crucial. Trend can help you stay compliant to the well-architected framework with its 750+ best practice rules. As mentioned above, if you are interested in knowing how well-architected you are, see your own security posture in 15 minutes or less. Learn more by reading the other articles in the series, here are the links: 1) Overview of All 5 Pillars 2) Security 3) Performance Efficiency 4) Reliability 5) Cost Optimization.

References
SQS Dead Letter Queue
Stack Failed Status
ACM Certificate Expired
EBS Volumes Attached to Stopped EC2 Instances

Read More HERE