1 Getting Started as an Azure Architect

Join our book community on Discord

https://packt.link/0nrj3In this chapter, we will focus on what an architect's role entails and explain the various cloud service models that are made available by the Microsoft Azure platform. We will describe how the numerous maps in this book are built, what they intend to demonstrate, and how to make sense of them.More specifically, in this chapter, we will cover the following topics:

Getting to know architectural duties
Getting started with the essential cloud vocabulary
Introducing Azure Architecture Maps
Understanding the key factors of a successful cloud journey

Our purpose is to help you learn the required vocabulary that is used across the book. You will also understand the duties of an Azure architect. We will explain the most frequently used service models and their typical associated use cases, which every Azure architect should know. We start smoothly but beware that the level of complexity will increase as we go. Let's start by getting acquainted with the definition of an architect.

Getting to know architectural duties

Before we define what an Azure architect is, let's first define what an architect's role is and how our maps specialize to reflect these different profiles. The word architect is used everywhere on the IT planet. Many organizations have their own expectations when it comes to defining the tasks and duties of an architect. Let's share our own definitions as well as some illustrative diagrams.

Enterprise architects

Enterprise architects oversee the IT and business strategies, and they make sure that every IT initiative is in line with the enterprise business goals. They are directly reporting to the IT leadership and are sometimes scattered across business lines. They are also the guardians of building coherent and consistent overall IT landscapes for their respective companies. Given their broad role, enterprise architects have a helicopter view of the IT landscape, and they are not directly dealing with deep-dive technical topics, nor are they looking in detail at specific solutions or platforms, such as Azure, unless a company would put all its assets in Azure. In terms of methodology, they often rely on the TOGAF (short for The Open Group Architecture Framework) framework and ArchiMate for modeling. The typical type of diagrams they deal with looks like the following:

Figure 1.1 – Capability Viewpoint: ArchiMate

As you can see, this is very high level and not directly related to any technology or platform. Therefore, this book focusing on Azure is not intended for enterprise architects, but they are, of course, still welcome to read it!

Domain architects

Domain architects own a single domain, such as The Cloud. In this case, the cloud is broader than just Azure, as it would probably encompass both public and private cloud providers as well as SaaS (Software as a Service) solutions. Domain architects are tech-savvy, and they define their domain roadmaps while supervising domain-related initiatives. Compared to enterprise architects, their scope is more limited, but it is still too broad to master the bits and bytes of an entire domain. This book, and more particularly our generic maps, will certainly be of great interest for cloud domain architects. Diagram-wise, the domain architects will also rely on TOGAF and other architecture frameworks, but scoped to their domain.

Solution architects

Solution architects help different teams build solutions. They have T-shaped skills, which means that they are specialists in a given field (the base of the T), but they can also collaborate across disciplines with the other experts (the top of the T). Solution architects are usually in charge of designing solution diagrams, and they tackle non-functional requirements, such as security, performance, and scalability. Solution architects may build both high-level, technology-agnostic diagrams, referred to as ABBs (Application Building Blocks), and more concrete implementations of ABBs, called SBBs (Solution Building Blocks). Azure solution architects typically focus more specifically on SBBs. While the concepts of ABBs and SBBs are borrowed from TOGAF, one might attempt an analogy with the C4 model where ABBs roughly correspond to C4's System Diagrams and SBBs to C4's Container Diagram. Figure 1.2 is an example of C4 System Diagram that could be seen as an ABB:

Figure 1.2 – C4 System Diagram illustrating an Hybrid API pattern

Figure 1.2 depicts a high-level view of a hybrid API pattern where on-premises clients reach out to Cloud-hosted APIs. It shows the different systems involved in this scenario but there is no information about the involved technologies. Conversely, Figure 1.3 gives much more details about what a concrete implementation would look like:

Figure 1.3 - C4 Container Diagram illustrating an Hybrid API pattern

From this diagram, we identify several Azure building blocks, including Virtual Networks, Azure Firewall, and Azure API Management. Entra ID is used as the authorization server. Azure Solution Architects will be primarily interested in our next chapter, Solution Architecture.

Data architects

Data architects oversee the entire data landscape. They mostly focus on designing data platforms, for storage, insights, and advanced analytics. They deal with data modeling, data quality, and business intelligence, which consists of extracting valuable insights from the data, in order to realize substantial business benefits. A well-organized data architecture should ultimately deliver the DIKW (Data, Information, Knowledge, Wisdom) pyramid as shown in Figure 1.4:

Organizations have a lot of data, from which they try to extract valuable information, knowledge, and gain wisdom over time. The more you climb the pyramid, the higher the value. Consider the following scenario to understand the DIKW pyramid:

Figure 1.5 shows that we start with raw data, which does not really make sense without context. These are just numbers. At the information stage, we understand that 31 stands for the day, 3 as the month of March, and 3,000 as the number of visits. Now, these numbers mean something. The knowledge block is self-explanatory. We have analyzed our data and noticed that year after year, March 31 is a busy day. Thanks to this valuable insight, we can take the wise decision to restock our warehouses up front to make sure we do not run short on goods.Data Architects are also responsible for developing reference architectures that support a variety of data processing needs, including file-based batch ingestion, real-time streaming, Internet of Things (IoT) data flows, and both ETL (Extract, Transform, Load) and ELT (Extract, Load, Transform) pipelines. Their role extends beyond technical design, as they also guide how modern technologies like Artificial Intelligence (AI) can be strategically leveraged to gain deeper business insights, enhance decision-making capabilities, and foster a data-driven culture throughout the organization. Figure 1.6 is an example of IoT diagram a data architect might build:

Figure 1.6 – Example of IoT architecture

Figure 1.6 illustrates the flow of data from sensors within an industrial network, passing through an IT network via Azure IoT Edge, and ultimately reaching Azure IoT Hub. From there, the data ingestion process begins, enabling further analysis.That is, among other things, the work of a data architect to help organizations learn from their data.Data is the new gold rush, and Azure, as well as Microsoft Fabric have a ton of data and AI services as part of their catalog, which we will cover in Chapter 7, Architecting Data Solutions as well as in Chapter 8, Artificial Intelligence Architecture

Technical architects

Technical architects have a deep vertical knowledge of a platform or technology stack, and they have hands-on practical experience. They usually coach developers, IT professionals, and DevOps engineers in the day-to-day implementation of a solution. They contribute to reference architectures by zooming inside some of the high-level components. They often act as a liaison between solution architects and engineers. For instance, if a reference architecture, designed by a solution architect, makes use of Azure Kubernetes Services (AKS), the technical architect might zoom inside the AKS cluster, to bring extra information on the cluster internals and some specific technologies. To illustrate this, Figure 1.7 shows a sequence from a high-level diagram a solution architect might have created:

Figure 1.7 – Reference architecture example

Figure 1.8 shows the corresponding blueprint created by the technical architect:

Figure 1.8 – Equivalent blueprint from a technical architect

Figure 1.8 contains precise technologies, such as KEDA (Kubernetes-based Event Driven Autoscaling), and some Azure components such as Virtual Networks, Azure Kubernetes Service, and Azure Blob Storage.Technical architects will mostly be interested in our detailed maps and our different use cases distributed throughout the book.

Security architects

In this hyper-connected world, security has become a top priority for every organization.. Security architects have a vertical knowledge of the security field. They usually deal with regulatory or in-house compliance requirements. The cloud and, more particularly, the public cloud, often emphasizes security concerns (much more than for equivalent on-premises systems and applications). With regard to diagrams, security architects will add a security view (or request one) to the reference solution architectures, such as the following:

Figure 1.9 – Simplified security view example

In Figure 1.9, the focus is set on pure security concerns: encryption in transit (TLS 1.3) between the browser and the WAF (Web Application Firewall). The WAF ensures minimal protection against the well-known OWASP (Open Web Application Security Project) vulnerabilities. The virtual network is divided into multiple subnets to clearly identify the different layers of the solution and be able to define Network Security Group rules for every subnet. The security-minded person in you, will immediately spot a transition from TLS 1.3 to TLS 1.2, which is due to the fact that, at the time of writing this book, API Management is not yet compatible with TLS 1.3. The API gateway acts as a PEP (Policy Enforcement Point) before it forwards the request to the backend service. The backend authenticates to the database using MSI (Managed Service Identity), and the database is encrypted at REST with customer-managed keys. Azure Key Vault is used to store the key and must allow trusted services access in the public firewall. Figure 1.9 effectively integrates three key dimensions—Networking, Identity, and Encryption—highlighting the primary areas of focus for security architects.Security architects not only evaluate various solutions but also play a key role in strengthening the organization's overall security posture. They will assess Azure and container-based solutions using frameworks like MITRE ATT&CK, oversee SAST (Static Application Security Testing), DAST (Dynamic Application Security Testing), and penetration testing, and ensure seamless integration with SIEM (Security Information and Event Management) and SOAR (Security Orchestration, Automation and Response) systems, among other things.However, as we will explore further in Chapter 9, Security Architecture, mastering cloud and cloud-native security is a tough challenge for a traditional (on-premises) security architect. Cloud native's defense in depth primarily relies on identity, while traditional defense in depth heavily relies on the network perimeter. Over the past years, Azure has also evolved towards a more perimeter-centric approach. Nevertheless, there still exists gaps between the cloud and non-cloud worlds.

Infrastructure architects

Infrastructure architects focus on building IT systems that host applications, or systems that are sometimes shared across workloads. They play a prominent role in setting up hybrid infrastructures, which bridge both the cloud and the on-premises world. Their diagrams reflect an infrastructure-only view, often related to the concept of a landing zone, which consists of defining how and where business assets will be hosted. A typical infrastructure diagram that comes to mind for a hybrid setup is the Hub and Spoke architecture (https://docs.microsoft.com/en-us/azure/architecture/reference-architectures/hybrid-networking/hub-spoke):

Figure 1.10 – Hub and spoke architecture

Figure 1.10 is a simplified view of the hub and spoke, which, in reality, is often much more complex than this. Infrastructure architects are also involved in topics such as backup/restore, disaster recovery, and so on, which we will explore this further in Chapter 3, Infrastructure Design. We will also stress some important aspects related to legacy processes, so as to maximize the chances of a successful cloud journey.

Platform architects

Platform architects have become increasingly important in recent years, playing a key role in building cloud platforms. Beyond managing infrastructure, they specialize in designing efficient, fully CI/CD-driven platforms that empower application teams to accelerate development and deployment. Their growing importance coincides with the rise of container orchestration platforms that bring new challenges and an entirely new mindset nurtured by DevOps and GitOps practices. Platform architects can be seen as Infrastructure architects 2.0, who have already totally embraced cloud native practices and automation in their daily routine. They often have a mixed background, allowing them to bridge the traditional silos which are infrastructure and application teams.

Application architects

Application architects focus on building features that are requested by the business. Unlike other architects, they are not primarily concerned by non-functional requirements. Their role is to enforce industry best practices and coding design patterns to build maintainable and readable applications. Their primary focus is on code quality and the adherence to well-known principles, such as SOLID (https://en.wikipedia.org/wiki/SOLID), DRY (Don't Repeat Yourself) and Clean Code. With regard to Azure, their primary concerns are to integrate with the various Azure services and SDKs, as well as leverage cloud and cloud-native patterns that help them support their business case. Beyond this book, a good source of information for them is the Microsoft documentation on cloud design patterns (https://docs.microsoft.com/en-us/azure/architecture/patterns/).The main challenge for application architects is to correctly understand the broader Azure ecosystem. Today, there is a clear shift toward breaking down monolithic architectures. This involves decomposing systems into multiple decoupled components, resulting in highly distributed architectures. In modern applications, many common responsibilities are offloaded to specialized services— which are mainstream in the cloud but often unfamiliar to traditional application architects. For example, API gateways come with built-in policies for API throttling, token validation, and caching, eliminating the need for custom implementations. Message brokers such as Azure Service Bus, have built-in features such as entity forwarding, sessions, etc. which would be a pain to implement in code. Application architects should familiarize themselves with the extensive capabilities offered by the Azure service catalog, which is not an easy task. Another key consideration for application architects is the cloud's horizontal scaling model, which requires applications and services to be multi-instance aware—an aspect rarely addressed in monolithic designs, often still prevailing in on-premises data centers.We will explore cloud native practices further in Chapter 6, Application Architecture.

Azure architects

From the top to the bottom of our enumeration, the IT landscape shrinks, from the broadest to the narrowest scope. It would be very surprising to ever meet an Azure enterprise architect. Similarly, it is unlikely that we will stumble upon an Azure domain architect, since the parent domain would rather be - the cloud - which is much broader than just Azure.However, it makes sense to have Azure-focused solution architects, technical architects, and data architects, because they get closer to the actual implementation of a solution or platform. Ideally, Azure architects should be subject matter experts in all architectural disciplines depicted earlier, as well as in all service models described in the following sections.

Architects versus engineers

Before we move on, we need to address the engineer that we all have inside of us! What differentiates architects from engineers is probably the fact that most architects have to deal with the non-functional requirements piece. In contrast, engineers, such as developers and IT engineers, are focused on delivering and maintaining the features and systems requested by the business, which makes them very close to the final solution. While this book primarily focuses on architecture, it also aims to provide value to engineers by featuring in-depth technical explanations.Now that we are familiar with the different architect roles, it is time to get started with the different service models and acquire the essential vocabulary that every Azure architect should know.

Getting started with the essential cloud vocabulary

In this section, we will cover the essential basic skills every Azure architect should have. The cloud has different service models, which all serve different purposes. It is very important to understand what the advantages and inconveniences of each model are, and to get acquainted with the jargon relating to the cloud.

Cloud service models map

Figure 1.11 is a sample map, which depicts the different cloud service models and introduces some vocabulary. This map features two additional dimensions (costs and ops) to each service model, as well as some typical use cases:

In terms of cost models, we see two big trends: consumption and pre-paid compute/plans. The consumption billing model is based on the actual consumption of dynamically allocated resources. Pre-paid plans allocate compute capacity at all times, independent of the actual resource consumption. In terms of operations, the map highlights what is done by the cloud provider, and what you still have to do yourself. For instance, very low means that you have almost nothing to do yourself. We will now walk through each service model.

IaaS (Infrastructure as a Service)

IaaS (Infrastructure as a Service) is probably the least disruptive model. It is basically the process of renting a data center from a cloud provider. It is business as usual (the most common scenario) in the cloud. IaaS is not the service model of choice to accomplish a digital transformation, but there are a number of scenarios that we can tackle with IaaS:

The lift-and-shift of existing workloads to the cloud.
IaaS is a good alternative for smaller companies that do not want to invest in their own data center.
In the context of a disaster recovery strategy, when adding a cloud-based data center to your existing on-premises servers.
When you are short on compute in your own data center(s).
When launching a new geography (for which you do not already have a data center), and to inherit the cloud provider's compatibility with local regulations.
To speed up the time to market a little, by optimizing some legacy practices and processes to align with the cloud delivery model.
By leveraging Infrastructure as Code to automate most deployments.

With regard to costs and operations, they are almost equivalent to on-premises, although it is very hard to compare the total cost of ownership (TCO) of IaaS versus on-premises.Of course, facilities, physical access to the data center, among other things, are all managed by the cloud provider. It is no longer necessary to buy and manage the hardware and infrastructure software by yourself. Many companies today have a hybrid strategy, which consists of keeping a certain number of assets on-premises, while gradually expanding their cloud footprint. In this context, IaaS is often required to bridge the on-premises and cloud worlds.

PaaS (Platform as a Service)

PaaS (Platform as a Service) is a fully managed service model that helps you build new solutions or refactor existing ones much faster. PaaS features off-the-shelves services that already come with built-in functionalities and whose underlying infrastructure is fully outsourced to the cloud provider. PaaS is quite disruptive with regard to legacy systems and practices.Because PaaS is very different from traditional IT, it demands strong engagement across all levels of the organization and the backing of a top executive sponsor to ensure the necessary mindset shift occurs.Make no mistake: going to the Cloud is a journey, especially when adopting Cloud native service models. With PaaS, much of the infrastructure and most operations are delegated to the cloud provider. The multi-tenant offerings are cost-friendly, and you can easily leverage the economies of scale, provided you adopt it for what it is. PaaS is suitable for many scenarios:

Green-field projects
Internet-facing workloads
The modernization of existing workloads
API-driven architectures
IoT and AI workloads
A mobile-first user experience
An anytime-anywhere scenario, and on any device

The preceding list of use cases is by far not exhaustive, but it should give you an idea of what this service model's value proposition is.

FaaS (Function as a Service)

FaaS (Function as a Service) is also known as serverless. It all started with stateless functions that were executed on shared multi-tenant infrastructures. Nowadays, FaaS expanded far beyond functions, and is the most elastic flavor of cloud computing. While the infrastructure is also completely outsourced to the cloud provider, the associated costs are calculated based on the actual resource consumption (unlike PaaS, where the cloud consumer pre-pays a monthly fee based on a pricing tier). From a cost perspective, FaaS is ideal for non-sustained workloads, meaning workloads that frequently idle throughout the day. If the workload remains consistently active without idle periods, FaaS can become more expensive than pre-paid PaaS plans. FaaS is ideal in numerous scenarios:

Event-driven architectures: Receive event notifications and trigger activities accordingly. For example, having an Azure function being triggered by the arrival of a blob on Azure Blob Storage, parsing it, and notifying other processes about the current status of activities.
Messaging: Azure Functions, Logic Apps, and even Event Grid can all be hooked to Azure Service Bus, handle upcoming messages, and, in turn, push their outcomes back to the bus.
Batch jobs: You might trigger Azure Logic Apps or scheduled Azure Functions to perform some jobs.
Asynchronous scenarios of all kinds: Performing activities by leveraging the Fan-Out/Fan-In pattern is a good example.
Unpredictable system resource growth: When you do not know in advance what the usage of your application is, but you do not want to invest too much in the underlying infrastructure, FaaS may help absorb this sudden resource growth in a cost-friendly fashion.

Some services are solely based on the FaaS service model but many PaaS services have an equivalent serverless pricing tier that allows cloud consumers to focus on building their applications while benefiting from elastic scaling in a cost-friendly way. While FaaS offers elasticity, it is best suited for workloads that can tolerate moderate latency and temporary capacity reductions. In recent years, Azure has experienced significant capacity constraints across multiple regions, affecting all compute models—though FaaS has been particularly impacted. Additionally, although FaaS is an outstanding service model, it is often shunned by large enterprises because of a reduced control over the security perimeter. FaaS is an even more disruptive service model than PaaS towards traditional IT and security practices, which may slow down its adoption in large companies.

CaaS (Containers as a Service)

CaaS (Containers as a Service) is between PaaS and IaaS. Containerization has become the new normal, and cloud providers could not miss that train. CaaS often involves more operations than PaaS. For example, AKS (Azure Kubernetes Service) and ARO (Azure Red Hat OpenShift) involve frequent upgrades of the Kubernetes version on both the control plane and the worker nodes, although recent versions feature an auto-upgrade mechanism.The operational workload increases even more when sharing a cluster among multiple applications, as it involves resource sharing, cost reporting, and dealing with challenges such as noisy neighbors, network isolation, and more. Moreover, it is not uncommon to see solutions built around services such as MongoDB or RabbitMQ, also hosted within the cluster, requiring mechanisms for backup, replication, resilience, etc., whereas services like Azure Service Bus and Cosmos DB natively include these features. Working with services such as AKS and ARO is like opening Pandora's box as the CNCF (Cloud Native Computing Foundation) ecosystem is vast and complex. Here is a pointer to the CNCF landscape: https://landscape.cncf.io/. On the other hand, services such as Web App for Containers, ACA (Azure Container Apps) and Azure Container Instances (ACI) are fully managed by Microsoft and tend to reduce the level of complexity. ACI even sits between serverless (the consumption-pricing model) and CaaS. Admittedly, CaaS is probably the hardest model when it comes to evaluating both costs and the level of operations because it varies according to how you use them. It is, nevertheless, suitable for the following scenarios:

Lift-and-shift: During a transition to the cloud, a company might want to simply lift and shift its assets, which means migrating them as containers. Most assets can be packaged as containers without the need to refactor them entirely.
Cloud-native workloads: Building modular, stateless and resilient solutions by leveraging Kubernetes and its broad ecosystem.
Portability: CaaS offers a greater portability, and helps to reduce the vendor lock-in risk to some extent.
Microservices: Most microservice architectures rely on service meshes, which are mainstream in the world of container orchestrators. We will cover them in Chapter 4, Working with Azure Kubernetes Service.
Modern deployment: CaaS uses modern deployment techniques, such as A/B testing, canary releases, and blue-green deployment. These techniques prevent and reduce downtime in general, through self-healing orchestrated containers.
Event-driven applications: CNCF solutions such as KEDA enable intelligent scaling of workloads based on system and custom events, optimizing compute resource utilization. This, combined with the agility, robustness of container orchestrators, make CaaS a first-class citizen for any event-driven architecture.

Additionally, ACI is great for running isolated, short-lived batch jobs without the overhead of a full Kubernetes cluster.We will explore the CaaS world in Chapter 2, Solution Architecture, AKS (Azure Kubernetes Services) in Chapter 4, Working with Azure Kubernetes Service and in Chapter 5, Other Azure Container Services

DBaaS (database as a service)

DBaaS (database as a service) is a fully managed service model that exposes storage capabilities. Data stores, such as Azure SQL, Cosmos DB, Azure MySQL, etc., significantly reduce operations, while offering strong high availability and disaster recovery options. Other services, such as Databricks, Data Factory, and Synapse do not strictly belong to the DBaaS category, but we will combine them for the sake of simplicity. Azure DBaaS was initially based on pre-paid resource allocation, but Microsoft introduced the serverless model in order to have more elastic databases. DBaaS brings the following benefits:

A reduced number of operations, since backups are automatically taken by the cloud provider, while remaining configurable.
Fast processing with Table Storage
Advanced replication mechanisms, thanks to built-in zonal and regional redundancy.
Potentially infinite scalability with Cosmos DB, provided proper engineering practices were taken up front
Cost optimization, when the pricing model is well chosen and fits the scenario

We will explore DBaaS in Chapter 6, Data Architecture.

XaaS or *aaS (anything as a service)

Other service models exist, such as Model as a Service (MaaS) and Identity as a Service (IDaaS), to such an extent that the acronym XaaS, or *aaS, was born around 2016, to designate all the possible service models. It is important for an Azure architect to grasp these different models, as they serve different purposes, require different skills, and directly impact the cloud journey of a company.

Important note

We do not cover SaaS in this book. SaaS is a fully managed business suite of software that often relies on a cloud platform for its underlying infrastructure. SaaS examples include Salesforce and Adobe Creative Cloud, as well as Microsoft's own Office 365, Power BI, and Dynamics 365 (among others).

Now that we reviewed the most important service models, let's dive a little more into the rationale behind our maps.

Introducing Azure Architecture Maps

Although we have already presented a small map, let's explain how Azure Architecture Maps were born and how to make sense of them. However rich the official Microsoft documentation might be, most of it is textual and straight to the point, with walk-throughs and some reference architectures. While this type of information is necessary, it is quite hard to grasp the broader picture. An exception to this is the Azure Machine Learning Algorithm cheat sheet (https://docs.microsoft.com/en-us/azure/machine-learning/algorithm-cheat-sheet), which depicts, in a concise way, the different algorithms and their associated use cases. Unfortunately, Microsoft did not create cheat sheets for everything, nor is there any other real visual representati on of the impressive Azure service catalog and its ecosystem. That gap leaves room for some creativity on the matter…and that is how Azure Architecture Maps were born. Given the success of the first edition of this book, we decided to publish a second edition to update the maps with the latest Microsoft trends and enterprise-grade best practices. The primary purpose of Azure Architecture Maps is to help architects find their way in Azure, and to grasp, in the blink of an eye, the following:

Available services and components: Since there are so many services and products out there, our primary purpose is to classify them and associate them with the most common customer concerns and use cases. However, keep in mind that Azure is a moving target! We will try to be as comprehensive as possible, but we can never guarantee exhaustive or complete coverage. It simply isn't possible.
Possible solutions: These architecture maps are like a tree with multiple branches and sub-branches, and finally the branches end with flourishing leaves. On many occasions, there are multiple ways to tackle a single situation. That is why we will map alternative use cases, based on real-world experiences. However, we strongly encourage you to form your own perspective. You need to exercise your critical reflection on every topic, as to not blindly apply the map recommendations. The unique particularities of your own use case will often require a different solution, or at least a modified solution.
Sensitivity and trade-off points: Architecting a solution is sometimes about choosing the lesser of two evils. Some of your choices or non-functional requirements might affect your final solution, or they might lead you to face some additional challenges. We will highlight sensitive trade-off points with specific marks on the maps.

Given the size of the Azure service catalog, a single map would not suffice. Hence, we created specialized maps. They are not restricted to Microsoft services and, when applicable, may also refer to marketplace and third-party solutions. Let's jump to the next section, which explains how to read and make sense of the maps.

How to read a map

The maps proposed in this book will be your Azure compass. It is therefore important to understand the fundamentals of how to read them. We will therefore go through a sample map to explain the semantics and its workings. Figure 1.12 presents a very tiny, sample map:

The central point that the diagram depicts is the Master Domain (MD), which is the central topic of the map. Each branch represents a different area belonging to the MD. Under the sub domains, you can find the different concerns. Directly underneath the concerns, the different options (the tree's leaves) might help you address the concerns (see POSSIBLE OPTION in Figure 1.12). There might be more than one option to address for a given concern. For example, CONCERN 2 in the diagram offers two options: ALTERNATIVE 1 and ALTERNATIVE 2.From time to time, dotted connections are established between concerns or options that belong to different areas, which indicates a close relationship. In the preceding example, we see that the ALTERNATIVE 2 option connects down to the SUB DOMAIN 2 concern. To give a concrete example of such a connection, we might find a Dapr (Distributed Application Runtime) leaf under the microservice architecture concern that is connected (by a dotted line) to a Logic Apps leaf under the integration concern. The rationale of this connection is because Dapr has a wrapper for self-hosted Logic Apps workflows. Let's now see how, as an architect, how you can get started with your cloud journey.

Understanding the key factors of a successful cloud journey

The role of the Azure architect is to help enterprises leverage the cloud to achieve their goals. This implies that there is some preparation work up front, as there is no such thing as a one-size-fits-all cloud strategy. As we have just seen, the various cloud service models do not respond to the same needs and do not serve similar purposes. It is, therefore, very important to first define a vision that reflects which business and/or IT goals are pursued by your company before you start anything with the cloud. The same principles apply if you're already on your journey and want to pause to reflect on your progress so far. You should always come back to the initial drivers that led your company go to the cloud.As an example, typical transversal drivers are cost optimization and a faster time to market. Cost optimization can be achieved by leveraging the economies of scale from multi-tenant infrastructures.Faster time to market is achieved by leveraging maximum outsourcing from the cloud provider. Should these two drivers be key for your company, rushing to a pure IaaS strategy would be an anti-pattern. Whatever your drivers, a possible recipe of success is the following: Define Vision  Define Strategy  Start Implementation. An effective strategy requires a well-defined vision.Let's now go through a few key aspects and start with the vision.

Defining the vision with the right stakeholders

Write a vision paper to identify what you are trying to solve with the cloud. Here are a few example questions for problems you might want to solve:

Do you have pain points on-premises?
Do you want to make data monetization through APIs?
Do you want to outsource your infrastructure and operations?
Is the hardware in your data center at its end of life?
Are you about to launch new digital services to a B2C (Business-to-Consumer) audience?
Are your competitors faster than you to launch new services to consumers, making you lose some market shares?

Finding answers to these questions help identify the main business and IT drivers that serve as an input for your strategy.Business drivers should come from the company's board of directors (or other corporate leaders). IT drivers should come from the IT leadership. Enterprise architecture may play a role in identifying both the IT and business drivers. Once the vision is clear for everyone, the main business and IT drivers should emerge and be the core of our strategy.

Defining the strategy with the right stakeholders

In order to achieve the vision, the strategy should be structured and organized around the vision. To ensure that you do not deviate from the vision, the strategy should include a cloud roadmap, cloud principles, and cloud governance. You should conduct a careful selection of candidate assets (greenfield, brownfield, and so on). Keep in mind that this will be a learning exercise too, so start small and grow over time, before you reach your cruising speed.You should conduct a serious financial capacity analysis. Most of the time, the cloud makes companies transition from CAPEX to OPEX, which is not always easy. Moreover, building a cloud platform even increases the overall expenditures as you must still keep the lights on on-premises at the same time. It will undoubtedly cost more upfront, but you may achieve a return on investment over time.You should see the cloud as a new platform. Some transversal budgets must be made available, and should not be too tightly coupled to a single business project. The new platform you are building should have its own lifecycle, independently of the projects in scope. Lastly, do not underestimate the organizational changes, as well as the impact of company culture over the cloud journey. Make sure that you integrate a change management practice as part of your strategy.In terms of stakeholders, the extent to which the executive committee is involved should depend on the balance between business drivers and pure IT drivers. In order to be empowered to manage the different layers, the bare minimum requirement is to at least leverage a strong business sponsor. You should also involve the Chief Information Officer, or, even better, the Chief Digital Officer.

Starting implementation with the right stakeholders

This phase is the actual implementation of the strategy. Depending on the use case, such as a group platform, the implementation often starts with a scaffolding exercise. This consists in setting up the technical foundations (such as connectivity, identity, and so on). It is often a good idea to have a separate sandbox environment, to let teams experiment with the cloud. Do not default to your old habits, to using products you already use on-premises. Do your homework and analyze Azure's built-in capabilities. Only fall back to your usual tools after having assessed the cloud-native solutions. Stick to the strategy and the principles that were defined up front.In terms of stakeholders, make sure you involve your application, security, and infrastructure architects (all together) from the start. Usually, the Azure journey starts by synchronizing Active Directory with Azure Active Directory for Office 365, which is performed by infrastructure teams. Since they start the cloud journey, infrastructure teams often tend to work on their own, and look at the cloud with infrastructure eyes only, without consulting the other stakeholders. Most of the time, this results in a clash between the different teams, which creates a lot of rework. Make sure that all the parties using the cloud are involved from the ground up, to avoid having a single perspective when designing your cloud platform.The above advice is useful when building a cloud platform for a company. However, these factors are also often important to know for third-party suppliers, who would be engaged on a smaller RFP (request for proposal). To deliver their solution, they might have to adhere to the broader platform design, and the sooner they know, the better. Let us know go through a practical scenario.

Practical scenario

As stated in the previous sections, crafting a few key principles that are signed off by the top management may represent a solid architecture artifact when engaging with various stakeholders in the company. Let's now go through a business scenario for which we will try to create an embryonic strategy:Contoso is currently not using the cloud. They have all their assets hosted on-premises and these are managed in a traditional-IT way. The overall quality of their system is fine, but their consumer market (B2C) has drastically changed over the past 5 years. They used to be one of the market leaders, but competitors are now showing up and are acquiring a substantial market share year after year. Contoso's competitors are digital natives and do not have to deal with legacy systems and practices, which enables them to launch new products faster than Contoso, responding faster to consumer needs. Young households mostly use mobile channels and modern digital platforms, which is lacking in the Contoso offering. On top of this, Contoso would like to leverage artificial intelligence as a way to anticipate consumer behavior and develop tailor-made services that propose a unique customer experience by providing digital personal assistants to end users. However, while the business has some serious ambitions, IT is not able to deliver in a timely fashion. The business asked the IT department to conduct both an internal and external audit so as to understand the pain points and where they can improve.Some facts emerging from the reports include - but are not limited to - the following:

The adoption of modern technologies is very slow within Contoso.
Infrastructure management relies entirely on the ITIL framework, but the existing processes and SLAs have not been reviewed for the past 5 years. They are no longer in line with the new requirements.
The total cost of ownership is rather high at Contoso. The operational team headcount grows exponentially, while some highly qualified engineers leave the company to work in more modern environments.
Some historical tools and platforms used by Contoso became end of life, and are discontinued by vendors, in favor of their corresponding cloud counterpart, which made Contoso opt for different on-premises solutions, leading to integration challenges with the existing landscape.

As a potential solution, the auditors proposed a magical recipe: the cloud (Azure in our case)! Now, it's up to you, the Azure architect, to manage expectations and advise Contoso on the next steps. We will see an example of this work in the next sections.

The drivers

Some drivers emerge rather quickly out of this business scenario. The business wants to launch products faster, so time to market is critical. Costs are never mentioned by the business, but the audit reveals a TCO (Total Cost of Ownership) increase due to growing operational teams. So, costs are not a strong focus, but we should keep an eye on it. The features the business want to expose as part of their services rely on top-notch technologies, which are hard to make available on-premises. So, technology could be a business enabler for Contoso. In summary, the drivers that emerge are time to market, new capabilities (enabled by top notch technologies), and, to a lesser extent, cost optimization.

Strategy

We could write an entire book on how to conduct a proper strategy, so we will simplify the exercise and give you some keys to get started with your strategy. To understand all the aspects that you have to keep an eye on, you can look at the Microsoft Cloud Adoption Framework (https://docs.microsoft.com/en-us/azure/cloud-adoption-framework/). This is a very good source of information, since it depicts all the aspects to consider when building an Azure cloud platform. Another interesting source of information is about the enterprise-scale landing zones (https://learn.microsoft.com/en-us/azure/cloud-adoption-framework/ready/enterprise-scale/). Back to the strategy, you could also leverage governance frameworks such as COBIT (Control Objectives for Information and Related Technologies) (https://www.isaca.org/resources/cobit) to help transform verbal intentions into a well-documented strategy, and to consolidate the different aspects so as to present them in front of executive people. COBIT also connects the dots between the business goals and the IT goals in a tangible fashion. One of the key COBIT artefacts is what they call the seven enablers, which are applicable to any governance/strategy plan:

The diagram offers a short definition of each, and their relative impact on the journey. You can easily map them to the dimensions you see in the CAF:

Principles, Policies, and Frameworks: This could be summarized as such: what is clearly thought is clearly expressed. You should identify your core principles and policies that are in line with the business drivers. These will be later shared and reused, among all involved parties. Writing a mission statement is also something that may help everyone to understand the big picture.
Processes: The actual means to executing policies and transforming the principles into tangible outcomes.
Organizational Structures: A key enabler to putting the organization in motion toward the business and IT goals. This is where management and sponsorship play an important role. Defining a team (or virtual cloud team), a stakeholder map, and first and foremost, a platform owner, accountable for everything that happens in the cloud, who can steer activities.
Culture, Ethics, and Behavior: This is the DNA of the company. Is it a risk-adverse company? Are they early adopters? The mindset of people working in the company has a serious impact on the journey. Sometimes, the DNA is even inherited from the industry (banking, military, and so on) that the company operates in. With a bit of experience, it is easy to anticipate most of the obstacles you will be facing just by knowing the industry practices.
Information: This enabler leverages information practices as a way to spread new practices in a more efficient way.
Services, Infrastructure, and Applications: Designing and defining services is not an easy thing. It is important to re-think your processes and services to be more cloud-native, and not just lift and shift them as is.
People, Skills, and Competencies: Skills are always a problem when you start a cloud journey. You might rely on different sourcing strategies: in-staffing, outsourcing, …, but overall, you should always try to answer the question everyone is asking themselves: what's in it for me in that cloud journey? In large organizations, a real change management program is required to accompany people on that journey.

Developing a strategy around all these enablers is beyond the scope of this book. Out of real-world experience, we can say that you should work on all of them, and you should not underestimate the organizational impacts and the cultural aspects, as they can be key enablers or disablers, should you neglect them. A cloud journey is not only about technology; that's probably even the easiest aspect.To develop our strategy a little further, let's start with some principles that should help meet the business drivers expressed by Contoso, for whom time to market is the most important one:

SaaS over FaaS over PaaS over CaaS over IaaS: In a nutshell, this principle means that we should buy instead of building first, since it is usually faster to buy than build. If we build, we should start from the most provider-managed service model to the most self-managed flavor. Here again, the idea is to gain time by delegating most of the infrastructure and operational burden to the provider, which does not mean that there is nothing left to do as a cloud consumer. This should help address both the time to market driver, as well as the exponential growth of the operational teams. From left to right, the service models are also ordered from the most to the least cloud-native. CaaS is an exception to this, but the level of operational work remains quite important, which could play against our main driver here.
Best of suite versus best of breed: This principle aims at forcing people to first check what is native to the platform before bringing their own solutions. Bringing on-premises solutions to the cloud inevitably impacts time. Best of suite ensures a higher compatibility and integration with the rest of the ecosystem. Following this principle will surely lock you more to the cloud provider, but leveraging built-in solutions is more cost- and time-efficient. In the Contoso scenario, this approach makes perfect sense. However, if your company already has a strong multi-cloud presence and relies on third-party systems for a unified view across cloud environments, this principle may not be the best fit.
Aim at multi cloud but start with one: In the longer run, aim at multi-cloud to reduce vendor lock-in. However, start with one cloud. The journey will already be difficult, so it's important to concentrate the efforts and stay focused on the objectives. In the short term, try to make smart choices that do not impact cost and time: do not miss low hanging fruit.
Design with security in mind: This principle should always be applied, even on-premises, but the cloud makes it a primary concern. With this principle, you should make sure to involve all the security stakeholders from the start, so as to avoid any unpleasant surprises.
Leverage automation: Launching faster means having an efficient CI/CD toolchain. The cloud offers unique infrastructure as code capabilities that help to deploy faster.
Multi-tenant over single-tenant building blocks: While single-tenant building blocks might give you more control, it also means a risk of reintroducing your on-premises practices to the cloud. Given the audit reports we had, we see that this might not be a good idea. Leveraging multi-tenant PaaS services that have been designed for millions of organizations worldwide is a better response to the business drivers.

This is not necessarily where the list ends. Other principles could be created.Having different drivers would give us different principles. The most important thing is to have concise, self-explicit, and straightforward principles. Now that you have this first piece done, you can build on it to further develop your policies and the rest of your strategy. This will not be covered in this book, so do work on this in your own time. The time has now come to recap this chapter.

Summary

In this chapter, we reviewed the architecture landscape and the different types of architects we may be working with in our day-to-day Azure architecture practice. Knowing the different profiles, being able to speak to each of them, as well as satisfying their own interests and preoccupations, is what every Azure architect should do.In this chapter, we also explained the value proposition of the maps and how to read a map, which will be very useful for the next chapters. We shed some light on the various service models that exist in the cloud, and those that serve different purposes. We also tried to grasp the important differences that exist across them, in terms of functionalities, operations, and costs. All these models constitute the cornerstone of Azure (as well as any other cloud), and should be wholly mastered by the Azure architect as they represent the minimal, vital must-have skills. Finally, we have understood the key success factors of a cloud journey out of real-world observations through a fictitious enterprise scenario.In the next chapter, we will start to get closer to the actual implementation of an Azure-based solution.

2 Solution Architecture

Join our book community on Discord

https://packt.link/0nrj3The Azure ecosystem is vast, offering a wide range of services that address diverse business and technical needs. For Solution Architects, understanding how to navigate this extensive catalog is essential to designing effective and scalable solutions. More specifically, we will cover the following topics:

The Solution Architecture map
Zooming in on the different workload types
Zooming in on containerization
Looking at cross-cutting concerns and governance
Looking at retiring or retired services since the first edition of this book
Solution Architecture use case

This chapter provides a high-level and concise overview of the most commonly used Azure services, helping you map key offerings to specific use cases. Rather than providing an exhaustive view and diving deep into each service, our goal is to equip you with a helicopter view—allowing you to recognize which services are best suited for different scenarios. To make it more than just descriptive, we will provide a few diagrams that should help you grasp how to assemble some of the services depicted in the maps. Specialized maps as well as deeper technical insights and code samples will follow in subsequent chapters.Let's explore the landscape of Azure and identify the right tools for the right challenges.

Technical requirements

This chapter is the least technical one from the book. Our use case will be oriented towards building a diagram that assembles different Azure services for a given business scenario. The solution diagram will be provided in Visio format as well as PNG, so you only need Microsoft Visio if you want to open the provided diagrams.The Solution Diagram is available at https://github.com/PacktPublishing/The-Azure-Cloud-Native-Architecture-Mapook-Second-Edition/tree/master/Chapter02.

The Solution Architecture Map

The purpose of the Solution Architecture Map is to help solution architects find their way in Azure. We defined the duties of a solution architect in the previous chapter, which is typically an architect who assembles the different building blocks and services of a solution while considering the non-functional requirements. Solution architects engage with their specialized peers, who are often application, infrastructure, and security architects.This Solution Architecture Map, illustrated in Figure 2.1, regroups all the areas that solution architect should explore to have a complete overview of a solution:

Important note

To see the full Solution Architecture Map (Figure 2.1), you can download the PDF file that is available here: https://github.com/PacktPublishing/The-Azure-Cloud-Native-Architecture-Mapook-Second-Edition/blob/master/Chapter02/maps/Azure%20Solution%20Architect%20Map.pdf. You can use the zoom-in and zoom-out features to get a better view.

In Figure 2.1, the Azure Solution Architect Map starburst acts as the hub of a wheel. There are eight spokes of the wheel, which we will cover in the next sections:

Monitoring
Workload Types
Governance / Complaince
Containerization
Identity
CI/CD
Security
Network

For the sake of readability, we have split the global map, available on GitHub, into multiple smaller maps, which we will begin exploring as of the next section. Let us start with the different workload types.

Zooming in on the different workload types

In the following sections, we will take a closer look at some typical use cases and cross-cutting concerns that solution architects deal with. Before exploring the various workload types, let's start by defining them.

Systems of Engagement (SoE) allow companies to engage with first and third-parties. SoEs materialize through various channels that help organizations increase their reach and interact with stakeholders.
Systems of Record (SoR) represent the authoritative data source for a given system, the source of truth. SoRs are typically subject to compliance and auditing requirements.
Systems of Insights (SoIns) leverage advanced analytics to derive actionable insights. They typically take into account historical data.
Systems of Intelligence (SoI) enable near real-time decision-making and enhance operational efficiency, whereas SoIns are typically based on historical data. SoIs are often integrated with SoEs to enrich the end-user or customer experience.
Systems of Integration (SoInt) enable different systems and applications to interact and work together seamlessly. They mostly rely on message brokers and APIs.

Let's get started with with SoE.

Understanding Systems of Engagement

This category regroups services that allow a company to engage with its own employees as well as with external parties. Software components such as user interfaces, mobile apps, APIs, etc. are channels that help engage with first and third parties:

In Figure 2.2, the SoE category map includes the following top-level groups:

Systems of Intelligence (SoI)
Frontend
API
Near Real-Time
Notifications

The Systems of Intelligence group in Figure 2.2 includes several subgroups, one of which is CHATBOTS. Over the years, chatbots have gained popularity because they provide instant, automated interactions that enhance customer experience and reduce operational costs. Azure has a lot to offer in that area. Azure Bot Service integrates with many channels (Alexa, Direct Line, Facebook, Microsoft Teams, Slack, etc.) and can be seen as the transport layer of your chatbots. It helps increase their reach. Azure OpenAI handles the conversational aspect of chatbots, while other Azure AI Services facilitate richer interactions. These services are also part of our Personal Assistants sub group, as they can be used to propose tailor-made services to users. Some real-world project examples include a recipe assistant that helps customers generate shopping lists from cooking recipes. Another example is a legal chatbot designed to assist jurists in specialized legal domains.All types of assistants and agents can be built with Azure AI Services, and in particular Azure OpenAI. We will see in more details how to build Retrieval Augmented Generation (RAG) and Cache Augmented Generation (CAG) solutions, later in Chapter 8 – Artificial Intelligence Architecture. While the bot framework allows you to build custom bots, it is also possible to build them in minutes with Copilot Studio without prior development experience. Another interesting service that can bring extra intelligence and capabilities to any SoE is Azure Communication Services (ACS). ACS enables developers to integrate voice, video, chat, SMS, and email communication into their applications, all enriched with AI capabilities. SoIs add intelligent capabilities to SoE and are often coupled to Frontend applications and devices, which is our next group.For building regular frontends, we can use Azure App Service as well as Static Web Apps, which are typically used to host web apps. Static Web Apps were built from the ground up to host SPAs (Single Page Apps), while Azure App Service is rather used for MVC (Model View Controller) types of web apps but can also be used for SPAs. While Static Web Apps promise a streamlined deployment process for SPAs, they introduce complexity as it feels you're locked into Azure's way of doing things. Last but not least, any container-based system can also be used to host either an SPA, either an MVC application. No matter which building block you choose for your frontend, most internet facing web apps require a Content Delivery Network (CDN) component to speed up content delivery. This task, among other things, can be handled by Azure Front Door.Frontends usually talk to APIs, which brings us to the next sub group, namely API. Azure API Management (APIM) is the preferred service to manage APIs at scale and its gateway to proxy the API backends. The API gateway acts as a Policy Enforcement Point (PEP), which allows us to enforce some controls such as rate-limiting, JSON token validation, etc. Additionally, APIM streamlines the management of APIs by facilitating version control (both minor and major versions), debugging, and organizing them into products for seamless exposure to consumers. Backends can be hosted in Azure App Service, which supports a wide set of programming languages (at the time of writing, this includes .NET, Java, Node.js, PHP and Python). App Service can easily be integrated with a deployment factory for Continuous Integration and Continuous Deployment (CI/CD) purposes. App Service allows for zero-downtime and provides rollback mechanisms through the use of deployment slots. App Service is ideal for the following scenarios:

MVC web applications
API backend services
Lift and shift of legacy .NET applications, by using Azure Web App for Containers (AWC) with Windows
Pre-container-orchestration scenarios, by using AWC with Linux or Windows

App Service relies on App Service plans, which define the compute resources that is allocated to the service(s). Plans are a group of virtual machines of a certain size, managed by the Azure platform itself. As a user, you don't have access to the machines but you decide about the size and the number of instances to use by selecting the appropriate pricing tier. Plans can be single or multi-tenant and support both vertical and horizontal scaling. They can also be shared across web applications. The advantage of App Service over container orchestrators is its simplicity and full management by Microsoft. It is also one of the oldest and most mature Azure service. If you need a stable and straightforward solution to host web apps, choose App Services.In addition to App Service, Function Apps can be used to host HTTP-triggered functions, effectively acting as micro-APIs. However, the Azure Functions runtime introduces some complexity and overhead, which may not be justified for simple HTTP APIs—making App Service a more suitable option in such cases, unless you are already leveraging functions for event-driven scenarios, where they excel. Function apps are primarily running on App Service Plans but are also available under the Consumption plan, which corresponds to the serverless model (FaaS) introduced in the previous chapter. The Consumption plan is mostly used for short-lived and resource-efficient tasks. A newer serverless model called Flex Consumption is now available. It combines some of the strengths of the Consumption tier while supporting virtual network integration, which is more in line with enterprise grade practices.Before exploring alternatives for backends, let's look at what an SoE channel might look like with App Services. Here is a simplified diagram showing how to articulate some of the previously described services together:

Figure 2.3 – End to end SPA example with App Service

We see that the client (browser) talks to Front Door, acting as a CDN, WAF and reverse proxy. Front Door either returns a response from its CDN cache, either fetches the origin, which is a Single Page App (SPA) made available by an App Service. When the SPA calls its Backend for Frontend (BFF), it goes back to Front Door, which in turns forwards to APIM where policies can be enforced. At last, once all controls are passed, APIM forwards the request to the BFF, which in is also hosted on an Azure App Service. When the BFF component needs to call other API services (for example, microservices or backend APIs), it does so via APIM rather than calling them directly. An alternative for the frontend component would have been to use Static Web Apps, but as explained earlier, the way you transition from the frontend to the backend is more convoluted and less controllable.For advanced scenarios, such as high-scale microservice and distributed architectures, container orchestrators are often a better choice than Azure App Service and Function Apps. They offer built-in support for service meshes and advanced add-ons supporting these architectures. Service meshes understand the application layer and are able to differentiate HTTP/1 from HTTP/2, gRPC from REST, etc., optimizing traffic across services and components. Additional frameworks and solutions such as Distributed Application Runtime (Dapr) and Kubernetes Event Driven Autoscaler (KEDA) are native in container orchestrators but not usable in App Services nor Function Apps. However, Azure Functions can be self-hosted within a container orchestrator and in this case, integrate with Dapr and other solutions from the container ecosystem. Remember that for anything you self-host, you are on your own to ensure high availability, capacity etc., potentially increasing the complexity and the burden of IT operations. Further details will be covered in the Containerization section of this chapter, with additional insights in chapters 4 and 5.Here is a revised version of Figure XXX, where only AKS is used to host both the SPA and the backend services.

Figure 2.4 – End to end SPA example with AKS

We see more or less the same traffic flow as in Figure 2.3 but an important difference is that we may decide to rely on a service mesh for traffic flying between the BFF and the other backend services. We hope that these two examples help you grasp some of the key differences between the different technology stacks.Next, let's move on to the Notifications top-level group of our SoE category in Figure 2.2. Azure Notification Hubs (ANH) enables sending push notifications to various systems. It is most often used as a way to push messages to mobile apps.Next, let us explore the Near Real-Time top-level group. Modern frontends provide feedback mechanisms to end users for every asynchronous task. Users might create actions that take time to complete but still expect to have feedback without refreshing the page or their screen in a mobile app. Services such as Azure SignalR Service and Azure Web PubSub provide a bi-directional channel between frontends and backends for near realtime communication, where backends typically send back task-related information that are surfaced on frontends, without requiring any action from the user. Here is an example that introduces SignalR to the previous use case:

Figure 2.5 – End to end SPA example with App Service and feedback loop through SignalR

The client (browser) establishes a bi-directional connection with Azure SignalR, which we use as a one-way channel to receive notifications from the backend. While the authentication details are beyond the scope of this section, a built-in mechanism ensures secure authentication between the frontend and the Azure SignalR service.To enhance communication, we introduced Azure Service Bus, enabling the BFF to queue commands, which are then processed by backend handlers. Once execution is complete, the backends send results back to Service Bus, allowing the BFF to pick them up and notify the frontend via Azure SignalR.This approach not only creates a feedback loop but also improves scalability by decoupling the BFF from other backend services.In the next section, we are going to discuss the Systems of Record.

Understanding Systems of Record

Strictly speaking, SoRs refer to databases and data integrity. A SoR might (or might not) be transactional or mission-critical:

For our SoR category map (Figure 2.6), we have the following top-level groups:

Compute Models
Data Warehousing
Relational
NoSQL
Unstructured

Let's get started with the Compute Models top-level group. The serverless compute model available in Azure SQL and Comos DB enables on-demand scaling based on the actual needs of the workloads. This is very convenient for solutions that support variable performance and potential limited capacity from time to time. Use cases for serverless data stores include – but are not limited to – asynchronous calculations/operations performed by scheduled jobs, time-based peaks and non-user facing applications in general. Serverless is cost-friendly and is ideal for development and testing. Both vCore and DTU (Database Transaction Unit) compute models apply to Azure SQL. The DTU model bundles compute and storage into a single unit, making it less granular than vCore, which allows you to independently choose compute, memory, and storage resources based on your workload needs. Following is a small comparison table that should help you choose the most appropriate model:

	DTU	vCore (Serverless)	vCore (Provisioned)
Scalability	Fixed but can be adjusted.	Dynamic within predefined range: between 0.5 and n vCores. Deallocates when predefined idle time is reached.	Fixed but can be adjusted. Provisioned compute can be higher than with serverless.
Granularity	Bundle	Independent storage and compute	Independent storage and compute
Scale to zero	No	Yes. Make sure to have period of inactivity allowing to pause the system.	No
Cost friendly	Yes, basic tier is very cheap and convenient for non-production.	Yes	No, it is the most expensive among all offerings. The general purpose tier is most budget friendly but still more expensive than DTU and serverless.
Use cases	Predictable workloads	Burstable workloads, dev/test, non-user facing systems.	Predictable workloads with higher scalability requirements.

Table 2.1 – Compute models comparison table

Next, we have the Flexible Server model that applies to Azure MySQL and Azure PostgreSQL. This compute model tier is similar to Azure SQL's vCore and has replaced the Single Server mode, which you might still see in many organizations. Finally, Flexible Server includes an auto-stop mode similar to Azure SQL's serverless mode; however, unlike serverless, it does not support autoscaling.In the Data Warehousing group, we find Azure Synapse Analytics , which has replaced the former Azure Data Warehouse service. We will cover it more in Chapter 7, Architecting Data Solutions.That brings us to the next two top-level groups, namely, Relational and NoSQL. The rationale of using a NoSQL engine versus a SQL one is beyond the scope of this book but we can try to summarize the main reasons that lead you to choose either of these. NoSQL systems are designed to scale, and mostly rely on the Basically Available, Soft State, Eventual Consistency (BASE) model, whose biggest impact is eventual consistency. Their purpose is to respond fast to read requests, at the cost of data accuracy. They are suitable for big data, in other words, huge data volumes. In Azure, NoSQL data stores, such as Azure Table storage, also enable strong consistency, but will not scale as much as an eventual consistency-based stores. A good example is Cosmos DB, for which you can choose between different consistency levels, from eventual consistency to strong. The more you go towards strong, the less scalable and performant. On the other hand, traditional SQL engines rely on the Atomicity, Consistency, Isolation, Durability (ACID) model, which is centered around transactions and data accuracy, at the cost of speed. ACID-based data stores are convenient for most workloads and scalability is only a problem for massive volumes. There is no definitive threshold at which the volume of data becomes too large for an ACID-based data store, as it depends on factors such as indexing strategies and sharding. However, as a concrete example, Azure SQL Hyperscale supports a maximum storage capacity of 128 TB.However, one thing is certain: you shouldn't rush to NoSQL without careful consideration, as a poorly designed Cosmos DB can cause significant issues in production.Since SQL engines are nothing new, let us focus on the NoSQL ones. Cosmos DB is Azure's best NoSQL storage solution. It supports multiple API models: SQL, Mongo, Table, Cassandra, and Gremlin. The native MongoDB API is only available through Atlas, a marketplace product, or by self-hosting a MongoDB. Cosmos DB's preferred API model is SQL, as it offers the best performance, so you should try to favor this option first. Going for Cassandra or Mongo introduces an additional overhead incurred by the translation between Mongo or Cassandra to SQL behind the scenes. However, this might be acceptable in lift and shift scenarios consisting of migrating a Mongo database to Azure when the migrated database – and thus application - can survive the overhead. The Table API for Cosmos DB is meant to be an alternative to Azure Storage Tables. Azure Storage Tables are a cost-effective solution for high-capacity storage in a single region (from a write perspective), while Cosmos allows you to achieve global distribution (read/write). The following table gives an indication on when to use relational versus NoSQL data stores:

	Relational	NoSQL
Scalability	Mostly vertical	Horizontal
Use cases	Transactional applicationsData integrity applied at database level	Big dataEventual Consistency-friendlyIoTStreamingStructured documents

Table 2.2 – Relational vs NoSQL use cases

At last, in our Unstructured group, we find Azure Blob Storage, which allows storing raw blobs. It is commonly used alongside other data stores, where metadata is persisted in Azure SQL or Cosmos DB, while related blobs are referenced in Blob Storage to avoid saturating the database with binaries. Blob Storage itself has built-in system metadata and supports user-defined metadata as well. Azure Data Lake is nothing else but Azure Blob Storage with Hierarchical Namespace (HNS) enabled. Without HNS, blob storage is just a flat file structure behind the scenes, the concept of folders being totally emulated. With HNS enabled, files are organized with folders and sub folders, making Azure Data Lake faster and a better choice than mere Azure Blob Storage for high volumes. Last but not least, Azure Files is a sub resource of Azure Storage allowing Azure Virtual Machines, Azure Kubernetes Service and even from on-premises environments to mount file shares, using both SMB and NFS protocols. Azure Netapp Files is an alternative that is more suitable for very low-latency and high throughput workloads. It is an option to explore, especially if your company already uses Netapp on-premises. Many of the services we depicted so far have built-in activity logs, data protection features such as versioning, soft delete, automated backups, etc. making them suitable for SoRs.There are many other types of data stores in Azure, such as Redis cache, Service Bus, Event Hub, etc. but these are mostly used for for data in transit and not meant to be used directly as SoR. Let us now explore the SoIn (Systems of Insights).

Understanding Systems of Insight

This category regroups data analysis services, helping to extract valuable business insights out of both SoRs and specialized analytics data stores:

In our SoIn category map (Figure 2.7), we have the following top-level groups:

Analytics
ETL/ELT
AI
Unified

Let's start with the easiest one, the AI group. At this stage, we simply refer you to the dedicated chapter on artificial intelligence. However, it was important to include AI in the SoIns to highlight its role in extracting valuable insights from data.Let us move on to the Anaytics group, which comprises the Kusto Query Language (KQL)-based Azure Data Explorer (ADX) service. KQL is a powerful query language that is used everywhere in Azure to query logs and create alert rules for monitoring purposes. ADX is a fully managed, fast, and scalable data analytics service. It efficiently ingests and analyzes large volumes of structured and semi-structured data. In IoT scenarios, it can be used as an output data store from Azure Stream Analytics (ASA), which will be described later.Now, let's move on to our All-in-One sub-group where we find Azure Synapse Analytics described earlier as it is also part of the SoRs. Beyond having replaced Azure Data Warehouse, Synapse glues many data services together (Azure Data Lake, Power BI, Spark clusters, Azure Machine Learning, etc.) in order to analyze both enterprise and big data. Azure Synapse is intended to be used by both data scientists and traditional Business Intelligence (BI) analysts in a single consolidated service. Quick detour to our Traditional BI sub-group featuring Azure Analysis Services, which you can still use if you come from a traditional BI stack on-premises and want to gradually move to the cloud. Azure HDInsight is an easy and cost-effective method for running open analytics, such as Apache Hadoop, Spark, and Kafka.In our Computed sub-group, Power BI is a comprehensive service that allows you to create reports and dashboards, including real-time dashboarding, at an enterprise scale. Power BI's real-time dashboards can be very handy in any Business Activity Monitoring (BAM) scenario.Azure Databricks is a very comprehensive service that you can use to analyze data at scale. It relies on flexible clusters and can be fed by any data source. It encompasses AI capabilities by leveraging data science frameworks as well as can be used as a vector database. Databricks requires very specific skills, although the SQL language is accessible to non-data scientists.In the Extract Transform Load (ETL)/Extract Load Transform (ELT) sub-group, Azure Data Factory (ADF) is another fully managed service that can be used for both ETL and ELT purposes. It can be combined with Databricks notebooks and Azure Functions. ADF Pipelines can be triggered through API calls, and they can react to Azure Event Grid notifications. ADF is mostly involved in data-in-movement scenarios, such as pulling data from on-premises systems to the cloud using the Self-hosted Integration Runtime (SHIR), that is part of the Hybrid sub-group .Another hybrid solution is the on-premises Data Gateway, which acts as a bridge between the cloud and on-premises data sources. The gateway comes in two flavors: personal and shared. The personal gateway allows users to create Power BI reports against on-premises data, for their own use. The shared gateway is a cross-user and cross-service gateway. A more recent flavor of gateways is the Virtual Network Data Gateway, which is used to bridge Power BI, now part of Microsoft Fabric, with Azure-hosted workloads that have been privatized. It has become a common practice to shield any data service behind private endpoints, making those services inaccessible from Power BI, without a gateway to bridge the two worlds.Azure Stream Analytics (ASA) is a serverless service that belongs to the Near Real-Time subgroup. In essence, ASA processes data streams by analyzing and applying transformations in real time. It ingests data through inputs and sends the filtered or transformed results to specified output destinations.ASA is often used in conjunction with Power BI real-time dashboards for BAM and IoT scenarios, as well as with Synapse for further data analysis. For more advanced scenarios or when input channels are not supported by ASA, Azure Databricks Structured Streaming represents a good alternative to process and handle data as it flows. To conclude this section, let's look at the Unified top-level group. In recent years, Microsoft has made efforts to unify all data services under a single umbrella called Microsoft Fabric. It is a SaaS platform that integrates with data and AI services. We will zoom in deeper on the data services in a dedicated chapter. Let us now focus on SoIs.

Understanding Systems of Intelligence

We saw earlier that SoI can be tightly coupled to SoE because frontends, mobile apps, etc. – in other words – customer facing apps are enriched with intelligent capabilities surfacing on client devices. However, SoI also exists as an independent category when used from within backends. They differ from SoIn in that they are used to make near-real time decisions on current context as well as on historical data. Scenarios such as industrial automation, autonomous decision systems, etc. are all part of SoI and can be achieved through the use of Azure AI Services. We will explore them in Chapter 8, Artificial Intelligence Architecture. Let us now explore Systems of Integration (SoInt).

Understanding Systems of Integration (IPaaS)

SoInt, illustrated in Figure 2.8, represent the way to integrate systems together. In Azure, we can refer to this as Integration Platform as a Service (IPaaS). This category regroups services that enable integration between different layers of a single solution or across multiple solutions and systems.

In the Systems of Integration (IPaaS) category map (Figure 2.8), we have the following top-level groups:

Event Driven Handlers
API
Point-to-Point
Hybrid
Pub/Sub
Orchestration

In the Event Driven Handlers top-level group, we find Azure Functions, which have many built-in bindings and triggers allowing interactions with other services and reacting to events, such as when a message is delivered, a blob has landed onto Blob Storage and so on. Azure functions are gluing services together. As seen earlier, functions can be hosted on App Service Plans and are also available as serverless through the Consumption and Flex Consumption pricing tiers.Beyond Azure Functions, that are centered around application events, you can also rely on Azure Event Grid System Topics to react to events happening in Azure itself. More than 25 Azure services produce events that can be captured by Azure Event Grid. You can find an exhaustive list here https://learn.microsoft.com/en-us/azure/event-grid/system-topics. This can be used from within applications, such as for example, triggering an action when a new client connects to Azure SignarlR, as well as for monitoring, such as when a Key Vault certificate has expired or a new Kubernetes version is available.Nowadays, integration is made through APIs. Azure API Management (APIM), that is part of our API top-level group, is Azure's first-class citizen for exposing APIs to other systems and organizations. The following list highlights the main APIM features:

Versioning: Dealing with multiple versions of the same API, for backward compatibility.
Revisions: Testing API changes, with the ability to promote them later on, as the real/deployed versions.
Products: Clubbing APIs together into a product, and then letting API consumers subscribe to the product.
Policies: Enforcing controls, such as JWT token validation, throttling, HTTP header check, request/response transformations, and so on. Policies are enforced by the API gateway, which is also known as the Policy Enforcement Point (PEP). API gateways can also be self-hosted, mostly in hybrid scenarios (on-premises, cloud, or cloud-to-cloud).
Workspaces to share APIM across multiple business projects.
Developer portal: Letting consumers discover and subscribe to your APIs.
Publisher portal (Azure Portal): Managing your APIs.

The preceding list is not exhaustive, but the key thing to remember is that all API management systems (not only Azure's APIM) play an important role when integrating different systems together. APIM is mostly used in Business-to-Business (B2B) contexts, when exposing APIs to other parties or as part of an integration landing zone, when application from different domains must interact. APIM supports a wide variety of protocols and specifictions such as OpenAPI, GraphQL, gRPC, WSDL and WADL over HTTP/1.1 and HTTP/2. Each APIM instance maintains its own catalog, which can become difficult to manage at scale when multiple instances are in use. Azure API Center addresses this challenge by serving as a centralized catalog for the entire organization, encompassing both APIM-managed and non-APIM APIs.Azure has native services to deal with all types of integrations. In the PUB/SUB top-level group in Figure 2.8, two services emerge: Azure Service Bus and Azure Event Grid (again). Both are suitable for Event-Driven Architectures (EDAs). The boundaries between a message and an event are sometimes blurred, because an event itself is a message. An event is generally used to tell others that something happened in a fire and forget way, while messages are rather commands to be processed by handlers.Both Azure Event Grid Custom Topics and Service Bus can be used to handle events using the Pub/Sub pattern. The major difference between them is that Event Grid is based on a push delivery model while Service Bus is mostly pull-based, although it also supports push-based delivery (via AMPQ's link credit feature), but this is never used in practice. In any case, Pub/Sub is used for discrete events. An example of discrete event could be: an order has been created. It is reasonable to consider that, for a given application, discrete event types can emerge as outcomes of an Event Storming session—a key practice in Domain-Driven Design. In contrast, Azure Event Hubs is ideal for telemetry, event streaming and event series, requiring high throughput, such as in IoT scenarios where IoT Hub is in fact internally based on Event Hub. Kafka-enabled Event Hubs can also be used for integrating on-premises Kafka instances with the cloud or migrating on-premises workloads that are using Kafka clients. Kafka is a special case because it is suitable for both discrete events and streaming.Pub/Sub is primarily used when multiple services might need to capture the same discrete event, that we explained earlier, as well as integration events. In microservice architectures, it's common for each service to maintain its own data store. While the use of a ubiquitous language helps ensure that data concepts are unique and aligned with each service's domain, the separation of data stores often introduces the need for data synchronization—typically handled through what we refer to as integration events.Figure 2.9 illustrates how Azure Event Grid and Service Bus differ from that perspective:

Figure 2.9 – Pub/Sub pattern with Event Grid and Service Bus

They both use topics and subscriptions but the main difference is the way they deliver the events, being push-based or pull-based.Here is a comparison of both services:

	Azure Event Grid	Azure Service Bus
Delivery mode	Push/Pull pull is possible through Event Grid Namespaces or natively but not to the extent of Service Bus.	Pull
Potential Architecture bottleneck	Event handlers as they might not be able to follow the pace of Event Grid.	Service Bus can buffer messages for a limited time, but not indefinitely. Overly aggressive senders may face rejection, while slow receivers risk causing messages to be dead-lettered.
Subscription filters	Yes	Yes
Subscribers endpoints	Internet Facing for the push-based delivery mode. A common practice is to route events to message stores, preventing direct exposure of handlers to the internet.	Internet and private.

Table 2.3 – Comparison between Event Grid and Service BusBeyond semantics, the fundamental difference between commands and events lies in the messaging patterns they follow. Events are based on the Pub/Sub one, while messages are based on the Point-to-Point (P2P) one. Historically, P2P was often used for both because Pub/Sub did not even exist back in the days. You will still encounter many on-premises legacy integrations, where different systems exchange so-called events via P2P rather than using Pub/Sub. This bad habit can tend to repeat itself in the cloud, especially when migrating on-premises workloads, and that's where you step in as an Azure Solution Architect.In cloud native solutions, we try to avoid this situation and strictly use P2P within a single domain or within a single application. For example, P2P is ideal when relying on the Load Levelling Pattern illustrated in Figure 2.10:

In this example, Application A users use the frontend which ultimately generates commands such as submitting a shopping list for order, etc. The commands are then transmitted to the BFF, which in turn, queues them to a Service Bus queue. One or more background handlers will listen to that queue and process the messages to handle the heavy lifting work at their own pace. The ensures proper scaling during peak load, preventing the frontend and the BFF from being saturated. The message broker is the buffer between the frontend, BFF and the actual backend. Ultimately, a feedback loop can be done by the backends to the frontend through services such as Azure SignalR, which we will describe later. Another valid use case for P2P communication is SAGA choreographies within the same domain, where participants execute actions in response to queue-based commands. Azure Queue Storage Queues and Azure Service Bus Queues are the two main services enabling P2P. Below is a summary of the main differences between the two:

	Service Bus Queues	Storage Queues
Message size	Up to 100MB with the premium tier.	Up to 64KB
Number of messages	Up to 5GB per queue. You can calculate the number based on your average message size.	No clear limit. Millions per queue.
Automatic Dead-lettering	Yes	No
Ordered delivery	Yes, using sessions	No
Supported protocols	AMQP, REST over HTTPS	REST over HTTP(s)

Table 2.4 – Comparison between Service Bus and Storage QueuesYou can find an exhaustive comparison here https://learn.microsoft.com/en-us/azure/service-bus-messaging/service-bus-azure-and-service-bus-queues-compared-contrasted.Now, let's dig into the Orchestration top-level group (see the bottom of Figure 2.8), as it allows us to handle orchestration-based SAGAs.For these workflow-like workloads, Azure Durable Functions and Logic Apps are the main services you would use. Logic Apps ships with hundreds of pre-built connectors and allow you to create workflows in minutes. Durable Functions require development skills as they make use of the Durable Framework, which you can integrate in your application. It is worth mentioning that Microsoft launched a preview of the Durable Task Scheduler in May 2025 that helps manage and monitor durable functions at scale. You should use Logic Apps whenever you target integrations with well-known third-party services (SaaS platforms) or integrate different systems together. You could use Durable Functions whenever you need orchestrations that are scoped to a single application.Hybrid services, our last top-level group, rely on on-premises agents that initiate outbound connections to Azure, creating a bi-directional communication channel for exchanging data. This approach is especially useful when there's no private peering between Azure data centers and on-premises environments, or when a cloud-based service needs to interact with on-premises systems. By running the agent locally, hybrid services can support protocols like NTLM and Kerberos—common in on-premises setups but not natively supported by most PaaS offerings. For example, Azure Hybrid Connections allow an Azure App Service to consume an on-premises database seamlessly, without the need of exposing the on-premises database to Internet no to have a private connectivity between the two worlds. Admittedly, this type of integration has become less common between PaaS services and the on-premises systems because most PaaS services now integrate fully with Virtual Networks, which in turn leverage the private connectivity. Yet, SaaS systems such as Power BI, still leverage this type of integration, as we described in the Sytems of Insights section.Let's now explore the Azure container landscape.

Zooming in on containerization

Containers are ubiquitous and a hot topic of discussion! In the following sections, we'll explore Azure's container offerings, but only at a high level, as two dedicated chapters will dive deeper into this subject.The Azure platform supports different flavors, which range from single-container support to full orchestrators. Let's zoom into the containerization area of the Solution Architecture Map (see Figure 2.11):

Needless to say that in terms of Use Cases, anything is possible with containers but some are particularly appropriate, such as Microservices and Distributed Architectures. One of the reasons why AKS (Azure Kubernetes Service), ACA (Azure Container Apps) and ARO (Azure Redhat OpenShift), all based on K8s (Kubernetes) behind the scenes, are extremely suitable for running such workloads, is the K8s pod. A K8s pod can be seen as the hosting piece of a microservice or a single component of a distributed architecture. Pods can be scaled out/in and deployed independently. K8s supports many deployment methods ensuring zero downtime. Additionally, the CNCF landscape, introduced in the first chapter, is a rich ecosystem that offers many solutions and frameworks to run such architectures at scale. One of these frameworks is Dapr(Distributed Application Runtime), which acts as an abstraction layer between the application code and services such as message brokers, state stores, secret stores, and so on. Dapr also brings the actor model, which provides a framework for stateful, single-threaded objects (actors) that encapsulate behavior and state, making it ideal for event-driven applications and microservices. Dapr is multi-cloud, and has connectors to many stores, as illustrated by Figure 2.12:

Most Cloud Native Computing Foundation (CNCF) solutions going beyond the boundaries of K8s are multicloud. It is also the case of KEDA, a smart metric server, allowing you to scale pods, from zero to any number, based on a wide variety of metrics, such as the length of a message queue, the count of blobs in an Azure Blob Storage container, etc. KEDA works with scalers targeting many different clouds. You can find the list of supported scalers here https://keda.sh/docs/2.16/scalers/.AKS is the only service that allows you to leverage the CNCF ecosystem to its full extent. That being said, while you can technically setup anything, you take the responsibility from a support perspective. Microsoft provides support for the API server, the virtual machines used as worker nodes and the built-in add-ons that you can enable. Unfortunately, many of those add-ons are often rather limited compared with their CNCF counterpart. One of the common challenges with AKS in the enterprise is its supportability. You might either stick to the pre-built add-ons providing you can live with the limitations, either have a support contract with another company than Microsoft. On the other hand, ACA enables a few CNCF solutions—such as Dapr and KEDA—by default, but you are strictly limited to Microsoft's selected integrations. For example in ACA, you cannot use the latest KEDA scalers, as the Microsoft-managed version often lags behind the upstream KEDA release for stability reasons..ACA is also based on Kubernetes, but the cluster is fully managed by Microsoft, meaning you don't have direct cluster-level access. On one hand, this is convenient because it reduces operational overhead; on the other hand, it can be frustrating due to the limited control and customization options.You can only deploy your applications and integrate with the built-in components that Microsoft decided to incorporate. Depending on your inclination toward CNCF solutions, you might choose AKS or ARO for greater flexibility, or opt for ACA if its built-in features meet your requirements.Beyond CNCF solutions, container orchestrators can be used to self-host Azure services. Many Azure services can be deployed as containers, which is often required for Edge computing, multi-cloud environments, and hybrid deployments. However, self-hosting an Azure service means taking full responsibility for high availability and disaster recovery. While you must provide your own compute and take more responsibility, you gain the advantage of operating within a controlled perimeter. It's essential to carefully weigh the benefits and challenges of self-hosting. At last, ARO adheres to OpenShift's supported operators. OpenShift's philosophy is to make a careful selection of CNCF project with an extra focus on security, maturity and enterprise readiness of the solution.Other use cases, such as running background jobs, can be handled by any container orchestrator. However, ACI (Azure Container Instances) are a very cost-friendly way to perform such tasks. ACI is a serverless offering that let's you allocate up to 256 GB of RAM and 32 CPU Cores depending on the Azure region you create them into. You are charged per second of execution. ACIs can be fluently provisioned and destroyed once a job is completed You can run up to 100 ACIs in parallel per subscription by default, and this limit can be increased upon support request. ACIs can also be used as virtual nodes for AKS to bring a serverless flavor to clusters. For hyperscale jobs requiring even more compute, we can use Azure Batch, a job scheduler able to perform parallel executions and using dozens or hundreds of virtual machines behind the scenes . At last, Azure Web App for Containers(AWC) shine for simple applications and when migrating legacy .NET applications that might require Windows-specific features that are not available on Linux-based container platforms. Note that AKS and ACI also support Windows-based containers, but unless you match the specific use-cases of AKS or ACI, you should rather consider AWC. Worth to mention that one of the latest feature brought to AWC is the support of sidecar containers. While you can containerize Azure Functions and host them in AKS or ARO, you also have the option to use Azure Functions for Containers(AFC ). AFC enables you to package functions as containers and deploy them to either Azure Container Apps Environment or Azure Function Apps. Moving forward, any mention of AFC refers exclusively to these two deployment options and does not include running functions in AKS or ARO.Figure 2.13 is a summary table outlining the best container options for different architecture styles and use cases:

Figure 2.13 - Container services mapped to use cases (empty circle means no match, full circle means high match)

Detailed explanations will be provided in Chapter 5, Other Container Services about the rationale that lead to this mapping but this already gives you a good idea of which container service to use according to your very specific situation.Whatever the use case, the Solution Architect is not always the sole decision-maker when selecting a container platform. The choice may depend on how well a given service aligns with company standards, which could encompass security requirements, control levels, operational capabilities, and more. This is why we have regrouped the different options under the Platform Requirements top-level group. One example of such constraints, could be that mTLS should be enforced for pod to pod communication. In that case, Service Meshes are the easiest way to enforce such behavior, which would be possible with both AKS and ARO but not with the other container options. AKS is vanilla enough for most use cases, especially if you want standard Kubernetes behavior—but it's not pure vanilla because it's wrapped in Azure-specific tooling and managed services (manage control plane, azure CNI, etc.).ARO provides you a lot of control but you must adhere to OpenShift's way of working.. Conversely, if minimizing operational effort is a priority and there are no specific requirements, ACA could be the optimal choice. Company standards are often mapped to non-functional requirements for security, resilience and more. Figure 2.14 is a mapping between container services and Software Quality Attributes:

Figure 2.14 – Container services mapped to Quality Attributes (empty circle means no match, full circle means high match)

Figure 2.14 highlights the built-in features that simplify achieving a given quality attribute with minimal effort on your part. For instance, ACI does not include a built-in auto-scaling feature, hence why an empty circle has been chosen. However, you can still implement your own solution to scale ACIs programmatically based on collected metrics. This simply means that achieving auto-scaling with ACI requires more effort compared to Azure App Service, where it is handled automatically. All the details and rationale that lead to this classification will be provided in Chapter 5.Regardless of the container platform, Azure Container Registry should be used to store container images, and Defender for Containers to enhance overall security.Beyond technology, being an architect also involves addressing cross-cutting concerns, which we will explore in the next section.

Looking at cross-cutting concerns and non-functional requirements

Building a solution is not only about programming or about blindly assembling Azure services. You may have the best developers ever, producing the best code, and you still might end up with a very poor customer experience. Beyond the workload classification itself, you need to look at the cross-cutting concerns and non-functional requirements, which all contribute to a production-grade application, for which quality is a must. Software quality is the balance between being fit for purpose and fit for use. To illustrate this, consider the analogy of a washing machine. Suppose you purchase one that delivers exceptional cleaning results but fails if used twice a day or breaks down every two weeks. Despite excelling at its primary function (cleaning clothes), its unreliability makes it a poor product. In this case, it is fit for purpose but not fit for use, ultimately leading to poor overall quality. That is why, our Solution Architect map includes multiple non-functional areas, which we will focus on in the next sections.

Learning about monitoring

Monitoring might not be the most exciting topic but is mandatory for any system that runs in production. The purpose of monitoring systems and applications is to detect adverse events as soon as possible, and respond manually or automatically. Monitoring also makes it possible to have global oversight of the running systems and be able to efficiently troubleshoot any issue, at either application, either infrastructure level. Additionally, you can then collect Key Performance Indicators (KPIs) to evaluate the service quality towards your proposed Service-Level Agreements (SLAs). See Figure 2.15 to zoom in on the monitoring category:

In Figure 2.15, we have three top-level groups: Common 3rd Parties, Native, and DSC. Let's discuss the Common 3rd Parties top-level group. Most organizations already have monitoring in place on-premises and many of them want to keep using the same monitoring infrastructure across all environments, cloud(s) and non-cloud. This is why we are often required to make diagnostic logs as well as application logs available to usual suspects such as Splunk and QRadar, the latter being focused on security logs. The good news is that such integration is rather easy since both Splunk and QRadar have built-in connectors allowing them to ingest Azure (security) logs. The only thing you have to do is to export logs to Azure Event Hubs. System Center Operations Manager (SCOM), a very common tool on-premises, has a management pack for Azure. However, even when integrating with third-party monitoring solutions, it's advisable to leverage Azure Monitor, as it is the native monitoring tool. Relying on external systems for alerts can be risky due to massive log ingestion, often causing on-premises systems to become overwhelmed, potentially disrupting them, or triggering alerts long after an issue has occurred. A more effective approach is to define alert rules in Azure Monitor and let it notify the other systems as needed. This way, you lower the delay between the occurrence of a problem and its detection while still integrating with on-premises systems to have a 360% view. Another benefit of such an approach is that you do not especially need to map all your KQL-based queries into Splunk ones. Moreover, this still allows you to centralize all logs to a platform such as Splunk, while not rely on these to generate alerts.This naturally bring us to the Native top-level group. Azure Monitor is the main service that collects all the metrics, while Log Analytics is usually used as a central log repository to query both diagnostic and application logs through KQL. Optionally, Azure Event Hubs can be used to centralize all logs to be ingested by Splunk. You can define alerts on both metrics and logs. Alerts are sent through action groups, which range from sending emails and SMS to triggering automated responses via Logic Apps, which integrates with virtually anything. Here is a high-level view of how this could look like:

Figure 2.16 – Integrating Azure Monitor with on-premises systems

Workloads send their logs to both Log Analytics and Event Hub. Log Analytics and Azure Monitor Metrics serve as the input to define alert rules. Logic Apps propagate alerts to a solution such as IBM Tivoli while Splunk pulls all the logs from Event Hub for centralization purposes.Azure Network Watcher is a set of tools for you to diagnose network-related problems. Application Insights focuses on application logs, enabling real-time observability (OpenTelemetry), request tracking, dependencies and allows developers to define custom traces. Application Insights integrate with Log Analytics Workspaces. Azure Workbooks are useful to quickly view a visual representation of a solution's health. Over the recent years, the rise of container-based solutions lead to the integration of Grafana and Prometheus as managed services. Grafana is centered around dashboarding, while Prometheus is a metric collector that is omnipresent in the container world.Lastly, in our DSC group (see Figure 2.16), Azure Automation is the only native Azure service that ensures a Desired State Configuration (DSC), with a focus over virtual machines. DSC consists of defining a desired configuration and get the service auto-correct deviations from that desired state. Azure Automation meets similar needs as Ansible, Chef, Puppet, and so on. Azure Automation can also be used to provision Azure Resources through Automation runbooks, either directly or from CI/CD pipelines through webhooks. Its hybrid workers also make it easy to interact with on-premises systems. Note that Azure Automation will be replaced by Azure Machine Configuration as of 2027.From the very beginning, an Azure solution architect should envision a monitoring strategy that is scoped to either a single solution or that is scoped to integrate with a strategy that is defined for the entire platform. This will impact deadlines if a monitoring strategy is not considered from the start. If a requirement from the landing zone is to redirect logs to Azure Event Hubs, then in order to let Splunk ingest them on-premises, you, the Azure solution architect, must supervise the activities accordingly, and reflect this in an initial set of solution diagrams.

Learning about factories (CI/CD)

Continuous Integration/Continuous Deployment (CI/CD) has become mainstream and is a key enabler for a successful cloud journey. The ultimate goal is to do Anything as Code, but, before getting there, a substantial amount of work must be provided upfront. In this second edition of the book, we decided to only shed some light on the CI/CD topic, so as to focus more on architecture. It remains nevertheless important for Solution Architects to assess the level of readiness of the existing CI/CD factory (if any), as it will influence the transition from drawing board to a tangible workload running in production. Poor CI/CD practices will impact your velocity as well as the resulting quality. Figure 2.17 zooms in on CI/CD spoke/area of our Solution Architecture Map:

Our CI/CD zoom-in map includes five elements:

Delivery Modes
Pipelines
K8s/OpenShift
IAC
Quality Gates

Let's get started with the Delivery Modes top-level group. You might be surprised how DevOps and GitOps differ and how impactful those differences are towards organizing a CI/CD factory. In essence, the pursued objective is the same, we want to maximize automation and empower application teams to achieve more in less time. GitOps is pull-based, which means that the source of truth are the Git repositories themselves. This eliminates the need of pipelines. Repos are monitored by a tool that deploys what's known in the repos to the target environments, upon pull request (for control) or straight commit to a branch. Conversely, DevOps is push-based, which means that you still need to configure pipelines to build the code and deploy it to the target environments. Discussing about Pipelines, we typically use Azure DevOps YAML piplines for both, building and deploying apps. Classic pipelines were typically used for releases only but should not be used anymore. Another popular tool is GitHub, which relies on GitHub Actions to build and deploy applications. There are of course other platforms such as Jenkins, etc. that can be used for CI/CD, but since Azure DevOps and GitHub are part of the Microsoft ecosystem, it makes them first-class citizens when deploying to Azure.Container-based platforms such as K8s/OpenShift, are natively built on top of GitOps principles. ArgoCD and Flux are two CNCF projects that help deploy both infrastructure and applications to the clusters. In Azure, we typically provision AKS with an IaC (Infrastructure as Code) language but the internals of the clusters are typically managed by either Flux or ArgoCD. Both solutions support using Helm and Kustomize, two of the most popular ways to deploy components to K8s. Although GitOps is a best practice for K8s and OpenShift, it is possible to use push-based pipelines as well. The main difference between GitOps and DevOps for K8s, is that ArgoCD also supports DSC, detecting any drift between the desired state (repos) and the actual state (the cluster).In the IAC top-level group, we find the most frequently used languages to deploy to Azure, namely Terraform and Bicep. Pulumi differentiates itself by reusing well-known programming languages such as Python, TypeScript etc. ARM used to be the only way to deploy Azure infrastructure back in the days but is definitely not the primary choice anymore because of its complexity. If you use Bicep or Terraform, you should look at the Azure Verified Modules, which regroups IaC modules according to best practices. Here is the link to the website https://azure.github.io/Azure-Verified-Modules/.At last, the overall purpose of automation, is to improve velocity while improving quality. To do so, Quality Gates can be enforced before deploying anything to production. SonarQube and SonarCloud remain very popular and focused on code quality. They will impact quality attributes such as maintainability, evolvability, changeability, etc. and make sure the code adheres to well-known principles such as SOLID, DRY and Clean Code. Selenium is often used for integration tests. Snyk is a very good solution for anything-security. Whether you want to scan source code or container images, Snyk is a modern and developer-friendly solution. JFrog is a well-kown supply chain platform that covers the end-to-end supply of software. Organizations using JFrog will also typically use JFrog's Xray to scan container images. An alternative to Snyk and JFrog's Xray to scan container images, is to use Agentless Container Vulnerability Assessment, which replaces the former Defender for Containers.The list of solutions presented in the map are by no means exhaustive but represent some of the most commonly used ones when building and deploying solutions to Azure and its container ecosystem. Now, let's explore the identity solutions that Azure provides.

Learning about identity

In the cloud, identity is one of the most important layers, and it is often overlooked or not well-known by solution and security architects. There is a huge gap between on-premises identity systems and cloud-based identity systems, where OpenID Connect and OAuth are dominant protocols. Figure 2.18 zooms in on the identity category:

For the IDENTITY zoom-in map, we have four top-level groups:

Hybrid
B2E
B2B
B2C

In the Hybrid group, the Entra ID Proxy gives you remote, secure access to your on-premises apps. It allows you to authenticate cloud users to on-premises webapps and APIs transparently, without changing anything to the on-premises solutions. While this can be handy at times, you must accept the fact that the endpoint made available by the proxy is internet facing, somewhat playing the role of a DMZ (Demilitarized Zone). The proxy is able to handle protocol transition, such as signing in using OpenID Connect and generate a corresponding Kerberos ticket on behalf of the user to target the on-premises systems.Active directory and Entra ID do not have that much in common when closely looking at them. IAM (Identity and Access Management) teams have a solid learning curve ahead to master Entra, which is the cornerstone of the Microsoft Cloud. Identity is one of the key foundations that you must define before you host anything in Azure. It is an important part of the Cloud Adoption Framework and a solid foundational asset. An Azure Solution Architect must have a fundamental understanding of modern authentication mechanisms, as they bridge two key layers: the application layer, which handles user authentication and token requests/validation, and the infrastructure layer, where client applications and APIs must be registered with the identity provider before they can be used. A good knowledge in that domain helps guide both developers and cloud engineers.That brings us to our second group in the identity zoom-in map. In a B2C (Business-to-Enterprise or Business to Employees) context, Entra ID is used to authenticate internal employees and collaborators to systems and applications. Microsoft Entra Connect provides hybrid identity features, such as password hash sync, pass-through authentication, synchronization between your on-premises directory and Entra ID to let you leverage single sign-on. Entra Connect Passthrough Authentication is an alternative to ADFS (Active Directory Federation Services) that allows you to validate user credentials against your on-premises directory, while not having to plan for the full ADFS infrastructure. Additionally, Passthrough Authentication can seamlessly switch to Entra ID-only authentication, should the on-premises agent not be available.Next is Business to Business (B2B), our third grouping from Figure 2.18. Entra ID can also be used in a B2B context through B2B invites. That is typically how you can control who Office 365 users can invite in a B2B collaboration context. The same applies to custom workloads.Finally, we have our fourth/bottom group. In a B2C (Business to Consumers) market with many public-facing apps and APIs, Entra External ID is the preferred choice and is gradually replacing Azure Active Directory B2C for that matter. You can use Entra External ID's own user store or integrate with well-known identity providers such as Facebook and Google, and any other OIDC-compliant identity provider.Because identity is an important security pillar in the cloud, we will explore it further in Chapter 9, Security Architecture. Now, let's take a high-level look at the security landscape in the next section.

Learning about security

Security is a primary concern for every organization. Figure 2.19 is a very high-level and condensed view of Azure's security landscape:

We have six top-level groups for our SECURITY zoom-in map (Figure 2.19):

Live Threat Detection
Posture Management
Live Threat Detection
CASB
PAM
Appliances

Let's start with the Posture Management top-level group. Azure Policy is the corner stone of governance in Azure. In a nutshell, policies allow you to make sure deployed resources remain compliant with corporate standards. For example, you may want to make sure that every App Service is configured with encryption in transit (HTTPS). You might want to check that no public IPs can be created by default, etc. Since policies govern all deployments in Azure, as a Solution Architect, you must ensure compliance with organizational standards. Certain services may be restricted or prohibited, requiring an exemption request, which could introduce challenges from the outset. This is especially critical when working for a consultancy firm, where you may not be fully aware of the client's corporate policies. Similarly, we have seen countless of K8s-based vendor solutions that do not comply with the default built-in AKS policies, which either forces the vendor to rework its product, either you to lower your security level. It is your duty as a Solution Architect to anticipate on this.Another product that helps keeping a good security posture is Microsoft Defender. It is versatile and granular as you can enable Defender by workload types. For example, you may use it as an anti-virus for Storage Accounts, or let it analyze your API traffic and detect potential security flaws through its Live Threat Detection mechanisms.Microsoft Sentinel is a cloud-native service in Azure that helps organizations detect, investigate, and respond to security threats. It acts as both a Security Information and Event Management (SIEM) system and a Security Orchestration, Automation, and Response (SOAR) platform. As a SIEM, it collects and analyzes security data from various sources to identify potential threats. As a SOAR, it streamlines and automates security operations, helping teams respond to incidents more quickly and efficiently. Although it's built into Azure, Sentinel can also integrate with on-premises systems and other cloud environments, making it a flexible solution beyond just Azure-based workloads.Automated responses, named playbooks in this context, are performed through Logic Apps, which has built-in connectors to hundreds of first and third parties. One of Sentinel's biggest benefits is that it will automatically scale as needed. A few years ago, Sentinel was even not showing up in Gartner's Magic Quadrant but filled the gaps at an incredible speed and now is part of the leaders. Yet, the same usual suspects as for monitoring, namely Splunk and QRadar, are still often used in the enterprise for SIEM/SOAR purposes.In terms of Cloud Access Security Broker (CASB) solutions, organizations often use Netskope and Zscaler. The primary focus of such solutions is to enforce zero-trust, a principle that aims at validating each and every end user/system activity, as well as data leak prevention, with a strong focus on the cloud. In the past, CASB solutions used to be mostly used on client devices but they are now moving to backend systems as well. A concrete example of this could be egress Internet traffic coming from Azure backend services, such as API Management, AKS, etc. being sent to a Netskope SaaS tenant for further verification. The impact of such practices is not to be neglected as they might force all your Azure components to trust Enterprise Certificate Authorities that sign the intermediate certificates sent by Netskope, which can be a serious pain.Microsoft's CASB is a SaaS solution named Defender for Cloud Apps and aims to address the exact same concerns. As a Solution Architect, it is important to anticipate on this, because you will have to integrate with these systems or might even be unable to use certain services, especially if TLS inspection is enforced at CASB level.Our next top-level group is Privileged Access Management (PAM), for which Azure features Privileged Identity Management (PIM). The main purpose of PAM tools is to elevate user privileges only when required and for a short duration, while tracking sessions, etc. PIM is rather straightfoward to use, which is not the case of CyberArk, a worlwide leader in that space, but that is quite challenging to integrate with all aspects of Azure.Our last top-leve group is Appliances, under which we can find native and common third-party solutions. Azure Firewall is a built-in service allowing to apply layer-4 and layer-4 firewall rules. Its premium tier allows to enforce TLS inspection and IDS/IPS (Intrusion Detection System / Intrusion Prevention System), the same way as you would traditionally do with niche players such as Palo Alto, Fortinet and Check Point. These third-party offerings are typically hosted on virtual machines. Recently, Palo Alto recently introduced the Cloud NGFW for Azure SaaS (also available for AWS and Google), offering seamless integration with Azure while eliminating the need for infrastructure management. When it comes to WAF (Web Application Firewall), Azure features WAF Policies, which can either be attached to Azure Front Door, either Application Gateway. The core ruleset is based on OWASP (Open Worldwide Application Security Project), a well-known authority that establishes the most common web vulnerablities, among other things. Additionally, we can define custom rules to achieve a tailored configuration. Alternatives on the market are F5, Imperva, Barracuda and of course Cloudflare, a well-known authority in that field. Similar to firewalls, WAFs have an even greater impact on custom solutions, often generating false positives that block legitimate traffic and trigger arbitrary 403 (Forbidden) errors. To minimize issues, it is crucial to test solutions in a non-production environment using the same WAF rules, ensuring alignment with WAF policies. The choice between native and third-party services typically falls to the network and security teams.We will explore deeper all the security concerns in Chapter 9, Security Architecture.

Learning about networking

Like identity, networking is a very important pillar. While not necessarily knowing bits and bytes about networking in Azure, Solution Architects should understand what role it plays in the overall solution. Every asset might deal with public and private endpoints, with different types of reverse proxies and firewalls. When it comes to hybrid workloads, say, for instance, a frontend in the Cloud talking to an on-premises backend, connectivity becomes even more crucial. Cross-cutting concerns, such as performance and resilience, are directly related to the underlying network plumbing. Figure 2.20 shows the most important connectivity options at our disposal, in order to bridge Azure data centers to on-premises ones, as well as to route and secure traffic:

For the networking zoom-in map, we have four top-level groups:

Reverse Proxies, WAF, and Load Balancers
Routing
Data Center, which you have to understand as on-premises data center
Topologies

Let's start with Data Center. Azure ExpressRoute guarantees a certain bandwidth, a specific level of resilience, and a quality of service, but a mere VPN connection does not. Solution Architects should know what is already set up (if anything), in order to evaluate whether (or not) the underlying network plumbing satisfies the non-functional requirements, especially when dealing with hybrid workloads. Azure ExpressRoute is the de facto connectivity choice made by many organizations.The second top-level group is Reverse Proxies and Load Balancers, where we find Azure Front Door and Application Gateway. Both are reverse proxies, with or without WAF enabled, but Azure Application Gateway is a regional service while Front Door is global. Any geo-distributed application should either use Cloudflare (if Azure is discarded), either Front Door or Traffic Manager, which is a DNS-based load balancer. Unlike Front Door, Traffic Manager does not proxy traffic but rather redirects the client to the right backend using DNS-based rules. Commonly used third-party reverse proxies and WAF are F5 and Imperva, already introduced in the previous section. For load balancing, besides Traffic Manager, Azure has multiple types of internal (ILB)/external (ELB) load balancers based on different pricing tiers. Interestingly, it is possible to bind ILBs to Azure Front Door using Private Link Service. Load balancers are everywhere in Azure, whether managed by you or by Microsoft.One of the key considerations for a Solution Architect is identifying the Network Topology used in the Azure environment, as it significantly influences workload design and can even determine which services are viable or restricted. One of the most widely adopted topology is Hub and Spoke, where workloads are deployed to spokes and where hubs are virtual networks dealing with specific duties (firewalls, hybrid connectivity, DNS, etc.). You can choose to fully manage the Hub-and-Spoke architecture yourself or opt for partial management using Azure Virtual WAN, which introduces Virtual Hubs that are entirely managed by Microsoft. An alternative to Hub and Spoke is the Mesh Network topology which is a decentralized network architecture where nodes are interconnected. This can easily be achieved using Azure Virtual WAN. The initial value proposition of Virtual WAN was to connect all branch offices of an organization together with the least possible effort. Now, Virtual WAN has drastically evolved and even allows you to bring your own appliances. We will dive much deeper into Azure and Kubernetes networking in the next chatpers.Now, let's take a quick look at governance.

Learning about governance/compliance

Whether you like it or not, a properly managed Cloud environment must be governed. Governance is key to keep things under control and manageable at scale. While not being directly in charge of the overall governance yourself as a Solution Architect, you must understand the landing zone model that has been adopted by the organization. Some well-organized companies might decide which service you can use or not depending on non-functional requirements such as RTO (Recovery Time Objective) and RPO (Recovery Point Objective) as they might have already mapped Azure services to RTO/RPO levels. RTO/RPO plays a key role in being able to define a service level your solution must adhere to. This could even be more important for customer facing applications (B2B or B2C). Figure 2.21 depicts the main services involved in defining a governance:

The GOVERNANCE zoom-in map has five top-level groups:

Pre-configure Complaint Workloads
RBAC
Hierarchy
Global Oversight
Frameworks

Azure governance is directly linked to the strategy of the Cloud journey, which we discussed in Chapter 1, Getting Started as an Azure Architect. The good news is that a documented strategy can be enforced in a tangible manner, through Azure Policy, that is part of our Global Oversight top-level group. As a Solution Architect, whether designing a solution for your own organization or for a customer, you must know which policies are enforced in the hosting environment. This must be anticipated, in order to avoid any unwanted surprises later on. Azure Policy lets you define virtually any rule that control how resources are deployed. It makes it possible to deny non-compliant deployments, to fix them on the fly or to simply audit deviations and remediate them later. Azure Policy is very powerful but can be very constraining at times. Common policies include managing deployment regions for services, restricting virtual machine sizes, preventing public access, enforcing service pricing tiers, and verifying whether resources are properly tagged, among other controls. Another way to rule the system is through Azure RBAC (Role-based Access Control), which lets you define who/what can access Azure resources. Azure RBAC is very granular as it has hundreds of built-in roles and lets you create custom roles if needed. Both RBAC and Azure Policy are tightly coupled to the Hierarchy, as they can be bound to each level of the Azure hierarchy, starting from the Root Management Group.The hierarchy reflects the structure of an organization, as well as all the scopes over which Azure policies and RBAC are applied. The hierarchy is composed of management groups, subscriptions, and resource groups. Each of these levels is an RBAC and policy scope. In practice, you might have hierarchies that are business-driven, as shown by Figure 2.22:

Figure 2.22 – Hierarchy with business lines

Hierarchies might also be more IT-driven, as shown in Figure 2.23:

You may combine both business and IT groups by using nested management groups. There is no one size fits all approach and each company defines a hierarchy that best reflects their own internal organization. In addition to Azure Policy for governing the platform and a structured hierarchy for organizing resources, you can leverage Microsoft Defender to evaluate overall security posture and Azure Resource Graph for querying resource-related information.To make sure your governance is based on solid foundations, you can rely on the CAF, introduced in the first chapter. The Well-Architected Framework (WAF) can be considered a governance tool focused on workloads, as it provides best practices and reusable patterns applicable across various solution types.At last, you can boost your compliance by using Pre-Configured Compliant Workloads, through Template Specs and Deployment Stacks, but these are mostly focused on ARM or Bicep. While it is possible to use them with Terraform, we would recommend you to stay away from it as it implies the use of the AzApi provider, which should generally be restricted to the bare minimum in favor of the AzureRm provider.While the design and definition of landing zones are primarily the responsibility of the infrastructure and security teams, Solution Architects must proactively consider all aspects of a solution. They should gather comprehensive information about the landing zones, as these will inevitably influence the solution's design. In an ideal world, Solution Architects should even be active contributors to the landing zone design to integrate application dimensions as well. Another key consideration—though not strictly architectural—is cost management, which typically falls under the broader umbrella of FinOps. While FinOps is generally aligned with platform governance, Solution Architects have a crucial role to play in designing cost-efficient solutions. Amazon CTO Dr. Werner Vogels outlines seven guiding principles for cost-conscious architecture (https://thefrugalarchitect.com/laws/), which are well worth exploring. Now, let's examine which Azure services from the first edition of the Azure Solution Architecture Map have been retired or are scheduled for retirement, as this also affects solution designs over time.

Retiring or retired services

Over the years, several Azure services featured in the first edition of the Solution Architecture Map have either been deprecated or are approaching retirement.

These services are: Microsoft LUIS and QnA Maker, which were mostly used in conversational chatbots have been replaced by Azure AI Language. You may of course also use Azure OpenAI for this as well.
Azure Media Services was typically used for media sharing, encoding, etc. Microsoft advises customers to go for partner solutions. Here is a link detailing migration strategies: https://learn.microsoft.com/en-us/previous-versions/azure/media-services/latest/azure-media-services-retirement .
Azure Front Door has replaced Azure CDN, Azure CDN Standard, and Azure Akamai, although it does not offer the same set of features.
Azure Time Series Insights has been replaced by the broader Azure Data Explorer service. Third-party solutions such as InfluxDB or TimescaleDB may be considered as well.
Azure MariaDB is retiring in 09/2025 and Microsoft suggests customers to migrate to Azure MySQL.
Azure Functions Proxy is considered legacy and should be replaced by Azure API Management.
Azure B2C will be officially retired as of 2030 and replaced by Entra External ID.

The retirement of these services highlights the importance of evaluating a service's adoption rate before integrating it into our solutions, ensuring it remains supported long after deployment. Let us now go through a use case.

Solution architecture use case

The following sections will guide you through a hands-on use case where you'll build a solution diagram for a specific scenario.Try to make use of the maps and the comparison tables provided in this chapter to find your way.

Use case scenario

Contoso needs a configurable workflow tool that allows them to orchestrate multiple resource-intensive tasks, based on blobs landing in a blob storage. Each task must launch large datasets to perform in-memory calculations. For some reason, the datasets cannot be chunked into smaller pieces, which means that memory contention could quickly become an issue under a high load. A single task may take between a few minutes to an hour to complete. Workflows are completely unattended (no human interaction) and are asynchronous. The business needs a way to check the workflow status and be notified upon completion. The solution must be portable but a minimal rework is acceptable in case of migration to another system. Contoso's manpower is a pool of .NET developers. They do not have data scientists and do not want to invest for such skills. They have a limited budget and want a solution in a timely fashion. Now, let's try to extact the most important keywords of this scenario.

Using keywords

As an Azure solution architect, you must capture the essential part of a story. Of course, in reality, you would have a few back-and-forth discussions with the business and be given an opportunity to ask questions and challenge their assumptions Nevertheless, here are a few keywords that can structure your train of thought before designing a solution. Let's review them one by one:

Portability: Whenever portability comes as a requirement, containers represent a good candidate.
.NET: They only have .NET skills in house and do not want to invest in data scientists/architects profiles. This discards data services such as Azure Databricks, which could have been the perfect candidate for such data calculations.
Resource-intensive tasks: In the cloud, as in any environment, computing power comes at a cost. With our limited budget, high-memory and high-CPU virtual machines are not an option. Instead, we must consider more flexible compute solutions, such as serverless or auto-scaling systems. The scenario does not specify whether guaranteed capacity is required on a permanent basis.
Task duration: This criterion is a structuring factor, as it eliminates some hosting options, such as Azure Functions hosted on the Consumption pricing tier (which cannot exceed 10 minutes of execution). One option could be Azure Functions on pre-paid pricing tiers.
Workflow: A workflow is a sequence of steps that are executed in a coordinated way. This aspect is important because it reduces the field of possibilities.

Let's now see how to make use of the map to progress in our thinking process.

Using the Solution Architecture Map against the requirements

Now that we have highlighted the important keywords, let's take a look at our map to try and make sense of it. We'll build a reference architecture, which you could be reused by other projects later. We know that we must be able to handle orchestrations. Figure 2.24 is a subset of the Solution Architecture Map:

A workflow being an orchestration, we see two possibilities (under Orchestration in Figure 2.24): Durable Functions and Logic Apps. Logic Apps seem more appropriate for integration scenarios. Since our workflow is scoped to a single application and our team is made of .NET developers, Durable Functions might be a fit. At this stage, it is hard to make a choice based on this map only. By the end of the book, you'll have a much deeper understanding of the options but now, you're still required to do a bit of extra research before taking a decision. After a bit of extra reading, you managed to get a deeper comparison between Logic Apps and Durable Functions and end up with the following sub map:

Figure 2.25 – A map focused on Logic Apps and Durable Functions

The manpower you have at your disposal is one of the factors to consider. Logic Apps is fully declarative, and it does not require any programming skills. Durable Functions is mostly developed for the scope of a single application and requires development skills. We have .NET developers and we do not have to deal with a workflow that goes beyond the scope of our single solution. The power of Logic Apps that ships with hundreds of connectors towards first and third-parties is not required for our scenario, making Durable Functions the most suitable option.Because we know we will use a container-based solution to satisfy the portability requirement, we look at the Containerization branch of our map:

For our use case, orchestrations are triggered based on blobs landing in the blob storage. Given the limited budget, we should look at services that allow us to perform resource-intensive tasks at a reasonable cost. We can safely assume that each orchestration can be seen as a multi-step job that performs calculations. We see that both Azure Batch and Container Instances are a good use case for jobs. As we previously observed, Container Instances can be easily created and destroyed, offering substantial compute resources while being billed per second of execution. They check all the boxes, as they are cost-friendly, compute-intensive friendly and can run Windows/Linux containers. We have identified our two primary building blocks, namely Durable Functions and Container Instances.In the next section, we will see how to infer a reference architecture from these preliminary conclusions.

Building the target reference architecture

With this progress, you are ready to build your reference architecture. Of course, there is never a one-size-fits-all approach, but Figure 2.27 shows you a possible solution:

Figure 2.27 – A sample reference architecture

There are small numbers on the diagram that we explain as follows:

Azure Blob Storage, part of a Storage account, receives the incoming blobs to be treated by the workflow. These are the large files we referred to in our scenario.
A durable client, with blob storage trigger, is kicked off whenever a blob lands into the blob storage. The durable client starts a new orchestration, while passing some orchestration configuration (such as the number of steps, retries, timeouts, and so on).
The main orchestrator, which could have sub-orchestrators, in turn provisions one container instance per workflow step, and passes a callback URL that is used by ACI to report its status and to optionally return a pointer to the input data of the next task. An alternative to using a direct callback URL could be to put a message broker between the orchestrator and the ACIs. The orchestrator could queue commands in a Service Bus or Storage Account queue and ACIs could handle those commands and report their status using queuing as well.
ACI gets the input blob that it needs to handle.
ACI writes output to another Blob Storage.

The overall process contains one or more steps, and each step is allocated a dedicated ACI with the required compute, which could be decided dynamically based on the size of the input blob. Each task may execute in parallel or one after the other, depending on the orchestrator logic. Figure 2.28 is an example of a sequential workflow that could match our scenario:

Figure 2.28 – A sequential workflow example with Durable Functions and ACI

The main orchestrator starts with one sub-orchestrator per workflow task. Each task creates a container instance and waits for its feedback through the WaitForExternalEvent operation (or using queue-based mechanisms). The Durable Functions framework allows you to define retries and timeouts. Whatever the workflow step status is (such as failed or timed out), the corresponding ACI is deleted, and the main orchestrator either retries, stops the orchestration, or goes to the next step according to the passed worklow configuration that is dynamically by the orchestrator client.Note that the exact same workflow logic could be defined with Logic Apps, which is also able to orchestrate the creation and deletion of ACIs, but considering a pure .NET team, they will likely prefer to use Durable Functions. Now that we have some sort of reference architecture, let's see what is still missing.

Understanding the gaps in our reference architecture

We may think we did a great job earlier when designing our reference architecture, but it suffers from important gaps. Look again at Figure 2.29:

We have not covered the cross-cutting concerns. We mostly focused on the building blocks and their interactions, but we did not cover anything about monitoring, security, resilience, and so on. Sometimes, it can be challenging to reflect everything in a single diagram, because it makes that diagram either too big or too complex to understand. To overcome this, a possibility is to work with different views, scoped to specific areas. Doing so will also help you engage with your peers, as well as with more specialized architects.The Solution Architecture Map cannot go too deep into each domain, because it would simply be too broad. Using this map only to cover all the NFRs is not possible. As a Solution Architect, you will also need to refer to other maps in this book to find your way in other fields and you'll perhaps also need to ask for extra help from infrastructure and security architects. The views that should be added to this architecture are as follows:

Monitoring view: This is where you add Azure Monitor, Log Analytics, and dashboards into the mix. You might have to add Splunk or any other on-premises tool that your organization (or your customer) is using.
Security view: This is where you indicate the different authentication mechanisms, such as Shared Access Signature (SAS) tokens, managed identities, OAuth flows, and so on to authenticate against Blob Storage as well as to create and destroy ACIs. You would add protocol-related information (TLS 1.xx, etc.). The security map in Chapter 9, Security Architecture, will help you here.
High-availability and disaster recovery views: Here, you focus on the availability and resilience of your solution. The infrastructure map of the next chapter will help you here.

As mentioned earlier, we started with the basics, but the complexity will gradually increase as we progress in this book. Before wrapping up this chapter, let's examine some real-world insights into trends and practices that have emerged over the past few years..

Real-world observations

Here is a collection of real-world observations. While not scientifically rigorous—being more empirical than statistical or mathematical, they represent common patterns/practices we have observed across many of our customers.

The adoption of containerization keeps rising. Services such as AKS and ACA remain very popular. We will cover them in-depth in Chapter 4 – Working with Azure Kubernetes Services and Chapter 5 – Working with other container services.
Every organization explores Generative AI, making Azure OpenAI and Azure AI Search, two of the most used services. We will cover this in details in Chapter 8 – Artificial Intelligence Architecture.
Isolation from Internet has become the new norm, leading sometimes to extra complexity for Solution Architects to find the right services that integrate with virtual networks. The next chapter will cover this in details.
Cybersecurity remains an important topic regardless of the industry sector. We will cover this in Chapter 9, Security Architecture.

Summary

In this chapter, we described the Solution Architecture Map and its different classification categories, which are SoEs, SoRs, SoIs, SoInts and SoIns.This architectural categorization simplifies the distribution of services, making design decisions more structured. We provided detailed explanations and focused maps to refine alternatives, ensuring a clear understanding of trade-offs. We then explored containerization, presenting a focused map of Azure's container ecosystem and evaluating each option based on cost, complexity, and operational overhead.Additionally, we emphasized cross-cutting concerns relevant to all solutions and discussed which ones solution architects should prioritize. Since addressing every concern from day one can be overwhelming, we introduced the concept of maturity levels, outlining how they can be incorporated into a roadmap to manage stakeholder expectations effectively.We then briefly covered key security and governance aspects, highlighting how numerous third-party solutions—while not exhaustive—can contribute to these areas. Remember a key takeaway from Chapter One: establish principles and adhere to them! A principle like "Best of suite over best of breed" can help minimize the number of third-party integrations required for your cloud platforms, especially now that most Azure security/governance services are usable across environments, including on-premises and other clouds.Finally, we engaged in a concrete use case, designing a solution architecture based on a business scenario, leveraging the Solution Architecture Map, and implementing it through sample diagrams.In the next chapter, we will shift focus to infrastructure-related aspects, such as monitoring, connectivity, and disaster recovery, helping you further enhance the reference architecture developed in this chapter.

3 Infrastructure Design

Join our book community on Discord

https://packt.link/0nrj3In this chapter, we will focus on infrastructure architecture with Azure. Here, we will review the different concerns that every infrastructure engineer and architect has to deal with on a daily basis. More specifically, we will cover the following topics:

The Azure Infrastructure Architecture Map
Zooming in on networking
Zooming in on monitoring
Zooming in on high availability and disaster recovery
Zooming in on backup and restore
Zooming in on HPC
Use cases

We will provide a 360˚ view of what it means to build infrastructure with Azure, including the most common practices and pitfalls. Azure Networking is such a vast topic that we've created a dedicated map for it, and we'll be devoting a significant portion of this chapter to exploring it in depth. We will also explore two different use cases, inspired from real-world situations. The first use case is more conceptual, guiding you through the design of a global API platform that spans multiple continents and leverages the PaaS service model. In contrast, the second use case focuses on implementing a multi-hub architecture—we'll provide both the architectural diagrams and the deployment code so you can set it up and see it in action in your own tenant. Let us now explore the technical requirements.

Technical requirements

In this chapter, we will be using Microsoft Visio files but the corresponding PNGs are also provided. To test the provided code of our second use case, you will need the following:

Visual Studio Code is needed to open the provided solution. You can download it here for free: https://code.visualstudio.com/download
Terraform is needed to deploy the provided code to Azure. You can download here: for free https://developer.hashicorp.com/terraform/install
An Azure subscription with owner permissions is needed to deploy the provided code. You can start a free trial if necessary. Follow this link: https://azure.microsoft.com/en-us/pricing/purchase-options/azure-account
Azure CLI is needed to authenticate against Azure prior to the deploying the solution. You can find the installation instructions here: https://learn.microsoft.com/en-us/cli/azure/install-azure-cli.

Maps, diagrams, and code are available at https://github.com/PacktPublishing/The-Azure-Cloud-Native-Architecture-Mapook-Second-Edition/tree/master/Chapter03.

The Azure Infrastructure Architecture Map

The Azure Infrastructure Architecture Map, shown in Figure 3.1, is intended as your Azure infrastructure compass. It should help you to deal with the typical duties of an infrastructure architect, which we covered in Chapter 1, Getting Started as an Azure Architect. Unlike the solution architecture map, which was more high-level, this map is a vertical exploration of infrastructure topics. It is by no means the holy grail, but it should help you grasp the broad infrastructure landscape at a glance. Throughout this chapter, we will describe its various elements and bring real-world context whenever possible.

Important note

To see the full Infrastructure Architecture Map (Figure 3.1), you can download the PDF file available at https://github.com/PacktPublishing/The-Azure-Cloud-Native-Architecture-Mapook-Second-Edition/blob/master/Chapter03/maps/Azure%20Infrastructure%20Architect.pdf.

The Azure Infrastructure Architect Map (Figure 3.1) has several top-level groups:

Network: This fundamental foundation is ubiquitous, and part of all the debates and trade-offs. Traditional IT is entirely based on the perimeter approach, which sometimes conflicts with the cloud's zero-trust approach. We will detail our options further in the Zooming in on networking section and will tackle the more specific security concerns in Chapter 9, Security Architecture. Since Azure Networking is such a broad topic, we've created a dedicated map that we'll explore in this chapter (covering foundational aspects) and revisit throughout the remaining chapters.
Monitoring: Since this topic was introduced in the previous chapter, we will focus on a concrete monitoring example.
Governance/Compliance and Security: Governance and security are both key aspects of well-managed infrastructure. For the sake of brevity and to avoid repetitions, we will explore this topic in Chapter 9, Security Architecture.
High Availability: Every system/application might require some level of resilience. We will explore the different options and work on a specific use case.
Disaster Recovery and Backup/Restore: Disaster recovery is a common topic in every infrastructure shop and Azure makes no exception, regardless of the Cloud service model. Our first use case will be exactly about this.
HPC: HPC stands for high-performance computing. We'll analyze the different HPC options in our Zooming in on HPC section.
CI/CD: Since this topic was introduced in the previous chapter, we will focus on a concrete Infrastructure as Code example using Terraform for our second use case.
Hybrid: Azure Local and Azure Arc are two ways to Azurify any environment, whether on-premises for Arc and Local, or other clouds for Arc.

Writing a book involves making choices, so we'll only give a brief overview of HPC and Hybrid sections, as each would warrant a dedicated chapter on its own. We've also chosen not to emphasize them too much, since the other topics we'll cover are essential knowledge regardless of the organization you work for.Before we delve deeper into our map, let's take a moment to further clarify the concept of landing zones, which was introduced in the previous chapter. Landing zones provide the foundational framework for deploying and operating applications in the Cloud. They establish essential baselines for security, monitoring, and governance—prior to provisioning any actual workloads. To support different application needs, distinct landing zone archetypes should be defined. For instance, an internet-facing application will need to be exposed through a web application firewall. A hybrid application must integrate with on-premises systems, while a cloud-only internal application may simply require secure access for corporate users. Each of these scenarios demands a unique set of capabilities, which should be encapsulated within tailored landing zone archetypes. In the Cloud Adoption Framework, Microsoft has defined a few archetypes, which you can refine further to meet your unique needs.Everything we implement—whether it's networking, monitoring, or governance—is ultimately in support of building and managing landing zones at scale. In addition to networking and security principles, landing zones can also align with service levels. For instance, you might choose to provide bronze, silver, gold, or platinum SLAs, each requiring varying levels of availability and business continuity capabilities.Now that we've clarified the purpose and outlined the key topics of our map, it's time to dive into each area in detail. We'll begin with networking, one of the most critical aspects of cloud infrastructure.

Zooming in on networking

Networking is one of the essential foundations of any Azure landscape. Figure 3.2 shows the topics we will focus on in this chapter. Note that we have provided an extra map for your reference that is much more comprehensive. Links to both maps are provided below.

Important note

To see the full Azure Networking Maps, you can download the PDF file available at https://github.com/PacktPublishing/The-Azure-Cloud-Native-Architecture-Mapook-Second-Edition/blob/master/Chapter03/maps/Azure%20Network%20Architect.pdf and

We introduced the landing zone concept as well as the Azure networking topic in Chapter 2, Solution Architecture. We briefly explained that the purpose of a landing zone is to structure, govern, and rule the Azure platform for the assets that will be hosted on it. Controlling network flows is one of the key governance aspects. Controlling the network means mastering internal and external traffic, inbound and outbound, flow logs, and so on. This is a vast topic and an important challenge. Let's now dive deeper. The network section has the following top-level groups:

Data Center connectivity
Most common architectures
DNS
Routing

Since connecting the on-premises data center(s) to Azure is typically one of the initial steps when establishing new foundations, let's begin by exploring the available data center connectivity options.

Looking at the hybrid connectivity options

Many enterprise customers using Azure operate in hybrid environments, where some workloads remain on-premises while others run in Azure. In some cases, individual workloads span both environments, with components deployed both in the cloud and on-premises. As a result, it's essential to establish connectivity between on-premises data centers and Azure. The DC Connectivity node of Figure 3.3, illustrates various options for achieving this.

Figure 3.3 – Data center connectivity options

Microsoft positions ExpressRoute (ER) as a key network enabler for hosting mission-critical workloads in Azure. ER is considered the best choice primarily because it offers predictable latency, allows precise control over bandwidth consumption, and supports priority management through Quality of Service (QoS), all made possible by its foundation on the physical layer. ER is also globally accessible through Cloud Exchange Brokers, with whom organizations often already have established connections, making it easier to extend their network to Azure. Additionally, ER offers several resilience models to accommodate varying levels of redundancy and availability. The Single-Homed model connects to a single high-available ExpressRoute circuit in one peering location, providing minimal redundancy and making it suitable only for non-critical workloads. The Dual-Homed model improves fault tolerance by connecting to two independent circuits in different peering locations, ensuring continued connectivity even if one site experiences failure. The Dual-Homed Metro model strikes a balance by leveraging two circuits in separate facilities within the same metropolitan area (for example: Amsterdam), offering higher resilience than Single-Homed while maintaining lower latency and simplified management compared to full Dual-Homed deployments. Because these hybrid connectivity models are essential foundations, you should make sure to anticipate on your Cloud roadmap and the business pipeline to make the right decisions.On the other hand, Site-to-Site (S2S) VPN is mostly used in small-scale production workloads, in terms of hybrid traffic.Besides an S2S VPN, you can also use a point-to-site (P2S) VPN, but this is typically used to allow external collaborators to access resources hosted in your Azure environment. For example, you may use Azure Virtual Desktop (AVD) to give access to external collaborators who should setup a VPN client on their machine prior to accessing the AVD machines.Chosing between S2S VPN or Expressroute is not always easy but here is a use case table that should help you (Note that + means preferred option):

	ExpressRoute	S2S VPN
Many hybrid workloads (such as, corporate applications, data exchanges between on-premises ,and the Cloud)	+	-
Mostly internet-facing workloads with few interactions (resource management aside) between on-premises and the Cloud	-	+
Global reach	+	-
Latency sensitive workloads	+	-
High SLA requirements	+	-
Low budgets	-	+
High resilience	+	-
Encryption	- When using regular ExpressRoute+ when using ExpressRoute Direct (MACSec)	+ (IP Sec)

Table 3.1 – Comparison table between ExpressRoute and S2S VPNRegardless of whether you use ExpressRoute or a Site-to-Site VPN—and depending on your chosen resiliency model—it's important to plan for a break-glass access method to your Azure workloads. Azure Bastion provides a secure and reliable way to achieve this. Azure Bastion is accessible directly from the Azure Portal and enables secure connections to your virtual machines without requiring public IP addresses or exposing RDP/SSH ports. While it can be useful for routine access, it becomes especially valuable in adverse scenarios where traditional connectivity paths are unavailable.Once you have bridged the two worlds – on-premises and Azure – you must concentrate on how to route and organize traffic within Azure itself, which is what we will look at in the next two sections.

Looking at the most common architectures

Let's start exploring the most common architectures illustrated in Figure 3.5.

Figure 3.4 – The most common architectures

Let's start with the Hub and Spoke Architecture.

Hub and Spoke Topologies

As we briefly described in Chapter 1, Getting Started as an Azure Architect, the Hub and Spoke architecture, presented in Figure 3.6, is the most frequent hybrid setup:

Figure 3.5 – A simplified view of the Hub and Spoke architecture – Single-Hub

At the bottom of Figure 3.6, we have the on-premises network that is connected to a single Azure Hub through either ExpressRoute or VPN or sometimes both at the same time, with a IPsec overlay. The Azure Hub is itself connected to the different spokes, in which assets are deployed.The Azure Hub is nothing but a regular Azure Virtual Network (VNET). Its role is to route traffic between spokes (east-west) and between the on-premises and cloud data centers (hybrid traffic north-south). Spokes, which are simply VNETs, are peered with the Hub to allow mutual awareness of each other's address space—enabling the Hub to recognize each Spoke's address range and vice versa. VNET peering is non-transitive in Azure, meaning that if spoke 1 is connected to spoke 2 and 2 is connected to 3, then 1 cannot talk directly to 3 as illustrated by Figure 3.7.

Figure 3.6 – Peering is not transitive in Azure

Consequently, the Hub and Spoke architecture is used to simplify the structure and to have a central VNET (the hub) that is connected to all the others as a way to bridge them all. Additionally, the hub might even deal with Internet Ingress and Internet Egress, also known as north-south type of traffic.Peering spokes directly is also technically possible, but goes against the purpose of the Hub and Spoke topology. VNET peering is global, meaning that you can peer a VNET that sits in the West US region with another one sitting in the West Europe region, but beware that cross-regional peering incurs extra outbound traffic costs. Each spoke usually hosts the services of one single asset, while the shared system services (DNS for instance, as highlighted in our previous section) are hosted in the hub.Besides the basic single hub architecture, you might want to split hubs further into a multi-hub architecture. As you may have understood by now, virtual network peering is a powerful way to bring network capabilities to a spoke. However, centralizing all network functions—including east/west and north/south traffic—into a single hub automatically extends these capabilities to connected spokes, making it unclear what type of traffic each spoke truly needs. A security breach could also have a larger and more difficult-to-contain blast radius, as changes to the central firewall may affect other spokes. Additionally, removing a spoke's peering immediately strips it of all capabilities—including operation channels leveraging on-premises connectivity—unless an out-of-band management process is in place. In security-driven organizations, working with a single hub might fall short. It is therefore interesting to look at multiple alternatives that enable a better segregation of duties between the hubs. One such example is the so called Double-Hub architecture, shown in Figure 3.8:

Figure 3.7 – Simplified view of the Double-Hub architecture

In the Double-Hub architecture represented in Figure 3.8, we have two hubs, namely the hybrid hub and the ingress hub. Each hub deals with its own duties. The ingress hub handles internet-facing traffic and includes services like web application firewalls and standard layer-4 firewalls. Internet facing assets hosted in online landing zones (Online Archetype), are peered with both the ingress hub and the hybrid hub, whereas corporate application (Corporate Archetype), are only connected to the hybrid hub. This Double-Hub architecture ensures that all inbound internet traffic flows through the ingress hub, preventing spokes from directly exposing public-facing components—a restriction that can be enforced using Azure Policy. This architecture simplifies the separation of internet-facing and internal workloads by allowing quick identification through peerings with the ingress hub. It also minimizes the blast radius in case of a firewall misconfiguration and enables swift isolation of any workload from the internet by simply removing its peering with the ingress hub.Yet, the hybrid hub still deals with multiple duties, such as managing east-west traffic across spokes, internet egress traffic as well as hybrid traffic (from on-premises to the Cloud and vice-versa). To take it a step further, you can leverage the Triple Hub architecture, as shown in Figure 3.9:

Figure 3.8 – The Triple Hub architecture

The Triple Hub architecture keeps using the ingress hub for internet-facing workloads but ensures that outbound internet traffic is routed through a dedicated egress hub. This egress hub is equipped with a firewall, which can optionally forward traffic to zero-trust and Cloud Access Security Broker (CASB) players like Netskope that can handle traffic filtering, enforce access policies, and ensure Data Leak Prevention (DLP). Yet, the hybrid hub still deals with multiple duties, such as managing east-west traffic across spokes as well as hybrid traffic.To take it a step further, you can leverage the Quadruple Hub architecture, as shown in Figure 3.10:

The Quadruple Hub architecture ensures that each hub is dedicated to a specific function: the ingress hub handles inbound internet traffic, the egress hub manages outbound internet traffic, the hybrid hub is reserved for hybrid connectivity, and the newest addition—the integration hub—is responsible for enabling spoke-to-spoke communication. The underlying objective remains unchanged: to minimize the blast radius of any security breach or misconfiguration, ensure clear visibility into the network capabilities assigned to each workload, and maintain the ability to selectively and precisely disable those capabilities when necessary. You might have noticed that in each of the four topologies we explored, all the spokes and the hub themselves were peered to the hybrid hub regardless of the landing zone archetype. This is required to be able to operate/manage/deploy workloads to the spokes and the hubs themselves. This makes the hybrid hub deal with both application-level and management-level hybrid traffic.Regardless of the topology you adopt, it's important to separate production from non-production environments—typically by replicating the chosen architecture in both. Some organizations take this a step further by using separate Entra ID tenants for production and non-production to enhance isolation. Here is a comparison table to help you choose the best topology (single hub versus multiple hubs) for your specific needs.

	Single-Hub	Multi-Hub
Scalability	-	+
Security	-	+
Blast Radius	-	+
Observability	-	+
Manageability	-	+
Multi-region	+	-
Costs	+	-

Table 3.1 – Single-Hub vs Multi-Hub ArchitectureMulti-hub architectures offer significant benefits for security-focused organizations. However, they come with substantial costs and can become increasingly expensive, especially when extending across multiple regions. This type of topologies mostly apply to financial institutions, insurance sector, etc. This would definitely not be suitable for smaller organizations with limited budgets. We wanted to demonstrate that the real world often extends well beyond what's covered in online documentation.Now that we've explored hub-and-spoke topologies, let's shift our focus to how you can implement them.

Hub and Spoke methods

You can choose to manage the hubs yourself for full control, or delegate their management to Microsoft by using Virtual WAN (VWAN). In VWAN, hubs—referred to as virtual hubs—are preconfigured with a default router (not a firewall). You can secure them by attaching Azure Firewall instances or your own firewall solution, like Palo Alto, transforming them into secure virtual hubs. Routing between hubs and spokes is greatly simplified using route intents, which automatically direct internet and/or private traffic to the firewalls. Additionally, VWAN is ideal for mesh networks, thanks to the built-in default router in each virtual hub that enables seamless connectivity between all spokes attached to the same hub. When multiple hubs belong to the same Virtual WAN instance, inter-hub communication is also supported by default, together forming a fully meshed network architecture. Given its significant evolution in recent years, VWAN is a compelling option to consider when building new network foundations.Be aware that establishing your network foundations can take several months, a factor often underestimated by customers. It's far from a trivial task, as it requires coordination with network and security teams, who may have limited experience with Azure. Now that we have a basic understanding of the different topologies, let's see how to route traffic.

Looking at the routing options

Based on the previous section, the ability to route both east-west and north-south traffic between a source and a destination is a fundamental requirement. The routing node of Figure 3.12 highlights the routing options and mechanisms available in Azure to support this.

Azure creates system routes by default. This type of routes cannot be deleted but can be overridden. Let us first take a concrete example to understand how they work. Let's imagine we create the following virtual network with two subnets and one virtual machine in each subnet:

Figure 3.11 – Isolated virtual network with two subnets and two VMs

All virtual machines (and PaaS components) will use the default route shown in Figure 3.13 when targeting either of the subnets, because both subnet ranges belong to the default prefix that was advertised by Azure. By default, every component of a given virtual network will leverage the virtual network next hop type when connecting to any other component of that virtual network. Similarly, when we peer our VNET with another one, an additional route is propagated, as shown in Figure 3.14.

Figure 3.12 – New route added because of peering

We see that this time, a new route leveraging the VNet peering next hop type is created to let both virtual networks know about each other's address space.Numerous other network-related events can trigger the creation of additional system routes. Additionally, other default system routes are created by default but we wanted to keep things simple and focus on basic examples. However, the main point to remember is that, unless explicitly overridden, system routes are always used by default and cannot be deleted. To override this default behavior, we must advertise either Border Gateway Protocol (BGP), or User-defined Routes (UDR) routes. Back to our initial example, we might choose to route intra-VNet traffic through a firewall. In other words, we may want to enforce firewall inspection or validation for communication between subnets. This requires us to modify Azure's default behavior by overriding system routes, as shown in Figure 3.15.

Figure 3.13 – Using UDRs to override system routes

As shown in Figure 3.15, the addition of a new route of type User telling Azure to send traffic destined for 192.168.0.0/24 to 192.168.1.4 (our firewall), causes the system route to be marked as Invalid. This means that whenever A tries to connect to B, it would first go to the firewall, which should allow or deny this type of traffic. We'll cover BGP routes later in this section, but for now, we hope that these basic examples helped you understand how Azure's routing behavior works.UDRs are commonly used to override system routes but it can be tedious to manage them at scale, and that is where Azure Virtual Network Manager (AVNM) comes to the rescue. This versatile service allows you to define default routes that are propagated to virtual networks that match the routing rule policies. This is also a way to enforce a default standard, which can be controlled further using Azure Policy. When using Azure Virtual WAN, a common practice is to define routing intent, for public and/or private traffic and let VWAN advertise these routes automatically to every spoke connected to the virtual hub.At last, BGP is a well-known protocol used to exchange routing information between different autonomous systems and is aimed at determining the best possible data path to connect systems. Azure Route Server, acts as a BGP peer and allows to dynamically exchange BGP routes between Azure virtual networks and NVAs automatically. No matter how you manage routes, keep in mind that system routes are present by default, and only routes that are equally or more specific—whether from BGP or UDRs—can override them. Also keep in mind that UDRs always take precedence over the other route types.Now let's dive into DNS, a system capability just as crucial as connectivity and routing.

Looking at the DNS options

DNS in Azure is a critical topic and serves as a fundamental building block. Figure 3.16 illustrates the available options.

In Figure 3.16, public and private DNS are clearly differentiated. Public DNS management can be delegated to Azure through DNS zones, which are global, highly scalable, and support automation. This enables Azure to handle tasks like domain validation for public certificate management and ensures seamless integration with other Azure services. Private DNS Zones provide similar benefits but are designed for custom private domains like for example contoso.internal. They also enable the privatization of PaaS services by leveraging private endpoints via Azure Private Link (APL).It's worth taking some time to understand how APL works, as it's not entirely straightforward. To privatize a PaaS service, such as Azure Service Bus, you need to enable APL directly on the service. This action prompts Microsoft to create an alias that points to the associated private link domain. So for example, if you enable APL for contososb.servicebus.windows.net, Microsoft, as the authoritative party, will add this alias record contososb.privatelink.servicebus.windows.net into its public DNS. From that point forward, clients without access to the private DNS zones will resolve the service to its public IP, while those with access will receive the private endpoint in the DNS resolution. Figure 3.17 illustrates this.

In Figure 3.17, both clients—the VM within the VNet and the one outside—attempt to resolve the same FQDN: contososb.servicebus.windows.net. The VM outside the VNet receives the public IP address because it doesn't have visibility into the private DNS zone. In contrast, the VM inside the VNet first queries 168.63.129.16, a non-routable IP used by Azure DNS as the primary resolver within each VNet (unless specified otherwise as we'll see in the next section). Since Azure DNS is aware of the Private DNS Zone attached to the VNet, it returns the private IP that matches a A record. If the private DNS record is unavailable, the Private DNS Zone can be configured to return the public IP as a fallback, by enabling the NxDomainRedirect property.However, because conditional forwarding is not supported with private DNS zones, we need an additional service to resolve private Azure DNS from on-premises to Azure and vice-versa. This is where we can either use DNS Private Resolver or a third-party vendor such as Infoblox. The most common DNS architectures are centralized and distributed. In a centralized model, all DNS queries are directed to a central hub where the Private DNS Zones are attached. In a distributed model, each spoke has its own DNS zones attached locally. Since these two approaches are not mutually exclusive, a hybrid architecture is also possible—where zones are linked to the hub as well as to the spokes (or a few spokes only). Figure 3.18 is a very simplified view of what a centralized DNS architecture looks like.

Figure 3.16 – Simplified view of a Centralized DNS architecture

All spokes route DNS queries to the inbound endpoint of the DNS Private Resolver. For Azure-specific domains, the resolver delegates to Azure DNS. Non-Azure queries are handled based on the DNS forwarding ruleset, which specifies where to forward the requests—for example, to on-premises DNS servers for corporate domains. When it comes to resolving Azure domains from the on-premises environment, the on-premises DNS servers must be configured with conditional forwarding to direct Azure-related queries back to the DNS Private Resolver's inbound endpoint. From a connectivity standpoint, the on-premises DNS servers must be able to reach the inbound endpoint, while the DNS Private Resolver's outbound endpoint must be able to access the on-premises DNS servers and any external (non-Azure) DNS infrastructure if required. Notably, Azure Firewall can function as a DNS proxy. In this setup, both spokes and on-premises DNS servers forward their DNS queries to Azure Firewall, which then relays them to the DNS Private Resolver's inbound endpoint. This approach represents the most optimal architecture when exclusively leveraging Azure-native services, as you have full control over DNS-related traffic and you can leverage Azure Firewall logging capabilities.Now, let's see how to monitor our workloads.

Zooming in on monitoring

Figure 3.19 is the same as the one we had in Chapter 2, Solution Architecture. In this section, we will explain a typical approach to monitoring Azure applications with native tools. The usage of Splunk, or any other third party, is beyond the scope of the book.

When an application is deployed to Azure, we must:

monitor the application events. This can be achieved with Application Insights, which ultimately send their log to Log Analytics.
monitor the Azure platform health and signals. This can be achieved by redirecting diagnostic logs to Log Analytics.
define alerts on standard and custom metrics and/or specific diagnostic log events.

Firstly, it is important to distinguish between logs and metrics. Log data can be used to perform root-cause analysis/troubleshooting of a problem or analyze how users are using a given application. Conversely, Azure Monitor automatically captures specific service metrics and allows you to define alerts when some thresholds are reached. Let's now focus on logging.Azure has the following different types of log data:

Activity logs: They are common to every Azure service. They allow you to monitor the categories shown in Figure 3.20:

These logs are also interesting from a security perspective, as they keep track of who does what against the resource.

Diagnostic logs: They can be service-specific or end up in the shared AzureDiagnostics log category. Figure 3.21 shows you an example of diagnostic logs for Azure SQL:

Figure 3.22 is another example of Azure Front Door log categories:

Figure 3.20 – Azure Front Door log categories

As you can see, they are very different from one service to another. You should always have a look at the specifics of each service. Most services offer the possibility to redirect logs to one or multiple targets, as illustrated in Figure 3.23.

Figure 3.21 – Sending diagnostic logs to a target repository

This makes it possible to centralize most service logs in Log Analytics for analysis and alerting, and use Azure Storage for archiving or as a cheaper log destination. Event Hubs can be used to enable third parties, such as Splunk, to ingest log data by retrieving it directly from the hub.Once the data is in Log Analytics, it is possible to perform advanced queries and even to define alerts on query results. Figure 3.24 shows an example that uses Kusto Query Language (KQL) to detect Azure Front Door's firewall security events:

Figure 3.22 – Azure Front Door firewall logs

Figure 3.24 shows that some SQL injection attacks were detected by Azure Front Door. KQL can also be used to render charts. Figure 3.25 shows you how to render charts:

Such charts can easily be pinned to Azure Dashboards or help build more elaborated Azure Workbooks, and new alert rules can be defined against KQL queries. When it comes to pure metrics, Azure Monitor Metrics is the easiest way to keep track and to define alerts when a given threshold is reached by a service. For example, Figure 3.26 shows you how to define an alert on Front Door's backend latency.

Figure 3.24 – Front Door backend latency

This figure shows the definition of an alert that should fire whenever the average backend latency is over 2,000 milliseconds over an aggregated period of 15 minutes. This evaluation happens every 5 minutes. Such alerts are directly or indirectly bound to one or more Action Groups, which help define who to notify and how to handle the event. Notifications range from mere emails to SMS/voice messages. Note that it is possible to leverage Azure Monitor's alert processing rules to decouple alert rules from action groups and override the default action group behaviour, if required. The associated actions can be hooked to many different services and systems, as illustrated in the following figure:

You can automate the alert handling, using any of the Azure services present in Figure 3.27. Webhooks are a way to reach out to any system, such as, for instance, Dynatrace, which has an Azure integration module. ITSM is certainly a very interesting option, as it allows you to create a ticket in your preferred ITSM tool, such as ServiceNow. If you combine KQL with the pre-defined metrics and alert mechanisms of Azure, you have a very powerful monitoring system. Additionally, it is also possible to create alert rules based on Azure Resource Graph queries. For example, you may count the number of resourcces of a certain type and generate an alert if you get close to a subscription-level quota. Besides using Azure Monitor as a first-party monitoring service, you can still use Event Hubs to let third-party tools gather all the log data.As mentioned in the previous chapter, rather than letting third-party services ingest monitoring data and trigger alerts, it's more effective to define alerts directly in Azure and propagate them to other systems. This approach typically offers a faster response to issues as they arise. Let us now zoom on High Availability and Disaster Recovery.

Zooming in on High Availability and Disaster Recovery

Let's start by clarifying the difference between High Availability (HA) and Disaster Recovery (DR). HA focuses on maximizing the availability of a solution by ensuring it can withstand adverse events such as a service instance crash, hardware failure, and similar disruptions.DR, on the other hand, is about ensuring a solution can withstand major incidents—such as data center fires, flooding, earthquakes, and other large-scale disruptions.In Azure, HA is typically achieved by using Availability Zones within a single region, ensuring resilience against localized failures. For DR, workloads are deployed across multiple regions to protect against the complete unavailability of an entire region or a regional service outage. Because DR-compliant architectures span multiple regions, they are also highly available by nature, especially in case of ACTIVE/ACTIVE multi-region setup.Whether you design a solution for HA or DR depends on the expected Recovery Time Objective (RTO) and Recovery Point Objective (RPO) defined by the business or expected by your customers (if you provide the service). RTO is the maximum acceptable time it takes to restore a service after an outage, while RPO is the maximum acceptable amount of data loss. For example, if you do not have a continuous backup mechanism, restoring your last backup might make you lose a day of data, which may not fit the business requirements. Two other concepts, Recovery Time Actual (RTA) and Recovery Point Actual (RPA), are the corresponding actual measures of RTO and RPO, which are captured during an exercise, or following a real outage. Let's have a closer look at HA.

High Availability

Figure 3.28 is a zoom-in on HA in Azure:

Most Azure regions offer three Availability Zones—physically separate data centers, each with independent power, cooling, and networking. If one data center goes down, the remaining two maintain service continuity. Availability Zones support ACTIVE/ACTIVE scenarios natively, without requiring any manual failover or failback from the cloud consumer. However, this doesn't mean there will be zero impact. For instance, existing connections to data stores may be disrupted, and load balancers might take a moment to detect unhealthy backends. So, while no infrastructure-level action is needed, your application should still be designed to gracefully handle any exceptions that may arise from a zone failure.A roundtrip between two Availability Zones typically adds around 2 milliseconds of latency. While this is generally acceptable, it might not meet the needs for workloads requiring ultra-low latency. In such cases, alternative options like Availability Sets and Proximity Placement Groups should be considered. Availability Sets ensure that VMs are distributed across different racks to improve availability, whereas Proximity Placement Groups prioritize minimizing latency and may or may not span multiple racks. If low latency is more critical than high availability, Proximity Placement Groups is a better choice; otherwise, stick with Availability Sets.Most PaaS services support zone-redundancy. For some, it's an all-or-nothing approach—enabling zone-redundancy means at least one instance per zone, which can increase costs. Others offer more flexibility, allowing you to select specific zones (for example, just two), which can help reduce the financial impact. Another approach to achieving HA is through Auto Scaling—either by using Auto Scaling Plans for PaaS services or Virtual Machine Scale Sets (VMSS) for IaaS. VMSS allow you to define scaling boundaries, such as maintaining a minimum of two instances (one per zone) at all times, while dynamically adding more during peak loads. This helps strike a balance between availability and cost efficiency.Most data services in Azure support zone-redundancy at a minimum, and some even offer geo-replication for added resilience. When designing solutions, it's essential to align the compute and data layers—there's little value in deploying zone-redundant compute resources if they rely on a non-redundant data store. For instance, running three API instances on a zone-redundant App Service plan that all target a Storage account using Locally Redundant Storage (LRS) would be a clear design flaw. Ensuring consistency between tiers is key to achieving true high availability.Let us now explore the various features available to ensure Disaster Recovery.

Disaster Recovery

Figure 3.29 shows the available options to ensure Disaster Recovery.

To keep things concise, we'll focus on the key topics and critical decision factors you should be aware of as an Azure Infrastructure Architect. You can explore the rest of the Disaster Recovery group at your own pace.When it comes to Disaster Recovery, we must decide to deploy workloads to at least two regions. The following table outlines potential strategies for two-region deployments, along with their respective advantages and disadvantages:

Strategy	Pros	Cons
Active/Passive with everything pre-deployed to the secondary region	Infrastructure is already available at the time the primary region fails, minimizing the amount of efforts during the failover process.	The secondary region incurs additional costs but is not actively used. This should be viewed as an insurance expense. Must prepare for failover/failback scenarios
Active/Passive with no compute deployed but data replication configured	Cost savings	This is a risky strategy, as capacity may be unavailable in the secondary region since you're unlikely to be the only one initiating a failover. Note that this could work if you can choose any region as a secondary at the time of the primary outage. Cloud-only workloads may fit with that strategy.
Active/Active	Cost optimization (you pay for what you use) No actual failover/failback process. However, as for the Availability Zones, the application should still be designed to gracefully handle any exceptions that may arise from a regional failure	Depending on the situation (as for example our first use case in this chapter), this may be the best strategy. However, the most challenging part is the data layer as it is far from easy to have data services that support concurrent writes in both regions at the same time.
Active/Active + Zone Redundancy	The most robust and fault-tolerant design	Costs
Active/Passive + Zone Redundancy	Robust and fault tolerant.	Costs

Table 3.2 – Multi-region deployment strategiesNo matter which strategy is selected, implementing a multi-region solution introduces substantial complexity—particularly when it comes to data management. Most data services propose a geo-replication feature that allows you to select regions in which data should be replicated. You should always look at the replication model as it may vary from one service to another. Here are the key critera to consider with regards to replication:

Does the service support write/write or read/write only?
RTO and RPO?
How do you failover?
Is the failover process initiated by Microsoft or yourself?
What is the risk of data loss?
How long does it take to have the secondary region fully online and usable in read/write?
Does the service perform a full or partial replication? For example, Azure Service Bus historically only replicated entities (such as queues, topics, and subscriptions) but not the actual messages themselves. At the time of writing this book, the service has a geo-replication feature in preview, that allows to choose whether to replicate data synchronously or asynchronously, balancing between performance and reliability.
What are the available consistency models (strong, eventual, session)?

Many of these questions involve intricate dependencies and cannot be addressed without comprehensive upfront analysis and architectural planning. At last, because Azure isn't for the faint of heart, other types of replications exist, such as the ones proposed by Azure Storage:

Replication Model	Definition
Locally Redundant Storage ( LRS )	Replicates data synchronously within a single phyical location
Zone-Redundant Storage ( ZRS )	Replicates data synchronously across all Availability Zones
Geo-Redundant Storage ( GRS )	Replicates data synchronously using LRS in the primary region and asynchronously to a single physical location of the secondary region
Geo-Zone-Redundant Storage ( GZRS )	Replicates data synchronously using ZRS in the primary region and asynchronously to a single physical location of the secondary region
Read Access Geo-Redundant Storage ( RA-GRS ) and Read Access Geo-Zone-Redundant Storage ( RA-GZRS )	These are variants enabling read/write in the primary region and read only in the secondary.

Table 3.3 – Azure Storage replication options

Important note

GRS, G-ZRS, RA-GRS, and RA-GZRS are replication options that are only supported fully by historical regions such as West Europe and North Europe. Newer regions such as Sweden Central and Sweden South only partially support it and even newer regions do not support them at all anymore. This means that you should define a workload placement strategy that is aligned with the intrinsic region capabilities and with RTO/RPO requirements.

In addition to the compute and data replication, you must also foresee backup and restore processes to recover from a data corruption, accidental data loss, or randsomware attacks. Keep in mind that this is a fundamental aspect of disaster recovery that applies universally, regardless of the deployment model (single-region, multi-region, etc.). You must always ensure the ability to restore data under any circumstances! To keep things concise, we can say that PaaS data services mostly feature two types of backups: Continuous and Long Term Retention (LTR). Continuous backups are built into the data service itself. They typically support Point-in-Time Restore (PITR), enabling you to recover data from any moment between a defined start time – often below 40 days – and the present. Time windows vary from one service to another. In contrast, LTR backups, also built into the data service itself, support extended retention policies—up to 10 years in services like Azure SQL. These backups are stored in Azure Storage, which can also be geo-replicated using GRS, when available.Additionally, Backup Vault and Recovery Services Vault are distinct backup solutions that enable backups for various types of resources. Furthermore, some PaaS offerings—such as Azure API Management—provide API endpoints for backup/restore or import/export operations. These processes are often automated using Logic Apps.Sometimes, some workloads go beyond typical resource needs, and require much more power. HPC, our next topic, is a possible answer to this.

Zooming in on HPC

High Performance Computing (HPC) is a pure infrastructure topic, because it boils down to bringing an unusual amount of compute and memory to a given workload. In general, HPC jobs are handled by dozens, hundreds, or even thousands of machines in parallel. Figure 3.32 shows most of the current Azure HPC landscape:

For memory-driven workloads, such as Computational Fluid Dynamics (CFD), you may rely on HBv2/v3-series virtual machines, which are bandwidth-optimized. For FLOPS-driven (floating-point operations per second) workloads, which require a fast and optimized CPU, you can rely on the HC series. If you are unsure of whether your workload is memory- or flops-driven, you might rely on HPE Cray, a supercomputer delivered as a managed service. When it comes to basic job scheduling, you can rely on Azure Batch. Batch is fully managed and provides both the underlying infrastructure and the scheduler. If you already have HPC workloads on-premises and want to migrate them to the Cloud, Azure Cycle Cloud is a better choice. Azure Cycle Cloud is an HPC cluster manager that does not have any built-in scheduler, but it integrates with third parties such as Slurm, PBS, etc. It also integrates with Azure Batch, which means that both options are not mutually exclusive. In terms of storage, which plays a crucial role in HPC solutions, you can work with Azure NetApp Files, which brings massive capabilities, in terms of I/O. For read-intensive workloads, often relying on HPC caching, and when Azure NetApp Files doesn't meet your performance needs, you'll likely need to integrate third-party solutions—since most legacy Azure options (such as Azure FXT Edge Filer, Azure HPC Cache, and Avere vFXT for Azure) are being deprecated.To wrap up our infrastructure map, let's briefly touch on what Azure Arc and Azure Local bring to the table.

Zooming in on Azure Hybrid solutions

Azure Arc is becoming increasingly common, whereas Azure Local, formely known as Azure Stack HCI, remains more specialized at this stage of Azure adoption.Azure Arc extends Azure's management and governance capabilities to any environment. It allows you to manage servers, databases, and Kubernetes clusters as if they were Azure resources. In short, enabling Azure Arc in a non-Azure environment requires installing the Azure Connected Machine Agent. This agent communicates over the internet—optionally via a proxy—with well-known Azure and Entra ID endpoints (such as management.azure.com). Through this bi-directional connection, it shares information about the remote servers and receives commands from Azure. Azure Arc allows you to apply Azure policies, leverage managed identities, and integrate with Defender for Cloud, which we'll cover in Chapter 9, Security Architecture. Arc enables unified governance across connected environments using the Azure control plane.Azure Local is designed for hybrid and edge computing scenarios, focused on running a subset of Azure services on your own validated hardware. It automatically includes Azure Arc and is particularly suitable for scenarios requiring ultra-low latency. Services such as Azure App Service, IoT services, and Azure Storage can all run on Azure Local but a significant number of Azure services are cloud-only. Azure Local will primarily be used in highly regulated environments subject to data sovereignty constraints, which still want to bring some shiny Azure features and services in their own walls.We have now covered most of the Azure Infrastructure Map. It's time to step back and reflect on what we learned so far before diving in our use cases.

Key advice from the field

As you may have noticed, Azure Infrastructure is vast, and we've only just begun to explore it. To leverage Azure services effectively at scale, it's important to plan ahead for their configuration and deployment. This includes evaluating how each service integrates with the Hub & Spoke architecture, identifying relevant log categories, establishing monitoring and security baselines, and defining appropriate Azure policies. All of this should be captured in a technical standard for each service before its adoption. Once these standards are in place, Infrastructure as Code (IaC) templates can be developed to enforce them, ensuring consistent and compliant usage across your organization. Furthermore, the standardization approach should remain flexible, allowing for reviews at least twice a year. Azure evolves rapidly, and a standard defined today may no longer be relevant five years down the line.Time has come to explore some concrete real-world inspired use cases.

Global API Platform Use Case (PaaS)

Scenario:The Contoso group wants to expose APIs to its customers (marketplaces) in the trade sector for product tracking purposes. They span three continents: North America, Europe, and Asia, but will begin with deployments in North America and Europe. Marketplaces mostly query (read) their APIs to verify if products are known by Contoso, while Contoso's member organizations mainly push (write) product identification information to Contoso. The read/write ratio is about 90%/10%. Contoso's SLAs towards their customers is 99.99% of availability and a Recovery Time Objective (RTO) of <=30 minutes. Member organizations requested an Recovery Point Objective (RPO) of <=15 minutes. Contoso wants to leverage geo-proximity to optimize response times, as marketplaces are spread around the world. The solution should survive a regional outage. Lastly, Contoso wants to use only Azure-native services.

First analysis

For this use case, inspired from a real-world situation, we will only analyze and explain a solution diagram that should satisfy Contoso's requirements.Let's first extract the keywords in this scenario that demand our focus:

RTO of <= 30 min: An RTO of 30 minutes is an aggressive target. Earlier in this chapter, we discussed the different disaster recovery strategies and we know that redeploying assets in case of outage is feasible but represents a risky strategy because resources might not be available at the time of re-deployment. Moreover, in this context, achieving a 30-minute RTO would imply the ability to redeploy the entire workload from scratch within that timeframe—an expectation that is largely unrealistic. Consequently, the focus should shift toward ensuring system resilience without requiring redeployment. Leveraging Availability Zones becomes essential to distribute workload components across at least two zones, thereby enhancing fault tolerance within a region. However, given that Contoso cannot tolerate prolonged downtime for all marketplaces in the event of a regional Azure outage, a multi-region deployment strategy must also be considered to ensure continuity at scale.
Geo-proximity: Contoso wants to make sure marketplaces are served by the closest API. This implies an ACTIVE/ACTIVE multi-region setup.
RPO of <= 15 minutes: The RPO is also an aggressive target, especially for an ACTIVE/ACTIVE multi-region setup. This potentially restricts our possibilities in terms of data stores we can use.
North America and Europe: These are the first two regions considered by Contoso, which means that we are restricted to those geographies in Azure too.

After a deeper analysis and discussions with Contoso, we realized that the following services are in line with the requirements:

Azure Front Door, a global service with hundreds of points of presence worldwide, is an ideal candidate to distribute traffic across the two geographies as well as the third one to come (Asia) in the near future.
Azure API Management can be used to manage APIs exposed to both marketplaces and member organizations and is also able to span multiple regions.
Azure App Service can be used to host the backend service code. It is a robust, mature, and cost-friendly option. It does not have any multi-region feature but app services could be deployed separately to multiple regions along with the application code. Additionally, Azure App Service also supports Availability Zones, which allows us to distribute regional instances across different zones.
Cosmos DB's global distribution is a perfect fit for this scenario as it not only directly covers Europe and America but can easily be extended to other regions later on. Depending on the chosen consistency model, Cosmos DB's continuous backup allows us to restore containers and databases with an RPO of <=15 minutes, as expected by Contoso. Additionally, the Cosmos DB Client SDK is designed to automatically detect region unavailability and switch seamlessly between regions in case of regional outages. Contoso could also take advantage of Cosmos DB's change feed feature to react easily to new product information being pushed by member organizations, which opens doors for further application growth.

Drawing board and detailed explanations

Now that we have identified the different services that meet Contoso's requirements, we return to the drawing board and arrive at the following high-level diagram:

Figure 3.29 – Contoso Global API Platform - High-level diagram

We'll begin by analyzing the high-level architecture diagram from left to right, which outlines the flow of an API call initiated by either marketplaces or member organizations. In the following sections, we'll break down this diagram into smaller parts to explore each segment in greater detail.In Figure 3.33, dotted arrows depict the API call flow from European marketplaces, while solid lines represent the flow from North America, following Contoso's geo-proximity-based design under normal operating conditions. In the event of a regional failure, traffic may be rerouted—EU to US or vice versa. Azure Front Door's latency-based routing algorithm facilitates this behavior by default, directing requests to the nearest available backend and automatically failing over to the alternate region if the primary is unavailable.Next component inline is a premium instance of Azure API Management with multi-region and zone-redundancy enabled. With both, zone-redundancy and multi-region, the solution is resistant to both, zone-level and region-level failures. Losing a single zone might not even cause any visible disruption. In Figure 3.33, APIM's primary region is West Europe with 3 gateway units (one per zone), while the secondary region is North Central US also with 3 gateway units. In case the master region is lost, the secondary region's gateway units will still be able to perform correctly but APIM's management plane will not be usable anymore. This means that deploying new APIs or changing anything in APIM's configuration won't be possible until the master region is back. Additionally, APIM is also used to manage API versions exposed to marketplaces and to enforce API policies such as JWT token coarse-grained validation, etc.Next component in line is the Azure App Service, which hosts the backend services, where the application code is deployed. The underlying App Service plan must also be made zone-redundant to have a consistent behavior. The application code should also re-validate JWT tokens and optionally perform additional fine-grained validations.Finally, a single Cosmos DB account configured with both West Europe and North Central US as active write regions (multi-master mode) and zone-redundancy, ensures best in class resilience. Combining multi-region writes with session consistency aligns well with the expected RPO requirements.Now that we analyzed the high-level flows and the role of each component in this architecture, let's zoom deeper into each part of the diagram and start with Front Door.

Front Door works with endpoints, which must be mapped to public domain names. In our case, products.contoso.com must be mapped to the endpoint <frondoor-prefix>.azurefd.net by defining as an alias DNS record in the Contoso domain. We can even use Azure DNS zones to manage our public domain. Note that this is usually never the case in the enterprise world but it is technically feasible. Next to each public Fully Qualified Domain Name (FQDN), there is a certificate, which can be entirely managed by Front Door. Azure Front Door-managed certificates are issued by DigiCert in a fully automated way, provided we use Azure DNS for your public DNS. When using our own domain provider, as the domain ower, we still have to validate ownership through defining a TXT record into our environment. Alternatively, we can bring our own certificate, store it in Azure Key Vault, and define a Front Door secret that points to the Key Vault. In this case, Front Door's managed identity must be granted the Key Vault Certificate User role to pull the certificate from Key Vault.Next, we must define an Origin Group, which is some sort of backend pool. In our case, the backend pool is made of the gateway units deployed to West Europe and North Central US. API Management exposes one load balancer per region. The three US gateway units are behind contoso-apim.northcentralus-01.regional.azure-api.net and the three EU units are behind contoso-apim-westeurope-01.regional.azure-api.net. APIM also provides a region-neutral DNS name, contoso-apim.azure-api.net, which uses Azure Traffic Manager behind the scenes to redirect traffic to the appropriate regional backend. This FQDN can be specified in the Origin Group instead of the regional ones. However, making such a choice causes APIM to take responsibility for ensuring that marketplace API requests are correctly routed to the corresponding region. Working with regional backends instead, causes Front Door to take responsibility, which gives us full control over the routing algorithm.In our diagram, we opted for the latency-based algorithm by setting a value of 0, which means that Front Door should always pick the fastest backend. Next, we defined a route by mapping products.contoso.com to the Origin Group. One tricky aspect to be aware of is health probing. APIM exposes a specific endpoint, /status-0123456789abcdef, to indicate its own health status. However, this endpoint does not reflect the health of our underlying APIs but rather the APIM gateway one—so APIM might report as healthy even when our APIs are not. Relying on this endpoint for Azure Front Door health checks can be misleading, as Front Door could continue routing requests to unhealthy backends. A better approach is to implement a custom status endpoint via an APIM policy that checks the actual API backend health, and have Front Door call it using an API key in the query string. We must make sure to always use the HEAD HTTP verb to minimize bandwidth consumption incured by the probing. At last, Front Door is not only used to route traffic to the closest backend but also to inspect incoming traffic, thanks to WAF policies that can be associated to the endpoints. This is necessary because, although APIM can enforce various security controls, it does not include a built-in Web Application Firewall (WAF).Let us now take a closer look at APIM and the backend services, the next components in our design.

Figure 3.31 – The APIM layer and its regional backends

Figure 3.35 only shows the primary region but the secondary region is configured the same way. At the time of writing this book, it is not yet possible to use the Private Link Service (PLS) between Azure Front Door and APIM premium, which means that APIM cannot be totally isolated from Internet. This is why our APIM instance is integrated with a virtual network in external mode, causing its exposure to the internet through a public IP (hence the warning sign in the design). It is nevertheless possible to restrict access to our Front Door instance by combining Network Security Group rules with the validation of the X-Azure-FDID HTTP header. Front Door systematically adds this HTTP header with its own unique instance identifier, making sure we know it is our Front Door calling. If you are ever confronted to such a design in the future, you should double-check if Front Door can take APIM as a full private backend through PLS, as it would be an even better option.As stated earlier, we enabled zone-redundancy as well as multi-region for our APIM premium instance and we enforced multiple security controls using policies. An important aspect to consider is that you must explicitly define the target backend (West Europe or North Central US) using the set-backend policy. This is due to APIM using a single configuration across both regions, which means we can't hard-code a backend service URL—doing so would cause all gateways (EU and US) to route traffic to the same backend, regardless of region.For the backend services, we use multi-tenant App Service plans to minimize costs and complexity. However, this forces us to integrate with the virtual network using private endpoint on the one hand, and virtual network integration on the other hand. Enabling private link for our app service allows us to deny public traffic and let APIM call the backend service through its private endpoint. We need outbound virtual network integration because, in turn, our backend must call Cosmos DB that is also isolated from the internet. This requires a dedicated subnet delegated to Microsoft.Web/serverFarms. Our single web app (but 1 instance per Availability Zone for a total of three instances) spans two subnets, one for the inbound traffic and one for outbound. Here again, we can enforce NSG rules to restrict access to APIM only.Lastly, let's have a closer look at our data layer, namely Cosmos DB.

First of all, we deny public access at Cosmos DB level, enable Private Link (as explained earlier in the DNS section), and deploy private endpoints into reach regional subnet. This allows the backend (App Service) to talk to the private endpoints, thanks to the virtual network integration. Interestingly, we end up with three private endpoints in each side. This is due to how Cosmos DB clients establish connections to Cosmo DB. The generic FQDN contosoproducts.privatelink.documents.azure.com is required by the client SDK. Additionally, the client code can indicate a preferred region, which in this case requires the presence of contosoproducts-westeurope.privatelink.documents.azure.com for the code running in West Europe and contosoproducts-northcentralus.privatelink.documents.azure.com for the code running in North Central US. In case of unavailability of the local region, the SDK is smart enough to automatically switch to the other region for both reads and writes, which in this case, causes a performance degradation incurred by the cross-region roundtrip. However, despite the performance impact, the solution is highly resilient at both the infrastructure and application levels. Regarding the RPO, we can leverage Cosmos DB's continuous backup feature, which supports an RPO of ≤15 minutes in a multi-region write setup with session consistency. However, additional effort is required at the Cosmos DB level to ensure the data model is designed to accommodate the read/write ratio and achieve an even distribution across logical partitions—this is essential for maintaining performance, scalability, and avoiding hot partitions.This use case demonstrated how to take advantage of built-in PaaS resilience by leveraging service-level features such as zone redundancy, multi-region support, and backup/restore capabilities. However, it's important to note that not all PaaS services offer the same level of resilience, and achieving a fully consistent, disaster recovery-compliant solution can be quite challenging when integrating multiple PaaS services within a single architecture. Let us now explore a pure IaaS use case.

Hub and Spoke Use Case (IaaS)

Scenario:Contoso is a security-driven company that wants to adopt the Hub & Spoke architecture. Contoso will primarily deal with the online and corporate landing zone archetypes. They want a clear separation between these archetypes and the ability to quarantine an online application if needed. Their goal is to minimize the blast radius in the event of a security breach or misconfiguration. Because Contoso is highly regulated, they want to be able to easily identify network flows within their cloud environment.In this real-world-inspired use case, we'll examine and explain a solution diagram designed to meet Contoso's requirements. We'll also include Terraform code to help you deploy the solution in your own tenant.

First analysis

Let's review the keywords in this scenario that demand our focus:

Security-driven: When security is embedded in a company's DNA, we naturally think of strong Identity and Access Management (IAM), comprehensive auditing, advanced encryption mechanisms, and a solid network security foundation. In this use case, however, we'll focus exclusively on the network aspects. We'll revisit this scenario in Chapter 9 to incorporate additional security elements into the architecture.
Online and corporate archetypes: This clearly indicates that Contoso will deal with internet-facing assets as well as internal workloads.
Segregation and quarantine: The initial keywords combined with these new ones, clearly point towards a multi-hub architecture.

In the networking section of this chapter, we explored various Hub & Spoke topologies and emphasized that assigning specific responsibilities to each hub, and peering spokes according to their needs, helps minimize the blast radius in case of misconfigurations or security breaches. While the Quadruple Hub model offers maximum separation of duties, Contoso has not explicitly required a dedicated egress hub or integration hub. Therefore, we'll go with the DualHub architecture—also keeping in mind cost-efficiency when you test this in your own tenant.

Drawing board and detailed explanations

After a few exchanges with Contoso, we agreed on building a small proof of concept with one online and one corporate archetype. We went to the drawing board and ended up with this diagram:

Let us decompose this diagram. There are a lot of NOT IMPLEMENTED mentions in the diagram because this is how we would typically design such an architecture but the provided code will not deploy these components for the sake of cost savings in your tenant. Let's dig into the details from left to right.The first component of the ingress hub is Application Gateway, which acts as a reverse proxy and WAF. The Application Gateway takes the web application sitting in the Online Archetype spoke as a backend pool. The traffic from Application Gateway is sent to the Azure Firewall of the ingress hub through the use of a route table. We see two destinations: 10.0.0.0/24 (intra vnet) and 10.10.0.0/26 (spoke). These routes are required to override the default system routes made available by the peering, and make sure traffic is routed to the firewall. The ingress firewall has an Application Rule defined to let the Application Gateway subnet talk to the inbound subnet of the web application in the online spoke. Note that application rules enforce Source Network Address Translation (SNAT) by default, which means that the destination will see the firewall IP instead of the actual source (Application Gateway in this case). Figure 3.38 is a focused view on the flow that we just described:

For sake of simplicity and demo purposes, communication is done over port 80. In the real world, this would of course be restricted to port 443 and TLS >=1.2 would be enforced. Additionally, we would order a public certificate or let Azure manage this for us. Next, a usual suspect in an ingress hub could be an instance of API Management for APIs that are exposed to external customers.Next hop is the Online Archetype, peered to our ingress hub, where we have a single web application that is behind a private endpoint and integrated with an outbound subnet to route traffic to the private perimeter. For a real application, the web app would probably talk to a database or any other data store for which the private endpoint would be in the data subnet. We make sure the inbound subnet is sensitive to NSG rules and custom user-defined routes by enabling specific network policies at the level of the subnet. We restrict traffic to the Azure Firewall by adding a specific allow rule and a deny all rule to block everything else. We apply a specific route table to the outbound subnet to send all traffic to the Main Hub's firewall through a 0.0.0.0/0 rule. Note that this rule does not override system routes brought by the peerings. We must define the target destinations of the webapp in the Main Hub's firewall if any.Next the Corporate Archetype is an empty shell but is there to show that we only peer it to the Main Hub, unlike the Online Archetype, which is peered to both hubs. Should we have a real application hosted there, we would also add the required route tables to make sure east-west and nort-sourth traffic is sent to the Azure Firewall of the Main Hub.Finally, we've enabled Private Link for the web application. The Private DNS Zone privatelink.azurewebsites.net contains the web app's record and is currently linked to all virtual networks. In a real-world setup, we would typically link this zone only to the Main Hub and configure the other virtual networks to forward DNS queries to the shared DNS service in the hub (for example, Infoblox, DNS Private Resolver, etc.). Let's now explore how you can deploy this architecture in your own tenant.

Deploying and testing the solution in your tenant

Important note

To simplify testing in your own tenant, all resources are deployed within a single resource group. This use case focuses on networking and firewalling—specifically, how to peer landing zone spokes based on the desired network capabilities and how to involve the corresponding firewall. In the real-world, you should also adhere to your company-specific governance.

To test this code in your own tenant, you must have the following:

An Azure subscription with Owner permissions. If needed, folllow this link to start a new trial: https://azure.microsoft.com/en-us/pricing/purchase-options/azure-account.
Install Azure CLI. Follow the instructions available here: https://learn.microsoft.com/en-us/cli/azure/install-azure-cli.
Install Terraform. Follow this link https://developer.hashicorp.com/terraform/install where you can find binaries for all operating systems. This only takes a few minutes!
Make sure the terraform command works and is known in your system. In other words, make sure the PATH environment variable points to the location where you downloaded the Terraform client.
Clone the map book repo or download the code.
Open the cloned/downloaded repository with Visual Studio Code.
Navigate (cd) to this folder: Chapter 3\use-cases\IaC.

Before you can deploy the code, you must make sure the Microsoft.Network and Microsoft.Web resource providers are enabled in your subscription. To enable a resource provider, go to your subscription, look for the provider, and enable it, as illustrated in Figure 3.39:

Figure 3.35 – Enabling resource providers

Before deploying the code, let's have a quick look at it and focus on the important bits. The entire configuration is passed through the config.yaml file:

Figure 3.36 – Extract of the configuration file

In this configuration file, you must make sure to replace both the backend-prefix and the destination-fqdns attributes. In the default provided file, values for those settings are constoso-mapbook and contoso-mapbook.azurewebsites.net, respectively. You must change these values to something unique because Azure enforces unique names worldwide for PaaS resources like Azure App Service. If I had to run this example myself, I would use my trigram - sey-contoso-mapbook - to guarantee uniqueness. Additionally, you can also change the location. We used swedencentral but feel free to chose a location that is closer to you.Here is the folder structure of the solution:

A few Terraform modules are provided. The main.tf file is where we parse the provided configuration file and call our Terraform modules accordingly to deploy the actual architecture. We invite you to read this demo code carefully and to deploy it using the following steps once ready:

Open a PowerShell command prompt and run az login using a subscription owner account. Just login as usual. This will ensure Terraform is able to reuse your cached credentials when deploying the code.
In Visual Studio Code, launch the terminal (View | Terminal).
Make sure you are in the right folder (cd '.\Chapter 3\use-cases\IaC\')
Modify line 13 of main.tf to add your own subscription ID.
Run terraform init
Run terraform apply --auto-approve

The last command performs the actual deployment. After about 10 minutes, you should be able to view all resources in your resource group as illustrated in Figure 3.42:

Now, you can click on the Application Gateway resource to grab its URL and browse it (make sure to use HTTP, not HTTPS). This should show the default page of the Online Archetype landing zone:

Figure 3.39 – Testing the deployed solution

Feel free to explore the solution, by checking the different route tables, firewall rules, etc. and make sure to delete the resource group once done to prevent excessive costs.

Important note

This script has been deployed multiple times successfully but a transient error may always happen. In case of error, just make sure you chose a unique name for your web app and redeploy. Immediately after deployment, you may see a 502 – Bad Gateway error when accessing the web app via the Application Gateway. This happens because the gateway might have attempted to probe the web app before the ngress hub firewall's allow rule was fully in place. Wait a few minutes, and traffic should begin to flow as expected.

Let's now conclude this chapter.

Summary

In this chapter, we took a deep dive into infrastructure practices on Azure. We covered a range of key topics including networking, monitoring, backup and restore, high availability, and disaster recovery. We also examined two practical use cases: one centered on High Availability and Disaster Recovery using PaaS services, and another showcasing a security-first approach built around a Double-Hub architecture.With this foundation, you should now be better equipped to handle core Azure infrastructure topics and navigate the trade-offs that come with architectural decisions—an essential skill for any Azure Infrastructure Architect.That said, becoming truly proficient will require continued exploration on your part. In the next chapter, we'll shift our focus to Azure Kubernetes Service (AKS), a unique offering in the Azure ecosystem that brings its own set of strengths and challenges.

4 Working with Azure Kubernetes Service (AKS)

Join our book community on Discord

https://packt.link/0nrj3In this chapter, we turn our attention to AKS—an ecosystem in its own right. We'll explore the various challenges that infrastructure engineers and architects routinely face. Specifically, we'll delve into the following topics:

The AKS Architecture Map
Zooming in on fundamental architectural concepts and the essential Kubernetes resource types
Zooming in on networking
Zooming in on cluster management and deployment
Zooming in on scaling and monitoring
Zooming in on high availability and disaster recovery
Zooming in on main add-ons and extensions
A use case about building multi-tenant AKS clusters

We'll provide a comprehensive guide to building solutions with AKS, highlighting best practices, common pitfalls, and key networking considerations. Given the breadth of AKS, we'll begin with an introduction to what makes container orchestrators so powerful and unique. From there, we'll take a closer look at networking, explain why AKS can be considered the elephant in the Hub & Spoke model, and present reference architectures alongside widely adopted Cloud Native Computing Foundation (CNCF) tools. Finally, we'll bring all the key concepts together through a practical use case.Let us now explore the technical requirements.

Technical requirements

In this chapter, we will be using Microsoft Visio for the diagrams but the corresponding PNGs are also provided. While not strictly required, you may use Visual Studio Code (https://code.visualstudio.com/download) to open the code sample provided in this chapter.Maps, diagrams, and code are available at https://github.com/PacktPublishing/The-Azure-Cloud-Native-Architecture-Mapook-Second-Edition/tree/master/Chapter04.

The AKS Architecture Map

Figure 4.1 presents the AKS Architecture Map, designed to serve as your compass for navigating AKS. While it's not a definitive guide, it offers a high-level overview to help you quickly understand the broader AKS ecosystem. Throughout this chapter, we'll break down its components and provide real-world context wherever possible.

Important note

To see the full AKS Architecture Map (Figure 4.1), you can download the PDF file available at https://github.com/PacktPublishing/The-Azure-Cloud-Native-Architecture-Mapook-Second-Edition/blob/master/Chapter04/maps/AKS%20Architecture.pdf.

The AKS Map discusses the following topics:

Networking: In the previous chapter, we introduced the Hub and Spoke model along with the fundamentals of Azure networking. While AKS clusters are deployed within Virtual Networks (VMs), they also incorporate their own internal networking mechanisms that require a solid understanding. This section will dive into those specifics.
High Availability (HA) and Disaster Recovery (DR): AKS clusters leverage the same high availability and disaster recovery foundations as other Azure services. However, achieving true HA and DR also depends heavily on how workloads are deployed within the clusters.
Deployment: This section focuses on cluster deployment and release strategies.
Main Add-ons/Extensions/Options: AKS comes with a variety of built-in add-ons and features that can enhance security and support the development of distributed solutions.
Management, Scaling, and Monitoring: We will highlight a few tools that help manage, scale, and monitor AKS clusters.

Now that we've outlined the key topics of our map, it's time to dive into each area in detail. We'll begin with an introduction to AKS and the essential resource types to know.

Zooming on fundamental architectural concepts and the essential Kubernetes resource types

Container orchestrators, like AKS, are designed to automate application deployment while providing built-in capabilities for scaling, self-healing, and dynamic resource allocation to enhance overall resilience. They also enable economies of scale by allowing multiple applications to share compute resources, thereby optimizing overall infrastructure costs. Additionally, they simplify the adoption of cutting-edge technologies and protocols, as applications are packaged as containers along with all their dependencies. This is especially valuable for third-party vendors aiming to build polyglot cloud-neutral or even data center-agnostic solutions to maximize their reach.Now that we have highlighted some of the key benefits, let's have a look at some basic, but fundamental, technical concepts. Figure 4.2 is a very high level view of an AKS cluster.

Figure 4.2 – High-level AKS architecture

Microsoft takes care of the underlying infrastructure of the API server (left-side of Figure 4.2), which is the cluster's brain. We are partially responsible for the worker nodes (right side of Figure 4.2). Microsoft manages the base images of the worker nodes—Ubuntu, Azure Linux, or Windows Server—while we are responsible for selecting their size, which will vary according to the type of workloads and the environment (non-production, production). Base images include a container runtime, as well as the kubelet and kube-proxy components. Worker nodes are essentially VMs running within Virtual Machine Scale Sets, which are tied to node pools. We determine how many node pools to use and choose their type—system, user, or gateway. These node pools can be deployed either to a shared virtual network subnet or to dedicated subnets.One of the best practices, is to separate system nodes from user nodes. In other words, we want to ensure that application workloads do not run on nodes designated for critical system components, preventing resource contention and maintaining cluster stability and observability.Figure 4.3 is an example of node pool segregation:

Figure 4.3 – Segregation of system and workload nodes

While assigning node pools to separate subnets isn't strictly required, it's widely regarded as a best practice. This separation creates a clearer boundary between system and user workloads, enabling more granular and manageable application of NSG rules.Beyond system and user pools, additional purpose-built node pools and corresponding subnets can be added to the architecture to support specific needs—for example, dedicated pools for ingress or egress traffic in security-driven setups, or GPU and high-compute node pools tailored for performance-intensive workloads. This modular design enhances both security posture and operational efficiency.Let's fast forward a bit and explore the essential Kubernetes resource types, shown in Figure 4.4, that available for deploying and running applications into clusters.

Figure 4.4 – Essential K8s resource types to know

While there are many other resource types in Kubernetes, these are the essential ones you absolutely need to know. In AKS, every deployment must target a Namespace—a logical construct used to organize and manage how applications share the cluster. Namespaces provide isolation between applications and can also be used to allocate compute resources such as CPU and memory. Deployment and ReplicaSet are both used to deploy workloads and specify the desired number of application instances (replicas). However, Deployment has become the de facto standard, as it serves as a higher-level abstraction that automatically manages the underlying ReplicaSet. Batch or one-off tasks can be seamlessly deployed using the Job resource type. StatefulSet is used for every stateful application or an application that must start pods in a certain sequence. Whatever deployment method you use, every application container will be hosted inside of a Pod. Pods are the smallest deployable unit that can run one or more containers sharing the same network namespace and storage. A common best practice is to deploy a single application container per pod. Multiple application containers are used only when they are tightly coupled, need to communicate over localhost, and when one should remain internal to the pod. A more frequent use case for multiple containers per pod, is to include a sidecar that handles cross-cutting concerns like telemetry collection or network traffic interception (for example, implementing the ambassador pattern). Because pod IP addresses are not static, we typically make use of Service to expose the pod to other pods (ClusterIP) or to the outside world (LoadBalancer, NodePort). Kubernetes services get assigned a static IP address that is used to forward traffic to the actual pod. Services do not proxy traffic but are just a stable entry point for callers. Forwarding is made possible by kube-proxy (iptables or IPVS) or using eBPF (for example, Cillium), which is the most performant approach. Long story short, remember that Kubernetes Services provide an abstraction over pods.ConfigMap and Secret are both ways to store non-sensitive and sensitive information for applications. Keep in mind that although Kubernetes secrets are base64-encoded, they can be easily decoded by anyone with access. A more secure approach is to store secrets in Azure Key Vault and use the CSI Driver to mount them into pods at runtime. This avoids storing secrets directly in AKS and leverages the advanced capabilities of a dedicated secret, certificate, and key management service. Third-party solutions often stick with Kubernetes secrets to maintain cloud agnosticism. However, for custom-built workloads, it's recommended to use Azure Key Vault for enhanced security and better integration with Azure-native services.Access to Kubernetes resources, including secrets, can be controlled using Azure RBAC, a combination of Azure Authentication and Kubernetes RBAC, or Kubernetes RBAC alone. Kubernetes RBAC should be limited to local development and sandbox environments. Since Azure RBAC currently lacks fine-grained permissions, most enterprises opt for Azure Authentication combined with Kubernetes RBAC to strike a balance between centralized identity and granular access control. We typically use Role, ClusterRole, RoleBinding, and ClusterRoleBinding to map Kubernetes roles to Entra ID security groups that users belong to. Whenever the user interacts with the cluster with kubectl, AKS will check user permissions against existing RoleBindings and ClusterRoleBinding.When connecting workloads to Azure services like Azure SQL or Azure Key Vault, a best practice is to use Workload Identities. This concept, not unique to Azure, establishes trust between the AKS cluster and Entra ID by configuring the cluster as an OIDC token issuer. The link between the cluster and Entra ID is established using a user-assigned managed identity (per pod) that uses federated credentials. In essence, it maps an AKS Service Account mounted to a pod, to the managed identity we created for that pod. We then grant permissions to the managed identity over the target Azure resource. Once the link is esbalished, service account tokens can be exchanged by the application with Entra ID access tokens.Finally, Volume, PersistentVolume, PersistentVolumeClaim, and StorageClass are Kubernetes resource types used to define and provision storage for pods. While stateless applications are typically preferred in AKS, there are scenarios where persisting state is necessary. In such cases, you should prefer the StorageClass resource as AKS provides many of them out of the box, which can be extended to suit your specific needs.Ultimately, a stateless application deployed to AKS might resemble the architecture shown in Figure 4.5.

Figure 4.5 – Example of stateless application using workload identity to access the message broker

The application pod uses workload identity to exchange its projected service account token for an access token from Entra ID. With this token, it authenticates to an Azure Service Bus instance located outside the cluster. This approach enhances security and allows you to take advantage of using a message broker fully managed by Microsoft.Conversely, Figure 4.6 shows a corresponding stateful application.

Figure 4.6 – Example of stateful application

Figure 4.6 provides a simplified illustration of a stateful application communicating with an in-cluster message broker (RabbitMQ), using credentials stored in a Kubernetes secret for authentication. RabbitMQ persists its state to external Azure Disks. If deployed on AWS, the same setup would use Amazon EBS instead. Although this design is cloud-neutral, managing RabbitMQ—including high availability and disaster recovery—becomes your responsibility, unlike using Azure Service Bus, which offers these capabilities out of the box. Moreover, stateful applications typically exhibit higher error rates and longer pod startup times compared to their stateless counterparts.As you might have noticed with this brief overview of AKS, it offers a deep integration with Azure services but it yet remains vanilla enough to provide a complete and flexible Kubernetes experience.Now that you have a better understanding of AKS and its essential resource types, let's dive into the specifics of networking.

Zooming in on networking

Mastering AKS Networking is essential to understand how to run clusters in the broader Azure landscape.Before diving deeper into that part of our map, it's important to highlight that, by default, AKS clusters operate as flat networks with no built-in restrictions. Applications can freely communicate with each other and with system components. Likewise, there is no default encryption or layer-7 (application layer) authorization in place. As you've probably gathered by now, AKS clusters are highly permissive out of the box. This is a key reason why AKS often stands out as the elephant in the Hub and Spoke model.In the previous chapter, we discussed landing zones and single or multi-hub topologies, where each application resides in its own spoke, and traffic between applications or domains is controlled via firewalls like Azure Firewall or other NVAs. While not strictly required to run workloads in Azure, the Hub and Spoke topology is extremely common in most enterprises to precisely rely on network micro segmentation.However, AKS doesn't enforce such segmentation by default since east-west traffic within the cluster is completely opened. The only way to strictly adhere to the Hub and Spoke model would be to deploy one AKS cluster per application (thus per spoke)—but this approach is difficult to manage at scale and can significantly increase operational costs. Moreover, even with dedicated clusters, you still wouldn't have Network Security Groups (NSGs) to control intra-spoke traffic. This shows how important it is to understand and master the internals of Kubernetes networking.For the remainder of this chapter, we'll proceed with the assumption that AKS clusters are shared environments hosting multiple applications, which are challenging the traditional Hub and Spoke model.Figure 4.7 shows the topics that belong to the networking section:

Figure 4.7 – Zooming in on AKS networking

As mentioned earlier, while the API server is fully managed by Microsoft, our worker nodes and CI/CD pipelines still need to establish a connection to it. We can control this access by configuring Authorized IP Ranges over the internet or by fully privatizing the connection using Private Link. The second option is the default choice in most organizations.When it comes to Network Plugins, there are several options to choose from. Table 4.1 outlines the main advantages and disadvantages of each.

	Kubenet	CNI Overlay	Azure CNI
IP-Friendly	Yes (1 IP per node instead of one per pod)	Yes (1 IP per node instead of one per pod)	No (1 IP+pod+1 IP per node) leading to a high consumption of IP addresses
Scalability	Up to 400 nodes	Up to 5000 nodes	Up to 5000 nodes
Network Policies	No but Calico Policy can be enabled	Yes	Yes
SNAT	Yes	Yes	No (pods use routable IPs)
Virtual Nodes	No	No	Yes
UDRs required	Yes (hence the limit of 400 nodes max.)	No	No

Table 4.1 – AKS network optionsAdditionally, it's possible to bring your own CNI but this option is not supported by Microsoft. To proceed with this approach, you should have a dedicated support agreement in place with the chosen CNI vendor. Azure CNI's performance can be further enhanced by leveraging the Powered by Cilium mode, which is built on top of eBPF, a Linux kernel feature, behind the scenes. Microsoft proceeded to a benchmark that compares clusters with and without Cillium. More information is available here: https://azure.microsoft.com/en-us/blog/azure-cni-with-cilium-most-scalable-and-performant-container-networking-in-the-cloud/.Most organizations choose to configure AKS with CNI Overlay because it solves the IP address exhaustion issue and eliminates the need for UDRs to enable node-to-node communication.Unless you're using Azure CNI, pods inside the cluster aren't visible externally because they use internal cluster IPs that Azure doesn't recognize. Even when using Azure CNI, you shouldn't directly target pod IPs because they are not static and highly subject to change across restarts, rescheduling, and so on. Instead, to expose a service like an API outside of the cluster, you need to use the ingress or gateway Kubernetes APIs. The component responsible for handling ingress traffic is called an Ingress Controller. There are several options available, including built-in solutions like Application Gateway for Containers and the Istio add-on, as well as popular third-party options such as NGINX, Traefik, and HAProxy. Figure 4.8 shows how we typically handle ingress according to enterprise-grade practices.

Figure 4.8 – Handling ingress traffic with AKS

To expose our Ingress Controller outside the cluster, we need to deploy a Kubernetes Service of type LoadBalancer. By default, Azure assigns it a public IP, but most organizations prefer to keep it internal. This can be easily achieved by adding specific annotations to the service to specify the target private IP address and subnet. Once a service is privately exposed outside the cluster, any component within the private network perimeter that has connectivity to the load balancer will be able to access it.For services that must be accessible from the Internet, it's common practice to add a WAF-enabled service such as Application Gateway or Azure Front Door.Ingress traffic management in Kubernetes can be complex, but in essence, setting the externalTrafficPolicy to Local helps preserve the client's source IP and limits ingress routing to only those nodes hosting the relevant ingress controller pods. Depending on the ingress controller, traffic might be routed through a ClusterIP service that fronts the pod, or it may reach the pod directly—such as in the case of the Istio service mesh. Regardless, Kubernetes services do not proxy traffic themselves; they are used by kube-proxy or eBFP to forward traffic appropriately. While it's not strictly necessary to isolate ingress nodes into a dedicated subnet, it's highly recommended, as it enables more efficient and manageable application of NSG rules. In practice, this often translates to creating a dedicated ingress node pool that spans three availability zones and is used exclusively for handling ingress traffic. This offers a good boundary between ingress specific pods and regular workload pods.Choosing the right ingress controller technology largely depends on your broader plans for AKS. If you're aiming to manage ingress, east-west (internal pod-to-pod) traffic, and egress with a unified solution, Istio—as a full-featured service mesh—can handle all these scenarios, making it a strong candidate for comprehensive traffic management. Of course, like any feature-rich solution, Istio comes with a relatively steep learning curve. Its powerful capabilities for traffic management, security, and observability can be highly beneficial—but they also require a solid understanding of its components and configuration model.If your focus is primarily on ingress, NGINX is a solid and flexible choice that works well in a variety of use cases. On the other hand, Traefik is particularly well-suited for microservices architectures, offering dynamic configuration and seamless integration with service discovery tools.For pod-to-pod (east-west) traffic within the cluster, a service mesh like Istio or Linkerd can be used to provide encryption and authentication via mutual TLS (mTLS). While there are several service mesh options available, these two are among the most widely adopted. As mentioned earlier, Istio is particularly compelling from a holistic standpoint, as it not only manages internal traffic but also offers robust support for handling ingress and egress traffic.In AKS, egress traffic—meaning traffic leaving the cluster—is enabled by default. It's important to distinguish between two types of external destinations: private and public. For private destinations, Azure's underlying networking stack handles routing using system routes and/or BGP/UDRs, as discussed in the previous chapter. For public destinations (that is, internet-bound traffic), AKS typically provisions an external load balancer, which serves as the default egress endpoint. An alternative approach involves using a managed NAT Gateway, which generally offers improved SNAT port scalability. However, a key drawback, as of mid-2025, is that it is not zone-redundant by default, unlike the standard load balancer which is. While zone redundancy can be configured for the NAT Gateway, doing so requires a more complex setup.At the time of writing, a feature called Static Egress Gateway is in preview. It offers a more controlled approach to public egress by assigning specific public IP addresses to designated egress nodes, which reside in a dedicated node pool configured with the gateway mode. However, most companies route internet-bound traffic through an NVA for inspection, logging, or policy enforcement. This common practice reduces the appeal of the Static Egress Gateway, as the NVA will only see node's private IPs and not the public IPs of the egress load balancer. Figure 4.9 shows a very common approach to handling egress traffic:

Figure 4.9 – Using Azure Firewall or NVA to handle internet-bound egress traffic

The NVA will observe either the node's private IP or the pod's IP as the source of the traffic, depending on the CNI plugin configured for the AKS cluster. With Azure CNI (notOverlay), pod IPs are routable and visible, while with CNI Overlay or Kubenet, the NVA typically sees the node's IP instead. If we do not route internet-bound traffic to a firewall, the default egress load balancer or specific public IPs made available by the Static Egress Gateway would be seen by the actual destination.While a single ingress gateway and a single egress path can be sufficient for many deployments, certain industries—especially those with strict compliance or security requirements—demand stronger isolation, similar to the principles of the Hub and Spoke network model. This concept can be extended to AKS by applying landing zone principles, using Istio as a unified solution to manage ingress, east-west, and egress traffic across isolated application domains. Figure 4.10 is a high-level view, showcasing how Istio enables consistent policy enforcement and traffic control while maintaining clear boundaries between workloads:

Figure 4.10 – Controlling ingress, east-west, and egress traffic with Istio

Each landing zone can have its own dedicated ingress and egress controllers, with Istio enforcing landing zone-specific authorization policies. This approach provides a strong balance between control and autonomy: the platform team is responsible for deploying and managing the landing zones, while application teams are empowered to define their own ingress and egress gateway rules according to their needs. At the same time, both inbound and outbound internet traffic remains under centralized control via NVAs, ensuring consistent security, compliance, and observability across the entire environment. As of mid-2025, the Static Egress Gateway feature does not support assigning static private IPs to Istio egress gateways. This limitation makes it difficult for upstream NVAs to uniquely identify traffic from specific landing zones, as they can only see the egress node IP—which may be shared across multiple gateways from different landing zones. While assigning a dedicated egress node pool is a possible workaround, it's neither scalable nor cost-effective. We strongly recommend monitoring the evolution of the Static Egress Gateway feature, as future updates may address this limitation.Additionally, services such as Advanced Container Networking Services, Calico,or plain Kubernetes network policies can be used to enforce network segmentation and prevent communication between landing zones—or even segment further the landing zones themselves. One of the key advantages of Calico is its support for GlobalNetworkPolicies, which apply cluster-wide by default and provide a strong baseline of security across all namespaces. This level of enforcement is not possible with standard Kubernetes network policies, which are namespace-scoped and lack global policy capabilities.While network policies (layer-4) and Istio controls (layer-7) both contribute to a defense-in-depth strategy by enforcing logical isolation within the container environment, their protections are limited to the Kubernetes layer. In contrast, NSGs operate at the Azure networking layer, providing an additional security boundary that can help mitigate threats such as container escape scenarios, or malicious actors gaining access to the underyling worker node.It's technically possible to route all east-west traffic within the cluster through a firewall—either as an alternative or in addition to NSG rules—but this approach is not considered cloud-native. It introduces significant complexity, latency, and operational overhead, and goes against the principles of distributed, scalable microservices architectures. We'll dive deeper into Calico and Istio when we discuss our use case. For now, let's shift our focus to cluster management, monitoring, and scaling—key aspects of maintaining a healthy, efficient, and resilient AKS environment.

Zooming in on cluster management and deployment

Ensuring that clusters are deployed in accordance with defined standards can be achieved through various approaches. Figure 4.11 illustrates the different mechanisms available for deploying and managing clusters, as well as releasing workloads—ranging from IaC and CI/CD pipelines to GitOps and policy-driven governance:

Figure 4.11 – Cluster management and workload deployment

IaC templates and Azure Policy play a key role in deploying AKS clusters and ensuring deployments remain compliant with organizational standards. It's important to distinguish between cluster deployment and workload deployment, as they typically involve different tools and processes.For cluster deployment, tools like Terraform, Bicep, or other IaC languages are commonly used to automate and standardize the provisioning process. Imperative tools such as Azure CLI can also be used.For workload deployment, the first step is to choose a deployment strategy—DevOps or GitOps. In cloud-native environments, GitOps is the preferred approach. In GitOps, Git repositories serve as the single source of truth for the desired state of the system. Agents such as Flux CD or Argo CD continuously watch specified branches, and any change to a branch triggers synchronization with the cluster. If the actual state of the cluster (for example, a deployed workload or system component) deviates from the desired state stored in Git, the agent automatically reconciles the difference to bring the cluster back into alignment with the defined configuration.The Git repositories used in a GitOps workflow typically contain a collection of Helm charts, raw YAML manifests, and/or Kustomize snippets—each representing declarative definitions of the Kubernetes resources to be deployed. However, the container image build process is separate. It involves building the application, packaging it into a container image, and pushing it to a container registry. This process doesn't inherently trigger updates to the Git repository. To ensure GitOps agents detect and deploy new container images, it's common practice to update the image tag in the Git repository. This change prompts the GitOps agent to reconcile the new state, triggering a fresh deployment of the updated application.Conversely, the DevOps way of deploying to AKS, is push based and doesn't differ from what we do with other types of Azure services. We trigger manually or automatically a release pipeline (GitHub Actions or YAML stage), when we want to change the configuration or release a new version of an application.Regardless of the chosen method, you still must plan for a Release Strategy, which ranges from mere Rolling Updates (out of the box) to Blue/Green, A/B, or Canary, which require additional solutions . Every strategy aims at ensuring the least possible disruption to existing workloads. Rolling Updates gradually replace old versions of an application with new ones, minimizing downtime. Blue/Green Deployment runs two identical environments (Blue and Green), where the new version is deployed to the idle one and traffic is switched over once validated. This corresponds to swapping deployment slots when using App Services. A/B testing and Canary releases expose a small subset of users to the new version to monitor for issues and analyze user behaviour before gradually increasing the rollout to the entire user base.Before releasing a new application—or a new version—it's essential to scan container images for vulnerabilities. This step ensures that known security issues are identified and addressed early in the pipeline. Commonly used tools for this purpose include JFrog Xray and Microsoft Defender for Containers. Snyk is also a top-tier vulnerability scanner, known for its developer-friendly interface and strong integration with CI/CD pipelines. It's definitely worth evaluating as part of a comprehensive container security strategy.At last, Fleet Manager can help operate clusters at scale as well as Azure Arc, although the latter typically applies to Kubernetes clusters hosted outside of Azure.Let's now explore how to scale and monitor AKS clusters.

Zooming in on scaling and monitoring

Scaling and monitoring AKS are essential for maintaining performance and reliability over time. Figure 4.12 shows the main options available for that matter:

Figure 4.12 – Scaling and monitoring AKS

Prometheus and Grafana are commonly used with AKS due to their strong integration with many CNCF open-source projects. Prometheus serves as a powerful metrics server, while Grafana is widely adopted for building dashboards and visualizing data. This native compatibility is one of the key reasons Microsoft introduced managed Prometheus and Grafana services, allowing you to leverage these tools without the operational overhead of hosting and maintaining them yourself. To make things even easier, Microsoft has launched Azure Monitor Managed Prometheus (AMMP), which comes with preconfigured, but customizable, alert rules and dashboards. Default rules are already rather comprehensive as they target the cluster, the nodes, the pods, and even persistent volumes.Historically, Container Insights was used to collect both metrics and logs from AKS clusters. However, it is now recommended to transition to AMMP for metrics collection, while continuing to use Container Insights primarily for container logs. Applications running in containers can log to stdout, which is automatically captured by Container Insights. That said, a more robust (and complementary) option is to use Application Insights (OpenTelemetry compliant), which provides advanced troubleshooting features such as distributed tracing, performance metrics, and dependency mapping—greatly enhancing observability at the application level.When it comes to scaling, AKS enforces a cluster-wide scaling profile that applies to all node pools that have enabled autoscaling. Each individual node pool can define its own node count boundaries (min and max node count) but all are bound to the scaling profile defined at cluster level. This profile allows you to define the scan intervals, scale down delays, and more. The cluster autoscaler watches all node pools that have enabled autoscaling and ignores those that haven't. The official documentation is a bit confusing from that perspective, but, as a rule of thumb, you should define your scaling profile at cluster level and your autoscaling settings at node pool level according to the type of workloads scheduled on the node pool. As we saw earlier, node pools are a set of nodes of a given size that is chosen at node pool creation time. One downside of this approach is that any new node added to a given node pool will be of that static virtul machine size, independently of the actual needs of the pods that are to be scheduled.Recently, Microsoft introduced the Node Autoprovisioning (NAP) feature, which is based on the Karpenter open source project. NAP helps you overcome this static approach by selecting the most appropriate virtual machine size that reflects the actual needs of the pods to be scheduled. NAP provisions such nodes dynamically as standalone nodes, meaning that they are not part of any Azure node pool. You can specify the type of virtual machines you want NAP to use, by deploying Karpenter-specific resources. Karpenter also relies on a node pool concept but this is not to be mixed with the AKS built-in node pools.At the pod level, scaling is handled natively by the Horizontal Pod Autoscaler (HPA), which typically reacts to CPU and memory usage. When the HPA kicks in, pod instances are added or removed depending on whether it is scaling out or down. For more advanced and event-driven scenarios, Kubernetes Event-Driven Autoscaling (KEDA) provides a powerful alternative. As a CNCF project, KEDA integrates with a variety of external systems and metrics sources, such as Azure Service Bus, Kafka, or Prometheus. It works by manipulating HPA resources behind the scenes, enabling dynamic scaling of pods based on custom events or metrics—making it particularly well-suited for distributed applications. Beyond KEDA and the HPA, you can also rely on the Vertical Pod Autoscaler (VPA) to add additional CPU/Memory at runtime. The VPA checks the actual CPU or memory consumption of pods and detects whether pods are reaching limits, in which case, it typically recreates another instance with adjusted resource requests and limits. Out of experience, I recommend to use it only in observation mode, as of the development environment, to evaluate if your resource requests and limits make sense or not. I have found VPA pretty unstable and disruptive and would not recommend it for production use, other than in observation mode.Additionally, Virtual Nodes represent the serverless compute option within AKS and can complement KEDA very effectively. KEDA can scale workloads out to Virtual Nodes, which start faster than regular nodes, during peak demand. Once the tasks are completed, nodes are destroyed. This dynamic scalability makes Virtual Nodes particularly useful for running jobs at scale or handling burst workloads that don't require persistent resources.Let's now look at high availability and disaster recovery with AKS.

Zooming in on high availability and disaster recovery

As with any other Azure service, High Availability (HA) and Disaster Recovery (DR) are essential dimensions to consider when working with AKS. Figure 4.13 shows key features that help maintain a reliable AKS environment:

Figure 4.13 – High Availalability and Disaster Recovery with AKS

Because applications running in AKS clusters are all containerized, it is essential to enable geo-redundancy for Azure Container Registries to ensure continued availability of the container images in case of a regional disaster, making it possible for a cluster in a seondary region to pull images when required.As for many other services, we can configure node pools to be zone-redundant to ensure a good worker node distribution across the different availability zones and resist against zonal failures. It is also very common to taint node pools for specific purposes—for example, applying system taints to the system node pool to prevent regular pods from being scheduled on those nodes. Similarly, we can apply this principle to workloads by segregating high-SLA applications into dedicated node pools with specific taints, tolerated only by those high-SLA pods.Beyond taints and tolerations, pods can also be scheduled based on affinity or anti-affinity rules that match nodes or group tightly coupled pods together. Similarly, the topologySpreadConstraints field in a pod specification can be used to ensure an even distribution of pods across nodes in different Availability Zones. While AKS makes a best-effort attempt to spread pods by default, using topologySpreadConstraints is necessary if you want to strictly enforce that, for example, three replicas each running in a separate zone. All these specifications help guide the Kubernetes scheduler, ensuring that our applications are deployed in alignment with the desired level of resilience.PodPriority and PodDisruptionBudget (PDB) resource types help define the priority of pod scheduling or eviction, as well as the minimum acceptable availability for a set of pods, ensuring that critical applications maintain a baseline level of uptime. This applies regardless of the cause of disruption, including voluntary disruptions such as maintenance or cluster upgrades.Finally, AKS Backup, built on top of the Velero solution, enables backup of both AKS resources and persistent volumes based on Azure Disks. Alternatively, tools like Argo CD or Flux CD can redeploy resources from source control, making this approach better suited for stateless (or out of cluster state) applications while AKS Backup might be suitable for stateful workloads. No matter how you proceed, the most important aspect to look at is how the application state is persisted if any.Let's now zoom in on the main AKS add-ons, extensions, and options.

Zooming in on the main add-ons, extensions and options

Microsoft continues to expand its catalog of AKS add-ons and extensions, offering both Azure-native capabilities and Microsoft-packaged versions of popular CNCF projects like Istio and KEDA. These add-ons, along with extensions and cluster configuration options, provide a streamlined way to adopt pre-tested, integrated features supported by Microsoft. Figure 4.14 illustrates the most interesting add-dons, extensions and options as of mid 2025:

Figure 4.14 – AKS add-ons, extension and options

We have already introduced most of the technologies and features illustrated in Figure 4.15 but it was important to show that they can be used out of the box.The Distributed Application Runtime (Dapr) extension is worth considering for any distributed, event-driven, or microservice-based architecture. We'll explore Dapr and KEDA more in the next chapter's use case, as both are also natively supported by Azure Container Apps, another container service. We introduced both Istio and Calico earlier in this chapter and will explore them further—with code samples—in the chapter's use case. Similarly, we already discussed Flux, AKS Backup, and Workload Identities.Azure Policy is a must-have for any well-managed cluster. It leverages the Gatekeeper admission controller to ensure that deployments remain compliant. We strongly recommend enabling and enforcing policies early in the application lifecycle to catch compliance issues upfront. Delaying this step often leads to non-compliant deployments being blocked later on, resulting in significant rework.Workload Identities and the Key Vault Secret Provider both contribute to a better security. Figure 4.15 shows them working together to make a secret available for a given pod:

Figure 4.15 – Using the Key Vault Secret Provider

The CSI Driver DaemonSet (a pod running on each node) uses the SecretProviderClass referenced by the application pod to fetch secrets from Azure Key Vault. It does so using the workload identity associated with the service account mounted to the pod. As a result, the secret is made available as a mounted volume, which can be associated to an environment varialbe that is accessed by the application.When it comes to storing application configuration, we can use the built-in Kubernetes ConfigMap resource or the more centralized Azure App Configuration service, which can be integrated through a dedicated extension. Interestingly, Azure App Configuration can also reference secrets stored in Azure Key Vault using a feature called Key Vault References. However, this mechanism is limited to secrets and does not support auto-rotation. In contrast, the CSI Key Vault Secret Provider—despite its name—can handle more than just secrets and does support automatic rotation, making it a more robust option for secret management.Before diving into our use case, let's take a moment to gather some key lessons and best practices from real-world field experience. These insights can help guide decisions, avoid common pitfalls, and ensure a more resilient and manageable AKS implementation.

Key advices from the field

AKS, along with the broader CNCF ecosystem, can feel like an overwhelming Pandora's box—rich with possibilities but also complex and multifaceted. It offers immense flexibility and power, but must be adopted thoughtfully and deliberately to avoid unnecessary complexity, misconfigurations, or operational overhead. A careful, phased approach grounded in clear requirements and sound architectural principles is essential to make the most of what AKS and the CNCF landscape have to offer.Additionally, regardless of the technologies you choose to adopt in your AKS clusters, it's crucial to consider Microsoft's supportability. While tools like Nginx and Traefik may be appealing for their features and flexibility, they are not supported by Microsoft. The same applies to running stateful workloads such as RabbitMQ, Redis, or MongoDB directly within the cluster—these are not supported either, especially when it comes to disaster recovery and high availability concerns. Choosing supported solutions, like add-ons and extensions, ensures smoother troubleshooting, better resilience, and alignment with best practices.So, to make your AKS life easier, you may consider:

Using add-ons/extensions that are packaged and maintained by Microsoft.
Consider a unified support contract (if possible) for other CNCF solutions that you would want to work with.
Consider security from day one, as neglecting security early on can lead to major challenges down the line.
Leverage Azure PaaS services for anything state related and leverage built-in backup/restore capabilities.
In the future, you may look at AKS Automatic, which is still in preview at the time of writing, but which should greatly ease cluster management.
Take a moment to look at all the dimensions highlighted in the map—we've only scratched the surface. There's no need to rush; understanding AKS is a journey, not a race.

Now, let's dive into our use case.

Use case – Multi-tenant SaaS on AKS

Let's start with the scenario and first analysis.

Scenario and first analysis

Contoso is planning to launch an internet-facing B2B SaaS platform targeting upto one hundred customers, all located in Europe. To maintain strong isolation between tenants, Contoso has decided on a replication strategy where each tenant will have its own set of dedicated application components, while sharing a single AKS cluster. Within the cluster, a given tenant should not be able to interact with any other tenant.The solution includes a web portal, several backend services, and a separate data store for each tenant. Contoso aims to optimize infrastructure costs by pooling compute resources through AKS, while still ensuring data isolation and operational independence—for example, to enable tenant-specific backup and restore operations of the data layer. Contoso should also have an internal dedicated access to the admin portal of the SaaS solution that allows them to perform actions targeting all tenants, which should not be exposed to internet.You have been brought in to design the AKS architecture that supports these requirements.Let's review the keywords in this scenario that demand our focus:

Replication strategy: Each tenant has its own set of components, including the database. In AKS terms, it means an ingress path, a few frontend and backend pods and an egress path to the tenant's database
Cost optimization: Contoso aims to keep costs under control by sharing a single AKS cluster across all tenants, which rules out the use of dedicated node pools for each tenant.
Strict boundaries: Although the tenants share the same cluster, they should not be able to connect with each other. A defense in depth would call for the use of network policies and potentially a service mesh.
Internal access: Contoso operators need to access an admin UI and potentially other tenant-specific services. This private access cannot be exposed to the internet.

Earlier in this chapter, we examined various strategies for managing ingress, east-west, and egress traffic within AKS clusters. In the context of our scenario, a cloud-native approach that combines Istio for Layer 7 traffic management with Calico network policies for Layer 4 isolation appears to be the most suitable solution to enforce tenant separation.

Diagrams

After a few exchanges with Contoso, we went to the drawing board and ended up with this high-level end-to-end view diagram shown in Figure 4.16:

Figure 4.16 – High-level end-to-end view

Although your responsibility is limited to AKS, you aimed to capture an end-to-end view that clearly illustrates the various tenant stacks. The shared infrastructure includes several core services such as Front Door, API Management, firewalls, and of course, AKS itself.Zooming in on the AKS virtual network, we can distinguish a few node pools:

The ingress, workload, egress, and system node pools are each mapped to their own subnet. These node pools correspond to what we discussed earlier in the networking section. Needless to mention that all node pools are zone-redundant.
The management node pool and its associated subnet are dedicated to internal-only access. Given Contoso's strong focus on security, a separate, purpose-built node pool is provided to reduce the risk of misconfigurations and conflicts with the broader ingress node pool—used to expose workloads to Front Door, API Management, and ultimately, to the internet.
An additional subnet, not part of AKS, that holds all private endpoints for the data stores. Each tenant has its own database.
NSGs attached to their respective subnet, restricting access.

NSGs will enforce Azure-level network controls. Here is a detail explanation of the rules we can define per NSG to restrict lateral movements at node level:

For all AKS-related subnets, a rule that allows communication over Kubernetes system ports is required to let core components such as the kubelet API function. Each NSG will also explicitly declare a DenyAll rule that has a lower priority than the default system rules.
The ingress NSG will restrict access to Front Door, API Management, and the Azure Load Balancer and/or Azure Firewall, depending on whether SNAT is performed at the ingress firewall level.
The workload and egress NSGs allow the entire pod CIDR range. Because pod-to-pod traffic with Azure CNI Overlay isn't encapsulated, the NSGs will see the pod IPs. Not allowing such traffic would break it. We can be even more restrictive but things must remain manageable and we'll rely on logical isolation (next section) to rule internal traffic further.
The private endpoint NSG will only accept traffic from the egress subnet.

From an Azure standpoint, a certain degree of segregation has already been implemented. Now, let's explore how to enforce tenant-level isolation within the shared AKS cluster itself.Figure 4.17 shows how to achieve this using Istio and Calico.

Figure 4.17 – Multi-tenant AKS low-level view

Important Note

Given the size of this diagram, do not hesitate to open the corresponding PNG or Visio available here: https://github.com/PacktPublishing/The-Azure-Cloud-Native-Architecture-Mapook-Second-Edition/tree/master/Chapter04.

Everything defined in the istio-system namespace has cluster-wide scope. To enforce strict security, we configure outboundTrafficPolicy as REGISTRY_ONLY and apply a PeerAuthentication resource set to STRICT.

REGISTRY_ONLY ensures that pod egress traffic is limited to services registered with Istio—typically, default Kubernetes services.
STRICT enforces mutual TLS (mTLS), preventing communication from non-Istio pods to Istio-injected ones.

Additionally, we define two default Calico GlobalNetworkPolicy resources to deny all traffic by default, except for DNS resolution. It's important to exclude system namespaces from these policies, as shown in the accompanying code sample.Other cluster-wide resources—such as ServiceEntry (Istio)—can also be defined when all tenants require access to the same external service, such as Microsoft Entra ID.These default settings establish strong security boundaries by default because, as is, a pod could not send nor receive traffic from anything other than performing DNS resolution. The goal is to open up little by little to meet the actual needs of each tenant.Each tenant landing zone is composed of an ingress namespace, one or more workload namespaces, and one egress namespace. All namespaces will be labeled with the tenant identifier. For example, one workload namespace belonging to tenant 1 could look like this:

apiVersion: v1
kind: Namespace
metadata:
  name: tenant1-ns1
  labels:
    landing-zone: tenant1
    istio-injection: enabled
  annotations:
    scheduler.alpha.kubernetes.io/defaultTolerations: '[{"operator": "Equal", "value": "workload", "effect": "NoSchedule", "key": "usage"}]'
---

The landing-zone label is used to identify the landing zone. The istio-injection label enforces workloads deployed to this namespace to be injected with Istio. Additionally, all pods deployed to that namespace will be scheduled to the workload node pool thanks to the appropriate toleration.In the ingress layer, a dedicated ingress controller will be deployed for each tenant. Each ingress LoadBalancer service will be assigned a tenant-specific IP and automatically scheduled to the ingress node pool. To ensure high availability, each ingress controller will run with at least three replicas, distributed across the three availability zones. This distribution can be explicitly enforced using topologySpreadConstraints.In the workload layer, we define namespace-scoped Calico NetworkPolicy resources to ensure that each tenant can communicate with its own components. For finer-grained control, we can enforce stricter access rules using Istio's AuthorizationPolicy resource. For instance, we might allow component 1 of tenant 1 to communicate with component 2, while denying access to component 3. Additionally, regular workload pods are scheduled onto the dedicated workload node pool using appropriate tolerations.In the egress layer, each tenant is assigned a dedicated egress controller that only accepts traffic originating from its own landing zone. Using the Istio Sidecar resource, we restrict the controller's visibility to just the egress namespace and the istio-system namespace.

Note:

To maintain strict control, Kubernetes RBAC permissions should ensure that only the platform team can manage the SideCar resources—application teams must not have modification rights.

As with the ingress and workload layers, egress controller pods are scheduled onto the dedicated egress node pool using appropriate toleration. Each tenant-specific egress controller is restricted to accessing only the tenant's database and other PaaS resources. This is enforced through ServiceEntry resources defined within the tenant's dedicated egress namespace. In addition, local Calico network policies are applied to allow traffic solely to the private endpoints of the tenant's PaaS services, ensuring tight control over outbound connectivity.Finally, we have the management node pool (not shown in Figure 4.18), which is decoupled from tenant landing zones. It hosts an ingress controller that is not exposed to Front Door or API Management, but is reserved exclusively for internal access.With this setup, we can be confident that traffic is tightly controlled—Istio, Calico, and NSGs work together to govern both Layer 4 and Layer 7 traffic across the environment.Before we explore some code samples, let's take a closer look at how Istio resources should be deployed—and more importantly, who should be responsible for managing them. Figure 4.18 illustrates where Istio resources belong and who is responsible:

Since Contoso developers are responsible for building the SaaS product, they require a certain level of flexibility—particularly given Contoso's decision to fully isolate tenants. This approach opens the possibility for some tenants to become early adopters or move ahead of others in terms of features and updates. That is why application teams should be able to deploy ingress, workload, and egress resources, being Istio or Calico related, except for the SideCar resource type. The platform team is full responsible for managing cluster-wide resources, as well as the initial landing zone setup, consisting of creating namespaces, defining quotas, assigning private IPs, deploying ingress and egress controllers, etc. For greater control, it is always possible to enforce a pull request as part of the lifecycle of a landing zone.Similarly, roles and responsibilities must be defined for Calico network policies, as shown in Figure 4.19:

In a flexible organization, global network policies are governed by the platform team, while namespace-scoped policies can be managed by application teams. However, the landing zone should provide default network policies to ensure internal communication within each tenant.Before diving into the code, it's important to emphasize that, in addition to using tools like Istio and Calico, we must also enforce best practices—such as adopting workload identities, integrating the Key Vault Secret Provider, and other security measures covered in previous sections.

Code samples

We've included some code snippets (https://github.com/PacktPublishing/The-Azure-Cloud-Native-Architecture-Mapook-Second-Edition/tree/master/Chapter04) and a sample folder structure to help you get started with Istio and Calico in the context of tenant landing zones. To keep things concise, we'll focus on the most critical aspects and leave the deeper exploration to you. The folder structure is shown in Figure 4.20:

The folder structure reflects our landing zone setup, with one folder per tenant, each containing the ingress, egress, and workload subfolders. Let's outline the responsibilities for each team, starting with the platform team.

Platform team work

In addition to defining the global setting (explained earlier) and the landing zone namespaces, the platform team should define the default Calico and Istio settings that are required to allow intra-tenant traffic. Let's have a quick look at the default network policies and start with the ingress namespace:

apiVersion: projectcalico.org/v3
kind: NetworkPolicy
metadata:
  name: allow-ingress-traffic
  namespace: tenant1-ingress
spec: 
  selector: all()
  order: 10
  types:
  - Ingress  
  - Egress
  ingress:
  - action: Allow   
    source:
      nets:
        - <IP range from the ingress hub>
  egress:
  - action: Allow   
    destination:     
      namespaceSelector: tenant == 'tenant1'

The ingress rule permits traffic originating from the ingress hub, while the ingress controller itself must be able to communicate with the landing zone. This is accomplished using a namespaceSelector condition that targets the tenant, as defined at the namespace level—just as we covered earlier.Every namespace of type workload should allow intra landing zone traffic as well as shown in the following code snippet:

apiVersion: projectcalico.org/v3
kind: NetworkPolicy
metadata:
  name: allow-intra-lz-traffic
  namespace: tenant1-workload-ns1
spec: 
  selector: all()
  order: 10
  types:
  - Ingress
  - Egress
  ingress:
  - action: Allow
    source:
      namespaceSelector: tenant == 'tenant1'     
  egress:
  - action: Allow
    destination:     
      namespaceSelector: tenant == 'tenant1'

This time, both ingress and egress rules allow traffic for the entire tenant. Note that we could be more restrictive but our goal is mostly to prevent cross-tenant traffic. Finally, the egress namespace should allow intra-tenant communication, while also permitting the egress gateway to access destinations outside the cluster. This is achieved through the following egress rule:

egress:
  - action: Allow
    destination:     
      nets:
      - 0.0.0.0/0
      notNets:
      - 10.0.0.0/8
      - 172.16.0.0/12
      - 192.168.0.0/16

This rule might look over permissive because it allows internet-bound traffic but it is only used by the egress gateway of each landing zone. No other pods should be deployed to that namespace. Additionally, the egress lockdown enforced by Istio's REGISTRY_ONLY mode makes sure that by default, nothing is reachable.To make sure tenants are restricted to their own egress resources only, the platform team must deploy a SideCar resource in every namespace of type workload:

apiVersion: networking.istio.io/v1
kind: Sidecar
metadata:
  name: restrictSideCarVisibility
  namespace: tenant1-workload-ns1
spec:
  egress:
  - hosts:     
    - "istio-system/*"
    - "tenant1-egress/*"
    - "*/*.svc.cluster.local"

The egress section tells Istio that istio-injected pods deployed to the tenant1-workload-ns1 namespace can only see Istio resources that are deployed to the tenant1-egress namespace as well as to istio-system. We also allow visibility over every internal service to be inline with the REGISTRY_ONLY mode and to keep things simple enough.In addition to the default landing zone settings, the platform team should provide one ingress and one egress controller per tenant. The ingress service will be of type LoadBalancer:

apiVersion: v1
kind: Service
metadata:
  name: istio-ingressgateway
  namespace: tenant1-ingress 
  annotations:
    service.beta.kubernetes.io/azure-load-balancer-internal: "true"
    service.beta.kubernetes.io/azure-load-balancer-internal-subnet: ingress   
spec:
  type: LoadBalancer 
  loadBalancerIP: <IP from the ingress subnet>
  externalTrafficPolicy: Local
  selector:
    istio: tenant1-ingress-gateway
  ports:
  # you might want to remove port 80 but it's good for testing
    - name: http
      port: 80
      targetPort: 8080
    - name: https
      port: 443
      targetPort: 8443
    - name: status-port
      port: 15021
      targetPort: 15021

The LoadBalancer service is mapped to the ingress controller pod (not shown here) and defines all ports used for incoming traffic. The service is explicitly marked as internal and mapped to the ingress subnet, and a specific private IP of that subnet is specified.Conversely, the egress service is of type ClusterIP:# here you should deploy the egress controller on top of the service

apiVersion: v1
kind: Service
metadata: 
  labels:
    app: tenant1-egressgateway   
    istio: tenant1-egressgateway   
  name: tenant1-egressgateway
  namespace: tenant1-egress   
spec:
  ports:
  - name: sql
    port: 1433
    protocol: TCP
    targetPort: 1433
  - name: https
    port: 443
    protocol: TCP
    targetPort: 8443
  selector:
    app: tenant1-egressgateway
    istio: tenant1-egressgateway
  sessionAffinity: None
  type: ClusterIP

We simply map the service to the tenant's egress gateway and specify the ports used by the SaaS product. Additional ports—such as those required for protocols like AMQP—can also be defined as needed. This concludes the setup of the landing zone. Now, let's see how application teams can start consuming the landing zone.

Application teams work

Each tenant is responsible for defining its own ingress and egress requirements—something that can be handled by the respective application teams.Overall, application teams should be able to manage their own landing zones but restricted to touch the SideCar (Istio) resource. You may grant them read only access on cluster-wide resources but not more. We know that each tenant within the SaaS platform has its own database, so let' check how to allow tenant 1 to talk to its own database. To achieve this, tenant 1 must define both a Calico network policy and an Istio service entry within its egress namespace:

apiVersion: projectcalico.org/v3
kind: NetworkPolicy
metadata:
  name: tenant1-sql-egress
  namespace: tenant1-egress
spec:   
  selector: "app == 'tenant1-egressgateway'"
  order: 10
  types:  
  - Egress 
  egress:   
    - action: Allow     
      protocol: TCP
      destination:       
        nets:
          - <SQL DB private endpoint>/32         
        ports:         
          - 1433

Since the default network policy in the egress namespace only permits internet-bound traffic, we must explicitly authorize the egress gateway to connect to the private endpoint of the Azure SQL Database. Additionally, we must enrich Istio's service registry with the FQDN of the database:

apiVersion: networking.istio.io/v1alpha3
kind: ServiceEntry
metadata:
  name: external-sql-db
  namespace: tenant1-egress
spec:
  hosts:
  - <tenant1>.database.windows.net
  location: MESH_EXTERNAL
  ports:
  - number: 1433   
    protocol: TLS
    name: sql
  resolution: DNS

Note that the TLS protocol is used and is required to let Istio inspect the Server Name Indication (SNI) host when filtering traffic. At this stage, pods in the workload namespaces can resolve the database hostname and attempt to connect, but they will be blocked by Calico that doesn't allow traffic destined to the private endpoint of the database. Additionally, NSG rules applied to the private endpoint subnet, only allow traffic from the egress subnet. Similarly, for internet-bound traffic, the upstream NVA restricts access to the egress subnet. As a result, both internal and external traffic must be explicitly routed through the egress gateway—exactly the behavior we aim to enforce. No pod should reach destinations outside the cluster without being properly authorized.To route traffic to the egress gateway, a few extra Istio resources are required. The first one is the Istio gateway:

apiVersion: networking.istio.io/v1
kind: Gateway
metadata:
  name: sql-gateway
  namespace: tenant1-egress
spec:
  selector:
    istio: tenant1-egressgateway
  servers:
  - port:
      number: 1433
      name: tls
      protocol: TLS
    hosts:
    - <tenant1>.database.windows.net
    tls:
      mode: PASSTHROUGH

The Gateway configuration in Istio defines the protocol to use and how TLS should be handled. In this scenario, since the application pod initiates the connection to the Azure SQL Database—and due to the nature of the TLS protocol—the Istio Egress Gateway functions in PASSTHROUGH mode. It does not terminate TLS but instead verifies that the SNI host in the TLS handshake matches the one defined in the corresponding ServiceEntry. Note that Istio is able to perform TLS Origination, to act as a bridge between an HTTP client and an HTTPS destination. The last step consists of defining the virtual service which routes the traffic to the egress gateway:

apiVersion: networking.istio.io/v1
kind: VirtualService
metadata:
  name: sql-virtual-service
  namespace: tenant1-egress
spec:
  hosts:
  - <tenant1>.database.windows.net
  gateways:
  - mesh
  - sql-gateway
  tls:
  - match:
    - sniHosts:
      - <tenant1>.database.windows.net
      gateways:
      - mesh
      port: 1433
    route:
    - destination:
        host: "tenant1-egressgateway.tenant1-egress.svc.cluster.local"
        port:
          number: 1433
  - match:
    - sniHosts:
      - <tenant1>.database.windows.net
      gateways:
      - sql-gateway
      port: 1433
    route:
    - destination:
        host: <tenant1>.database.windows.net
        port:
          number: 1433
      weight: 100

In essence, Istio will route traffic destined to <tenant1>.database.windows.net to tenant1-egressgateway.tenant1-egress.svc.cluster.local, which is the egress gateway ClusterIP service. A match rule is performed against the SNI host that corresponds to the service entry.The same resources types (Virtual Service, Gateway) are used for ingress traffic. We'll leave it to you to explore the provided code further on your own.To test this code, you'll need to set up an AKS cluster within a virtual network that follows the structure shown in the diagrams. You'll also need to install both Istio and Calico. Once your environment is ready, apply the Calico-related YAML files using the calicoctl command-line tool, and use kubectl for the rest. For analyzing and troubleshooting Istio configurations, the istioctl CLI will be helpful.In real-world scenarios, these YAML templates would typically be packaged as Helm charts with values files or structured using Kustomize, and deployed via tools like Argo CD or Flux CD—though that's beyond the scope of this book. We wanted to explain and demonstrate the key concepts regardless of the deployment method. Let's now summarize this chapter.

Summary

In this chapter, we explored AKS from both a foundational and advanced perspective. Starting with the core architectural elements and essential Kubernetes resource types, we laid the groundwork for understanding how AKS operates. We then focused on networking—where AKS diverges most from traditional Azure practices—diving into both layer-4 and layer-7 concerns. Through detailed explanations and diagrams, we examined how tools like Calico and Istio address network policy enforcement, service-to-service communication, and traffic control in a shared cluster environment.We covered the key operational capabilities of AKS, including scaling strategies for clusters, node pools, and pods, as well as monitoring basics. High Availability and Disaster Recovery principles were discussed to help ensure resilience in production environments. We also reviewed the main AKS add-ons, extensions, and provisioning options.The chapter concluded with a hands-on use case: hosting a multi-tenant SaaS solution on AKS. This brought together theory and practice, demonstrating how to apply Istio and Calico in tandem to achieve secure and observable tenant isolation within a shared cluster—highlighting the challenges and opportunities that arise when adapting hub-and-spoke segmentation principles to the AKS model.With this chapter, we gained not only a panoramic view of AKS but also a deeper understanding of the network and security complexities that are critical to real-world deployments. In the next chapter, we're going to explore the other container services available in Azure.

5 Other Container Services

Join our book community on Discord

https://packt.link/0nrj3In this chapter, we'll explore the broader landscape of container services offered by Azure, focusing on options that simplify container deployment without the operational overhead of managing Kubernetes clusters. Since the container landscape was already introduced in Chapter 2, Solution Architecture, we'll now dive straight into the specifics. We'll begin with a brief overview of Azure Container Instances (ACI) for quick, serverless container execution, and Azure Web App for Containers, which brings containers into the PaaS world of App Service.We'll also touch on Azure Red Hat OpenShift (ARO), a managed OpenShift platform for enterprises seeking Red Hat's container orchestration ecosystem. However, our primary focus will be on Azure Container Apps (ACA)—a serverless container platform built specifically for microservices and event-driven architectures. ACA abstracts away the underlying infrastructure and Kubernetes complexity while enabling advanced capabilities like Dapr for service invocation, pub/sub, and state management, and KEDA for event-driven autoscaling.To help guide service selection, we'll include a comprehensive comparison table outlining the strengths and limitations of each container offering—including AKS—followed by an in-depth analysis of when and why to choose each.We'll conclude with a hands-on use case built on Azure Container Apps that showcases its power when combined with Dapr and KEDA, illustrating how to build resilient, scalable, and decoupled applications with minimal infrastructure management.More specifically, we'll look at:

The Container Services Map
A quick look at Azure Container Instances
A quick look at Azure Web App for Containers
A quick look at Azure Function Containers
A quick look at Azure Red Hat OpenShift
A quick look at Azure Container Apps
Extensive comparison between the container services
A use case about building and deploying a Dapr-enabled microservices application to ACA.

Let us now explore the technical requirements.

Technical requirements

Here are the technical requirements for this chapter:

We will be using Microsoft Visio for the diagrams but the corresponding PNGs are also provided.
Visual Studio Code and Terraform to open and deploy the ACI sample application.
Visual Studio 2022 or later to open and build the .NET solution. Since it is possible to deploy and test the provided solution without opening and rebuilding the application, do not feel obliged to install Visual Studio.
An Azure subscription with owner permissions is needed to deploy the provided code. You can start a free trial if necessary. Follow this link https://azure.microsoft.com/en-us/pricing/purchase-options/azure-account
Azure CLI is needed to execute the provided script. You can find the installation instructions here https://learn.microsoft.com/en-us/cli/azure/install-azure-cli.
Maps, diagrams, and code are available at https://github.com/PacktPublishing/The-Azure-Cloud-Native-Architecture-Mapook-Second-Edition/tree/master/Chapter05.

A quick look at Azure Container Instances (ACA)

As previously discussed in Chapter 2, Solution Architecture, ACI excels at running background tasks—whether resource-intensive or lightweight—such as processing large blobs, performing computations, or hosting lightweight components like self-hosted Azure DevOps agents. To give you a more concrete idea of how to use ACIs, let's focus on a concrete example.Figure 5.1 illustrates a sample use case that combines Logic Apps and ACI and for which, the code is provided for you to test in your own tenant.

Figure 5.1 – Orchestrating ACIs to handle blobs with Logic Apps

In Figure 5.1, two independent Logic Apps (or two separate workflows within a Standard Logic App) are triggered on a schedule. The first Logic App, responsible for ACI orchestration, iterate through blobs of a specified container storage and provisions one ACI per blob in parallel, and passes the blob name as an environment variable to each ACI. Each ACI then processes its assigned blob. A separate Logic App independently handles cleanup by listing the ACIs in the resource group and looping through them to delete any ACI whose container state is Terminated. Both Logic Apps have system-assigned identities, which are assigned the ACI Manager role (for both) and the Blob Storage Contributor role for the ACI orchestration.The code executed within the ACI is a dummy app that pretends to handle the blob:

string blobName = Environment.GetEnvironmentVariable("BlobName");
if (string.IsNullOrEmpty(blobName))
    throw new ApplicationException("No blob transmitted");
//pretend to handle the blob
Thread.Sleep(new Random().Next(2000,15000));
Console.WriteLine("job done");

This concrete example shows how you can spin up ACIs to handle a given task and orchestrate them with Logic Apps for creation, monitoring, and deletion. In a real-world context, you would handle failing ACIs or let them report their own status to a telemetry service such as Event Hub. To keep things simple enough, we limited ourselves to the orchestration bits and the fluent creation and destruction of ACIs.

Important note

The code is available here https://github.com/PacktPublishing/The-Azure-Cloud-Native-Architecture-Mapook-Second-Edition/tree/master/Chapter05/code .

To simplify testing in your own tenant, all resources are deployed within a single resource group. Both Logic Apps have a manual trigger to avoid uncessary costs should you forget to disable/delete them.To test this code in you own tenant, you must:

Have an Azure subscription with owner permissions. Follow this link to start a new trial if required https://azure.microsoft.com/en-us/pricing/purchase-options/azure-account.
Install Azure CLI. Follow the instructions available here https://learn.microsoft.com/en-us/cli/azure/install-azure-cli.
Install Terraform. Follow this link https://developer.hashicorp.com/terraform/install where you can find binaries for all operating systems. This only takes a few minutes!
Make sure the terraform command works and is known in your system. In other words, make sure the PATH environment variable points to the location where you downloaded the Terraform client.
Clone the map book repo or download the code.

Before you can deploy the code, you must make sure the Microsoft.Logic and Microsoft.ContainerInstance resource providers are enabled in your subscription. Please refer to the Hub and Spoke use case section of Chapter 3 where we already explained how to enable resource providers.Once ready, just follow these steps to deploy the provided example:

Run az login from the Visual Studio Code terminal. Just login as usual. This will ensure Terraform is able to reuse your cached credentials when deploying the code.
Open Visual Studio code and locate .\Chapter 5\code\ACI\IaC\. Adjust the config.yaml file to your own needs.

The provided config.yaml file contains the following information:

location: "swedencentral"
resourceGroup: "mapbook-aci"
subscriptionId: "<your-subscription-id>"
storageAccountName: "seymapbookacisa"
containerImage: "stephaneey/acihandler:dev"

You are free to change any of the parameters but make sure to add your own subscription identifier and to change the storage account name, which must be unique. You may optionally build your own container image with the the provided .NET code and corresponding DockerFile, instead of using the provided one, although this is not relevant to understand how ACIs and Logic Apps can work together. We recommend you to stick to the provided image and we'll make sure not to delete it.To deploy the infrastructure, follow these steps:

In Visual Studio Code, launch the terminal (View | Terminal).
Make sure you are in the right folder (cd '.\Chapter 5\code\ACI\IaC\')
Run terraform init
Run terraform apply –auto-approve

The last command performs the actual deployment, which takes about a minute to complete. Once done, you should be able to view all resources in your resource group as illustrated in Figure 5.2:

You can upload a few blobs into the storage account's blobs container and trigger the aci-orchestration Logic App. A successful execution looks like Figure 5.3.

Figure 5.3 – Execution of the aci-orchestration Logic App

In Figure 5.3, we see that two iterations (because we uploaded two blobs) took place in parallel, and soon after the execution of the aci-orchestration Logic App, new resources (our ACIs) appear in the resource group:

Figure 5.4 – New ACIs in the resource group

Clicking on one of the ACIs and going to its container section, we can see its logs:

Once done, you can trigger the aci-cleanup Logic App, which will delete any completed ACI as illustrated in Figure 5.6.

Figure 5.6 – Execution of the aci-cleanup Logic App

We hope that this concrete example helps you grasp how ACIs can be used in a real-world use case. Let's now look at Azure Web App for Containers.

A quick look at Azure Web App for Containers (AWC)

AWC is well-suited for lifting and shifting legacy .NET applications and handling basic deployment scenarios. It supports both Windows and Linux as underlying operating systems, making it particularly appropriate for older .NET apps that rely on GAC-deployed assemblies. Although AWC isn't the only service capable of running Windows-based containers, it offers a much simpler experience compared to a full-fledged container orchestration platform. With AWC, you simply package your app into a container, push the image to a registry, and let the service handle the rest. You can still take advantage of App Service Plan features like autoscaling without any additional configuration. AWC can, of course, also be used for brand-new apps using the latest and greatest technologies.An application deployed to AWC could look like Figure 5.7.

Figure 5.7 – Possible architecture of an application deployed to AWC.

In Figure 5.7, the application is deployed into its own spoke. Public access to the web app is disabled, and Private Endpoints are enabled to secure access. VNet Integration is configured to allow the app to communicate with private endpoints for both the data store (SQL) and the container registry (ACR). The web app uses its managed identity to pull the container image from ACR. To ensure this image pull works over the VNet integration outbound path, you must set the WEBSITE_PULL_IMAGE_OVER_VNET configuration setting. While it's possible to keep all services publicly accessible, we wanted to give you a pro tip and highlight this crucial setting—without it, image pulls will fail in a private network setup. Let's now explore AFC.

A quick look at Azure Function Containers (AFC)

From a hosting model perspective, the key difference between AFC and AWC lies in AFC's broader hosting flexibility. While AWC runs exclusively on App Service Plans (and Azure Arc-enabled systems), AFC can be hosted not only on App Service Plans—just like regular web apps—but also on Functions Premium Plans and within Azure Container Apps Environments. Furthermore, Azure Functions offer the additional advantage that they can be packaged and self-hosted anywhere. Since we'll dive deeper into Azure Functions in the next chapter and cover Azure Container Apps Environments later in this one, we'll now move directly to Azure Red Hat OpenShift.

A quick look at Azure Red Hat OpenShift (ARO)

ARO is a full-fledged container platform similar to AKS, with the key distinction being the OpenShift layer on top of Kubernetes. This layer adds enterprise-grade features and tooling tailored for large-scale container deployments, but it also introduces a more opinionated and less flexible environment compared to AKS. While everything discussed in Chapter 4, Working with Azure Kubernetes Service applies to ARO conceptually, the implementation details differ to align with the OpenShift stack. ARO also has fewer integrations with the broader Azure ecosystem and comes at a significantly higher cost. In general, organizations already invested in OpenShift—whether on-premises or in other clouds—are likely to adopt ARO in Azure. Others will typically default to AKS. It's worth noting that ARO and AKS are not mutually exclusive. Since ARO is a substantial topic deserving its own book, we'll pause here and turn our focus to more native Azure services. Let's now explore Azure Container Apps.

A quick look at Azure Container Apps (ACA)

ACA is a fully managed container service that supports both Windows and Linux containers. While it is built on Kubernetes, the underlying infrastructure is completely abstracted from the cloud consumer. ACA is purpose-built for microservices and distributed application architectures, offering native integration with tools like Dapr for building event-driven apps and KEDA for autoscaling based on demand. More recently, Microsoft introduced features such as Container Apps Jobs and Container App Session Pools, enabling the execution of untrusted code within isolated sandbox environments. Microsoft is strongly positioning ACA as its flagship container platform, aiming to provide a scalable and flexible solution that maximizes resource utilization while minimizing operational complexity.Additionally, ACA provides integrated revisions, allowing blue-green and canary deployments and can be fully deployed into virtual networks for private network and security. ACA fully integrates with broader Azure features such as managed identities, authentication with Entra ID, and Azure Monitor.A Container App Environment, based on Consumption or Workload Profiles, is the hosting piece of Container Apps and Container Apps Jobs. Figure 5.8 shows different workload profile sizes:

In Figure 5.8, we see Consumption-based profiles as well as pre-paid profiles, each with a corresponding amount of vCPU and RAM, as well as GPU when applicable. This gives us a lot of options to host tiny but also more resource-intensive workloads. Figure 5.9 shows an example of a distributed architecture using ACA.

Figure 5.9 – Distributed architecture using ACA

One container app is exposed through HTTP and apps communicate directly through Dapr's invoke API or asynchronously using Dapr's pub/sub API building block. While Dapr is optional, it comes in handy for such scenarios. Notice however that, at this stage, the Container Job cannot leverage Dapr and must directly subscribe to Azure Service bus, using the SDK. Another key point to note is that, as of May 2025, applications within the same environment can communicate with each other without any restrictions—except for those imposed by the application code itself.Each container app has its own scaling rules and is completely independent from the others in terms of configuration, managed identities, and so on.Since the upcoming section provides a comprehensive comparison of container services—and given that our use case is centered around Azure Container Apps—we'll now move on to the next section.

Extensive comparison between container services

Before diving into our use case, let's focus on comparing container services on various criteria. Given the rapid pace of change in the cloud, comparisons between Azure services are inherently temporary and must be revisited regularly to remain accurate. Nevertheless, it should highlight key points of attention that you can independently verify when faced with a specific use case. We already introduced Figure 5.10 in our second chapter.

Figure 5.10 - Container services mapped to use cases

Figure 5.10 shows empty circles for no alignment and full circles for full alignment. The assessment reflects how easily each architectural style can be realized with the service's built-in functionality and considering costs and a minimal friction to achieve a given goal. Before diving into the detailed explanation, let's briefly summarize each architecture style in a nutshell:

Microservices are an architecture style aimed at developing and deploying independent services and to make each service immune from changes/adverse events happening in other services. Microservices also favor agility and flexibility over reusability. While not tightly coupled with Domain-driven Design (DDD), it can be useful to follow some of the DDD principles to identify the service's boundaries.
Service-Oriented Architecture (SOA) is an architectural style that promote reusability at the enterprise-level. In the context of this comparison, we'll only focus on the hosting piece for the services themselves, not to the related Enterprise Service Bus (ESB) concept.
NTIER is an architecture style that splits an application into different layers (frontend, backend, data) with possible additional layers. It promotes modularity allowing each tier to be developped and maintained by different teams.
Event-driven Architecture (EDA) is an architecture style that relies on message and event brokers to let components react asynchronously to changes. It maximizes robustness and scalablity.

The last item in Figure 5.10—resource-intensive tasks—is not an architecture style but rather a recurring requirement across all the architecture styles we've discussed. For this reason, we also evaluated container services in relation to this common need.Table 5.1 provides a detailed explanation of how each container service aligns with the microservices architecture style:

Microservices
ACA	ACA includes Dapr and KEDA, making it well-suited for microservices. It provides built-in service discovery, per-service autoscaling, and native integration with message brokers and event stores.
ACI	ACI lacks native support for microservices and is not ideal for long-term use due to cost and stability limitations. Microservices are expected to be long lived and stable.
AFC	AFC alone does not have any specialized feature for running microservices, but its built-in triggers and bindings make it useful for performing tasks that handle command and integration events. Functions can also be handy in microservices communication chains.
AKS/ARO	AKS and ARO support microservices architectures well but require greater setup effort and expertise than ACA.
AWC	AWC lacks built-in service discovery, bindings, and triggers, but it can still be deployed and scaled independently, as expected in microservices architectures.

Table 5.1 – Evaluation of container services against the microservices architecture styleTable 5.2 provides a detailed explanation of how each container service aligns with the SOA architecture style:

SOA
ACA	ACA is well-suited for hosting reusable and enterprise-wide backend services.
ACI	Because reusable, enterprise-wide services are expected to be continuously available, ACI is not an ideal choice, as running it permanently raises concerns about cost and stability.
AFC	Due to the function runtime overhead and complexity, AFC is not ideal for hosting simple web services.
AKS/ARO	AKS and ARO are well-suited for hosting reusable and enterprise-wide backend services with a maximum level of control.
AWC	AWC is well-suited for hosting reusable and enterprise-wide backend services.

Table 5.2 – Evaluation of container services against the SOA architecture styleTable 5.3 provides a detailed explanation of how each container service aligns with the NTIER architecture style:

NTIER
ACA	ACA is not a good fit for NTIER architectures because there is no built-in way to split the different application layers from a network perspective, nor even using Dapr (for now). Every container app hosted in a container app environment, has access to all the other container apps sharing the environment. One may define layers using three different Container App Environments, but this is largely overkill, especially when we know that there is a fully-managed K8s cluster behind each environment. Dapr provides service-level access policies, which could be used to split the layers (that is, frontend can talk to backend but not to data), but this is not yet available in ACA as discussed on GitHub https://github.com/microsoft/azure-container-apps/issues/303
ACI	ACI do not have any built-in feature to split the application layers. You could create one group with three different containers, but this wouldn't represent an actual layer split.
AFC	Here again AFC's runtime overhead is not well-suited for NTIER for the backend layer and AFC lacks frontend support.
AKS/ARO	AKS is a perfect fit for NTIER because you can easily: Leverage internal network policies to split the application layers Leverage fine-grained service-level authorizations with ecosystem solutions such as Dapr or any Service Mesh (Open Service Mesh, Istio, Linkerd, and more.) AKS is often used to host multiple business applications within a single cluster, using logical isolation to handle intra/cross-application communication.
AWC	App Services can be hosted on dedicated App Service Plans (one for frontend, one for backend) into different subnets and you can use Azure networking features to control and restrict traffic from one layer to another. Traffic destinated to data services may as well be restricted using Network Security Groups. AWC integrates very well with Azure networking, especially if hosted on App Service Environments.

Table 5.3 – Evaluation of container services against the NTIER architecture styleTable 5.4 provides a detailed explanation of how each container service aligns with the EDA architecture style:

EDA
ACA	ACA easily bind to message brokers and event stores, thanks to Dapr, which makes it a perfect fit for EDA.
ACI	No built-in bindings, triggers, or any autoscaling features. However, when combined with either Logic Apps, either Durable Functions, they can be used to perform event-related tasks.
AFC	A ton of built-in triggers and bindings, making AFC a perfect fit for EDA.
AKS/ARO	AKS/ARO are also a perfect fit for EDA but require more work and knowledge. Unlike ACA, Dapr and KEDA have to be manually configured.
AWC	Deployed as an API, any AWC can handle events but there is nothing built-in to make your life easier.

Table 5.4 – Evaluation of container services against the EDA architecture styleTable 5.5 provides a detailed explanation of how each container service accommodate resource intensive tasks:

Resource-intensive tasks
ACA	ACA workload profiles accommodate most resource-intensive jobs and provide GPU support.
ACI	ACI is burstable by design, allowing up to 32 CPU cores and 256 GB of RAM to be assigned to a single process, with optional GPU support depending on the region. They can be easily created and deleted using the Fluent API or Logic Apps once a task finishes, making them a cost-effective solution. They are particularly interesting because they can be created and destroyed in less than a minute and we only pay for the actual execution time.
AFC	AFC on Flex, Functions Premium, or App Service plans offers limited computing power. While AFC can run within a Container App Environment, it's often better to switch to an ACA job in that scenario.
AKS/ARO	AKS combined with virtual node (ACI in Azure) and KEDA, is a perfect fit for resource-intensive tasks, but ACI's availability depends on the chosen network plugin. It takes only between 20 and 90 seconds to start a new virtual node container (ACI) and it gets destroyed once the task completes, making it a very cost-friendly approach. ARO lacks virtual node support. ARO and AKS alone are still a good fit thanks to the granuarlity offered by node pools/machinesets and the cluster's autoscaler.
AWC	AWC is well-suited for handling resource-intensive tasks when paired with a high-performance underlying app service plan but this is not a cost-friendly option.

Table 5.5 – Evaluation of container services for resource intensive tasksNow that we have a better understanding about how to map container service to a specific purpose, let's see how they respond to Software Quality Attributes. Figure 5.11 shows a consolidated view that is followed by detailed explanation:

Figure 5.11 - Container services mapped to Quality Attributes

Table 5.6 provides a detailed explanation of how AKS and ARO align with the chosen Quality Attributes:

AKS/ARO
Scalability	AKS and ARO are highly scalable, thanks to their use of independent node pools/machine sets that can be scaled vertically (different VM SKUs) and horizontally. Additionally, they both incorporate built-in features like Horizontal Pod Autoscaler ( HPA ) and Vertical Pod Autoscaler ( VPA ), to automatically scale pods in and out. Advanced ecosystem solutions such as Keda can also be used.
Availability	AKS and ARO are highly available as they can leverage Azure's Availability Zones.
Resilience	Both AKS and ARO benefit from the underlying K8s self-healing capabilities and both can be supervised by Argo CD's desired state configuration.
Security	Red Hat OpenShift adds a security-focused layer on top of Kubernetes and is considered more secure by default. However, as shown in the previous chapter, AKS offers numerous add-ons and extensions that enhance its overall security posture.
Performance	AKS and ARO performance depends on the VM size used in each node pool or machine set. However, this level of granularity enables fine-tuning for optimal performance.
Observability	AKS and ARO both rely on Prometheus and Grafana, and they both have integrations with Azure Monitor. AKS also benefits from the Container Insights feature.
Reliability	Both are mature and reliable platforms.
Testability	Container orchestrators streamline automated testing by making it easy to deploy test containers into the execution environment.
Deployability	Modern deployment techniques such as blue/green, canary, and Rolling Updates are all supported.
Interoperability	AKS and ARO fully integrate with virtual networks with no restriction, which makes it possible to integrate with any other internal system.

Table 5.6 – AKS/ARO mapped to Quality AttributesTable 5.7 provides a detailed explanation of how ACA aligns with the chosen Quality Attributes:

ACA
Scalability	Container Apps run within Container Apps Environments ( CAEs ), which are fully managed Kubernetes clusters provided by Microsoft. Compared to AKS or ARO, ACA offers less control over cluster-level scaling since it lacks node pool or machineset configurations. However, it's possible to deploy multiple CAEs to accommodate scaling needs. Regarding pod-level scaling, ACA has built-in tools like KEDA, enabling advanced scaling rules that allow workloads to scale from zero to any number of instances.
Availability	CAEs support Availability Zones
Resilience	ACA being based on K8s, the same built-in self-healing features are available. Additionally, Dapr users will be able to leverage Dapr's resilience policies.
Security	As of now, ACA lacks the internal security boundary capabilities found in AKS or ARO. Until at least mid-2025, all applications within the same CAE can communicate freely with each other, without enforced isolation. While it's possible to achieve segregation by deploying multiple environments, this approach significantly increases IP address consumption, making it difficult to scale. Additionally, mTLS is not enforced for applications that do not use Dapr and sometimes not even enforced by Dapr-injected applications (see explanation just after this table).
Performance	The current capabilities of ACA do not match those of AKS in terms of selecting specialized virtual machines. However, for generic purpose and AI applications, ACA offers a good level of performance.
Observability	ACA fully integrates with Log Analytics, which gives a lot of visibility of what is going on inside the cluster. Application telemetry can be sent to Application Insights or any other OpenTelemetry compliant tool.
Reliability	ACA is still a young product within the Azure landscape, but is gaining in stability over time.
Testability	Integration tests can be conducted thanks to sidecar test containers. ACA supports A/B testing but does not support jobs yet.
Deployability	AKS supports some of the modern deployment techniques such as Blue/Green and Rolling Updates.
Interoperability	Due to certain limitations in the integration of virtual networks and less control over network traffic, there may be potential risks to interoperability in certain integration scenarios.

Table 5.7 – ACA mapped to Quality AttributesWhen it comes to mTLS, there's often confusion between Dapr's implementation and that of service meshes like Istio, as they offer overlapping functionality. While the underlying mTLS mechanism is fundamentally the same, the key distinction lies in how their sidecar container handles traffic.Service mesh sidecars follow the Ambassador pattern, meaning they intercept and control all ingress and egress traffic to and from the pod. This traffic flow is enforced by default and can only be bypassed through specific configurations or annotations.In contrast, Dapr sidecars do not enforce such interception. They operate alongside the application container but do not filter traffic by default, making it possible for traffic to reach the application container directly—bypassing the Dapr sidecar entirely. This can leave the application exposed unless additional network or security controls are in place. To better understand this, let's look at Figure 5.12 that illustrates how things work with the Istio Service Mesh:

Figure 5.12 – Istio sidecar implementing the Ambassador pattern

A pod that is not injected by Istio will never be able to communicate with the application container directly nor even with the Istio sidecar if STRICT mode is enabled. Conversely, direct connections to the application container are permitted with Dapr by default, as shown in Figure 5.13.

In a Dapr-enabled application, any container from any pod, injected or not, can directly communicate with the application container, but not with the daprd sidecar when mTLS is enforced. This is because Dapr does not intercept traffic; it only processes traffic explicitly directed to its sidecar. Naturally, when using Dapr's invoke API, mTLS remains enforced, but a malicious actor would go through the backdoor.From a security standpoint, this is significant. Since the application container typically listens on 0.0.0.0:<port> by default rather than 127.0.0.1:<port>, it becomes accessible to other pods, bypassing the protections mTLS would otherwise provide via the Dapr sidecar. This behavior stands in contrast to service meshes, where the sidecar transparently intercepts all traffic and enforces security policies uniformly. Long story short, the only reliable way to enforce Dapr's mTLS is to bind the application container to 127.0.0.1. This ensures that all inbound traffic must go through the Dapr sidecar, allowing mTLS to be effectively enforced. Without this, the application remains exposed on the pod network, bypassing Dapr's security mechanisms.Another key difference between Dapr and a service mesh is that Dapr requires integration within the application code, whereas applications remain completely unaware of the presence of a service mesh. While embedding Dapr might be acceptable for in-house developed solutions, it's not something we can enforce on third-party or vendor applications that we may need to host. Keep that in mind should you decide to use Dapr's mTLS as the sole option.At last, It's also possible to run Dapr alongside a service mesh, disabling Dapr's mTLS in favor of the mesh's mTLS. This setup allows the service mesh to handle all traffic encryption and policy enforcement. However, this comes at a cost: you'll be running two sidecar containers per pod, which can impact performance and increase resource consumption. It's important to weigh the security and operational benefits against the added complexity and overhead.Now that we clarified how Dapr's mTLS differs from service meshes, let's look at how ACI respond to our Quality Attributes.Table 5.8 provides a detailed explanation of how ACI aligns with the chosen Quality Attributes.

ACI
Scalability	Container Instances are scalable in that they can be run in parallel to execute various tasks. There is default limit of 100 container instances per subscription, which can be extended to 5000. However, the service does not have any autoscaling feature.
Availability	Container instances are ideal for short-lived and idempotent tasks. Although the service promises 99.9% uptime, out of personal observations at various customers, it may not be suitable for long-running operations like for an API or a website. ACI is a zonal service, meaning that you can explicitly deploy an ACI to a given zone but it doesn't support zone-redundancy out of the box.
Resilience	ACI has a basic container restart mechanism
Security	ACI integrates with virtual network with some limitations that do not represent a real impediment from a network perspective. However, the service does not feature any security-specific functionality.
Performance	Once started, ACI executes according to the number of cores and memory that was allocated. However, the start of an ACI takes between 20 and 90 seconds under normal circumstances. This slow startup is due to two reasons: the typical cold start effect of serverless infrastructures the fact that in 99.9% of the cases, ACI pulls images from a registry at every startup. The time it takes to start heavily depends on the image size. It only uses the image cache for some Microsoft-managed images.
Observability	ACI integrates with Log Analytics, with basic information about running container groups, but there is no other feature that helps achieve better observability.
Reliability	ACI is not the most reliable service for long-running workloads. This is mostly due to its serverless nature.
Testability	Integration tests can be conducted thanks to sidecar test containers. There is however not other built-in feature that can help testability.
Deployability	Like any other Azure service, deployments can be done fluently through infrastructure as code and CI/CD pipelines, but the service does not offer any mechanism to support blue/green, canary, etc.
Interoperability	Due to certain network limitations, there may be potential risks to interoperability in certain integration scenarios, though to a lesser extent than with ACA.

Table 5.8 – ACI mapped to Quality AttributesTable 5.9 provides a detailed explanation of how AFC aligns with the chosen Quality Attributes:

AFC
Scalability	The serverless architecture of AFC allows for automatic scaling out and down without any manual input, although some adjustments can still be made for optimal performance.
Availability	As AFC is designed for brief, event-based operations, immediate availability is not as critical. Transient errors are typically handled by message brokers and event producers, which retry or keep items in queues/topics, if the subscriber is unavailable. From an infrastructure standpoint, AFC is a zone-redundant service.
Resilience	Similarly, resilience is less of an issue but the service itself has no built-in way to ensure a desired state.
Security	AFC integrates with virtual networks but extra efforts must be done to integrate both inbound and outbound traffic, except when hosted on an ASE, which is natively inside a virtual network.
Performance	The performance depends on the chosen plan. Isolated plans running on ASE have high performance profiles. Function premium plans are also interesting from that aspect.
Observability	AFC integrates natively with Application Insights, which in turns, integrates with Log Analytics. Both function-scoped and global-scoped observability are achieved.
Reliability	AFC is a mature service and is reliable. However, the function runtime evolves quite frequently and some early day issues might come to the surface with new bindings/triggers.
Testability	Deployment slots can help test safely with no interference with production.
Deployability	AFC feature deployment slots, which allow for modern deployment scenarios.
Interoperability	No interoperability limitations when integrated with virtual networks.

Table 5.9 – AFC mapped to Quality AttributesTable 5.10 provides a detailed explanation of how AWC aligns with the chosen Quality Attributes:

AWC
Scalability	AWC are supported by App Service Plans that feature autoscaling. Multi-tenant plans can scale quickly, but isolated plans can take considerably longer (about 13 minutes to add a new instance). It is recommended to use scheduled autoscaling for isolated plans to optimize performance. Scaling time aside, Isolated Plans can scale to many more instances than the other plans.
Availability	From an infrastructure standpoint, AWC is a zone-redundant service.
Resilience	AWC do not have any built-in feature for maintaining a desired state. However, application-level health checks can be configured to remove unhealthy instances from the load balancer, and restart them.
Security	AWC integrates with virtual networks but extra efforts must be done to integrate both inbound and outbound traffic, except when hosted on an App Service Environment, which is natively inside a virtual network.
Performance	Performance varies according to the selected App Service Plan. This is especially true with Isolated plans, which offer the best possible performance, because they are backed by powerful virtual machines.
Observability	AWC integrates natively with Application Insights, which in turns, integrates with Log Analytics. Both function-scoped and global-scoped observability are achieved.
Reliability	AWC are extremely reliable.
Testability	Deployment slots can help test safely with no interference with production.
Deployability	AWC feature deployment slots, which allow for modern deployment scenarios.
Interoperability	No interoperability limitations when integrated with virtual networks.

Table 5.10 – AWC mapped to Quality AttributesNow that you are able to map some key Quality Attributes to Azure container services, let's proceed to our use case.

Use case – Microservices

Let's look at the use case scenario.

Scenario and first analysis

Contoso is a newly formed logistics startup aiming to disrupt the same-day delivery market with a fully cloud-based, microservice-driven backend. With no existing infrastructure—on-premises or in the cloud—they're starting from scratch and want to quickly build a proof-of-concept (PoC) to validate their architecture and business idea without incurring high upfront costs. Their PoC's objectives and constraints are:

Prove out a scalable event-driven microservice architecture and only focus on the order and shipping services first.
Keep operational overhead and cost to a minimum.
Avoid managing complex infrastructure (for example, Kubernetes clusters).
Enable fast iteration and developer productivity.

Let's first extract the keywords in this scenario that demand our focus:

Microservices: We should focus on a service that helps deal with distributed architectures and service discovery.
Minimal overhead and low costs: We should look at a serverless offering, especially for the PoC. We may always transition for something else later on.

After a bit of digging, you realized that ACA could be a good fit as there is a serverless offering and they include both Dapr and KEDA, which may help boost the development productivity. You went to the drawing board and ended up with Figure 5.14.

Diagrams

Figure 5.14 illustrates a possible solution.

Figure 5.14 – Consoto's PoC Architecture

You foresee a single Container App Environment into which you will deploy the Order, Shipping, and OrderQuery service. All the services communicate asynchronously through Azure Service Bus using a Dapr Pub/Sub component.The namespace has the following topics:

order.placed: The order service subcribes to this topic to handle any incoming order, which could be placed by an external API call or by an external component pushing a new order placement request to the topic.
order.paid: The shipping service subscribes to this topic since shipping can start as soon an order is paid.
order.shipped: The shipping service publishes a message to this topic once the shipping process completes.

Before starting the shipping process, the service calls the orderquery service synchronously to fetch order details using Dapr's invoke API. Components leverage Azure Managed Identity to interact with the Service Bus entities. Let's see a bit more concretely what the code looks like.

Code samples

A full .NET application, focusing only on the usage of Dapr in the simplest expression, is made available on the GitHub repo but let's focus on the essential parts. The first thing any Dapr-enabled API has to do is to enable Dapr and enable Cloud Events:

public void ConfigureServices(IServiceCollection services){
    services.AddControllers().AddDapr();
}
public void Configure(IApplicationBuilder app, IWebHostEnvironment env){
    app.UseCloudEvents();
}

Dapr makes use of the CloudEvents message format for its pub/sub API building block. Once Dapr is enabled for a given service, we can simply leverage the pub/sub channel very easily:

[Topic("daprsb", "order.placed")] //this route subscribes to the order topic
[HttpPost]
[Route("dapr")]
public async Task<IActionResult> ProcessOrder([FromBody] Order order){
    _logger.LogInformation($"Order with id {order.Id} processed!");
    //we'll pretend it is already paid
    _logger.LogInformation($"Order with id {order.Id} paid!");
    await PublishOrderPaidEvent(order.Id, OrderEvent.EventType.Paid);
    return StatusCode(StatusCodes.Status201Created);
}

This code subscribes to the order.placed topic, pretends to handle the order, and publishes a message to the order.paid topic, also pretending that the order has been paid. For direct service invocation, we can simply use Dapr's invoke API as used from within the shipping service to fetch order details:

var request = _dapr.CreateInvokeMethodRequest<object>(
    HttpMethod.Get, "orderquery", id.ToString(), null);
var response = await _dapr.InvokeMethodWithResponseAsync(request);

The call is made towards the orderquery service (Dapr application identifier) and the method being consumed is the actual order identifier. We'll let you discover the provided code on your own.From an infrastructure perspective, we must provision a Container App Environment and a Dapr component of type pub/sub.

name: daprsb
componentType: pubsub.azure.servicebus
version: v1
metadata:
- name: namespaceName
  value: <namespace>
- name: azureClientId
  value: <identity>

We must specify the service bus namespace <namespace> and the identity <identity> to be used dynamically after having provisioned them (not shown here). Once the component's YAML file is updated, we must deploy it:

az containerapp env dapr-component set `
  --name $envName `
  --dapr-component-name daprsb `
  --resource-group $resourceGroupName `
  --yaml ./daprsb.yaml

Both the order and shipping apps will consume that component. Our apps must be Dapr-enabled and be identified with a unique application identifier. Here is an example for the shipping service:

az containerapp create `
--name shipping `
--resource-group $resourceGroupName `
--environment $envName `
--image $shippingImage `
--enable-dapr true `
--dapr-app-port 8080 `
--dapr-app-id shipping `
--min-replicas 1  `
--user-assigned mapbookidentity `

We also have to define the port number the application container listens to (8080) and we link the app to the identity to which we have already granted access to the Service Bus (not shown here). Now that you should have a better idea of how the solution works, let's see how to test it.

Testing the solution

You can use the provided Docker images or decide to rebuild the application code yourself and use your own images. Since this step is optional and not directly linked to Azure, we'll leave it to you. To test this solution in you own tenant, you must:

Have an Azure subscription with owner permissions. Folllow this link to start a new trial if required https://azure.microsoft.com/en-us/pricing/purchase-options/azure-account.
Install Azure CLI. Follow the instructions available here https://learn.microsoft.com/en-us/cli/azure/install-azure-cli.
Clone the map book repo or download the code.
Have Visual Studio Code with the PowerShell Extension (Linux)
Open the cloned/downloaded folder with Visual Studio Code. Make sure to run az login before any other activity.
Navigate (cd) to this folder: Chapter 5\code\use-case\IaC.

Once in the correct folder, open deploy.ps1 and adjust the following variables to your needs:

$resourceGroupName = "mapbook-aca-rg"
$envName = "mapbook-aca-env"
$location = "swedencentral"
$orderImage ="stephaneey/order-aca:dev"
$orderQueryImage ="stephaneey/orderquery-aca:dev"
$shippingImage ="stephaneey/shipping-aca:dev"
$serviceBusNamespace = "seymapbookacans"

Stick to the provided container images and only modify if you decided to rebuild and repackage the application yourself. The only parameter that you should change is the $serviceBusNamespace as it should be unique worldwide. Just use something like <your trigram>mapbookns.From there on, you can run the script ./deploy.ps1 if you are on Windows or have the PowerShell extension, else feel free to extract the Azure CLI commands and run them yourself. The only dependency on PowerShell is the following code block:

$yamlPath = "./daprsb.yaml"
$yamlContent = Get-Content $yamlPath -Raw
$updatedYamlContent = $yamlContent -replace '<namespace>', "`"$serviceBusNamespace.servicebus.windows.net`""
$updatedYamlContent = $updatedYamlContent -replace '<identity>', "`"$clientId`""
$updatedYamlContent | Set-Content $yamlPath

This code replaces the <namespace> and <identity> tokens in the daprsb.yaml file with the corresponding clientId and service bus namespace's FQDN. Feel free to replace those values manually if needed.The execution of the .ps1 file takes only a few minutes. Once deployed, you should find the following resources in the resource group:

Figure 5.15 – resources deployed by the PowerShell script

From there on, you should be able to click on any Container App and check that everything is running fine in the revisions and replicas section:

Figure 5.16 – Revisions and replicas of the order service.

In the overview tab of the order service, you should be able to grab the Application URL which looks like https://order.<random>.<location>.azurecontainerapps.io.With the URL in hand, you can perform a call to the order service by placing a command using the following raw HTTP payload (provided as well in the code):

POST https://order.<random>.<location>.azurecontainerapps.io/order
Content-Type: application/json
{
        "id": "4aadc0f8-eeda-4ee7-9c26-a6d39cbfbc28",
        "products": [
            {
                "id": "5678f982-2ae4-408c-92ff-6af45118d159",
                "name": null
            },
            {
                "id": "6678f982-2ae4-408c-92ff-6af45118d159",
                "name": null
            }
        ]
}

Make sure not to forget to add /order at the end of your endpoint since this is the API entry point of the order service. Repeat the request a few times. The flow should be as follows:Order placed to order service through API | order service pushing to the order.placement topic | order service to process the order | order service pushing an event to the order.paid topic | shipping service starting the shipping process, fetching order details through an API call to orderquery, and pushing a message to order.shipped topic upong succesful completion.In a real-world scenario, there would typically be an additional payment service, but we simplified the setup to focus solely on synchronous and asynchronous communication using ACA and Dapr. For your convenience, we added the extra subscription named foryourtocheck on the order.shipped service as shown in Figure 5.17.

Figure 5.17 – Checking if the communication chain works well.

Important note

You could do the extra mile yourself by adjusting the solution and add KEDA scaling rules that would scale the shipping app based on the number of messages landing to the shipping subscription. Figures 5.18 and 5.19 give you a hint on how to get there.

Figure 5.18 – Scalling rule on Azure Service Bus

Figure 5.19 – Full scaling configuration.

With this scaling configuration, the shipping service would scale to 0 if no activity takes place and would scale out to maximum 10 replicas should we receive a lot of order.paid events.Let's now summarize this chapter.

Summary

In this chapter, we examined the diverse set of container services available in Azure, focusing on solutions that streamline container deployment without the burden of managing Kubernetes infrastructure. We explored Azure Container Instances for rapid, serverless workloads; Azure Web App for Containers for lift and shift and simple scenarios; and Azure Red Hat OpenShift for enterprises seeking a managed OpenShift experience. Our main focus was on Azure Container Apps, which provides a serverless platform tailored for microservices and event-driven architectures, with built-in support for Dapr and KEDA. Through detailed comparisons and a practical use case, we highlighted how to select and apply the right Azure container service for various scenarios.With a clear understanding of Azure's container services, we will now shift our focus to Application Architecture. This next chapter will explore how to design applications that are modular, scalable, and resilient—whether built on containers or other compute models. We'll cover key cloud-native architectural patterns.

6 Developing and Designing Applications with Azure

Join our book community on Discord

https://packt.link/0nrj3In this chapter, we begin by outlining the key core services every developer must grasp to be proficient with in their daily work. Then, we take a step back to explore the broader ecosystem and examine how it helps deal with the most common cloud-native patterns. Finally, our use case will be about how to leverage Azure Functions in an event-driven application.More specifically, we'll look at:

The Application Architecture Map
Zooming in on local developer experience
Zooming in on core Azure services and concepts every developer should know
Zooming in on some cloud-native patterns.
Use case

Let us now explore the technical requirements.

Technical requirements

We will be using Microsoft Visio for the diagrams but the corresponding PNGs are also provided.
Visual Studio Code and Azure CLI to open and deploy sample application.
Visual Studio 2022 or later to open and build the .NET solution. Since we do not rely on CI/CD pipelines to deploy the code, you'll have to use either Visual Studio or the raw dotnet commands.
An Azure subscription with owner permissions is needed to deploy the provided code. You can start a free trial if necessary. Follow this link https://azure.microsoft.com/en-us/pricing/purchase-options/azure-account
Azure CLI is needed to execute the provided script. You can find the installation instructions here https://learn.microsoft.com/en-us/cli/azure/install-azure-cli.

Maps, diagrams and code, are available at https://github.com/PacktPublishing/The-Azure-Cloud-Native-Architecture-Mapook-Second-Edition/tree/master/Chapter05.

What does it mean to be a cloud-native developer?

Over the years, we've encountered numerous situations where traditional developers struggled to appreciate the broader ecosystem they were entering. Many preferred to concentrate solely on writing application code, but in today's Everything as Code era, that mindset is no longer sustainable. Fully embracing cloud-native development requires stepping beyond your comfort zone and engaging with the full spectrum of the development lifecycle and application-specific Azure services. This consideration applies even more to Application Architects.That's why our Azure Application Architecture Map highlights not only some of the cloud-native patterns, but also the essential knowledge every developer needs to master, for effective and modern application development.

The Azure Application Architecture Map

The Azure Application Architecture Map, shown in Figure 6.1 should help you deal with the typical duties of an application architect, which we covered in Chapter 1, Getting Started as an Azure Architect.

Important note

To see the full map (Figure 6.1), you can download the PDF file available at https://github.com/PacktPublishing/The-Azure-Cloud-Native-Architecture-Mapook-Second-Edition/blob/master/Chapter06/maps/Azure%20Application%20Architecture.pdf.

Figure 6.1 has the following top-level groups:

Local development: This section is primarily geared towards developers, with a focus on the local debugging experience that most developers look for.
Core services and concepts developers must master: This section represents the so-called extra mile that true cloud-native developers must take, as outlined in the previous section. In our view, it reflects the bare minimum that any proficient Azure developer should be familiar with.
Common architecture styles: We'll shed light on the key architectural styles that define cloud-native development and examine the common frictions between microservices and distributed monoliths—without going too deep, as this book is not primarily about software architecture.
Common patterns: Since this book focuses on architecture, we'll take a step back from tooling and core services to briefly explore common patterns that have become paradigms in the cloud.
AI, data, security: We included these sub-domains because most applications involve data, AI, and must adhere to security requirements. However, since this chapter does not focus on these areas in depth, we encourage you to refer to the dedicated maps and chapters for a more comprehensive view.

Let's get started with the local development experience.

Zooming in on the local development experience

One of the most recurrent pain points when developers start embracing the cloud or even a container ecosystem, is how they can test their code locally and which tools they need to install to make this possible.Figure 6.2 lists a set of tools that are needed for local development and debugging.

Figure 6.2 – Required local development tools and features

Visual Studio Code is clearly positioned by Microsoft as the ultimate development tool required for whatever Azure and whatever AI. While many traditional developers rely on Visual Studio or Eclipse Integrated Development Environment (IDE), they should first and foremost get acquainted with Visual Studio Code that makes it easy to install extensions for Azure and non-Azure environments. Visual Studio is of course still usable and has a good level of integration with Azure, but Visual Studio Code is more frequently updated as well as its extensions. To connect to an Azure Database or to a local database, we can use SQL Management Studio as well as Azure Data Studio but the latter is on the deprecation path, in favor again of Visual Studio Code. Most emulators and runtimes are available on the form of containers, thus requiring Docker to be set up on the development machine. These emulators are essential for enabling local debugging. For instance, if you're developing an application that integrates Azure SQL, Azure Service Bus, and Azure Functions, you can run all the corresponding emulators locally—eliminating the need for direct Azure dependencies. This contrasts with services that lack emulator support, which require a live Azure environment to test against.Alternatives exist to help you get a preconfigured development environment faster by leveraging the Microsoft Dev Box or by using Azure Virtual Desktop (AVD) instead of a classic Virtual Desktop Infrastructure (VDI). AVD is particularly handy when working with remote parties or when being part of a service provider to gain access to the customer environment in a timely fashion. Finally, since many components are now containerized, you may want to consider using Dev Containers, which allow you to spin up a fully configured remote development environment—without needing to install everything locally. If you are developing with .NET, the Aspire orchestration tool can help you with the orchestration of the different containers that are required by your code when testing it.

Zooming in on core services and concepts developers must master

A good Azure developer should master services shown in Figure 6.3.

Figure 6.3 – Core services and concepts developers must master

As highlighted earlier, it's essential for developers to embrace the broader ecosystem to work more efficiently in their day-to-day activities. In most enterprises, CI/CD has become the standard, and developers are often expected to contribute by building or maintaining pipelines using tools like Azure DevOps or GitHub Actions. At the very least, developers should understand how these pipelines function so they can effectively troubleshoot when things go wrong. Similarly, with the widespread adoption of containerized applications, developers are expected to have at least a basic understanding of how to package an application into a container, the purpose of multi-stage builds, how to run containers locally, and the core concepts behind container images.One area where developers often struggle is OpenID Connect (OIDC) and OAuth—both of which are foundational in cloud-native applications. A solid grasp of the different token types and authorization flows is essential, since lacking this knowledge can significantly hinder your effectiveness as an Azure developer. In Azure, the main Identity Providers (IdPs) are Entra ID and Entra External ID—typically used for B2B and B2C scenarios, respectively. However, many solutions also make use of third-party providers like Duende or Keycloak. Regardless of the provider, all of them are built upon the same standards: OIDC and OAuth. Here are the essential notions you must remember about these standards:

Access tokens are used to authorize access to protected resources, whether they're Azure services or your own APIs. They are commonly used by Single Page Applications (SPAs) to call APIs directly or are managed by a Backend-for-Frontend (BFF) that acts as the SPA's gateway. Access tokens are also frequently used in service-to-service communication scenarios. When acting as a client, your application must request an access token to consume a protected resource. When acting as a resource (for example, an API), it must validate the token presented by the caller to ensure the request is authorized.
Refresh tokens are long-lived tokens used to obtain new access tokens on behalf of a user when previously acquired access tokens have expired.
ID tokens are the result of a user authentication and often contain profile information although it is not strictly required by the standard. They are used by client applications only and should not be sent to APIs.

All these tokens can be acquired through various authorization flows, such as Proof Key for Code Exchange (PKCE), Authorization Code Grant, and Client Credentials Grant, which are the most commonly used flows. Gaining a solid understanding of both token types and authorization flows will greatly enhance your developer experience in Azure and the cloud in general.A best practice in Azure is to use either managed identities or workload/federated identities to access resources securely. Both techniques require the infrastructure and the permissions to be pre-configured according to the application needs. Regardless of the identity type, the client credentials grant flow is always used to obtain access tokens in these scenarios.Managed identities are typically used in two different ways. The first option consists in having the code directly using the identity to access the resource as shown in Figure 6.4.

Figure 6.4 – Code accessing a resource using Managed Identities.

The first step is to request an access token for the target resource. When using the Azure SDK (typically Azure.Identity), the request is made transparently by the SDK to a local endpoint, not directly to Entra ID. This local endpoint, provided by Microsoft, acts as a proxy that performs the actual token request to Entra ID and returns the token to the application. Note that this proxy layer caches tokens for up to 24 hours.Once the token is obtained, the application can use it to access the resource. The managed identity used in this process can be either system-assigned or user-assigned. System-assigned identities are tied to the lifecycle of the resource, such as a web app, and are automatically created and deleted with it. User-assigned identities are standalone Azure resources that can be associated with one or more services. The main benefit of using managed identities, is that the identity credentials are entirely managed by Azure itself.For accessing Azure Key Vault, which is commonly used to store sensitive information, we can either follow the approach illustrated in Figure 6.4, where the code explicitly retrieves secrets through the proxy, or rely on Azure to fetch the secret on our behalf—making the code agnostic of where the secret is stored—as shown in Figure 6.5.

Figure 6.5 – Using Key Vault References to fetch secrets

Although this might look similar to Figure 6.4, it is not at all the case. Dotted circles (steps 1 and 2), are actions performed by Azure, not by our code. In this case, our code accesses sensitive information as any regular configuration setting. Azure itself pulls the secret from Key Vault when the web app starts, thanks to a Key Vault Reference that we define in the settings. Such a reference looks like this: @Microsoft.KeyVault(SecretUri=https://myvault.vault.azure.net/secrets/mysecret). Key Vault references can be used from various services including App Services, Function Apps, API Management, and Azure App Configuration. With this approach, your code doesn't need to handle managed identity authentication or connect directly to Key Vault—Azure resolves the secret for you. However, a limitation of this method is that Azure doesn't automatically refresh the reference when a secret is rotated since it fetches Key Vault every 24 hours or at best, every 4 hours. In contrast, when accessing Key Vault programmatically, you have more control—for example, caching the secret value and implementing a retry mechanism if an expired secret causes a failure. You might choose a specific method based on how often the secret is rotated.When managed identities are not available—such as when the application runs in AKS—we must use workload identities instead. While the core principle remains the same (avoiding the need to store credentials in application configuration), the implementation differs significantly. With workload identities, Azure uses federated identity credentials linked to a Kubernetes ServiceAccount, which in turn is linked to a pod. The Azure identity is bound to the ServiceAccount via a trust relationship, allowing the application to authenticate securely to Azure without storing secrets. This approach relies on OIDC federation between the AKS cluster and Entra ID. Your application then requests an access token to Entra ID in exchange of the projected K8s token. There is no more proxy involved. Note that AKS used to support pod-managed identitities in the past but this has been deprecated, making workload identities the only option.All of this, whether using managed or workload identities, requires apps to be registered in Entra ID. You can rely on the Azure.Identity SDK for both managed and workload identities.As you can see, fully grasping these concepts requires a significant investment of effort and attention but is mandatory to know.From a service perspective, you should become familiar with Azure Functions, Service Bus, Azure SQL, Azure Storage, Azure Key Vault, Redis Cache, API Management, App Configuration, and Application Insights, as these services are commonly used in Azure-hosted solutions. Importantly, they all support integration with managed identities. Needless to mention that Azure Functions are a way to interact with most other Azure Services through its bindings and triggers https://learn.microsoft.com/en-us/azure/azure-functions/functions-triggers-bindings. Each of these services has a corresponding SDK, which you must look at. Our use case, supported by demo code, will demonstrate how some of these services can be used in practice, hence why we will not focus more on them right now.It's worth highlighting that Application Insights is a one-stop shop for developers, providing valuable insights for troubleshooting, performance monitoring, and overall application health. It has a built-in rich user interface and supports KQL queries, enabling quick identification of issues. This service should be integrated from the development environment onward to analyze behavior and catch potential problems early in the application lifecycle.

Figure 6.6 – Screenshot of Application Insights

Figure 6.6 illustrates the various blades available to us for monitoring performance, catching failures, and measuring availability. Application Insights makes it easy to identify dependencies and lets you drilldown into API operations, up to the number of SQL calls that are performed by a given API operation. Application Insights alone makes it possible to track down problems such as the N+1, which is often the consequence of improper use of Object Relational Mapping (ORM) libraries such as Entity Framework.Another very handy tool for troubleshooting is Kudu, which is available for Function Apps, Logic Apps, and App Services. Kudu is available from the service's development tools blade and you can use it to verify if a package was deployed correctly by inspecting the contents. You can also run handy commands such as nslookup and Invoke-WebRequest if you want to test something from within the execution context of your code.When it comes to databases, solutions often involve Azure SQL or other relational database engines, along with NoSQL options like Cosmos DB. Redis, another NoSQL service, can serve as either a non-persistent or persistent data store based on durability needs, though it is most commonly used in a non-persistent manner. At last, Azure Storage is used in most Azure solutions, either as a technical component, when used with Azure Functions, Logic Apps, etc., either as an integration application component where business data is explicitly persisted by the application. Whatever the use case, you should avoid storing blobs in databases and use Azure Storage instead.Let's now look at some cloud-native application design patterns.

Zooming in on some cloud-native application design patterns and architecture styles

Figure 6.7 illustrates some recurrent architecture styles and design patterns that we can easily leverage when building cloud solutions.

Figure 6.7 – Common patterns and architecture styles

While there are no hard rules for designing applications specifically for on-premises or cloud environments, on-premises systems are often monolithic, whereas cloud applications tend to be more distributed. This book will not delve into the debate over which approach is superior, as that is beyond its scope. Instead, it's important to note that the cloud can support any architectural style—including monoliths. One step toward breaking monoliths is by adopting the so-called modular monoliths, which is a way to introduce logical modularity at the level of the code itself while still deploying the application as a whole. Modular Monoliths improve the readability and maintainability of the application code, but do not allow for independent deployment, scaling, or component-specific resilience mechanisms, unlike distributed architectures and microservices. Distributed Monoliths on the other hand, are the other extreme, as they give a false impression of autonomy and independence but are still tightly coupled. An example of a Distributed Monolith could be the split of a single service into three different services that call each other synchronously at runtime and that are all reading and writing to the same database. Some call this microservices but let's face it, these are just different pieces of the same puzzle that are still highly depending on each other. Since the main subject of this book is not about software architecture, let's simply look at common patterns that help decouple application components and bring higher resilience.Most modern applicactions are API-centric and Azure API Management supports patterns such as Gateway Offlloading, Gateway Aggregation, Gateway Routing, and Circuit Breaker. All of these patterns are either built into the service or made possible through the use of policies. There is however one rule of thumb to respect: keep it simple and do not write business logic in policies. APIM policies are very powerful but not so easy to test. For example, there is no code coverage (unit testing), integration tests can be put in place in CI/CD pipelines using Postman Collections but this requires extra efforts and skills. Interestingly, APIM policies can be debugged using Visual Studio Code. However, simple aggregation using custom logic or GraphQL, as well as dynamic backend routing, can be easily put in place while not compromising maintainability and testability. With the rise of Generative AI, Microsoft is positioning Azure API Management as an AI Gateway, for which many of the above patterns are used. Figure 6.8 shows a routing example across two different instances of Azure OpenAI making the client totally agnostic of the underlying AI architecture.

Figure 6.8 – Gateway Routing across multiple Azure OpenAI instances

In Figure 6.8, Azure API Management is configured with a backend pool consisting of two Azure OpenAI instances. A simple policy is applied to handle HTTP status codes 429 (Too Many Requests) and 503 (Service Unavailable), enabling seamless request failover between backends—completely transparent to the client. This pattern also allows Azure API Management to take over the authentication process: using its managed identity, it acquires access tokens and authenticates with the OpenAI instances. As a result, clients do not interact directly with the AI backends, enhancing both security and abstraction. As an Application Architect, you must consider such patterns whenever possible and offload such duties to Azure services instead of reinventing the wheel in code. The AI gateway pattern principles also apply to your own backend services. You can find a lot of samples in the GitHub repo https://github.com/Azure-Samples/AI-Gateway. Azure API Management goes far beyond these basic pattenrns and also helps standardize, monitor, secure and design APIs regardless of the application use case.Distributed applications often involve message and event brokers such as Azure Service Bus and Event Grid and their related point-to-point, pub/sub, load-levelling patterns which we already explained in Chapter 2. Another common messaging pattern in the Claim Check pattern, which handles large messages by simply storing the large payload in an external system such as Blob Storage and embedding a reference to it in the message sent to the message brocker, such as Azure Service Bus. The message handler can go grab the actual large payload from Azure Blob Storage instead of saturating Service Bus or storage account queues with large messages. The Claim Check pattern is very commonly used, so make sure to master it. A few years ago, it was mandatory to use Claim Check because Azure Service Bus message sizes were rather limited. Nowadays, we can store large messages directly into service bus but it is nevertheless a questionable approach to do so.When working with data stores, it's useful to distinguish between code-level design patterns—such as Command Query Separation (CQS) and Event Sourcing—and native data store features like Change Data Capture (CDC) or Cosmos DB's Change Feed. Command Query Responsibility Segregation (CQRS) bridges both domains, for instance by routing write operations to a primary Azure SQL database while directing read-only queries to its replicas that can be made automatically available by Azure SQL. As an application architect, it is essential to evaluate and leverage built-in data capabilities that support or enhance design patterns. For example, the Change Feed in Cosmos DB can support event sourcing scenarios and facilitate the creation of materialized views or cache invalidation when used with services like Azure Functions. Let's now have some practical example with our use case.

Use case – Serverless invoice processing pipeline

Scenario and first analysis:Contoso is a financial services firm that receives hundreds of customer invoice daily from partners. These files are uploaded in JSON format and must be validated, parsed, and then routed to different back-office systems based on their contents (for exxample, domestic versus international payments, amounts subject to further approval, and so on). They are looking for a scalable solution that can handle peak loads while remaining cost friendly.Let's first extract the keywords in this scenario that demand our focus:

Hundreds of invoices per day: hundreds, not thousands or millions. Given that invoices are also typically rather small documents, we know that the overall throughput will remain quite limited.
Peak loads: we know that the load will not be linear. If we have hundreds of invoices per day, this might be 200 in the morning and 200 at the end of the day, or there might be busier days than others. In any case, we remain with relatively low volumes but should still be able to scale when needed.
Financial services: Contoso is in the financial sector so we can imagine that the solution should adhere to high security standards, although none is explicitly expressed in the scenario.
Validation and parsing: we must foresee a parsing process

After a bit of digging, you went to the drawing board and ended up with a logical view shown in Figure 6.9.

Diagrams

Figure 6.9 shows a high level diagram of our invoice processing pipeline:

Figure 6.9 – Contoso invoice processing logical view

We assume that customers will submit their invoices through Contoso's existing Managed File Transfer (MFT) solution—such as GoAnywhere, a widely used platform. The MFT solution will handle antivirus scanning and, once the files are cleared, upload the invoices to the application's storage account. In the unlikely event that Contoso, a financial organization, would not have any MFT solution, you could suggest to leverage the built-in storage account SFTP features as well as Defender for Storage as the anti-virus. The latter could be used as well as a second-line anti-virus scanning tool. You might as well suggest an API layer that customers use to send their invoices instead of relying on SFTP. However, most financial organizations still rely on good old file transfer services as ingestion channels. Regardless on how files land into our storage account, it is interesting to see how to handle them.We know that files should be relatively small so we can decide to use Azure Functions Elastic Premium to parse and validate the files and get our code triggered thanks to the Blob Trigger. The validation layer will send messages to the invoices topic with some promoted properties that are used by the subscriptions. Our back office components will have three subscriptions on the invoices topic:

The invoices.auto-approved subscription that filters all valid invoices for which the amount is below or equal to 10000 euros.
The invoices.approval-required susbscription for which the amount is greater than 10000 euro.
The invoices.invalid subscription.

We used a single topic because we know we are not dealing with large amount of messages so we're far from hitting any of the entity-level limits. An alternative could have been to split further the topics.We basically delegate the filtering and appropriate routing to Azure Service Bus itself using subscription filters. You might have noticed that we also filter based on the message version as it is a best practice. Using message versions ensure no breaking change takes place should the message format change, providing the publisher updates the version number accordingly. This allows subscribers to gradually adapt and adopt the new version.All our subscriptions have a dead-letter queue that our back office must monitor. Messages are dead-lettered by Azure Service Bus itself whenever messages cannot be delivered for some reason (the max delivery attempt is reached, the max retention time has occurred, and so on). To make sure we do not lose a single message, we must monitor the dead-letter queues and decide what do to with them. Back-office processing may or may not utilize Azure Functions to consume messages from the Service Bus subscriptions.Beyond making sure messages are published with the correct properties, we also leverage the Claim Check pattern as we will encapsulate the blob location as part of the message body instead of including the entire invoice body into our payload. This part will be more visible in the provided .NET code.In the provided code sample, the focus is limited to message validation and Service Bus configuration, allowing back-office components the flexibility to implement their own preferred message handling mechanisms.For sake of simplicity and cost control, we have not hardened the provide example from a network perspective but beware that the corresponding physical view could look like Figure 6.10.

Figure 6.10 – Contoso invoice processing physical view

The MFT solution should be able to talk to our storage account that is isolated from Internet as any other component of our application. The MFT infrastructure might be hosted in Azure or anywhere else but connectivity should be in place. It might use its own dedicated firewall or reuse the Hub's one. In any case, this is beyond the control of the Application Architect, hence why we did not include this as part of the provided sample solution, which is entirely public facing. The focus here is on how to link Azure Functions to Azure Service Bus and leverage built-in features such as subscription filters. Let's look closer at some code.

Code samples

The Azure Functions runtime can be quite complex and isn't exactly beginner-friendly, as it comes with a steep learning curve. First of all, we must use the so-called isolated workder mode, which is gradually replacing the older in-process mode. This must be defined explicitly when creating the Azure Functions project.In our scenario, we're using the runtime in a straightforward manner—to trigger code execution when a blob is added to a storage container and to enqueue a message to a Service Bus topic. At first glance, this should be simple, thanks to the built-in Blob Storage trigger and Service Bus output binding. However, as is often the case, the devil is in the details. Our design requires promoting custom metadata properties so that subscribers can filter messages and receive only the invoices relevant to them. Unfortunately, the Service Bus output binding doesn't currently support promoting such properties. To meet this requirement, we must instead use the native Service Bus SDK directly.Secondly, while we haven't yet applied strict network hardening to our solution, we aim to follow best practices for authentication by leveraging Managed Identities to access both our business storage account and Service Bus. This approach, though more secure, can be challenging to implement—especially since most online examples rely on basic connection strings rather than identity-based access. Figure 6.11 shows our needs in terms of identity-based resource access.

Figure 6.11 – Identity-based resourcce access

We create a separate User-Assigned Identity to which we want the Storage Blob Data Contributor and Storage Queue Data Contributor roles to the storage account that contains the invoices. Next, we grant the Azure Service Bus Data Sender role to the same identity over the Service Bus where we'll publish messages. For the sake of simplicity, we continued using key-based authentication for our technical storage account, as it does not hold any business-critical data. The technical storage account is used by the function app to log some information as well as keep track of the different triggers. It typically doesn't contain sensitive data, especially when not using Durable Functions. However, it's important to note that in enterprise environments, Azure policies may enforce restrictions that completely block key-based authentication by default. In such cases, you'll need to either switch to managed identity or formally request a policy exception.The creation of the managed identity as well as role assignment is done at infrastructure level but as an Application Architect, you must know your exact needs and should be able to document them properly. You must also understand which implications such a configuration has on the application code.In our example, we use the default Blob Trigger, which is shown in the following function signature:

[Function(nameof(InvoiceFunctions))]      
public async Task InvoiceLanded([BlobTrigger("invoices/{name}", Connection = "AzureWebJobsBusinessStorage")] BlobClient blobClient,
string name,
FunctionContext context)

While this looks simple enough, the default behavior is that the AzureWebJobsBusinessStorage setting is supposed to be the plain connection string using key-based authentication and not identity-based authentication. The convention to make the runtime "understand" that it should switch to identity-based authentication, consists in defining a few extra environment variables:

AzureWebJobsBusinessStorage__blobServiceUri: this setting indicates the full blob URL of the storage account, such as https://mapbook.blob.core.windows.net/.
AzureWebJobsBusinessStorage__queueServiceUri: this setting indicates the full queue URL of the storage account, such as https://mapbook.queue.core.windows.net/. At this stage, you might wonder why you need to pass the queue service URL although you're only dealing with a blob trigger. This is required by the runtime as it pushes poisoned messages into a queue. In our use case, it means that when the function fails to handle a blob, a correspondoing poison message will be added to a storage account queue.
AzureWebJobsBusinessStorage__credential: this setting is required to indicate that we want to use identity-based authentication.
AzureWebJobsBusinessStorage__clientId: this setting is only required when using user-assigned identities instead of system-assigned. The Client ID of the user-assigned identity must be specified.

Transitioning from key-based to identity-based authentication introduces its own layer of complexity.Once our function is triggered by the arrival of a blob, it is supposed to handle it and queue a message to the invoices topic. As explained earier, we must use the plain Service Bus SDK to do so because the default provided outbound binding doesn't allow us to publish messages with custom properties. Therefore, we need to perform a bit of dependency injection plumbing to get things done:

var host = new HostBuilder()
    .ConfigureFunctionsWorkerDefaults()
    .ConfigureAppConfiguration((hostingContext, config) =>
    {
        config.AddEnvironmentVariables();
    })
    .ConfigureServices((context, services) =>
    {
        var configuration = context.Configuration;      
        var serviceBusFQDN = configuration["ServiceBusConnectionFQDN"];
        var managedIdentityId = configuration["ServiceBusConnectionclientId"];
     
        services.AddSingleton(serviceProvider =>
        {          
            var credential = new DefaultAzureCredential(new DefaultAzureCredentialOptions
            {
                ManagedIdentityClientId = managedIdentityId
            });
            return new ServiceBusClient(serviceBusFQDN, credential);
        });
        services.AddScoped<InvoicePublisher>();
    })  
    .Build();

We basically add a singleton for our Service Bus client and we leverage the DefaultAzureCredential class (from the Azure.Identity package) to switch to identity-based authentication. The settings defined at the level of the Function App are used here. We also inject a custom InvoicePublisher that we can use from within the functions. The core of our function is the following code:

string blobUrl = blobClient.Uri.ToString();
_logger.LogInformation($"Pretend to handle blob \n Name:{name} \n");
var payload = new{
    BlobUrl = blobUrl              
};
var json = JsonSerializer.Serialize(payload);
var message = new ServiceBusMessage(BinaryData.FromString(json))            {              
    Subject = "CustomerInvoice",
    ContentType = "application/json"
};
await _publisher.PublishInvoiceAsync(
    blobUrl,
    (new Random().Next(10) < 5) ? true : false,
    new Random().Next(15000),"v1");

We pretend to handle the blog (the point here is not to teach you how to parse a JSON file), then we add the link to the blob into the message (Claim Check pattern) and we publish a message to the invoices topic with our custom properties. We randomly set them to make sure our different subscriptions will receive something. Our publisher object is just a wrapper of the Service Bus client. Note that this code does not include any custom exception handling—exceptions will bubble up by default. If you choose to implement your own exception handling, it is crucial to re-throw the exception afterward. Failing to do so may cause the Azure Functions runtime to treat a failed execution as successful, which directly affects how poison and non-poison messages are handled. This can lead to undetected issues in your processing pipeline. In fact, it's safer to avoid custom exception handling altogether than to implement it incorrectly and risk suppressing genuine errors. Now, let's see how to test this code.

Testing the solution

To test this solution in you own tenant, you must:

Have an Azure subscription with owner permissions. Folllow this link to start a new trial if required https://azure.microsoft.com/en-us/pricing/purchase-options/azure-account.
Install Azure CLI. Follow the instructions available here https://learn.microsoft.com/en-us/cli/azure/install-azure-cli.
Clone the map book repo or download the code.
Have Visual Studio Code with the PowerShell Extension (Linux)
Open the cloned/downloaded folder with Visual Studio Code. Make sure to run az login before any other activity.
Navigate (cd) to this folder: Chapter 6\ use-case\IaC.

Once in the correct folder, open deploy.ps1 and adjust the following variables to your needs:

$resourceGroupName = "mapbook-chapter6"
$location = "swedencentral"

All the resources will be created using a unique name by default. You may still come up with your own names if you want, in which case, feel free to adjust the other variables as well. If both, the default resource group and location are ok, just leave them untouched.Once done, you can run the script ./deploy.ps1 from the Visual Studio Code terminal. The script takes only a few minutes to complete and deploys the resources showin in Figure 6.12.

Figure 6.12 – Resources deployed by our use case script

The script also deploys the invoices container as well as a dummy invoice to faciliate the testing.

Figure 6.13 – Dummy invoice deployed into the invoices container

The script also assigns the required roles to the managed identity assigned to the Function App and deploys the environment variables that are picked by the function code.

Figure 6.14 – Environment variables for identity-based authentication

With the infrastructure totally deployed, you still need to publish the code.In a real-world situation, this step would be ensured by CI/CD pipelines. The code would be first built and then deployed using Azure DevOps tasks or GitHub Actions. To keep things simple enough, we will deploy the code manually using the Publishing Profile of the Function App. Here are the steps required to publish the code:

In the Azure Portal, navigate to the Function App and click on Get publish profile:
Download it somewhere on your disk.
Open Visual Studio 2022 or later by double-clicking on Chapter 6\use case\code\InvoiceFunctions\InvoiceFunctions.sln.
Right click on the project and select Publish:
Next, click on Import Profile and select the download file. Accept all the default options.
Next, click on the Publish button:
Once the publish process is complete, you should see the function appear in green:
Since we provided a default dummy invoice, you should see an invocation (it might take a few minutes before you see it). You must click on the link labeled Invocations and more, shown in Figure 6.18.

You will be redirected to the function invocation page and should see one successful execution:

If everything went fine, the function should have published a message to our Service Bus topic and this should land in one of the subscriptions:

You can upload a few extra files (it doesn't need to be real invoices as we do not parse them anyway) to see how messages are distributed across subscriptions. Feel free to click on the different subscriptions and double-check their filter settings, and so on.This seemingly simple use case underscores the importance of close collaboration with infrastructure teams, as they often carry out substantial configuration work behind the scenes to ensure your code runs properly. At the same time, it's equally critical for developers to understand the infrastructure requirements and constraints involved. Making this work is a shared responsibility—it requires coordinated effort from both sides.Let's summarize this chapter!

Summary

This chapter provided practical guidance for managing the local development experience in Azure, addressing common debugging challenges and the importance of staying productive while working with cloud services. It highlighted essential Azure services that every developer and architect should be familiar with, alongside foundational identity concepts like Managed Identities and OAuth/OIDC—often underappreciated yet critical in cloud-native design.A few architectural patterns such as CQRS and Event Sourcing were discussed, with a strong emphasis on how Azure-native capabilities—like Cosmos DB's Change Feed—can simplify their implementation. The message is clear: don't reinvent the wheel in code when the Azure ecosystem already provides powerful building blocks/features.A concrete invoice processing use case tied it all together, demonstrating how Azure Functions, Blob Triggers, and Service Bus can be orchestrated effectively—while also exposing real-world limitations, such as those found in output bindings. Ultimately, the chapter encourages Application Architects and developers to embrace Azure's ecosystem with a realistic mindset: while Azure enables rapid development, achieving robust, production-grade solutions comes with a steep learning curve and attention to subtle, critical details. In the next chapter, we will explore Data Architecture.

7Data Architecture

Join our book community on Discord

https://packt.link/0nrj3In this chapter, we start by mapping the core data platform capabilities to corresponding Azure services to help you select the right tools for your data architecture. We then explore Hybrid Transactional and Analytical Processing (HTAP)—a long-standing goal for data architects. Finally, we conclude by highlighting key aspects such as data governance and infrastructure enablers that support hybrid data scenarios.More specifically, we'll look at:

The Data Architecture Map.
Zooming in on data platform capabilities
Zooming in on HTAP.
Zooming in on miscellaenous data aspects.
Use case

Our use case focuses on ingesting data streams, processing them in real time to detect anomalies, while also enabling analysis of historical data.It is important to note that Microsoft Fabric is Microsoft's unified, end-to-end data platform that integrates data engineering, data science, real-time analytics, and business intelligence into a single SaaS offering. It abstracts the underlying infrastructure, enabling organizations to focus on delivering insights rather than managing components. While Microsoft Fabric is becoming a strategic pillar of Microsoft's data vision, this book focuses on the architectural building blocks and services available within the Azure platform itself. As a result, Microsoft Fabric—which is positioned as a higher-level, opinionated Software as a Service—is considered beyond the scope of this book. Nevertheless, a few Fabric features will be discussed throughout this chapter.Let us now explore the technical requirements.

Technical requirements

We will be using Microsoft Visio for the diagrams but the corresponding PNGs are also provided.
Visual Studio Code, Azure CLI, and Terraform to open and deploy sample application.
Visual Studio 2022 or later if you want to rebuild the provided applications (device emulator and an Azure function) yourself. This is optional since the application artefacts are provided separately.
An Azure subscription with owner permissions is needed to deploy the provided code. You can start a free trial if necessary. Follow this link https://azure.microsoft.com/en-us/pricing/purchase-options/azure-account
Terraform and Azure CLI.

Additional information will be provided in our use case section. Maps, diagrams, and code are available at https://github.com/PacktPublishing/The-Azure-Cloud-Native-Architecture-Mapook-Second-Edition/tree/master/Chapter07.

The Azure Data Architecture Map

The Azure Data Architecture Map, shown in Figure 7.1 should help you deal with the typical duties of a data architect, which we covered in Chapter 1, Getting Started as an Azure Architect.

Figure 7.1 – The Azure Application Architecture Map

Important note

To see the full map (Figure 7.1), you can download the PDF file available at https://github.com/PacktPublishing/The-Azure-Cloud-Native-Architecture-Mapook-Second-Edition/blob/master/Chapter07/maps/Azure%20Data%20Architecture.pdf.

Figure 7.1 has the following top-level groups:

Data platform capabilities: This section maps typical data capabilities to Azure services. Capabilities are the ingredients you must assemble to build your data architectures.
Hybrid Transactional and Analytical Processing (HTAP): It is also a capability but we preferred to have a dedicated section because of its special flavor. This will be explained later in the HTAP section.
Miscellaneous: This section regroups a few data-related topics such as data governance, infrastructure enablers, and more.

Let's get started with the data platform capabilities.

Zooming in on data platform capabilities

Patterns such as Kappa, Lambda, Medallion, and Lakehouse to name a few, are built upon a common set of underlying capabilities. By aligning these capabilities with Azure services, we aim to simplify the process of selecting the right service for your specific architectural objectives. A few data services such as Azure Dabricks and Synapse Analytics are versatile and allow you to tackle most data scenarios. Databricks is slightly more comprehensive but both offerings are very complete. Organizations that have a Databricks ecosystem on-premises typically favor Azure Dabricks, while Microsoft shops typically favor Synapse. Databricks originally emerged with a focus on data science, whereas Synapse was designed to address data warehousing and traditional business intelligence needs. In the meantime, Synapse grew significantly to also cover modern data science aspects.Figure 7.2 illustrates the mapping between data platform capabilities and Azure services:

Figure 7.2 –Data platform capabilities mapped to Azure services

There are a few sub categories, such as data ingestion, data processing, serving layers, storage, and so on. Let's start with the data ingestion category.

Data ingestion

IoT Hub and Event Hubs are both highly scalable ingestion layers. As its name indicates, IoT Hub is the Azure service for any IoT scenario. The service is able to establish bi-directional communications with IoT devices and can be supplemented with IoT Edge and IoT Edge Modules to accommodate entreprise-grade scenarios matching industry standards where industrial networks typically do not directly connect to Internet. Event Hubs is a general-purpose ingestion service, for high-throughput data streaming. Event Hubs is commonly used for application telemetry and real-time data collection. For example, Azure API Management has a built-in log-to-eventhub policy that allows us to track any API operation call in near real-time while not compromising performance. Event Hubs can also be handy in any Business Activity Monitoring (BAM). Events landing in both IoT Hub and Event Hubs can be persisted in other data services, such as Azure Data Explorer, Cosmos DB, or ADLS Gen2, among others, using Stream Analytics or other processing services to process and transfer the data to downstream stores.

Data processing

Data processing can take various forms—batch processing, stream processing, or event-driven processing of discrete data events. For scenarios that don't require Massively Parallel Processing (MPP), Azure Functions and Fabric Dataflow Gen2 are viable options. Azure Functions, already introduced in previous chapters, offer a code-based approach. In contrast, Fabric Dataflow Gen2 enables users to build ingestion and transformation pipelines using Power Query, without writing any code. This service is designed for power users and citizen developers who prefer a low-code experience. When it comes to batch processing, both Azure Databricks and Synapse Analytics with Spark runtime are very good options. One of the key decision factors about choosing Databricks or Synapse for batch processing is whether you need to lookup reference data from external relational stores. At this stage, only Databricks provides this capability. Databricks is also the only one to propose an auto-scaling feature, which can be particularly useful when processing in bursts. Data Factory, being an ETL/ELT tool, is particulary suited for processing batches too, especially when the data sources are located on-premises as we can rely on its Self-Hosted Integration Runtime (SHIR) to bridge the two worlds. Microsoft Fabric pipelines share many familiar concepts with Azure Data Factory and can also be used for batch processing.Finally, for stream processing, Azure's established service is Stream Analytics, which integrates seamlessly with the broader Azure ecosystem. It supports various input sources such as IoT Hub, Azure Event Hubs, Blob Storage, and Data Lake Storage Gen2, and offers no fewer than fourteen output options. Stream Analytics is particularly well-suited for bridging the ingestion layer with downstream data stores. Stream Analytics is available in two forms: jobs and clusters. Clusters provide dedicated resources (single tenant) and can scale to process up to 400 MB per second in real time. Jobs leverage Azure's underlying multi-tenant infrastructure to run your stream processing logic and cannot digest as much data as clusters. Choosing between jobs and clusters depends on your specific needs, but if you are looking for guaranteed throughput, you should go for clusters.For more advanced complex streaming scenarios, both Synapse and Databricks rely on Spark Structured Streaming, which can be customized further. When output downstream data services are part of the Fabric world, you should look at Fabric Real Time Intelligence.

Data storage (raw)

Unlike operational databases such as Azure SQL or PostgreSQL—which are optimized for structured, transactional workloads—raw storage solutions like Azure Blob Storage, Azure Data Lake Storage (ADLS), or Microsoft OneLake are designed to store large volumes of unstructured or semi-structured data efficiently. These storage systems serve as cost-effective repositories for data that may later be transformed, analyzed, or processed in downstream analytics pipelines. Azure Files and Azure Netapp Files can be used as recipient of data transformations (for example, reports) which can be shared across applications through mounted volumes. Azure Netapp Files is particularly suited for very high-throughput and ultra-low latency workloads. Blob Storage is widely used across data and non-data applications for storing raw binary objects. ADLS builds on the same underlying service but introduces a true hierarchical namespace and is therefore optimized for data operations.

Serving and data visualization

The data serving and visualization layers are distinct yet closely related components. The serving layer exposes processed, analyzed, or transformed data to applications, services, and users. The visualization layer builds on this by presenting the data through dashboards, reports, charts, and other graphical formats. Power BI—now integrated into Microsoft Fabric—is the flagship service for data visualization, designed to empower business users with self-service analytics. Power BI enables the creation of reports and dashboards, which can also be easily added to external applications using Power BI Embedded. Alternatively, Azure Data Explorer Dashboards offer lightweight visualization capabilities, which are primarily geared toward data engineers and developers.

Other data capabilities

Azure AI Search is a general-purpose search service increasingly used as a vector database and a hybrid search engine, combining the strengths of both vector and keyword-based search. While Cosmos DB and Azure SQL also offer capabilities for full-text and vector search, Azure AI Search remains the most versatile for search-centric scenarios.For time series data, Azure Data Explorer stands out as the most suitable service. It can ingest billions of rows and respond to queries within milliseconds, making it ideal for high-throughput, low-latency analytics.When it comes to sharing data across internal teams or platforms, Synapse Workspace Federation and Databricks Delta Sharing are effective solutions. For external data sharing with third parties, Azure Data Share offers a more appropriate option. Now that we have identified a few capabilities, let's see how we could combine them to match usual data patterns.

Patterns to services

It is nearly impossible to list all the possible service combinations we can use to achieve typical data patterns, but we can at least try to identify a few. The Medallion pattern is best achieved with ADLS combined to Azure Databricks, thanks to its native support of Delta Lake making it easy to apply data transformation and enrichment across the bronze, silver, and gold layers.When it comes to Lambda and Kappa, there are many possibilities. An example of Lambda pattern could be achieved using services shown in Figure 7.3.

Figure 7.3 – Possible Lambda Architecture

We use Kafka-enabled Azure Event Hubs for data ingestion, combined with Databricks Structured Streaming to process the data in real time and store it in Delta Tables. For the batch layer, data lands in ADLS via an ETL process orchestrated by Azure Data Factory. Databricks pulls data from ADLS, applies the required transformations, and stores the results in Delta Tables. Power BI serves as the consumption layer, enabling end users to explore and visualize the prepared data. Alternatively, a traditional SQL database can be used as the final output destination—particularly when reporting needs or scalability requirements call for it.An alternative for building a Lambda architecture could be as shown in Figure 7.4.

Figure 7.4 – Alternative to build a Lambda architecture

In this scenario, we continue to use Azure Event Hubs as the ingestion layer, with or without Kafka protocol support. Optionally, Azure Stream Analytics can sit between Event Hubs and Azure Data Explorer to apply filtering or support multiple output targets. However, Data Explorer is also capable of ingesting data directly from Event Hubs.For the batch layer, Azure Data Factory is used to land raw data in ADLS. Data Explorer then pulls the data from ADLS and applies the necessary transformations. As in the previous architecture, Power BI is used as the serving layer to deliver insights to end users.For a Kappa architecture, you might do the same as Lambda with a focus on the speed and serving layers only. Synapse and Fabric could also be used to realize both Lambda and Kappa. Let's now zoom in on HTAP.

Zooming in on HTAP

Traditionally, Online Transaction Processing (OLTP) systems like relational databases and Online Analytical Processing (OLAP) systems like data warehouses are separated due to their differing performance needs and data models. HTAP eliminates the need to separate these workloads, enabling real-time analytics directly on live data. HTAP is transformative because it eliminates the need to copy data, which is time-consuming to maintain, prone to staleness, and can introduce inconsistencies. Additionally, in traditional architectures, application teams typically only consider their operational system as the system of record, while data teams rely on the raw layer available in their Data Lake. With HTAP, all teams can align on a single, shared data store as the authoritative source.However, this advanced capability—long sought after by data architects—has yet to become mainstream. As you can see in Figure 7.5, there are currently a handful of options.

Since true HTAP means that there should be no copy from the source system to an external analytical one, the only fully compliant HTAP option (as of June 2025) in Azure, is Synapse Link for Cosmos DB as Synapse directly interacts with the analytical store of Cosmos DB. The analytical store of Cosmos DB must be enabled at container-level and once done, Cosmos will automatically sync the transactional store with its analytical store. Thanks to Synapse Link, we can query the data using either Spark notebooks or SQL serverless endpoints. For best performance, you should try to keep Synapse Link in the same region as your Cosmos DB.The Synapse Link for SQL feature is not a true HTAP because data is copied from SQL to Synapse. However, functionally speaking, it is HTAP-like as you do not have to come with custom ETL flows to have data landing into Synapse. At last, Fabric Mirrored Databases function the same way as Synapse for SQL link but with many more source systems and near real-time data replication. Let's now explore other data-related aspects.

Zooming in on miscellaneous data aspects

Figure 7.6 illustrates various other aspects also related to data architectures.

Figure 7.6 – Various aspect related to data architectures.

For anything AI-related, we encourage you to read Chapter 8, that is fully dedicated to AI. Chapter 9 covers the security-related aspects of data and non-data workloads. Let's see how infrastructure enablers can help bridge environments.

Infrastructure enablers

Today, most data sources are isolated from the public Internet, requiring the use of infrastructure components to establish secure connectivity between data services and those sources. Historically, the On-premises Data Gateway (ODG) was the primary solution for enabling cloud-based services like Power BI to access on-premises data sources. It remains widely used today for the exact same purpose but is not the only possibility anymore. To use it, the gateway must be installed on one or more on-premises servers with connectivity to the target systems. Beyond data access, ODG can also integrate with systems like IBM MQ. One of its key advantages is support for authentication protocols such as NTLM and Kerberos—something that is difficult to achieve with PaaS or SaaS components. Figure 7.7 illustrates how we can take advantage of ODG:

ODG initiates an outbound connection to the Azure Relay service, which bridges ODB and the target cloud services such as Power BI and Logic Apps. The outbound connection is bi-directional and ODG can receive commands to execute from the cloud components. Connection strings and configuration settings are defined in the cloud and executed on-premises.The Self-Hosted Integration Runtime (SHIR) introduced earlier, works more or less the same way but is specific to Data Factory and Synapse Pipelines (or under the Fabric umbrella) as the target cloud services. Any ETL/ELT flow between on-premises and the Cloud involves the SHIR. Of course, the SHIR can be installed in other clouds or even on Azure IaaS. For example, two different Azure-hosted solutions running self-hosted SQL servers might leverage a cloud-based SHIR to exchange data.The Managed Virtual Network (MVN) is particularly useful for enabling services like Data Factory or Microsoft Purview to connect to PaaS data sources that are not exposed to the public Internet. MVN acts as a secure tunnel between the calling service and the target data source. Within the MVN, the service can deploy Managed Private Endpoints, which must be explicitly approved by the target data service. Once approved, the calling service gains access to the data source through the private endpoint it created, ensuring secure and controlled connectivity.Last but not least, the Virtual Network Data Gateway (VNDG) is specific to Azure and requires Virtual Network integration. Its primary purpose is to enable Power BI and Fabric to access Azure data stores that are not exposed to the public internet. For instance, if the Power BI service needs to retrieve data from a Synapse Workspace secured behind private endpoints, VNDG serves as the secure bridge between the two worlds. Figure 7.8 illustrates an end-to-end integration between Power BI and Synapse through VNDG.

Figure 7.8 – End-to-end flow between Power BI and Synapse using VNDG

Figure 7.8 illustrates that your primary concern is deciding where to deploy VNDG. Configuration is handled directly within the Power BI service, and a single gateway can be shared across multiple Power BI workspaces. You can choose between a shared or dedicated VNDG, depending on your architecture. In the scenario shown in Figure 7.8, a dedicated VNDG is assumed, as it resides in the same Virtual Network as the Synapse Workspace.The most critical requirement is ensuring that the VNDG has network connectivity to the target data sources and the firewall rules allow outbound traffic to Entra ID, which is essential for acquiring tokens required by Power BI connectors. A common misconfiguration by infrastructure and data teams is overlooking Entra ID traffic, which prevents the gateway from completing authentication and establishing a successful connection. This configuration error frequently results in support requests to Microsoft, so it's important to keep it in mind.In conclusion, all these infrastructure enablers allow you to bridge cloud services with the data center where they are installed. Let's now look at the OLTP.

Operational Database Systems

Azure offers a variety of OLTP systems, including relational databases like Azure SQL and PostgreSQL. However, the flagship OLTP service is undoubtedly Cosmos DB. As of 2025, it remains the only Azure service that supports multi-master writes across multiple regions at global scale, while also delivering sub-2 millisecond latency for both reads and writes. Cosmos DB is packed with features that make it a uniquely powerful offering.That said, as is often the case with feature-rich platforms, teams sometimes adopt Cosmos DB too quickly without fully understanding its design principles. This can lead to suboptimal performance, escalating costs, and ultimately, a need to revisit the design. Let's take a closer look at this powerful yet complex service to better understand how to use it effectively through a mini use case scenario.Let's imagine that you are developing a mobile app to manage wine cellars. The app is made available through the different app stores and you expect potentially many users so you have to make it scalable and accessible worldwide. A high-level solution could be something like we explained in Chapter 3, Infrasctructure Design in our Global API Platform Use Case, where you proxy your backend services with a multi-region API Management instance, have backends services in corresponding regions, each talking to their own regional Cosmos DB. Figure 7.9 shows a simplified version of it to refresh your memory.

Figure 7.9 – Simplified multi-region solution

As a global service, Azure Front Door is inherently multi-region aware. Our API layer builds on this by leveraging Azure API Management's native multi-region capabilities. On the backend, services are deployed across multiple regions, and Cosmos DB is configured with multi-master writes in all enabled regions. Together, these elements ensure that our architecture consistently delivers the fastest and most responsive user experience—regardless of the user's location worldwide.However, global distribution alone is not enough—performance also depends heavily on how your Cosmos DB is designed. Without following a few key design principles, even the most elegant and appealing mobile app can quickly become sluggish and less engaging for end users. Here are a few essential guidelines to consider when using Cosmos DB:

Avoid hot partitions: Cosmos DB scales horizontally by distributing data across multiple logical partitions, which are then mapped to physical partitions. If data is not evenly distributed—often due to a poorly chosen partition key—some partitions may receive a disproportionate volume of requests or data. This results in hot partitions, which can degrade performance and eventually lead to throttling or full partitions.
Avoid cross-partition queries: Cosmos DB cross-partition queries occur when a query needs to access data from multiple logical partitions, typically because the query lacks a partition key filter or spans multiple partition key values. These queries will increase latency and RU (Resource Unit) consumption, leading to slower performance and increased costs.
Evaluate read/write ratio: This aspect is often overlooked by application and data teams, yet it plays a critical role in performance tuning. By default, Cosmos DB indexes all attributes, which benefits read-heavy workloads but can negatively impact write-heavy ones. In write-intensive scenarios, it is recommended to define a custom indexing policy that limits indexing to only the fields required for queries. This reduces write latency and optimizes resource consumption.
Mind hard limits: You must always pay attention to hard limits such as the maximum document size, maximum partition size, and so on.

Many of these considerations can be addressed by choosing an appropriate partition key strategy. However, some best practices may conflict with one another. For example, selecting a GUID as the partition key can ensure even data distribution and help avoid hot partitions—but it may not align with your query patterns. This misalignment can result in frequent cross-partition queries, which may increase latency and RU consumption.Back to our wine cellar mobile app scenario, let's review these considerations against our use case.We can safely assume the following:

Avoid hot partitions: Overall, data should be reasonably evenly distributed, as it is unlikely we'll encounter extreme imbalances, such as some cellars containing millions of bottles while others have only a few.
Avoid cross-partition queries: Each user will only access their own collection of wines, meaning that the majority of queries will target a single cellar. While some back-office reporting scenarios may require cross-cellar queries, these are expected to represent a small fraction of the overall query workload.
Evaluate read/write ratio: Users will most likely generate much more reads then writes as they will be searching for wines, looking at their KPIs, and so on. Writes should be limited to adding/removing bottles from the cellear. We can safely consider this scenario as read-heavy.
Mind hard limits: A single cellar should not exceed the current hard limit of 20 GB per logical partition. Given typical usage patterns, it is unlikely that a single cellar would contain enough wine documents (wine references) to reach this limit. Likewise, a single wine reference should not exceed the current limit of 2 MB per document, provided we store bottle pictures (if any) outside of Cosmos.

From a functional perspective, we can expect the mobile app to offer a dashboard view as the home screen to help users quickly get aggregates such as the total number of wines in the cellar, average price, quantity per color, and similar metrics.The app should also facilitate the addition of a new bottle by reusing reference data such as wineries, appelations, merchants, and more.With all of that in mind, we may consider the design illustrated by Figure 7.10:

Figure 7.10 – Multi-tenant wine cellar example with Cosmos DB

We have two containers, the WineContainer and the RefDataContainer one. Each container has multiple document types. The WineContainer is using the tenantId as the partition key, which would be the userId extracted from an access token obtained by the mobile app and uniquely identifying the user (owner of the cellar). This means that we'll have as many logical partitions as users (even distribution). Our WineContainer has two document types, both sharing the same partition key:

The Wine document type represents individual bottles, with each document corresponding to a single bottle. To optimize search capabilities within the app and avoid cross-partition queries, a few reference metadata attributes, such as winery name, are embedded directly within the document. In Cosmos DB, redundancy and denormalization is the norm.
A critical point is that every query should always include the tenantId partition key, which ensures that queries are scoped to a single partition. Additional filters (for example, wineryName) can be applied without negatively impacting performance, as long as the query remains within the same partition. In this regard, the design is sound and efficient.
The WineAggregates is a separate document type with its identifier (ID) set to the tenantId, enabling efficient point reads for the mobile app's home screen. As noted earlier, point reads and writes in Cosmos DB provide sub-2 millisecond latency, ensuring a fast and responsive user experience, while being at the same time, the most cost-friendly.

To compute aggregates, we can rely on Azure Functions with a Cosmos DB trigger that uses the Change Feed to capture changes made to wine documents. It's important to note that the Change Feed does not capture delete operations. Therefore, a soft delete strategy—such as marking a wine as deleted with a flag—should be used to ensure the function gets those deletions and updates aggregates accordingly. We can also optionally leverage the Change Feed to capture reference data changes and update wines accordingly. Because reference metadata updates are typically not so frequent, this shouldn't impact too much the overall performance.For reference data, we use a dedicated container with the type attribute as the partition key. This approach allows us to easily distinguish between different document types within the same container. As a result, we will have only three logical partitions—one each for merchants, wineries, and appellations. Given the relatively small size of reference data, we are confident that none of these partitions will approach the 20 GB limit. Furthermore, every query against the RefDataContainer should include the type attribute, ensuring that operations target a single partition for optimal performance and cost efficiency.For more advanced scenarios, consider using hierarchical partition keys and/or further separating containers and document types to better align with access patterns and scalability needs. We hope this example has helped illustrate just how critical thoughtful design is when working with Cosmos DB. Let's now look at some governance aspects.

Governance

In our governance branch, we can find Databricks Unity Catalog and Microsoft Purview. Both aim to provide centralized data governance; however, Unity Catalog is specific to Databricks environments, whereas Microsoft Purview is designed to manage data across a wide range of environments. Purview excels at metadata management, data cataloging, and lineage across a broad set of data services, while Unity Catalog is best for fine-grained data access control and governance within Databricks environments.Purview relies on some infrastructure enablers (SHIR, MVN) described earlier to gain access to a variety of data sources. Let's now bring some of the learned concepts to life by applying them to our use case.

Use case – Smart Fridge monitoring in retail stores

Scenario and first analysis:Contoso, a large supermarket chain, wants to proactively monitor all refrigeration units (for example,, dairy fridges and frozen food freezers) across its stores. Each unit is equipped with temperature and humidity sensors that send telemetry every 30 seconds to a centralized system. To reduce product loss, ensure food safety, and optimize maintenance efforts, the company wants to automatically detect anomalies in temperature and humidity levels using sensor data. The company wants to react in near real-time to issues, store historic data for audit purposes, and trigger alerts when anomalies are detected. The primary focus should be set on the anomaly detection mechanism.Let's first extract the keywords in this scenario that demand our focus:

Detect anomalies in near real-time: Any deviation of fridge temperature should ultimately raise alerts. This means that we must foresee a speed layer in our architecture.
Auditing and historical data: We must find a data store that can scale to support both operational requirements and serve as a source for historical data analysis. This will be our serving layer.
Sensors: While the scenario doesn't explicitly mention it, the presence of sensors naturally points to an IoT context involving time series data. This implies the need for a solution that can both respond quickly to anomalies and scale to accommodate high data volumes. Although you haven't been provided with specific details regarding the number of devices or their data transmission frequency, you can reasonably anticipate a substantial volume of data—making scalability a key design consideration.

After some investigation, you concluded that a Kappa Architecture seems to fit the requirements, as the emphasis is primarily on real-time data—prompting a focus on the speed and serving layers. You went to the drawing board and ended up with a high-level view shown in Figure 7.11, discussed in the next section.

Diagrams

While there are often multiple ways to achieve the same goal, the Data Architecture Map guides us toward suitable services. For time series storage, Azure Data Explorer is a strong fit. Similarly, since sensors are IoT devices, they should communicate with an IoT-compliant service—namely Azure IoT Hub, which appears under the ingestion capability in our architecture map. We have identified the input and the output channels and we still need a service to bridge the two worlds. Although this isn't explicitly shown on the map, we learned that Stream Analytics can ingest data from various sources and supports multiple output options—including Azure Data Explorer, our Time Series Database (TSDB). Since it processes data in motion, Stream Analytics also enables us to compute aggregates and detect temperature anomalies in near real time, with the possibility of forwarding anomalies to a dedicated alerting system.Figure 7.11 is our high-level diagram.

Figure 7.11 – Kappa Architecture for Contoso

Here are the services we use:

IoT Hub as the entry point for the ingestion layer. IoT Hub is known to be scalable.
Event Hubs as the default IoT Hub backend that stores the events.
Stream Analytics as the message processing layer. Stream Analytics Jobs or Clusters are highly scalable and have been designed from the ground up to analyze, transform and route data streams to many different destinations. In our design, Stream Analytics pulls data from Event Hubs and stores everything into Azure Data Explorer, our TSDB. It also fires alerts to a custom Azure Function whenever temperature anomalies are detected for a few minutes in a row. The purpose is to avoid having a noisy system that raises too many alerts. Stream Analytics serves as the processing layer that bridges raw data ingestion with downstream output channels.
Note that Azure Data Explorer is able to ingest data from both IoT Hub and Event Hubs directly, but we still put Stream Analytics in the middle as we want to route our alerts directly to Azure Functions. Additionally, while Figure 7.11 is a workable design, an enterprise grade solution would rather call for VNet-integrated components with only IoT Hub exposed to Internet.

However, for sake of simplicity and cost control, the code sample makes use of public-only services and does not provision the IoT Hub since the real ingestion layer from our processing pipeline perspective is Event Hubs. For the same reason, we did not use Power BI as it requires specific licenses but this would be a valid option for our serving layer. The provided example provisions exactly what is shown in Figure 7.12.

Figure 7.12 – Detailed diagram of the provided sample

Our Infrastructure as Code provides the services illustrated in Figure 7.12. It also grants the required permissions to Stream Analytics to get input data and push it to Data Explorer. We also provide two .NET applications, namely, a device emulator and the function app's code to handle alerts raised by our Stream Analytics job. Let's look at the application code.

Code samples

The device emulator is a console program, which you can run to send events to Event Hubs.

private const string eventHubName = "temperatures";
private static readonly string[] deviceIds = { "store01-fridgeA", "store01-fridgeB", "store02-freezerA" };
private static readonly Random random = new();
static async Task Main(string[] args){
    string connectionString = "";
    if (args.Length == 0 || string.IsNullOrEmpty(args[0]))
        throw new ApplicationException("Usage: ./DeviceEmulator <event hub connection string>");
    connectionString = args[0];
    var producerClient = new EventHubProducerClient(connectionString, eventHubName);
    while (true){
        using EventDataBatch eventBatch = await producerClient.CreateBatchAsync();
        foreach (var deviceId in deviceIds){
            var payload = new{
                deviceId,
                storeId = deviceId.Split('-')[0],
                temperature = Math.Round(5 + 5 * random.NextDouble(), 2), // e.g., 5–10 °C
                humidity = random.Next(60, 90),
                timestamp = DateTime.UtcNow
            };
            string json = JsonSerializer.Serialize(payload);
            var eventData = new EventData(json);
            if (!eventBatch.TryAdd(eventData)){
                Console.WriteLine("Event too large, skipping.");
                continue;
            }
            Console.WriteLine($"Sending: {json}");
        }
        await producerClient.SendAsync(eventBatch);
        await Task.Delay(TimeSpan.FromSeconds(5));
    }
}

The program expects to receive the connection string of the Event Hubs as an argument. It then generates and sends a series of random events every five seconds.As stated earlier, anomalies should be detected by Stream Analytics and sent to a function. Here is the code of that function:

[Function("TemperatureFailureAlert")]
public async Task<IActionResult> Run([HttpTrigger(AuthorizationLevel.Function, "post", Route = null)] HttpRequest req){
    try{
        string body = await new StreamReader(req.Body).ReadToEndAsync();
        var events = JsonSerializer.Deserialize<List<SensorData>>(body, new JsonSerializerOptions{
            PropertyNameCaseInsensitive = true
        });
        foreach (var e in events){
            _logger.LogWarning(
                "Abnormal temperature {0} detected for Device {1} from store {2} alert: {3}.",
                e.AvgTemp, e.DeviceId, e.StoreId, e.AlertMessage);
        }
        return new OkResult();
    }
    catch (Exception ex){
        _logger.LogError(ex, "Exception occurred during function execution.");
        throw;
    }
}

This code is an HTTP-triggered Azure Function (POST) that receives a batch of events from our Stream Analytics job. Its current role is limited to logging the incoming events. In a real-world scenario, it would naturally process them further, but our objective here is simply to demonstrate the communication flow. As previously mentioned, Power BI Real-Time Dashboards could be a suitable alternative, enabling a monitoring team to detect anomalies almost immediately.Paradoxically, the hardest and heaviest part of the code is dedicated to the Infrastructure as Code setup. For the sake of brevity, we'll focus only the Stream Analyties queries but you can explore the full code further on your own. Here are the queries performed by Stream Analytics:

SELECT deviceId, storeId, temperature, humidity, timestamp INTO adx FROM temperatures TIMESTAMP BY timestamp;
SELECT deviceId, storeId, System.Timestamp AS alertGeneratedAt, AVG(temperature) AS avgTemp, 'ALERT: Avg temp > 3°C over 5 minutes' AS alertMessage INTO alert FROM temperatures TIMESTAMP BY timestamp GROUP BY deviceId, storeId, TumblingWindow(minute, 5) HAVING AVG(temperature) > 3;

The first query ingests all events from the Event Hub source (temperatures) and writes them to Azure Data Explorer (adx). The second query filters events and sends them to our function (alert) only when the average temperature exceeds 3 degrees over a rolling 5-minute window. We intentionally set this low threshold to trigger alerts frequently, purely for demonstration purposes. In the real world, you would probably set the threshold to 6 degrees. The remaining parts of the code are more complex and offer valuable insights, particularly in terms of automation—which proved to be far from straightforward. Let's now see how to test the solution.

Testing the solution

To test this solution in your own tenant, you must:

Have an Azure subscription with owner permissions. Folllow this link to start a new trial if required https://azure.microsoft.com/en-us/pricing/purchase-options/azure-account.
Install Azure CLI. Follow the instructions available here https://learn.microsoft.com/en-us/cli/azure/install-azure-cli.
Install Terraform. Follow this link https://developer.hashicorp.com/terraform/install where you can find binaries for all operating systems. This only takes a few minutes!
Clone the map book repo or download the code.
Have Visual Studio Code.
Open the cloned/downloaded folder with Visual Studio Code. Make sure to run az login before any other activity.
Navigate (cd) to this folder: Chapter 07\use-case\IaC.

Once in the correct folder, open config.yaml file and adjust the following variables to your needs:

location: "swedencentral"
resourceGroup: "mapbook-chapter07"
tenantId: "<tenantId>"

Note that the deployment script has only been tested against the swedencentral region so you'd better stick to this region as not all regions have the same pricing tiers available, especially for Data Explorer. If you use another region and encounter a failure, either adjust the service SKUs that we defined, either switch back to Sweden Central. The only variable that you must adjust is the tenantId one. You can get your tenant ID with the following Azure CLI command:

az account show --query tenantId --output tsv

Once done, from the Visual Studio Code terminal, you can run the following commands:

terraform init
terraform apply --auto-approve

The script takes about 15 minutes to complete and deploys the resources shown in Figure 7.13.

Figure 7.13 – Resources deployed by our use case script

All the resources of our diagram are deployed. Note that the names are unique and will not be exactly the same in your environment. The first thing to do is to get the connection string of our Event Hubs (highlighted in Figure 7.13) that we will pass to our device emulator at a later stage. Follow these steps:

Click on the Event Hubs
Then, click on Shared access policies under the Settings blade.
At last, click on RootManageSharedAccessKey and get the primary connection string as show in Figure 7.14:

Figure 7.14 – Retrieving the Event Hubs connection string

Keep the value somewhere as we will reuse it later. For now, you should start the Stream Analytics Job. To do so, just click on the Stream Analytics resource available in your resource group and click on Start job. Take a moment to explore inputs and outputs, as well as the query, all shown in Figure 7.15.

Now that you started the job, you can send events to Event Hubs. Our job should pick them up, analyze them, and send everything to Data Explorer and anomalies to our Azure function. Note that our function's code was also deployed automatically by the Terraform script.To start the emulator, you can either open the provided .sln file with Visual Studio and rebuild the application, which will generate an executable file, either unzip the provided zip file named DeviceEmulator.zip. Once you have the executable file, open a PowerShell command line and execture the program:

./DeviceEmulator.exe <event hub primary connection string>

This should start sending events as shown in Figure 7.16.

Figure 7.16 – Sending events to Event Hubs using the emulator application.

Note:

The event hubs key was removed from Figure 7.16 but should be passed as a parameter. Additionally, the screenshot has been truncated because of size constraints.

It may take some time before data appears in Azure Data Explorer and before our Azure Function gets called. This delay is typically due to the startup time of the Stream Analytics job and the use of free or lower-tier pricing plans. So, don't worry if the data doesn't show up right away—this behavior is expected. After a few minutes, go to your Azure Data Explorer instance and check if you have data as illustrated in Figure 7.17 and 7.18.

Figure 7.17 – Checking anomalies in Data Explorer

Figure 7.18 shows how to view the entire dataset:

Figure 7.18 – Exploring sensor data in Data Explorer

The last step to check is whether our function got fired by Azure Data Explorer. Go to the Azure Function App and click on link labelled Invocations and more. You should land on this page:

Figure 7.19 – Azure Functions invocation window

Clicking on any row will open a window detailing the execution of the function.

Figure 7.20 – Anomalies sent to our Azure function.

Note that Stream Analytics invokes the Azure Function at each query evaluation interval, results in function executions receiving an empty array ([]). This behavior is expected. You can see what is sent by Stream Analytics by testing the query in the Azure Portal and check the function output.Feel free to explore the end to end solution and do not forget to delete the resource group afterwards to avoid unexpected costs.Let's summarize this chapter!

Summary

This chapter examined the essential building blocks of data architectures in Azure, mapping core capabilities—such as ingestion, processing, storage, and serving—to corresponding Azure services. We explored how these capabilities support established patterns like Lambda, Kappa, and Medallion, and how services like Azure Databricks, Synapse Analytics, Event Hubs, Stream Analytics, and Data Explorer play distinct roles depending on the scenario.We also covered Hybrid Transactional and Analytical Processing (HTAP), identifying Synapse Link for Cosmos DB as the only Azure-native service that currently enables real-time analytics directly on transactional data without data duplication between the OLTP and OLAP systems.Beyond platform services, we reviewed infrastructure enablers like the Self-Hosted Integration Runtime (SHIR), Managed Virtual Network (MVN), and Virtual Network Data Gateway (VNDG) to bridge secure connectivity gaps between cloud and on-premises or private environments.The use of Cosmos DB as a flagship OLTP store was also explored in detail through an example of a multi-tenant mobile app for managing wine cellars, emphasizing best practices such as partitioning, indexing strategy, and leveraging change feed for aggregate updates.The chapter concluded with a retail use case involving smart fridges. A Kappa architecture was implemented using Terraform and .NET, with telemetry ingested through Event Hubs, processed in real time with Stream Analytics, and stored in Azure Data Explorer. Anomalies were forwarded to an Azure Function for alerting, illustrating a complete, scalable IoT analytics pipeline.In the next chapter, we'll explore Azure's vast AI ecosystem with a focus on Generative AI.

The Azure Cloud Native Architecture Mapbook

The Azure Cloud Native Architecture Mapbook, Second Edition: Design and build Azure architectures for infrastructure, applications, data, AI, and security

1 Getting Started as an Azure Architect

Join our book community on Discord

Getting to know architectural duties

Enterprise architects

Domain architects

Solution architects

Data architects

Technical architects

Security architects

Infrastructure architects

Platform architects

Application architects

Azure architects

Architects versus engineers

Getting started with the essential cloud vocabulary

Cloud service models map

IaaS (Infrastructure as a Service)

PaaS (Platform as a Service)

FaaS (Function as a Service)

CaaS (Containers as a Service)

DBaaS (database as a service)

XaaS or *aaS (anything as a service)

Introducing Azure Architecture Maps

How to read a map

Understanding the key factors of a successful cloud journey

Defining the vision with the right stakeholders

Defining the strategy with the right stakeholders

Starting implementation with the right stakeholders

Practical scenario

The drivers

Strategy

Summary

2 Solution Architecture

Join our book community on Discord

Technical requirements

The Solution Architecture Map

Zooming in on the different workload types

Understanding Systems of Engagement

Understanding Systems of Record

Understanding Systems of Insight

Understanding Systems of Intelligence

Understanding Systems of Integration (IPaaS)

Zooming in on containerization

Looking at cross-cutting concerns and non-functional requirements

Learning about monitoring

Learning about factories (CI/CD)

Learning about identity

Learning about security

Learning about networking

Learning about governance/compliance

Retiring or retired services

Solution architecture use case

Use case scenario

Using keywords

Using the Solution Architecture Map against the requirements

Building the target reference architecture

Understanding the gaps in our reference architecture

Real-world observations

Summary

3 Infrastructure Design

Join our book community on Discord

Technical requirements

The Azure Infrastructure Architecture Map

Zooming in on networking

Looking at the hybrid connectivity options

Looking at the most common architectures

Hub and Spoke Topologies

Hub and Spoke methods

Looking at the routing options

Looking at the DNS options

Zooming in on monitoring

Zooming in on High Availability and Disaster Recovery

High Availability

Disaster Recovery

Zooming in on HPC

Zooming in on Azure Hybrid solutions

Key advice from the field

Global API Platform Use Case (PaaS)