Azure AD Architecture Explained


Some Component of Azure AD


In this blog, we will learn about the architecture of Azure AD, and we will see how various design patterns are used to design Azure AD. Check out my Azure AD Explained Blog, to get the basic understanding of Azure Active Directory.


Azure AD Architecture uses a lot of design pattern to ensure:

  • High Availability

  • Fault Tolerance and Fault Isolation

  • Scalablilbility

  • Security

  • Collection of logs and metrics

  • Automated Recovery


At the end of this blog, I have added Key Takeaways Section, one can directly jumps to that section as well.

Azure AD has stateless gateways, front-end service, backend service is all available datacenters, Additionally they also have sync servers in all datacenter

Overview



azure ad partitions


Azure AD is comprised of independent, scalable units ( aka partitions ). Front-end servers provide read and write capability, through geographically distributed data centres.


They have two kinds of Replicas:

  • Primary Replicas

  • Secondary Replicas


Primary Replica


It is meant for all write operations, and all write operation is performed from the nearest datacenter.


It is further classified into two:

  • Active Primary: It is a single clustered write replica. In normal operation, all write request will be directed to this replica. Once the writing is completed, data will be written to a passive primary as well.

  • Passive Primary: It is also a single clustered write replica. In normal operation, it receives the data from Active Primary. In case of some failure in Active replica, it will take the role of Active Primary, and once older Active primary is back, it will change its role to Passive. (The process of changing role is also known as Leader Election).

Data needs to be written in at least one more datacenter, apart from one which is receiving the write operation, to avoid any dataloss in case of failure.




Secondary Replica


It is meant for all read operations, and all read operation performed from the nearest data centre.


It comprises of multi clustered read replicas, located in different geographical locations.

All read replicas receives data asynchronously, that ensures eventual consistency, not strong consistency. ( In eventual consistency, data will not be written immediately in all replica, whereas in strong consistency data will be written immediately ).

In eventual consistency, there is always a chance of getting old data.

Azure AD uses Graph API for writing, Each Graph API service maintains a logical session with some secondary replicas, and it always pulls the response synchronously from that secondary replica only, during write Operation. Once data is returned then, other replicas will be updated asynchronously, as discussed above.



NOTE: I have only named few design patterns that are responsible for designing AAD, there may be more


High Availability


To ensure, highly available architecture, it uses the following design patterns:

  • Health Endpoint Monitoring: It continuously monitors the health of all services, at regular intervals.

  • Deployment Stamps: It has independent copies of services, along with databases.

  • Geodes: It has services that are distributed in a set of different geographical nodes.

  • Throttling: It controls the access of resources, within an application.


Geographically distributed data centres (Using Deployment stamp and Geodes) plays a significant role in high availability.


Continuously monitoring of services ( Using Health Endpoint Monitoring ), ensures that there are no unhealthy services. In case of unhealthy service Gateway Service will perform load balancing, and will route the request to healthy services.


It uses a Single Master System (Active Primary); carefully orchestrated and deterministic failover to Passive Primary.




Fault Tolerance And Isolation


To ensure fault tolerance and isolation behaviour, it uses the following design patterns:

  • Health Endpoint Monitoring: It continuously monitors the health of all services, at regular intervals.

  • Circuit Breaker: It prevents the cascading of error, in case of failure.

  • Compensating Transaction: It undoes all steps if a failure occurs amidst write operation.



Each service of Azure AD works in de-correlated mode, that will prevent the failure of the entire system in case of failure of single service ( Using Circuit breaker).


Health Endpoint Monitoring ensures that there is no unhealthy service, and in case of unhealthy service, Gateway Service will perform load balancing.


In case of failure amid of write operation, Compensating Transaction undoes all operation.


High Availability, and Fault Tolerance and Isolation contributes to Continous Availability of Azure AD

Scalability


To ensure scalability, it uses the following design patterns:

  • CQRS (Command Query Responsibility Segregation): It uses different replicas for reading and write operation. Command in CQRS is all CRUD operations, and Query part is fetching of data from datastores.

  • Sharding( aka Horizontal Scaling): It uses different datastores from a different set of clients.

  • Caching: It stores the data in a key-value store, from where data can be pulled faster. e.g. Redis


Partitioning (Check the Overview Section), plays a key role in Write Scalability (Using Active Primary and Passive Primary), for reading ( Using Secondary Partition) operation, Azure AD ensures multiple replicas. Using different read/write, replicas are achieved using CQRS.


Different data stores for a different set of clients ensures that each client can work without affecting the work of others (Using Sharding)


Security


To ensure security, it uses the following :

  • MFA (Multi-Factor Authentication)

  • Auditing

  • Just-in-time Privileged Access Management


In order access Azure AD, the user needs to register its account in Authenticator app. and whenever the user wants to log in, Azure AD will send the approval request in your phone, or you can use the passcode provided by an Authenticator app. (Using MFA)


If anyone wants any access temporarily, Azure AD uses Just in time elevation system. (Using Just-in-time Privileged Access Management)


Logging and Metrics Collection Capability


To ensure highly available, scalable, secure system, logging and metrics collection plays a very significant role.


It ensures that how another design pattern behaves in a fashion that will provide the best customer experience.


Continuously analyze and monitors all of the key health metrics of service.

Helps in the tuning of metrics, like if CPU usage is high, then the system will take the necessary action to bring down CPU Usage.

Helps in the restoration of service, if not working properly.

Quickly detect problems in a live site and instruct the system to take necessary actions.


Azure AD focuses on minimizing time of detection(TTD), as of now it is less than five minutes (TTD < 5 mins), once issue is identified, then time to mitigation(TTM) is less than thirty minutes (TTM < 30 mins).


Key Takeaways


  • Azure AD is mainly used for Authentication and lookups.

  • It has 2 types of replicas: Primary and Secondary.

  • Primary Replica is further classified in Active and Passive Primary.

  • All Write operation is performed by Primary Replica ( Active Primary )

  • Secondary Replicas perform all Read Operation.

  • In case the failure of Active Primary, Passive Primary will take its role.

  • Secondary Replicas receives data asynchronously using Sync Service, and this results in eventual consistency.

  • Data needs to written in at least two datacenters before it is acknowledged.

  • All services work in a de-correlated mode that prevents cascading of error in case of failure.

  • Time to Detect any issue is less than 5 mins.

  • Time to mitigate is less than 30 mins.

  • Azure AD uses soft delete, instead of hard delete to prevent any accidental delete of data.


Please reach out to me, if anyone wants some blogs on a specific topic of Azure AD or Azure, I would be happy to write it.


Please let me know if anyone finds any flaw with this article.


Comments and feedback are most welcomed.


Follow me on Linkedin, Github, and join our newsletter to keep yourself updated.


Thanks for reading. Happy Learning 😊


506 views