Multi-Region DS Topology

ldengiam · February 8, 2024, 3:30pm

Hi Folks, I have a question regarding Multi-region DS topology design. This could be an interesting design question. I’m open and welcoming any inputs and advices.

We are setting up a new ForgeRock stack and it involves 4 Data Centers (let’s say Data Center A, B, C and D). A and B are both on East Coast and are close to each other. C and D are both on West Coast and are close to each other. East Coast and West Coast are two regions and hence, distant from each other. Now, within each Data Center, there will be one primary DS instance and one secondary DS instance for each DS profile (Config, CTS and User, separate servers). Take CTS as example, Each Data Center has 2 DS CTS instances (DS1 and DS2), and with 4 Data Centers, that would be 8 instances in total.

Then, we want to have replications set up for all DS instances, meaning put all 8 instances into one replication pool, so that any Data Center can be a backup when things went south. The question here is as we have a fair amount of replication connection (replication between any two DS instances), should we use Dedicated Standalone Replication Server to mitigate too many replication connection issue?

Assume we use dedicated standalone replication server, at least two instances are needed for high availability within each data center. In that case, should all DS profiles (Config, CTS and User) instances go through those two replication servers within its own data center for replication?

If we don’t use dedicated standalone replication server, meaning replication service is deployed along side each DS service on the same server, will performance become a concern. Furthermore, if we decide to add a third instance within each Data Center (DS3), will that replication pool with 12 instances become a problem? I guess, the general question is there a threshold number of instances in the DS pool when a dedicated standalone DS replication server should be used?

The target user number is 10 million. I appreciate any input on this. Thanks!

Best,
Le

edward.johnson · February 15, 2024, 12:04am

Hello Le,

Thanks for reaching out to the Community!

Have you considered the use of Replication Groups in your environment? According to DS documentation:

Define replication groups so that replicas connect first to local replication servers, only going outside the group when no local replication servers are available. This limits the replication traffic over slow network links to messages between replication servers, except when all local replication servers are down.

Please review the following for further details:
https://backstage.forgerock.com/docs/ds/7.4/deployment-guide/plans.html#standalone_replication_servers

https://backstage.forgerock.com/docs/ds/7.4/config-guide/repl-groups.html

For further reference, I’ve provided some further details on DS deployment patterns:
https://backstage.forgerock.com/docs/ds/7.4/deployment-guide/patterns.html

I hope this helps!

Warm Regards,
Ed

ldengiam · February 16, 2024, 4:29pm

Hi Edward,

Thanks very much for your response. Yes, I did see the Replication Group documentation. If dedicated server is used for each DC, then DS instances in each DC will be set as a replication group so that it connect to local replication server first. This can probably be the case for when DS/RS are running together on the same server, meaning without using dedicated replication server, so that DS can connect to local group first too. However, I’m trying to debate, does it justify to actually add one more layer of dedicated replication server? If so, what are the catches. See if this makes sense.

Best,
Le

mwtech · February 16, 2024, 9:43pm

Hi @ldengiam

I hate to give you the generic consultant answer for this, but it depends. Let’s take a step back and look at what you are actually replicating.

Config Store - first off, are you sure you are going to be replicating this in production? The model I am seeing most frequently now is to keep your configuration static in production, which means replication would not be needed. Assuming you are going to use replication, how frequently are you expecting changes to be made?
Core Token Service - does the business actually require session tokens to synchronize across data centers? What I typically observe is user traffic routed to a specific data center via load balancer (assuming hot/hot) based on geography or other conditions, with traffic remaining within that data center for a user’s session. Only in the even of the data center becoming unavailable would a user be routed to a different data center, and in those cases it can be viewed as acceptable to ask the user to log in again. Do you truly need CTS replication across data centers?
User Store - You have 10mm users, but what data is actually in your user store how frequently are you making adds/removes/updates? This information certainly should be replicated across data centers, and the transaction volume would be the key metric I’d look at before even considering a dedicated replication server.

Generally speaking, I tend to lean towards not introducing dedicated replication servers (even within data centers) and instead would look at utilizing them if we observe a need arise during performance testing, such as large replication delays. Even then I’d focus on other factors external to DS such as network latency or OS configuration before I look at a dedicated replication server. With this in mind, I can’t stress enough how important it is to approach performance testing with a clear definition of your service level objectives.

I know this doesn’t really give you a clear cut answer, but I hope it can help you establish some criteria that you can use to help inform your decision on whether or not to use dedicated replication servers.

ldengiam · March 1, 2024, 8:26pm

Hi @mwtech ,

I’m sorry that somehow I missed your answer. Thanks very much for the detailed reply and information provided. I agree that it’s kind of difficult to give feedback/suggestions regarding performance related design without target specs. The thing is that it’s also difficult to get a good estimation right now on the specific numbers. The platform is used by 2B customers and they are dynamic, and since they are all global customers, that would make it even more difficult. The 10 million users is based on the current situation and foreseeable several years.

We should probably be fine for now without dedicated RS servers. The config is not changing much, for CTS we are only using Authentication Session (User session is managed by other platform) and the only challenging one is 10 million users. But even with that amount of users, a powerful server should still be able to handle it and I doubt the total user number will eventually surpass 10 million.

Another point of adding a dedicated RS server is for the sake of maintainability. I could be wrong, but I think it provides an isolated layer of replication. When replication goes into issue, that might help on the troubleshooting. One question though is that do we need to backup the data from RS servers e.g. change logs. We have backup in from DS, but not sure about RS servers.

Best,
Le

grpensa · March 27, 2024, 4:59pm

Incidentally, in the FR DS model, there is no such thing as “the primary” and “the secondary”. All servers are multi-master.
Though of course, the notion of “primary” and “secondary or failover” is an ldap client configuration.
But I digress. You question concerning optimal replication strategies is very well covered in the DS course; where we describe the criteria by which such decisions are made. At the end of the day, the RS is the provider of the changelog to the DS. Need you have many copies of this changelog (1 per DS) or would a reduced set (and thus reduced network communications) suffice?