vRealize Automation 6.2 High Availability

Introduction

For the past few months, my cloud automation team and I have been very focused on accomplishing one of the most difficult tasks we’ve faced: Upgrading from vCloud Automation Center 5.2 to vRealize Automation 6.2.  This upgrade is especially challenging, because the 5.x version was really a relic of the DynamicOps product that VMware acquired, and 6.x was almost a completely new product, with the old DynamicOps .NET code remaining in the background as the IaaS components. Because the new product is all based on VMware appliances running SuSE Enterprise Linux and vFabric tcServer, and the old product was based on Windows .NET, you can probably imagine that to achieve a highly available design requires a completely different architecture.

As a starting point, we read this document:

Which Deployment Profile is Right for You?

In the vCAC/vRA reference architecture, there are three sizes of deployment profile: small, medium, and large.

Small Deployment

  • 10,000 managed machines.
  • 500 catalog items.
  • 10 concurrent deployments.
  • 10 concurrent application deployments:
    • Each deployment has approximately 3 to 14 VM nodes.

Medium Deployment

  • 30,000 managed machines.
  • 1,000 catalog items.
  • 50 concurrent deployments.
  • Up to 20 concurrent application deployments:
    • Each deployment has approximately 3 to 14 VM nodes.

Large Deployment

  • 50,000 managed machines.
  • 2,500 catalog items.
  • 100 concurrent deployments.
  • 40 concurrent application deployments:
    • Each deployment has approximately 3 to 14 VM nodes.

For McKesson, we chose to go with a Medium Deployment profile.  This meets our current needs, however, at some point in the future we may need to transition to a large deployment.  It’s worth noting that you can upgrade between deployment models by adding more components to an existing configuration, so you’re not stuck with what you started with.

Design Requirements

Here are some design requirements for our vRealize Automation deployment:

  1. Single Outside URL – we must present all our cloud services with a single URL to our customers, https://onecloud.mckesson.com (note: this isn’t Internet accessible; we’re not a public provider).  This keeps it simple for our customers and they don’t need to remember anything more than “OneCloud” to get to any of our current and future cloud services.
  2. Security – We must provide a high degree of security by encrypting all traffic through SSL using AES-256 bit ciphers where supported by clients, and disabling all weak ciphers that are vulnerable to exploit (Poodle, etc).
  3. Custom Portal Content Redirection – Not every cloud service that we deliver is done through vRA, we must present custom content that we’ve developed in-house that resides on a different set of web servers, using layer 7 content redirection.
  4. Support High Availability for vRA Appliances – Since the majority of front-end requests go directly to the vRA appliances, we need the vFabric tcServer instances running there to be highly available.
  5. Support High Availability for IaaS Components – The IaaS components also provide critical functionality, and while the Manager service itself can only run active/standby, the other services such as DEM workers and vSphere agents can run active/active.
  6. Support High Availability for SSO – The SSO service and authentication must also be highly available.
  7. Support High Availability for vRealize Orchestrator – Workflows and automations executed during provisioning or machine lifecycle state changes are critical to the availability of the overall solution, and must be made highly available.

High Availability Design

Below, you’ll see our vRealize Automation HA design:

vRealize Automation High Availability Design

vRealize Automation High Availability Design – IP addresses removed to protect the innocent 🙂

There are a few design decisions to note, that I’ll talk through below.

Load Balancer Placement

We have three load balancers in our design:

  • One in front of the vRealize Automation Appliances on port, performing SSL termination with an Intranet SSL certificate on 443/HTTPS.
    • This load balancer also listens to port 7444 and sends traffic to the SSO backend, again, terminating SSL with the same Intranet certificate.
    • This load balancer also delivers custom content from another set of web servers, which I’ll explain later.
  • One in front of the IaaS servers on port 443/SSL.
  • One in front of our vRealize Orchestrator Cluster (not shown) on port 443/HTTPS, terminating SSL with a different Intranet SSL certificate.

As a side note, vRealize Automation 6.2 only supports vRealize Orchestrator 6.x, which is not publicly available – you will need to contact GSS to get a pre-release build.

One Load Balancer to Rule Them All…

…and in the OneCloud bind them!  Ok, that’s enough of my Lord of the Rings jokes.  I did want to mention the reason why the front-end load balancer serves 3 purposes. We want to present a single front-end URL to our customers, onecloud.mckesson.com (not publicly available), which gives them a portal to all of our cloud services.  Sometimes we can’t do everything we want in the vRealize Automation portal, and have to develop custom mini-portals of our own:

  • Support ticketing portal.
  • CloudMirror – our Disaster Recovery as a Service portal.
  • Update virtual machine support contacts portal – to let our operations team know who to contact when your VM has application or guest OS issues.
  • Documentation and training materials.

So, we do some custom content switching in order to achieve the goal of providing a single, easily remembered URL to our customers.  I’ll go over how this is done in a bit.

Another reason is that we want the SSO experience to be seamless to our customers, and redirecting them to another server, with a different SSL certificate, that asks for their Active Directory credentials might cause our more security conscious customers to head for the exits.

HAproxy Instead of NSX Edge Load Balancer

We made the design decision to leverage HAproxy, an excellent open source load balancer, rather than using the NSX Edge load balancer.  Why did we do this?  We need to provide Disaster Recovery of our complete solution by replicating it to another datacenter (we use Zerto for this).   While Zerto can easily replicate the VMware appliances and IaaS components to our secondary datacenter, the NSX Edge components are managed by a single NSX Manager and control cluster, which is tied to our primary datacenter’s vCenter.  If we fail them over to the other datacenter, they become unmanageable.  In order to achieve better recoverability and manageability, we deploy HAproxy in Linux VMs and replicate them along with the VMware appliances and IaaS components, as a single Virtual Protection Group, to our secondary datacenter.  This allows us to failover our entire cloud management platform in minutes in the event of a datacenter outage, with no dependency on the NSX management components in our primary datacenter.

Three vRealize Automation Appliances?

In the vCAC Reference Architecture document published by VMware, they have 2 appliances, and the vPostgres database on one is active and the other is standby.  In our design, we decided to disable the vPostgres database on two vRA appliances, and deploy a third vRA appliance that is only running vPostgres, and has the tcServer and other front-end services disabled.  This was a design decision we made in order to obtain a more modular architecture, and scale the performance of the data storage layer separately from the front-end servers.

We also considered deploying a 4th vRA appliance to serve as a standby vPostgres database server and increase availability of the solution, however, we decided that vSphere HA is fast enough to provide the recoverability requirement of our environment, and did not want to reduce manageability by introducing a more complex clustered database setup.

We also considered deploying a separate Linux server running Postgres, however, we decided to use the vRA appliance version, as it would be fully supported by VMware GSS.

Scale-up or Scale-out the IaaS Components?

In our vCAC 5.2 design, we had separated out all components of the IaaS solution, such as DEM workers and vSphere Agents, in order to increase the scalability of the solution:

vCAC 5.2 HA Architecture

vCAC 5.2 High Availability Architecture – IP addresses and hostnames removed to protect the innocent.

What we discovered after years of operation, deploying and managing thousands of VMs across 2 datacenters, was that the DEM workers and vSphere Agents are idle almost all the time.  By idle I mean only a few hundred MB of RAM consumed and less than 5% CPU usage all the time.  They really don’t require dedicated servers, and this recommendation seems to be from the vSphere 4.x days when 4 vCPU and 8GB of RAM was considered a large VM.

We made the design decision to combine the DEM workers and vSphere Agents onto a single pair of IaaS servers running the Manager service in active/standby.  This simplifies the solution and reduces complexity, increasing manageability.  We are confident we can scale these servers up if needed, and are starting with 8 vCPU and 12GB of RAM each.  In addition, we can always add a 3rd or 4th and scale-out if this ever becomes an issue.  This decision differs from the reference architecture, so I wanted to explain our reasoning.

High Availability Deployment and Configuration

To deploy the HA configuration, we need to make sure we have taken care of some things first:

  1. Deploy your load balancers first, and make sure they are working properly before configuring any appliances.  This will save you from having to reconfigure them later.
  2. Take vSphere snapshots of all appliances and IaaS components prior to configuration.  This will give you an easy way to roll back the entire environment to a clean starting point in case something goes wrong.

HAproxy Global and Default Settings

vRealize Automation has some unique load balancer requirements.  Because the HTTP requests can be very large (more than the default 16K buffer size), we must tune HAproxy to increase the default buffer size to 64K.  If you don’t do this, you’re likely to get some HTTP 400 errors when requests that are too large get truncated by the load balancer.  We also set some reasonable defaults for maximum connections and client/server timeouts, as well as syslog settings.  Here are the settings we’re using:

global
log 127.0.0.1 local0
log 127.0.0.1 local1 notice
maxconn 2048
tune.ssl.default-dh-param 2048
tune.bufsize 65535
user nobody
group nobody

defaults
log global
option forwardfor
option http-server-close
retries 3
option redispatch
timeout connect 5000
timeout client 50000
timeout server 300000
maxconn 2048

HAproxy Load Balancer Front-End Configuration

First, let’s cover the front-end load balancer, which provides load balancing services to portal users.  We will need to configure front-ends for the following listeners:

  • 80/HTTP – redirecting all users to 443/SSL.

frontend www-http
bind *:80
mode http
reqadd X-Forwarded-Proto:\ http
default_backend vra-backend

  • 443/SSL – terminating SSL encryption for portal users.
    • This frontend must route all content to the vRA appliances, except our portal content located at https://onecloud.mckesson.com/portal.  I’m using an acl with path_beg (path begins with) regex to accomplish this:

frontend www-https
bind *:443 ssl crt /path/to/cert.pem ciphers ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-SHA384:ECDHE-ECDSA-AES256-SHA:ECDHE-RSA-AES256-SHA384:ECDHE-RSA-AES256-SHA:DHE-RSA-AES256-SHA:!NULL:!aNULL:!RC4:!RC2:!MEDIUM:!LOW:!EXPORT:!DES:!MD5:!PSK:!3DES
mode http
reqadd X-Forwarded-Proto:\ https
acl url_portal path_beg /portal
use_backend portal-https if url_portal
default_backend vra-backend
stats enable
stats uri /stats
stats realm HAproxy\ Statistics
stats auth username:password

Note: I’m not going to include the ciphers on every front-end, for ease of formatting this post.  If you’d like to understand the reason for choosing this exact set of ciphers, please read this excellent blog post Getting an A+ with SSLLabs, which describes how to achieve excellent browser compatibility while using the highest grade encryption possible.  Sorry Internet Explorer 6 on Windows XP users, you’re out of luck!  Time to upgrade your OS… or run Chrome or Firefox.

  • 7444/SSL – terminating SSL encryption for SSO.

frontend sso-https
bind *:7444 ssl crt /path/to/cert.pem ciphers (snip)
mode http
reqadd X-Forwarded-Proto:\ https
default_backend sso-backend

  • 5480/SSL – terminating SSL encryption for management of the vRA appliances.
    • Note that this is required in order to properly configure the vRA appliances for HA clustering.

frontend 5480-https
bind *:5480 ssl crt /path/to/cert.pem ciphers (snip)
mode http
reqadd X-Forwarded-Proto:\ https
default_backend vra-mgmt-backend

HAproxy Load Balancer Back-End Configuration

Now, here are the back-ends that the above configured front-ends connect to:

  • The main vRA appliance back-end on port 443.
    • Note that redirection to SSL on port 443 if port 80 (unencrypted) traffic is seen happens via the “redirect scheme…” command.
    • Also note that “balance source” and “hash-type consistent” ensure that traffic will remain sticky on the same back-end server (sticky sessions). If you have a network where client source IPs change frequently (mobile users, etc) you might need to use another persistence method.  vRA isn’t designed for mobile use (yet), so this isn’t a big concern on our network:

backend vra-backend
redirect scheme https if !{ ssl_fc }
balance source
hash-type consistent
mode http
server vr01 (ip address):443 check ssl verify none
server vr02 (ip address):443 check ssl verify none

  • The vRA management back-end on port 5480:

backend vra-mgmt-backend
redirect scheme https if !{ ssl_fc }
mode http
balance source
hash-type consistent
server vrmgmt01 (ip address):5480 check ssl verify none
server vrmgmt02 (ip address):5480 check ssl verify none

  • The SSO back-end on port 7444:

backend sso-backend
redirect scheme https if !{ ssl_fc }
mode http
balance source
hash-type consistent
server sso01 (ip address):7444 check ssl verify none
server sso02 (ip address):7444 check ssl verify none

  • The portal content back-end on port 443.
    • You might recall, this leads to our custom portal content on a different set of webservers:

backend portal-https
redirect scheme https if !{ ssl_fc }
mode http
balance source
hash-type consistent
server web01 (ip address):443 check ssl verify none
server web02 (ip address):443 check ssl verify none

IaaS Components Load Balancer Configuration

For the HAproxy load balancer that is front-ending the IaaS components, we have a much simpler configuration.  Because we don’t want to break Windows authentication protocols, we are actually doing TCP load balancing where the packets are transmitted directly to the back-end with no SSL termination or re-encryption at all.  This was required on vCAC 5.2 due to the way that the reverse proxy functionality breaks Windows authentication protocols, but I’m not so sure it’s required in vRA 6.x due to the way the vRA appliance redirects content from the IaaS components.  For now, we’ll keep this configuration simply because it works:

listen ssl_webfarm
bind *:443
mode tcp
balance source
hash-type consistent
option ssl-hello-chk
server iaas01 (ip address):443 check
server iaas02 (ip address):443 check backup

Note that when load balancing IaaS components, only one Manager service can be active at once, so the “backup” directive tells HAproxy to only send traffic to the secondary server if the primary server is down.

vRealize Orchestrator Cluster Load Balancer Configuration

The HAproxy load balancer that is front-ending the vRealize Orchestrator cluster is setup very similar to the other primary front-end load balancer, with separate front-end and back-end.

  • Front-end configuration:

frontend vro-https
bind *:80
bind *:443 ssl crt /path/to/cert.pem ciphers (snip)
mode http
reqadd X-Forwarded-Proto:\ https
default_backend vro-backend
stats enable
stats uri /stats
stats realm HAproxy\ Statistics
stats auth username:password

  • Back-end configuration:

backend portal-https
redirect scheme https if !{ ssl_fc }
mode http
balance source
hash-type consistent
server vro01 (ip address):8281 check ssl verify none
server vro02 (ip address):8281 check ssl verify none
server vro03 (ip address):8281 check ssl verify none backup

This back-end configuration reflects our HA configuration for the vRA cluster, which I might go into more detail about in a future post.  Essentially, we have a two-node cluster with a single dedicated vPostgres database, and there is a 3rd standalone vRO appliance with it’s own vPostgres database, so that in the event of an outage on the primary database, we can still run workflows on the standalone appliance.  Thus, the “backup” directive will only send traffic to the standalone vRO appliance if the cluster is down.

vRealize Appliance Configuration

In order to configure the vRealize appliances properly, it’s important to do a few things.

Hosts File Settings on vRA Appliances

On all of the vRA appliances, add a hosts file entry that points your primary URL, in our case, onecloud.mckesson.com, at the IP address of your front-end load balancer.  You’ll need to login as root with ssh to do this, and it should look like this on each appliance when you’re done:

vr01:~ # cat /etc/hosts
(ip of load balancer) onecloud.mckesson.com
# VAMI_EDIT_BEGIN
127.0.0.1 localhost
(ip of local appliance) vr01.na.corp.mckesson.com vr01
# VAMI_EDIT_END

SSO Appliance Configuration

The SSO or VM identity appliances should be configured with the load balancer’s name as follows:Screen Shot 2015-02-14 at 9.57.59 AM

You still need to join the SSO appliances to your Active Directory domain with their own hostname, but this hostname is what they present to the vRA appliances for URL redirection, so it needs to match your front-end load balancer.

vRealize Automation Appliance Configuration

On each vRA appliance, you’ll need to configure the SSO settings with the front-end load balancer name as well:Screen Shot 2015-02-14 at 9.58.26 AMAgain, it’s important that the load balancer is up and running before you begin configuration of the SSO Settings on your vRA appliances.

IaaS Components Installation

When installing the IaaS components, it’s very important that you browse to the front-end load balancer address from your individual IaaS servers to download the IaaS components installer.  The URL will be encoded in the binary executable you download, and if you downloaded it directly from a vRA appliance, you won’t have a highly available configuration.

Conclusion

When all is said and done, we now have a highly available vRealize Automation 6.2 cloud management platform!  More importantly, we’ve met the design requirements for security with strong SSL encryption:

Screen Shot 2015-02-14 at 2.11.12 PM

And the user experience meets requirements by delivering a single outside URL for all services, including Single-Sign-On:

Screen Shot 2015-02-14 at 2.21.31 PM

We’re still in the midst of our upgrade and roll-out of vRealize Automation 6.2, so hopefully all goes well.  Database migration has turned out to be the most challenging aspect of our upgrade, rather than the design and installation itself.

Are you deploying vRA 6.2 in your environment?  Please let me know how it goes in the comments section.

15 Responses

  1. Hari February 28, 2015 / 12:57 am

    Great Article on vCAC using Real Production Experience.

    I have some questions on it:
    1. Manager Service is running on which server. Is it with Iaas Web Server or with DEM/Agent server.
    2. In Vmware reference architecture it is written that Manager service is Active /Standby and on standby server the service has to be disabled. In case Active service fails we have manually enable the service in the standby server. Is this the same in your case also.
    3. You have not included Load Balancer for the DEM/Agent Server.

    Keep coming up with these good articles.

    Regards
    Hari

    • VCDXpert February 28, 2015 / 10:54 am

      Hi Hari,

      Thanks for the kind words. Let me try to answer your questions:

      1. The Manager service is running on the first IaaS server, along with the Web server. The second IaaS server is running as a cold standby, with the service installed but set to “Manual” so it can be started manually in the event of a failure of the primary server.

      2. That is correct – only one Manager service can run at a time. In the event that the first IaaS server fails, the Manager service on the second server will need to be started manually.

      3. A load balancer is not required for the DEM orchestrators, workers, and vSphere Agents. They can run active/active (2 copies of each) and each one will connect directly to the SQL database to look for work items to execute.

      To summarize, IaaS server 1 has the following services:

      Manager (Active set to automatic startup)
      IIS
      DEM Orchestrator
      DEM Worker
      vSphere Agent (1 for each vCenter)

      IaaS server 2 has the following:

      Manager (standby, set to Manual start)
      IIS
      DEM Orchestrator
      DEM Worker
      vSphere Agent (1 for each vCenter)

      I hope this makes sense, but please let me know if you have any followup questions.

      • shawn jin July 24, 2015 / 9:37 am

        1 A load balancer is not required for the DEM orchestrators, workers, and vSphere Agents.They can run active/active.but why use a load balancer vRealize Orchestrator Cluster in your environment?
        2 Could I use Self-signed certificates in a distributed environment?

        • VCDXpert January 6, 2016 / 4:59 pm

          1. You are correct, the DEM orchestrators, workers, and agents don’t need a load balancer. The vRealize Orchestrator cluster does have a load balancer because we have some legacy vRO webviews that we wanted to keep highly available.

          2. Theoretically, you could use self-signed certificates, however, I wouldn’t recommend it as you may end up breaking many of the vRA components that require trusted SSL certificates throughout. You will need to ensure that the self-signed certificates are trusted by all components, including the IaaS servers and all vRA appliances, which would be very problematic. Troubleshooting SSL failures would also be difficult.

  2. Rebecca March 25, 2015 / 11:21 am

    Certificate wise, have you used a single one (with all of the names in the SAN) for the Identity Appliance, Virtual Appliances, IaaS Web, and IaaS Manager? I see you mention you did use a different cert for vRO. We are looking to see if we can simplify how many certificates we are spinning up for our 6.2 environment.

    • VCDXpert March 25, 2015 / 11:28 am

      So, we actually have a single SSL certificate for our front-end, with a single common name, onecloud.mckesson.com, and the HAproxy front-end terminates SSL on port 7444 for the identity appliance, so it looks like onecloud.mckesson.com to the customers coming in. Port 443 terminates SSL for the vRA appliances, so they also look like onecloud.mckesson.com to the outside world.

      We have a separate certificate for the IaaS servers that is bound to 443 on both IIS servers. If you wanted to simplify your SSL configuration, you could put all the names in the SAN as VMware suggests. We didn’t have this option available to us, because we are using Symantec Intranet SSL certificates and they don’t support multiple common names in SAN on a single certificate. Hope this helps.

  3. Houa April 29, 2015 / 12:25 pm

    Hey Luke,

    For the DEM02 and DEO02, how are you pointing them to the standby iaas because the manager service on the standby iaas is not running? also the model manager web service, are you using the vip for both DEM/DEO installations on iaas01 and iaas02 servers?

    Thanks!

    • VCDXpert May 1, 2015 / 7:30 am

      DEM orchestrator and worker #2 point to the IaaS load balancer address, which allows them to pick up work items even though the local manager service is not running. In fact, all DEMs and Agents point to the VIP for IaaS load balancer. Also, Guest Agents need to point to this same VIP for IaaS load balancer, so that they can still run during a failover situation.

  4. arielik May 13, 2015 / 5:46 pm

    Hi, how do you achieve SSO from your custom portal to vRA?

  5. donjo June 12, 2015 / 3:52 pm

    Hi Luke,
    Great article, just wondering about the load balanced Identity Appliance, from what i can see of the VMWare vRA 6.2 reference architecture doc, they say that the ID appliance cannot be run in a HA config – is this something you have done here and if so have you got vmware to support this?

    • VCDXpert June 13, 2015 / 6:36 am

      You are correct; VMware doesn’t currently support running the ID appliance in an HA config. We are only running a single instance of the ID appliance. In some of my chats with the vRA GSS team, this might be supported in the future.

  6. Rishab Mehta December 28, 2015 / 5:30 am

    Hi,

    I am trying to set up vRA 7 with the HA configuration, as you must be aware SSO appliance has been removed and the functionality included in the vRA appliance itself, I assume the whole setup will change. Could you give out the settings for the HAProxy for such a setup? I am totally a noob with HAProxy so it will take me a lot of time to understand and learn it.

    My setup has the following
    vRA appliance 1 on IP x.x.x.86
    vRA appliance 2 on IP x.x.x.87
    IaaS 1 on IP x.x.x.153
    IaaS 2 on IP x.x.x.154
    vRA LB on x.x.x.211 (centos 7 VM with HAproxy installed)
    IaaS LB on x.x.x.240 (centos 7 VM with HAproxy installed)

    Also could you explain the cipher part and how we get the values for the same.

    Thanks
    Rishab

    • VCDXpert January 6, 2016 / 4:51 pm

      vRA 7 has a significantly changed architecture, so this article definitely wouldn’t apply to a vRA 7 deployment. I need to spend some time working with vRA 7 before writing a follow up article.

    • VCDXpert January 6, 2016 / 4:56 pm

      Regarding the ciphers, if you are using SSL, many ciphers that are enabled by default are used by older browsers and are considered insecure. By disabling certain ciphers, you can improve the overall security of your solution, however, you will end up blocking certain older browsers, like IE 6 on Windows XP (which you shouldn’t be using in the first place). The linked article goes into much more detail on this topic: https://raymii.org/s/tutorials/Strong_SSL_Security_On_nginx.html

  7. Mitra February 7, 2016 / 8:55 am

    Hello Sir ,

    I am a beginner on cloud platform ,
    for a easy start i just want to know following things
    1. Working of each & every component in IAAS, like DEM, Orchestrtor, model manager,,IAAS website , VRA …etc
    2. How they are depends on each other

    It will help me a lot to understand working of cloud component
    having lots of question in mind, but will ask you afterwards
    Looking positive reply from you

    Regards,
    Mitra

Leave a Reply

Your email address will not be published. Required fields are marked *