Automating NSX Security Groups with vCAC/vRA

Brian Ragazzi posts a great 2 part blog on how to automate both provisioning and the day 2 operation of moving VMs into NSX security groups defined in Service Composer:

Part 1:

https://brianragazzi.wordpress.com/2015/03/23/automating-nsx-security-groups-with-vcacvra-part-1/

Part 2:

https://brianragazzi.wordpress.com/2015/03/25/automating-nsx-security-groups-with-vcacvra-part-2/

vCAC to vRealize Automation Upgrade Notes

Introduction

Well, that was a long weekend!  My cloud management automation team and I started work bright and early Saturday morning, after notifying our business unit customers several days before that the portal would be down on Saturday while we completed the upgrade.  We spent approximately 30 hours over the weekend pushing through the upgrade.  Most of the time was simply due to the large scale of our deployment (thousands of VMs, over 1,000 blueprints, 219 business groups, and hundreds of entitlements), and the need to do proper testing to ensure our customers have a healthy environment on Monday morning.

Here are some notes that I took during the upgrade process – it is anything but simple, and a lot of improvements could be made by the vRealize engineering team to enhance the customer experience.

Pre-Migration Notes

First, take snapshots of every VM, source and target.  Take snapshots before and after Pre-Migration, and before and after Migration.  This will allow you to completely rollback your environment in case something goes horribly wrong.  Take SQL database backups as well, prior to the snapshot, so they are included in the snapshot.

Preparing the Source System

Before you begin the Pre-Migration and make any changes to the environments, understand that any lease extension requests or machine approval requests in flight will be lost during the migration process.  For us this meant that we went ahead and approved any pending lease extension requests, notifying the Business Group Managers that we had done so, as a lost lease extension request could cause expiration of a machine.  We deemed that new machine requests were less critical, as the owner could simply request the machine again.

Allow all pending workflows to complete, and uninstall any custom workflow stubs by using cloudutil from the CDK.

Self-Signed SSL Certificates and Pre-Migration

If you are migrating and want to keep your portal website URL the same, you’ll need to do this:

  1. Generate a self-signed certificate on your existing IaaS server.
  2. Import the self-signed certificate into both the existing IaaS server, as well as your new IaaS server, so that it is a trusted certificate.  If you’re not sure how to do this, see this article.
  3. Edit the binding in IIS manager for port 443 to use the self-signed certificate, instead of your current certificate.
  4. You’ll also need to edit both the Manager.config and Web.config file for the Manager service and Repository service to point to the FQDN used for the self-signed certificate, then restart services, recycle IIS application pools, and do an “iisreset.”
  5. Verify that you can browse to your existing vCAC portal, using the FQDN, from the new vRA IaaS server, and that:
    1. You can still load the portal (Model Manager/Repository is working).
    2. The SSL certificate is trusted and you don’t receive any security warnings.

The reason you have to do this is that the both the Pre-Migration and actual Migration require a trusted SSL certificate on the source system.  Both our source system and target system will use “onecloud.mckesson.com” for the URL, but during the migration, I need to address source and target individually.  I can’t just hack around this with a hosts file, because then the FQDN won’t match the SSL certificate’s common name, which makes the SSL certificate untrusted.

Once you have the SSL trust re-established using self-signed certificates, you can proceed with Pre-Migration.

Database Cleanup

One of the byproducts of having a long-running vCAC installation (ours has been up since 2013) is that you will inevitably have orphaned work items in the dbo.WorkItems table.  Since these work items will likely be months or even years old, they are never going to complete successfully, and can be safely dropped.  First, confirm that there are not a lot of recent work items in the table, and after doing so, you can safely delete them:

DELETE from dbo.WorkItems

Performing the Migration

During the actual Migration, point the target system to the primary IaaS server, not the load balancer.  It needs a self-signed certificate that is trusted, so you might need to update the IIS binding, as you did above on the source IaaS server.

After the Migration

One of the first things you need to do after the Migration, when you bring up all of the components in the new system, is to do an Inventory Data Collection across all of your Compute Resources so that the MoRef (Managed Object Reference) of each VM can be updated in the vRA database.  I found a trick to do this automatically without having to manually kick it off from the portal:

UPDATE dbo.DataCollectionStatus SET LastCollectedTime=NULL, CollectionStartTime=NULL

What this does is set the LastCollectionTime and CollectionStartTime to NULL for each Compute Resource so that vRA will immediately initiate a new Inventory Data Collection on each, just as if those Compute Resources were brand new and had never run data collection.

Guest Agent Reconfiguration

If you’re upgrading from 5.x to 6.x, you’ll need to update your Guest Agents to point to the IaaS load balancer, instead of the primary portal URL.  This is due to the architectural change that placed the tcServer in front of the IaaS tier in the application.  Here are steps to do this for each type of Guest Agent:

Windows Guest Agent

If you are using an older Windows Guest Agent, unfortunately, you’ll most likely have to upgrade to the new version that is a port of the Linux Guest Agent.  Download the new Guest Agent from https://vra-appliance/i/ and be sure to Unblock the downloaded file… otherwise UAC will block the Guest Agent from running and you might waste a lot of time trying to figure out why (I know I did…).  Right-click the Zip file you downloaded from the vRA Appliance, select Properties, then click Unblock:

Screen Shot 2015-03-04 at 3.27.58 PM

By the way, this didn’t seem to be as much of a problem in 5.x, since the file came from the vCAC binaries.  Now that you get the file from the website of the vRA Appliance, unless that website is in your Trusted Websites list, it will get blocked from execution automatically.

To uninstall the old Windows Service, run the following command:

C:\VRMGuestAgent\winservice.exe -u

To install the new Windows Service, run the following command:

C:\VRMGuestAgent\winservice.exe -i -p SSL -h iaas_load_balancer:443

If your SSL certificate has changed, you can remove the cert.pem file from C:\VRMGuestAgent, and the Guest Agent will automatically download the cert.pem from the IaaS load balancer upon execution.

Linux Guest Agent

We didn’t have to update the Linux Guest Agent from the 5.2 version.  It still seems to work fine, however, you will need to reconfigure it to point at the IaaS load balancer:

# rm -rf /usr/share/log
# cd /usr/share/gugent
# ./installgugent.sh iaas_load_balancer:443 ssl

Please note that the documentation for Linux Guest Agent installation is actually incorrect and will be updated (we filed a PR with VMware) to reflect this correct way to install the Linux Guest Agent.  The document says you should put a hyphen in front of the IaaS hostname, however, the hyphen is not necessary and will actually cause connection to fail because the hostname is invalid.

Success!

I hope you have a successful migration.  We were able to complete ours, however, we still have a number of PRs that are open with VMware and are awaiting some critical patches.  The good news is that if you are performing your upgrade after 6.2 Service Pack 1 is released, I am told that they will try to get most of the patches we’ve had created into that release, so the product should be much more stable.  Here is our shiny new Service Catalog:

Screen Shot 2015-03-04 at 1.12.44 PM

vRealize Automation 6.2 High Availability

Introduction

For the past few months, my cloud automation team and I have been very focused on accomplishing one of the most difficult tasks we’ve faced: Upgrading from vCloud Automation Center 5.2 to vRealize Automation 6.2.  This upgrade is especially challenging, because the 5.x version was really a relic of the DynamicOps product that VMware acquired, and 6.x was almost a completely new product, with the old DynamicOps .NET code remaining in the background as the IaaS components. Because the new product is all based on VMware appliances running SuSE Enterprise Linux and vFabric tcServer, and the old product was based on Windows .NET, you can probably imagine that to achieve a highly available design requires a completely different architecture.

As a starting point, we read this document:

Which Deployment Profile is Right for You?

In the vCAC/vRA reference architecture, there are three sizes of deployment profile: small, medium, and large.

Small Deployment

  • 10,000 managed machines.
  • 500 catalog items.
  • 10 concurrent deployments.
  • 10 concurrent application deployments:
    • Each deployment has approximately 3 to 14 VM nodes.

Medium Deployment

  • 30,000 managed machines.
  • 1,000 catalog items.
  • 50 concurrent deployments.
  • Up to 20 concurrent application deployments:
    • Each deployment has approximately 3 to 14 VM nodes.

Large Deployment

  • 50,000 managed machines.
  • 2,500 catalog items.
  • 100 concurrent deployments.
  • 40 concurrent application deployments:
    • Each deployment has approximately 3 to 14 VM nodes.

For McKesson, we chose to go with a Medium Deployment profile.  This meets our current needs, however, at some point in the future we may need to transition to a large deployment.  It’s worth noting that you can upgrade between deployment models by adding more components to an existing configuration, so you’re not stuck with what you started with.

Design Requirements

Here are some design requirements for our vRealize Automation deployment:

  1. Single Outside URL – we must present all our cloud services with a single URL to our customers, https://onecloud.mckesson.com (note: this isn’t Internet accessible; we’re not a public provider).  This keeps it simple for our customers and they don’t need to remember anything more than “OneCloud” to get to any of our current and future cloud services.
  2. Security – We must provide a high degree of security by encrypting all traffic through SSL using AES-256 bit ciphers where supported by clients, and disabling all weak ciphers that are vulnerable to exploit (Poodle, etc).
  3. Custom Portal Content Redirection – Not every cloud service that we deliver is done through vRA, we must present custom content that we’ve developed in-house that resides on a different set of web servers, using layer 7 content redirection.
  4. Support High Availability for vRA Appliances – Since the majority of front-end requests go directly to the vRA appliances, we need the vFabric tcServer instances running there to be highly available.
  5. Support High Availability for IaaS Components – The IaaS components also provide critical functionality, and while the Manager service itself can only run active/standby, the other services such as DEM workers and vSphere agents can run active/active.
  6. Support High Availability for SSO – The SSO service and authentication must also be highly available.
  7. Support High Availability for vRealize Orchestrator – Workflows and automations executed during provisioning or machine lifecycle state changes are critical to the availability of the overall solution, and must be made highly available.

High Availability Design

Below, you’ll see our vRealize Automation HA design:

vRealize Automation High Availability Design

vRealize Automation High Availability Design – IP addresses removed to protect the innocent 🙂

There are a few design decisions to note, that I’ll talk through below.

Load Balancer Placement

We have three load balancers in our design:

  • One in front of the vRealize Automation Appliances on port, performing SSL termination with an Intranet SSL certificate on 443/HTTPS.
    • This load balancer also listens to port 7444 and sends traffic to the SSO backend, again, terminating SSL with the same Intranet certificate.
    • This load balancer also delivers custom content from another set of web servers, which I’ll explain later.
  • One in front of the IaaS servers on port 443/SSL.
  • One in front of our vRealize Orchestrator Cluster (not shown) on port 443/HTTPS, terminating SSL with a different Intranet SSL certificate.

As a side note, vRealize Automation 6.2 only supports vRealize Orchestrator 6.x, which is not publicly available – you will need to contact GSS to get a pre-release build.

One Load Balancer to Rule Them All…

…and in the OneCloud bind them!  Ok, that’s enough of my Lord of the Rings jokes.  I did want to mention the reason why the front-end load balancer serves 3 purposes. We want to present a single front-end URL to our customers, onecloud.mckesson.com (not publicly available), which gives them a portal to all of our cloud services.  Sometimes we can’t do everything we want in the vRealize Automation portal, and have to develop custom mini-portals of our own:

  • Support ticketing portal.
  • CloudMirror – our Disaster Recovery as a Service portal.
  • Update virtual machine support contacts portal – to let our operations team know who to contact when your VM has application or guest OS issues.
  • Documentation and training materials.

So, we do some custom content switching in order to achieve the goal of providing a single, easily remembered URL to our customers.  I’ll go over how this is done in a bit.

Another reason is that we want the SSO experience to be seamless to our customers, and redirecting them to another server, with a different SSL certificate, that asks for their Active Directory credentials might cause our more security conscious customers to head for the exits.

HAproxy Instead of NSX Edge Load Balancer

We made the design decision to leverage HAproxy, an excellent open source load balancer, rather than using the NSX Edge load balancer.  Why did we do this?  We need to provide Disaster Recovery of our complete solution by replicating it to another datacenter (we use Zerto for this).   While Zerto can easily replicate the VMware appliances and IaaS components to our secondary datacenter, the NSX Edge components are managed by a single NSX Manager and control cluster, which is tied to our primary datacenter’s vCenter.  If we fail them over to the other datacenter, they become unmanageable.  In order to achieve better recoverability and manageability, we deploy HAproxy in Linux VMs and replicate them along with the VMware appliances and IaaS components, as a single Virtual Protection Group, to our secondary datacenter.  This allows us to failover our entire cloud management platform in minutes in the event of a datacenter outage, with no dependency on the NSX management components in our primary datacenter.

Three vRealize Automation Appliances?

In the vCAC Reference Architecture document published by VMware, they have 2 appliances, and the vPostgres database on one is active and the other is standby.  In our design, we decided to disable the vPostgres database on two vRA appliances, and deploy a third vRA appliance that is only running vPostgres, and has the tcServer and other front-end services disabled.  This was a design decision we made in order to obtain a more modular architecture, and scale the performance of the data storage layer separately from the front-end servers.

We also considered deploying a 4th vRA appliance to serve as a standby vPostgres database server and increase availability of the solution, however, we decided that vSphere HA is fast enough to provide the recoverability requirement of our environment, and did not want to reduce manageability by introducing a more complex clustered database setup.

We also considered deploying a separate Linux server running Postgres, however, we decided to use the vRA appliance version, as it would be fully supported by VMware GSS.

Scale-up or Scale-out the IaaS Components?

In our vCAC 5.2 design, we had separated out all components of the IaaS solution, such as DEM workers and vSphere Agents, in order to increase the scalability of the solution:

vCAC 5.2 HA Architecture

vCAC 5.2 High Availability Architecture – IP addresses and hostnames removed to protect the innocent.

What we discovered after years of operation, deploying and managing thousands of VMs across 2 datacenters, was that the DEM workers and vSphere Agents are idle almost all the time.  By idle I mean only a few hundred MB of RAM consumed and less than 5% CPU usage all the time.  They really don’t require dedicated servers, and this recommendation seems to be from the vSphere 4.x days when 4 vCPU and 8GB of RAM was considered a large VM.

We made the design decision to combine the DEM workers and vSphere Agents onto a single pair of IaaS servers running the Manager service in active/standby.  This simplifies the solution and reduces complexity, increasing manageability.  We are confident we can scale these servers up if needed, and are starting with 8 vCPU and 12GB of RAM each.  In addition, we can always add a 3rd or 4th and scale-out if this ever becomes an issue.  This decision differs from the reference architecture, so I wanted to explain our reasoning.

High Availability Deployment and Configuration

To deploy the HA configuration, we need to make sure we have taken care of some things first:

  1. Deploy your load balancers first, and make sure they are working properly before configuring any appliances.  This will save you from having to reconfigure them later.
  2. Take vSphere snapshots of all appliances and IaaS components prior to configuration.  This will give you an easy way to roll back the entire environment to a clean starting point in case something goes wrong.

HAproxy Global and Default Settings

vRealize Automation has some unique load balancer requirements.  Because the HTTP requests can be very large (more than the default 16K buffer size), we must tune HAproxy to increase the default buffer size to 64K.  If you don’t do this, you’re likely to get some HTTP 400 errors when requests that are too large get truncated by the load balancer.  We also set some reasonable defaults for maximum connections and client/server timeouts, as well as syslog settings.  Here are the settings we’re using:

global
log 127.0.0.1 local0
log 127.0.0.1 local1 notice
maxconn 2048
tune.ssl.default-dh-param 2048
tune.bufsize 65535
user nobody
group nobody

defaults
log global
option forwardfor
option http-server-close
retries 3
option redispatch
timeout connect 5000
timeout client 50000
timeout server 300000
maxconn 2048

HAproxy Load Balancer Front-End Configuration

First, let’s cover the front-end load balancer, which provides load balancing services to portal users.  We will need to configure front-ends for the following listeners:

  • 80/HTTP – redirecting all users to 443/SSL.

frontend www-http
bind *:80
mode http
reqadd X-Forwarded-Proto:\ http
default_backend vra-backend

  • 443/SSL – terminating SSL encryption for portal users.
    • This frontend must route all content to the vRA appliances, except our portal content located at https://onecloud.mckesson.com/portal.  I’m using an acl with path_beg (path begins with) regex to accomplish this:

frontend www-https
bind *:443 ssl crt /path/to/cert.pem ciphers ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-SHA384:ECDHE-ECDSA-AES256-SHA:ECDHE-RSA-AES256-SHA384:ECDHE-RSA-AES256-SHA:DHE-RSA-AES256-SHA:!NULL:!aNULL:!RC4:!RC2:!MEDIUM:!LOW:!EXPORT:!DES:!MD5:!PSK:!3DES
mode http
reqadd X-Forwarded-Proto:\ https
acl url_portal path_beg /portal
use_backend portal-https if url_portal
default_backend vra-backend
stats enable
stats uri /stats
stats realm HAproxy\ Statistics
stats auth username:password

Note: I’m not going to include the ciphers on every front-end, for ease of formatting this post.  If you’d like to understand the reason for choosing this exact set of ciphers, please read this excellent blog post Getting an A+ with SSLLabs, which describes how to achieve excellent browser compatibility while using the highest grade encryption possible.  Sorry Internet Explorer 6 on Windows XP users, you’re out of luck!  Time to upgrade your OS… or run Chrome or Firefox.

  • 7444/SSL – terminating SSL encryption for SSO.

frontend sso-https
bind *:7444 ssl crt /path/to/cert.pem ciphers (snip)
mode http
reqadd X-Forwarded-Proto:\ https
default_backend sso-backend

  • 5480/SSL – terminating SSL encryption for management of the vRA appliances.
    • Note that this is required in order to properly configure the vRA appliances for HA clustering.

frontend 5480-https
bind *:5480 ssl crt /path/to/cert.pem ciphers (snip)
mode http
reqadd X-Forwarded-Proto:\ https
default_backend vra-mgmt-backend

HAproxy Load Balancer Back-End Configuration

Now, here are the back-ends that the above configured front-ends connect to:

  • The main vRA appliance back-end on port 443.
    • Note that redirection to SSL on port 443 if port 80 (unencrypted) traffic is seen happens via the “redirect scheme…” command.
    • Also note that “balance source” and “hash-type consistent” ensure that traffic will remain sticky on the same back-end server (sticky sessions). If you have a network where client source IPs change frequently (mobile users, etc) you might need to use another persistence method.  vRA isn’t designed for mobile use (yet), so this isn’t a big concern on our network:

backend vra-backend
redirect scheme https if !{ ssl_fc }
balance source
hash-type consistent
mode http
server vr01 (ip address):443 check ssl verify none
server vr02 (ip address):443 check ssl verify none

  • The vRA management back-end on port 5480:

backend vra-mgmt-backend
redirect scheme https if !{ ssl_fc }
mode http
balance source
hash-type consistent
server vrmgmt01 (ip address):5480 check ssl verify none
server vrmgmt02 (ip address):5480 check ssl verify none

  • The SSO back-end on port 7444:

backend sso-backend
redirect scheme https if !{ ssl_fc }
mode http
balance source
hash-type consistent
server sso01 (ip address):7444 check ssl verify none
server sso02 (ip address):7444 check ssl verify none

  • The portal content back-end on port 443.
    • You might recall, this leads to our custom portal content on a different set of webservers:

backend portal-https
redirect scheme https if !{ ssl_fc }
mode http
balance source
hash-type consistent
server web01 (ip address):443 check ssl verify none
server web02 (ip address):443 check ssl verify none

IaaS Components Load Balancer Configuration

For the HAproxy load balancer that is front-ending the IaaS components, we have a much simpler configuration.  Because we don’t want to break Windows authentication protocols, we are actually doing TCP load balancing where the packets are transmitted directly to the back-end with no SSL termination or re-encryption at all.  This was required on vCAC 5.2 due to the way that the reverse proxy functionality breaks Windows authentication protocols, but I’m not so sure it’s required in vRA 6.x due to the way the vRA appliance redirects content from the IaaS components.  For now, we’ll keep this configuration simply because it works:

listen ssl_webfarm
bind *:443
mode tcp
balance source
hash-type consistent
option ssl-hello-chk
server iaas01 (ip address):443 check
server iaas02 (ip address):443 check backup

Note that when load balancing IaaS components, only one Manager service can be active at once, so the “backup” directive tells HAproxy to only send traffic to the secondary server if the primary server is down.

vRealize Orchestrator Cluster Load Balancer Configuration

The HAproxy load balancer that is front-ending the vRealize Orchestrator cluster is setup very similar to the other primary front-end load balancer, with separate front-end and back-end.

  • Front-end configuration:

frontend vro-https
bind *:80
bind *:443 ssl crt /path/to/cert.pem ciphers (snip)
mode http
reqadd X-Forwarded-Proto:\ https
default_backend vro-backend
stats enable
stats uri /stats
stats realm HAproxy\ Statistics
stats auth username:password

  • Back-end configuration:

backend portal-https
redirect scheme https if !{ ssl_fc }
mode http
balance source
hash-type consistent
server vro01 (ip address):8281 check ssl verify none
server vro02 (ip address):8281 check ssl verify none
server vro03 (ip address):8281 check ssl verify none backup

This back-end configuration reflects our HA configuration for the vRA cluster, which I might go into more detail about in a future post.  Essentially, we have a two-node cluster with a single dedicated vPostgres database, and there is a 3rd standalone vRO appliance with it’s own vPostgres database, so that in the event of an outage on the primary database, we can still run workflows on the standalone appliance.  Thus, the “backup” directive will only send traffic to the standalone vRO appliance if the cluster is down.

vRealize Appliance Configuration

In order to configure the vRealize appliances properly, it’s important to do a few things.

Hosts File Settings on vRA Appliances

On all of the vRA appliances, add a hosts file entry that points your primary URL, in our case, onecloud.mckesson.com, at the IP address of your front-end load balancer.  You’ll need to login as root with ssh to do this, and it should look like this on each appliance when you’re done:

vr01:~ # cat /etc/hosts
(ip of load balancer) onecloud.mckesson.com
# VAMI_EDIT_BEGIN
127.0.0.1 localhost
(ip of local appliance) vr01.na.corp.mckesson.com vr01
# VAMI_EDIT_END

SSO Appliance Configuration

The SSO or VM identity appliances should be configured with the load balancer’s name as follows:Screen Shot 2015-02-14 at 9.57.59 AM

You still need to join the SSO appliances to your Active Directory domain with their own hostname, but this hostname is what they present to the vRA appliances for URL redirection, so it needs to match your front-end load balancer.

vRealize Automation Appliance Configuration

On each vRA appliance, you’ll need to configure the SSO settings with the front-end load balancer name as well:Screen Shot 2015-02-14 at 9.58.26 AMAgain, it’s important that the load balancer is up and running before you begin configuration of the SSO Settings on your vRA appliances.

IaaS Components Installation

When installing the IaaS components, it’s very important that you browse to the front-end load balancer address from your individual IaaS servers to download the IaaS components installer.  The URL will be encoded in the binary executable you download, and if you downloaded it directly from a vRA appliance, you won’t have a highly available configuration.

Conclusion

When all is said and done, we now have a highly available vRealize Automation 6.2 cloud management platform!  More importantly, we’ve met the design requirements for security with strong SSL encryption:

Screen Shot 2015-02-14 at 2.11.12 PM

And the user experience meets requirements by delivering a single outside URL for all services, including Single-Sign-On:

Screen Shot 2015-02-14 at 2.21.31 PM

We’re still in the midst of our upgrade and roll-out of vRealize Automation 6.2, so hopefully all goes well.  Database migration has turned out to be the most challenging aspect of our upgrade, rather than the design and installation itself.

Are you deploying vRA 6.2 in your environment?  Please let me know how it goes in the comments section.

Building a Cloud Isn’t Easy

I found a very insightful article from William Huber, on VCDXblog.com:

http://www.vcdxblog.com/blog/?p=249

This is the reality that we’ve faced over the last couple of years. It’s very easy to provision Linux or Windows VMs through an automated portal like vRealize Automation, however, what about the other 20 or 30 things that have to be done to make a system ready for production? What about day 2 operations? These are the things that take serious effort and innovation to address.