Immutable Infrastructure with AWS and Ansible – Part 3 – Autoscaling

Introduction

In part 1 and part 2 of this series, we setup our workstation so that we were capable of provisioning infrastructure with AWS and Ansible, and then created a simple play that provisioned an Ubuntu workstation in EC2.  That’s a great first step, but it’s not immutable.  The workstation we provisioned in EC2 is just like any other.  It’s not resilient to failure, and if it gets terminated for any reason, it will no longer be running.  In this part, we’re going to add to our Deploy Workstation play and create an AMI (Amazon Machine Image) out of our workstation after it’s configured, then create a launch configuration pointing at that AMI, then create an autoscaling group that points at the launch configuration.

Even if we’re just running a single instance, autoscaling groups are beneficial because they indicate the desired state of your system.  You can tell it you want a minimum of 1 instance and a maximum of 1 instance, and AWS will ensure you always have 1 running.  This means your instance will be resilient to all types of failures.  It will be restarted if it gets terminated for any reason, and if you need to do upgrades, AWS is smart enough to do them in a rolling manner, so you always have at least 1 healthy running instance.

Update:  The source code for this playbook is on Github here:  https://github.com/lyoungblood/immutable

The Create AMI Role

The first new role we are going to add to our playbook will allow us to create an AMI by snapshotting our running instance.  We do this after our previous workstation role has fully configured the instance, so we are capturing the golden master state that we want to preserve immutably.  Create new folders under your playbook folder called roles/create-ami/tasks, and place the following main.yml file in it:

The Create Launch Config Role

The second new role we are going to add to our playbook will allow us to create a launch configuration that points at the AMI we just created.  This is a necessary next step before we can create the autoscaling group.

Create new folders under your playbook directory called roles/create-launch-config/tasks, and create a file in that folder called main.yml, with the following content in it:

Update the PowerUsers IAM Group

Before we can successfully create the launch configuration, our PowerUsers group needs permission to perform the IAM:PassRole for our “noaccess” policy.  If we don’t have permission to do this, creating the launch configuration will fail.  Go into the Identity and Access Management screen from the AWS console, and click on Groups on the left-hand side.  Select the PowerUsers group we created in step 1, then click the Permissions tab, the click Attach Policy.  Type “iam” in the filter box, and check the box next to IAMFullAccess:
Screen Shot 2016-01-25 at 4.04.46 AM

Your IAM group’s policies should look like this after you’re done:
Screen Shot 2016-01-25 at 4.04.56 AM

The Autoscaling Role

The third new role we are going to create is the role that actually creates the autoscaling group.  This role first checks to see if an autoscaling group with the same name already exists, and if so, it just updates it.  By updating the autoscaling group to point at the new launch configuration, with a new AMI, the autoscaling group will automatically do a rolling upgrade, where it starts a new instance, waits until the OS is loaded and healthy, then terminates an old instance.  It repeats this process until all instances in the autoscaling group are running the new AMI.  You can configure this by changing replace_batch_size, however, we’ve set a sensible default based on the size of the group divided by 4. For example, if you had an autoscaling group with 8 running instances, autoscaling would deploy 2 new instances at once, to speed up the rolling upgrade process.

If it’s creating a new autoscaling group, it also sets some CloudWatch metric alarms based on CPU utilization, and links the metric alarms to the scaling policies.  The way we set these alarms, if average CPU utilization is greater than 50% for 5 minutes, the group will scale up by adding another instance.  If average CPU utilization is less than 20% for 5 minutes, the group will scale down by terminating an instance.  There are also some cooldown times set so that this doesn’t happen too often; scaling up can only happen every 3 minutes, and scaling down can only happen every 5 minutes.

Create new folders roles/auto-scaling/tasks under your playbook folder, and create a file named main.yml in this folder, with the following content in it:

Cleaning up after ourselves

We’re also going to add three more new roles that are designed to purge all but the last 5 AMIs and launch configurations, as well as terminate our amibuild instance (the instance that we just used to configure our golden AMI).  Keeping the last 5 AMIs and launch configurations around is extremely useful, in the event that  you deploy a breaking change to your infrastructure, you can simply point your autoscaling group at the most recently working launch configuration, and your application will be back up and running rapidly.  You can easily configure this to keep more than the 5 most recent launch configurations and AMIs if you like.

The Delete Old Launch Configurations Role

Create a new folder called roles/delete-old-launch-configurations/tasks and create a file named main.yml in it, with the following content:

This role also requires a python script called lc_find.py be placed in it’s library.  Create a folder called roles/delete-old-launch-configurations/library and create a file named lc_find.py in it, with the following content:

The Delete Old AMIs Role

This role simply deletes any AMIs other than the 5 most recently created ones for the particular autoscaling group we are deploying.  Create a folder called roles/delete-old-amis/tasks, and create a file named main.yml in it, with the following content:

The Terminate Role

This role is very simple.  Now that we’ve captured the AMI snapshot of our fully configured system, created a launch config, and created an autoscaling group based on it, we no longer need our temporary amibuild system.  This role will terminate it.

Create new folders named roles/terminate/tasks under your playbook folder, and create a file named main.yml in it, with the following content:

Putting it all together

In order to put all of the new roles we’ve created together, we need to update our deployworkstation.yml play located in the root of our playbook folder.  The new deployworkstation.yml play should have the following content in it:

Execute your play by typing the following command:

ansible-playbook -vv -e group_name=test deployworkstation.yml

After your playbook has run, you should see output like the following if everything was successful:
Screen Shot 2016-01-25 at 4.30.18 AM

Conclusion

Congratulations!  You’ve now successfully provisioned an immutable autoscaling group in Amazon Web Services!  If you run the playbook again, it will create a new AMI, and perform a rolling deploy/upgrade to the new image.  One of the beautiful things about immutable infrastructure is that, when you need to patch or upgrade your system, you don’t have to touch the existing server – you simply run the automation that created it, and you get a brand new immutable image, updated to the latest security patches and versions.

In future articles, we’ll continue to expand our playbook with more functionality beyond simply provisioning immutable workstations in AWS.

Immutable Infrastructure with AWS and Ansible – Part 2 – Workstation

Introduction

In the first part of this series, we setup our workstation so that it could communicate with the Amazon Web Services APIs, and setup our AWS account so that it was ready to provision EC2 compute infrastructure.  In this section, we’ll start building our Ansible playbook that will provision immutable infrastructure.

Update:  The source code for this playbook is on Github here:  https://github.com/lyoungblood/immutable

Ansible Dynamic Inventory

When working with cloud resources, Ansible has the capability of using a dynamic inventory system to find and configure all of your instances within AWS, or any other cloud provider.  In order for this to work properly, we need to setup the EC2 external inventory script in our playbook.

  1. First, create the playbook folder (I named mine ~/immutable) and the inventory folder within it:
    mkdir -p ~/immutable/inventory;cd ~/immutable/inventory
  2. Next, download the EC2 external inventory script from Ansible:
    wget https://raw.github.com/ansible/ansible/devel/contrib/inventory/ec2.py
  3. Make the script executable by typing:
    chmod +x ec2.py
  4. Configure the EC2 external inventory script by creating a new file called ec2.ini in the inventory folder alongside the ec2.py script.  If you specify the region you are working with, you will significantly decrease the execution time because the script will not need to scan every EC2 region for instances.  Here is a copy of my ec2.ini file, configured to use the us-east-1 region:
  5. Next, create the default Ansible configuration for this playbook by editing a file in the root of the playbook directory (in our example, ~/immutable), named ansible.cfg. This file should have the following text in it:

To test and ensure that this script is working properly (and that your boto credentials are setup properly), execute the following command:
./ec2.py --list
You should see the following output:
Screen Shot 2016-01-22 at 12.06.19 PM

Ansible Roles

Roles are a way to automatically load certain variables and tasks into an Ansible playbook, and allow you to reuse your tasks in a modular way.  We will heavily (ab)use roles to make the tasks in our playbook reusable, since many of our infrastructure provisioning operations will use the same tasks repeatedly.

Group Variables

The following group variables will apply to any tasks configured in our playbook, unless we override them at the task level.  This will allow us to specify a set of sensible defaults that will work for most provisioning use cases, but still have flexibility to change them when we need it.  Start by creating a folder for your playbook (I called it immutable, but you can call it whatever you like), then create a group_vars folder underneath it:
cd ~immutable; mkdir group_vars
Now, we can edit the file all.yml in that folder we just created, group_vars, to contain the following text. Please note that indentation is important in YAML syntax:

Now that our group_vars are setup, we can move onto creating our first role.

The Launch Role

The launch role performs an important first step – it first searches for the latest Ubuntu 14.04 LTS (long term support) AMI (Amazon Machine Image) that is published by Canonical, the creators of Ubuntu, then launches a new EC2 compute instance, in the region and availability zone specified in our group_vars file.  Note that the launch role also launches a very small compute instance (t2.micro) because this instance will only live for a short time while it is configured by subsequent tasks, then baked into a golden master AMI snapshot that lives in S3 object storage.

A quick note about Availability Zones: if you comment out the zone variable in our group_vars file, your instances will be launched in a random zone within the region specified.  This can be useful if you want to ensure that an outage in a single AZ doesn’t take down every instance in your auto-scaling group, but there is a trade-off as data transfer between zones incurs a charge, so if your database, for example, is in another zone, you’ll pay a small network bandwidth fee to access it.

Create a new folder under your playbook directory called roles, and create a launch folder within it, then create a tasks folder under that, then edit a file called main.yml in this tasks folder:
mkdir -p roles/launch/tasks
Now, put the following contents in the main.yml file:

You’ll notice that this launch role also waits for the instance to boot by waiting for port 22 (ssh) to be available on the host. This is useful because subsequent tasks will use an ssh connection to configure the system, so we want to ensure the system is completely booted before we proceed.

The Workstation Role

Now that we have a role that can launch a brand new t2.micro instance, our next role will allow us to configure this instance to be used as a workstation.  This workstation configuration will be fairly simplistic, however, you can easily customize it as much as you want later.  This is mainly just to illustrate how you would configure the golden image.

There are two directories we need to create for this role, the tasks directory, as well as the files directory, as there is an init script we want to populate onto the workstation that will create a swapfile on first boot:
mkdir -p roles/workstation/tasks;mkdir roles/workstation/files
Next, we’ll create the task:

Initializing a swap file automatically

When you provision the Ubuntu 14.04 LTS instance, it won’t have a swapfile by default.  This is a bit risky because if you run out of memory your system could become unstable.  This file should be placed in roles/workstation/files/aws-swap-init, where the task above will copy it to your workstation during the configuration process, so that a swap file will be created when the system is booted for the first time.

The DeployWorkstation Play

Now, we’ll create a play that calls these tasks in the right order to provision our workstation and configure it.  This file will be created in the root of your playbook, and I named it deployworkstation.yml.

Testing our Playbook

To test our work so far, we simply need to execute it with Ansible and see if we are successful:
ansible-playbook -vv -e group_name=test deployworkstation.yml
You should see some output at the end of the playbook run like this:
Screen Shot 2016-01-22 at 10.16.05 PM

Next, connect to your instance with ssh by typing the following command:
ssh ubuntu@ec2-52-90-220-60.compute-1.amazonaws.com
(hint: copy/paste the hostname above)

You should see something like the following after you connect with ssh:
Screen Shot 2016-01-22 at 10.17.01 PM

That’s it!  You’ve now created a workstation in the Amazon public cloud.  Be sure to terminate the instance you’ve created so that you don’t incur any unexpected fees.  You can do this by navigating to the EC2 (top left) dashboard from the AWS console, then selecting any running instances and choosing to terminate them:
Screen Shot 2016-01-22 at 10.18.32 PM

After selecting to Terminate them from the Instance State menu, you’ll need to confirm it:
Screen Shot 2016-01-22 at 10.18.52 PM

Now that you’ve terminated any running instances, in the next part, we’ll learn how to create snapshots, launch configurations, and auto-scaling groups from our immutable golden master images.

Immutable Infrastructure with AWS and Ansible – Part 1 – Setup

Introduction

Immutable infrastructure is a very powerful concept that brings stability, efficiency, and fidelity to your applications through automation and the use of successful patterns from programming.  The general idea is that you never make changes to running infrastructure.  Instead, you ensure that all infrastructure is created through automation, and to make a change, you simply create a new version of the infrastructure, and destroy the old one.  Chad Fowler was one of the first to mention this concept on his blog, and I believe it resonates with anyone that has spent a significant amount of time doing system administration:

“Why? Because an old system inevitably grows warts…”

They start as one-time hacks during outages. A quick edit to a config file saves the day. “We’ll put it back into Chef later,” we say, as we finally head off to sleep after a marathon fire fighting session.

Cron jobs spring up in unexpected places, running obscure but critical functions that only one person knows about. Application code is deployed outside of the normal straight-from-source-control process.

The system becomes finicky. It only accepts deploys in a certain manual way. The init scripts no longer work unless you do something special and unexpected.

And, of course the operating system has been patched again and again (in the best case) in line with the standard operating procedures, and the inevitable entropy sets in. Or, worse, it has never been patched and now you’re too afraid of what would happen if you try.

The system becomes a house of cards. You fear any change and you fear replacing it since you don’t know everything about how it works.  — Chad Fowler – Trash Your Servers and Burn Your Code: Immutable Infrastructure and Disposable Components

Requirements

To begin performing immutable infrastructure provisioning, you’ll need a few things first.  You need some type of “cloud” infrastructure.  This doesn’t necessarily mean you need a virtual server somewhere in the cloud; what you really need is the ability to provision cloud infrastructure with an API.  A permanent virtual server running in the cloud is the very opposite of immutable, as it will inevitably grow the warts Chad mentions above.

Amazon Web Services

For this series, we’ll use Amazon Web Services as our cloud provider.  Their APIs and services are frankly light years ahead of the competition.  I’m sure you could provision immutable infrastructure on other public cloud providers, but it wouldn’t be as easy, and you might not have access to the wealth of features and services available that can make your infrastructure provisioning non-disruptive with zero downtime.  If you’ve never used AWS before, the good news is that you can get access to a “free” tier that gives you limited amounts of compute resources per month for 12 months.  750 hours a month of t2.micro instance usage should be plenty if you are just learning AWS in your free time, but please be aware that if you aren’t careful, you can incur additional charges that aren’t covered in your “free” tier.

Ansible

The second thing we’ll need is an automation framework that allows us to treat infrastructure as code.  Ansible has taken the world by storm due to its simplicity and rich ecosystem of modules that are available to talk directly to infrastructure.  There is a huge library of Ansible modules for provisioning cloud infrastructure.  The AWS specific modules cover almost every AWS service imaginable, and far exceed those available from other infrastructure as code tools like Chef and Puppet.

OS X or Linux

The third thing we’ll need is an OS X or Linux workstation to do the provisioning from.  As we get into the more advanced sections, I’ll demonstrate how to provision a dedicated orchestrator that can perform provisioning operations on your behalf, but in the short-term, you’ll need a UNIX-like operating system to run things from.  If you’re running Windows, you can download VirtualBox from Oracle, and Ubuntu Linux from Canonical, then install Ubuntu Linux in a VM.  The following steps will get your workstation setup properly to begin provisioning infrastructure in AWS:

Mac OS X Setup

  1. Install Homebrew by executing the following command:
    ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
    You should see output like the following:Screen Shot 2016-01-08 at 11.14.05 AM
  2. Install Ansible with Homebrew by executing the following command:
    brew install ansible
    You should see output like the following: (note, I’m actually running Ansible 2.0.0.2 now, but this output was for an older version; use Ansible 2.0+ as it’s the future 🙂 )
    Screen Shot 2016-01-08 at 11.14.32 AM
  3. Install the AWS Command Line Interface (CLI) with Homebrew by executing the following command:
    brew install awscli
    You should see output like the following:
    Screen Shot 2016-01-08 at 2.15.32 PM
  4. Install wget through homebrew by executing the following command:
    brew install wget
    You should see output like the following:
    Screen Shot 2016-01-22 at 11.56.37 AM

Linux Setup

  1. Install Ansible by executing the following command:
    sudo pip install ansible
  2. Install the AWS Command Line Interface (CLI) by executing the following command:
    sudo pip install awscli
  3. Install wget using your package manager.

Generic Workstation Setup

These steps need to be followed whether you’re running a Mac or Linux for your workstation.

  1. Install a good text editor.  My favorite is Sublime Text 2, but you can use whatever you want.
  2. Install the yaegashi.blockinfile Ansible role from Ansible galaxy.  This is a very useful role that will allow us to add blocks of text to configuration files, rather than simply changing single lines.  Type the following command to install it:
    sudo ansible-galaxy install yaegashi.blockinfile
    You should see output like the following:
    Screen Shot 2016-01-08 at 11.24.55 AM

Amazon Setup

There are a few things you’ll need to begin provisioning infrastructure in your AWS account.  First, you’ll need to make sure the default security group in your VPC allows traffic from your workstation.  This is necessary because Ansible will configure your EC2 compute instances over SSH, and needs network connectivity to them from your workstation.

  1. Login to your AWS Console and select VPC from the bottom left of the dashboard.
  2. Click on Security Groups on the bottom left hand side under Security.
  3. Select/highlight the security group named “default”, and select the Inbound Rules tab.  Click the Edit button, then click Add another rule, and for the rule type, select “ALL Traffic”, and insert your workstation’s Internet IP address, with a /32 at the end to indicate the CIDR netmask.  If you don’t know your workstation’s true Internet IP address, you can find it at this website.
    Screen Shot 2016-01-08 at 3.34.08 PM
    Note: I blanked my IP address in the image above.
  4. Click Save to Save the Inbound Rules.
  5. Go back to the AWS Console dashboard, and click “Identity & Access Management.”  It is located towards the middle of the second column, under Security & Identity.
  6. Click on Users on the left, then click “Create New Users.”  Enter a username for yourself, and leave the checkbox selected to Generate an access key for each user.  Click the Create button:
    Screen Shot 2016-01-09 at 5.49.04 PM
  7. Your AWS credentials will be shown on the next screen.  It’s important to save these credentials, as they will not be shown again:
    Screen Shot 2016-01-09 at 5.49.31 PM
  8. Using your text editor, edit a file named ~/.boto, which should include the credentials you were just given, in the following format:
  9. At the command line, execute the following command, and input the same AWS credentials, along with the AWS region you are using:
    aws configure
    For most of you, this will be either “us-east-1” or “us-west-1”.  If you’re not in the US, use this page to determine what your EC2 region is.
  10. Click on Groups, then click “Create New Group”:
    Screen Shot 2016-01-09 at 5.59.18 PM
  11. Name the group PowerUsers, then click Next:
    Screen Shot 2016-01-09 at 5.59.35 PM
  12. In the Attach Policy step, search for “PowerUser” in the filter field, and check the box next to “PowerUserAccess”, then click “Attach Policy”:
    Screen Shot 2016-01-09 at 6.00.09 PM
  13. Click Next to Review, and save your group.
  14. Select/Highlight the PowerUsers group you’ve just created, and click Actions, then “Add Users to Group”:
    Screen Shot 2016-01-09 at 6.00.41 PM
  15. Select the user account you just created, and add that user to the group:
    Screen Shot 2016-01-09 at 6.00.59 PM
  16. Now, we’ll need to create an IAM policy that gives zero access to any of our resources.  The reason for this is that we’ll be provisioning EC2 instances with an IAM policy attached, and if those instances get compromised, we don’t want them to have permission to make any changes to our AWS account.  Click Policies on the left hand side (still under Identity & Access Management), then click Get Started:
    Screen Shot 2016-01-09 at 6.06.26 PM
  17. Click Create Policy:
    Screen Shot 2016-01-09 at 6.06.39 PM
  18. Select “Create Your Own Policy” from the list:
    Screen Shot 2016-01-09 at 6.07.11 PM
  19. Give the policy a name, “noaccess”, and a description, then paste the following code into the policy document:
  20. Click Validate Policy at the bottom.  It should show “This policy is valid,” as you see below:
    Screen Shot 2016-01-10 at 7.27.10 AM
  21. Click Create Policy, then click Roles on the left-hand side of the screen.
    Screen Shot 2016-01-12 at 9.01.41 AM
  22. Click Create New Role, then type in a role name, “noaccess”:
    Screen Shot 2016-01-12 at 9.01.54 AM
  23. Under the Select Role Type screen, select “Amazon EC2”:
    Screen Shot 2016-01-12 at 9.02.06 AM
  24. On the Attach Policy screen, filter for the “noaccess” policy we just created, and check the box next to it to select it:
    Screen Shot 2016-01-12 at 9.02.22 AM
  25. On the Review screen, click the Create Role button at the bottom right:
    Screen Shot 2016-01-12 at 9.02.33 AM
  26. Now, go back to the main screen of the AWS console, and click EC2 in the top left.
  27. Click “Key Pairs” under the Security section on the left:
    Screen Shot 2016-01-12 at 1.38.35 PM
  28. Click “Create Key Pair”, then give the Key Pair a name:
    Screen Shot 2016-01-12 at 1.38.56 PM
  29. The private key will now be downloaded by your browser.  Save this key in a safe place, like your ~/.ssh folder, and make sure it can’t be read by other users by changing the mode on it:
    mv immutable.pem ~/.ssh
    chmod 600 ~/.ssh/immutable.pem
  30. Run ssh-agent, and add the private key to it, by executing the following commands:

    You should see output like the following:
    Screen Shot 2016-01-12 at 1.49.16 PM
  31. Next, install pip using the following command:
    sudo easy_install pip
    You should see output like the following:
    Screen Shot 2016-01-22 at 11.52.01 AM
  32. Then, install boto using the following command:
    sudo pip install boto
    You should see output like the following:
    Screen Shot 2016-01-22 at 11.54.04 AM

The setup of your environment is now complete.  To test and ensure you can communicate with the AWS EC2 API, execute the following command:
aws ec2 describe-instances

You should see output like the following:
Screen Shot 2016-01-12 at 10.00.17 AM

In the next article, we’ll begin setting up our Ansible playbook and provisioning a test system.

Automating NSX Security Groups with vCAC/vRA

Brian Ragazzi posts a great 2 part blog on how to automate both provisioning and the day 2 operation of moving VMs into NSX security groups defined in Service Composer:

Part 1:

https://brianragazzi.wordpress.com/2015/03/23/automating-nsx-security-groups-with-vcacvra-part-1/

Part 2:

https://brianragazzi.wordpress.com/2015/03/25/automating-nsx-security-groups-with-vcacvra-part-2/

vCAC to vRealize Automation Upgrade Notes

Introduction

Well, that was a long weekend!  My cloud management automation team and I started work bright and early Saturday morning, after notifying our business unit customers several days before that the portal would be down on Saturday while we completed the upgrade.  We spent approximately 30 hours over the weekend pushing through the upgrade.  Most of the time was simply due to the large scale of our deployment (thousands of VMs, over 1,000 blueprints, 219 business groups, and hundreds of entitlements), and the need to do proper testing to ensure our customers have a healthy environment on Monday morning.

Here are some notes that I took during the upgrade process – it is anything but simple, and a lot of improvements could be made by the vRealize engineering team to enhance the customer experience.

Pre-Migration Notes

First, take snapshots of every VM, source and target.  Take snapshots before and after Pre-Migration, and before and after Migration.  This will allow you to completely rollback your environment in case something goes horribly wrong.  Take SQL database backups as well, prior to the snapshot, so they are included in the snapshot.

Preparing the Source System

Before you begin the Pre-Migration and make any changes to the environments, understand that any lease extension requests or machine approval requests in flight will be lost during the migration process.  For us this meant that we went ahead and approved any pending lease extension requests, notifying the Business Group Managers that we had done so, as a lost lease extension request could cause expiration of a machine.  We deemed that new machine requests were less critical, as the owner could simply request the machine again.

Allow all pending workflows to complete, and uninstall any custom workflow stubs by using cloudutil from the CDK.

Self-Signed SSL Certificates and Pre-Migration

If you are migrating and want to keep your portal website URL the same, you’ll need to do this:

  1. Generate a self-signed certificate on your existing IaaS server.
  2. Import the self-signed certificate into both the existing IaaS server, as well as your new IaaS server, so that it is a trusted certificate.  If you’re not sure how to do this, see this article.
  3. Edit the binding in IIS manager for port 443 to use the self-signed certificate, instead of your current certificate.
  4. You’ll also need to edit both the Manager.config and Web.config file for the Manager service and Repository service to point to the FQDN used for the self-signed certificate, then restart services, recycle IIS application pools, and do an “iisreset.”
  5. Verify that you can browse to your existing vCAC portal, using the FQDN, from the new vRA IaaS server, and that:
    1. You can still load the portal (Model Manager/Repository is working).
    2. The SSL certificate is trusted and you don’t receive any security warnings.

The reason you have to do this is that the both the Pre-Migration and actual Migration require a trusted SSL certificate on the source system.  Both our source system and target system will use “onecloud.mckesson.com” for the URL, but during the migration, I need to address source and target individually.  I can’t just hack around this with a hosts file, because then the FQDN won’t match the SSL certificate’s common name, which makes the SSL certificate untrusted.

Once you have the SSL trust re-established using self-signed certificates, you can proceed with Pre-Migration.

Database Cleanup

One of the byproducts of having a long-running vCAC installation (ours has been up since 2013) is that you will inevitably have orphaned work items in the dbo.WorkItems table.  Since these work items will likely be months or even years old, they are never going to complete successfully, and can be safely dropped.  First, confirm that there are not a lot of recent work items in the table, and after doing so, you can safely delete them:

DELETE from dbo.WorkItems

Performing the Migration

During the actual Migration, point the target system to the primary IaaS server, not the load balancer.  It needs a self-signed certificate that is trusted, so you might need to update the IIS binding, as you did above on the source IaaS server.

After the Migration

One of the first things you need to do after the Migration, when you bring up all of the components in the new system, is to do an Inventory Data Collection across all of your Compute Resources so that the MoRef (Managed Object Reference) of each VM can be updated in the vRA database.  I found a trick to do this automatically without having to manually kick it off from the portal:

UPDATE dbo.DataCollectionStatus SET LastCollectedTime=NULL, CollectionStartTime=NULL

What this does is set the LastCollectionTime and CollectionStartTime to NULL for each Compute Resource so that vRA will immediately initiate a new Inventory Data Collection on each, just as if those Compute Resources were brand new and had never run data collection.

Guest Agent Reconfiguration

If you’re upgrading from 5.x to 6.x, you’ll need to update your Guest Agents to point to the IaaS load balancer, instead of the primary portal URL.  This is due to the architectural change that placed the tcServer in front of the IaaS tier in the application.  Here are steps to do this for each type of Guest Agent:

Windows Guest Agent

If you are using an older Windows Guest Agent, unfortunately, you’ll most likely have to upgrade to the new version that is a port of the Linux Guest Agent.  Download the new Guest Agent from https://vra-appliance/i/ and be sure to Unblock the downloaded file… otherwise UAC will block the Guest Agent from running and you might waste a lot of time trying to figure out why (I know I did…).  Right-click the Zip file you downloaded from the vRA Appliance, select Properties, then click Unblock:

Screen Shot 2015-03-04 at 3.27.58 PM

By the way, this didn’t seem to be as much of a problem in 5.x, since the file came from the vCAC binaries.  Now that you get the file from the website of the vRA Appliance, unless that website is in your Trusted Websites list, it will get blocked from execution automatically.

To uninstall the old Windows Service, run the following command:

C:\VRMGuestAgent\winservice.exe -u

To install the new Windows Service, run the following command:

C:\VRMGuestAgent\winservice.exe -i -p SSL -h iaas_load_balancer:443

If your SSL certificate has changed, you can remove the cert.pem file from C:\VRMGuestAgent, and the Guest Agent will automatically download the cert.pem from the IaaS load balancer upon execution.

Linux Guest Agent

We didn’t have to update the Linux Guest Agent from the 5.2 version.  It still seems to work fine, however, you will need to reconfigure it to point at the IaaS load balancer:

# rm -rf /usr/share/log
# cd /usr/share/gugent
# ./installgugent.sh iaas_load_balancer:443 ssl

Please note that the documentation for Linux Guest Agent installation is actually incorrect and will be updated (we filed a PR with VMware) to reflect this correct way to install the Linux Guest Agent.  The document says you should put a hyphen in front of the IaaS hostname, however, the hyphen is not necessary and will actually cause connection to fail because the hostname is invalid.

Success!

I hope you have a successful migration.  We were able to complete ours, however, we still have a number of PRs that are open with VMware and are awaiting some critical patches.  The good news is that if you are performing your upgrade after 6.2 Service Pack 1 is released, I am told that they will try to get most of the patches we’ve had created into that release, so the product should be much more stable.  Here is our shiny new Service Catalog:

Screen Shot 2015-03-04 at 1.12.44 PM

vRealize Automation 6.2 High Availability

Introduction

For the past few months, my cloud automation team and I have been very focused on accomplishing one of the most difficult tasks we’ve faced: Upgrading from vCloud Automation Center 5.2 to vRealize Automation 6.2.  This upgrade is especially challenging, because the 5.x version was really a relic of the DynamicOps product that VMware acquired, and 6.x was almost a completely new product, with the old DynamicOps .NET code remaining in the background as the IaaS components. Because the new product is all based on VMware appliances running SuSE Enterprise Linux and vFabric tcServer, and the old product was based on Windows .NET, you can probably imagine that to achieve a highly available design requires a completely different architecture.

As a starting point, we read this document:

Which Deployment Profile is Right for You?

In the vCAC/vRA reference architecture, there are three sizes of deployment profile: small, medium, and large.

Small Deployment

  • 10,000 managed machines.
  • 500 catalog items.
  • 10 concurrent deployments.
  • 10 concurrent application deployments:
    • Each deployment has approximately 3 to 14 VM nodes.

Medium Deployment

  • 30,000 managed machines.
  • 1,000 catalog items.
  • 50 concurrent deployments.
  • Up to 20 concurrent application deployments:
    • Each deployment has approximately 3 to 14 VM nodes.

Large Deployment

  • 50,000 managed machines.
  • 2,500 catalog items.
  • 100 concurrent deployments.
  • 40 concurrent application deployments:
    • Each deployment has approximately 3 to 14 VM nodes.

For McKesson, we chose to go with a Medium Deployment profile.  This meets our current needs, however, at some point in the future we may need to transition to a large deployment.  It’s worth noting that you can upgrade between deployment models by adding more components to an existing configuration, so you’re not stuck with what you started with.

Design Requirements

Here are some design requirements for our vRealize Automation deployment:

  1. Single Outside URL – we must present all our cloud services with a single URL to our customers, https://onecloud.mckesson.com (note: this isn’t Internet accessible; we’re not a public provider).  This keeps it simple for our customers and they don’t need to remember anything more than “OneCloud” to get to any of our current and future cloud services.
  2. Security – We must provide a high degree of security by encrypting all traffic through SSL using AES-256 bit ciphers where supported by clients, and disabling all weak ciphers that are vulnerable to exploit (Poodle, etc).
  3. Custom Portal Content Redirection – Not every cloud service that we deliver is done through vRA, we must present custom content that we’ve developed in-house that resides on a different set of web servers, using layer 7 content redirection.
  4. Support High Availability for vRA Appliances – Since the majority of front-end requests go directly to the vRA appliances, we need the vFabric tcServer instances running there to be highly available.
  5. Support High Availability for IaaS Components – The IaaS components also provide critical functionality, and while the Manager service itself can only run active/standby, the other services such as DEM workers and vSphere agents can run active/active.
  6. Support High Availability for SSO – The SSO service and authentication must also be highly available.
  7. Support High Availability for vRealize Orchestrator – Workflows and automations executed during provisioning or machine lifecycle state changes are critical to the availability of the overall solution, and must be made highly available.

High Availability Design

Below, you’ll see our vRealize Automation HA design:

vRealize Automation High Availability Design

vRealize Automation High Availability Design – IP addresses removed to protect the innocent :-)

There are a few design decisions to note, that I’ll talk through below.

Load Balancer Placement

We have three load balancers in our design:

  • One in front of the vRealize Automation Appliances on port, performing SSL termination with an Intranet SSL certificate on 443/HTTPS.
    • This load balancer also listens to port 7444 and sends traffic to the SSO backend, again, terminating SSL with the same Intranet certificate.
    • This load balancer also delivers custom content from another set of web servers, which I’ll explain later.
  • One in front of the IaaS servers on port 443/SSL.
  • One in front of our vRealize Orchestrator Cluster (not shown) on port 443/HTTPS, terminating SSL with a different Intranet SSL certificate.

As a side note, vRealize Automation 6.2 only supports vRealize Orchestrator 6.x, which is not publicly available – you will need to contact GSS to get a pre-release build.

One Load Balancer to Rule Them All…

…and in the OneCloud bind them!  Ok, that’s enough of my Lord of the Rings jokes.  I did want to mention the reason why the front-end load balancer serves 3 purposes. We want to present a single front-end URL to our customers, onecloud.mckesson.com (not publicly available), which gives them a portal to all of our cloud services.  Sometimes we can’t do everything we want in the vRealize Automation portal, and have to develop custom mini-portals of our own:

  • Support ticketing portal.
  • CloudMirror – our Disaster Recovery as a Service portal.
  • Update virtual machine support contacts portal – to let our operations team know who to contact when your VM has application or guest OS issues.
  • Documentation and training materials.

So, we do some custom content switching in order to achieve the goal of providing a single, easily remembered URL to our customers.  I’ll go over how this is done in a bit.

Another reason is that we want the SSO experience to be seamless to our customers, and redirecting them to another server, with a different SSL certificate, that asks for their Active Directory credentials might cause our more security conscious customers to head for the exits.

HAproxy Instead of NSX Edge Load Balancer

We made the design decision to leverage HAproxy, an excellent open source load balancer, rather than using the NSX Edge load balancer.  Why did we do this?  We need to provide Disaster Recovery of our complete solution by replicating it to another datacenter (we use Zerto for this).   While Zerto can easily replicate the VMware appliances and IaaS components to our secondary datacenter, the NSX Edge components are managed by a single NSX Manager and control cluster, which is tied to our primary datacenter’s vCenter.  If we fail them over to the other datacenter, they become unmanageable.  In order to achieve better recoverability and manageability, we deploy HAproxy in Linux VMs and replicate them along with the VMware appliances and IaaS components, as a single Virtual Protection Group, to our secondary datacenter.  This allows us to failover our entire cloud management platform in minutes in the event of a datacenter outage, with no dependency on the NSX management components in our primary datacenter.

Three vRealize Automation Appliances?

In the vCAC Reference Architecture document published by VMware, they have 2 appliances, and the vPostgres database on one is active and the other is standby.  In our design, we decided to disable the vPostgres database on two vRA appliances, and deploy a third vRA appliance that is only running vPostgres, and has the tcServer and other front-end services disabled.  This was a design decision we made in order to obtain a more modular architecture, and scale the performance of the data storage layer separately from the front-end servers.

We also considered deploying a 4th vRA appliance to serve as a standby vPostgres database server and increase availability of the solution, however, we decided that vSphere HA is fast enough to provide the recoverability requirement of our environment, and did not want to reduce manageability by introducing a more complex clustered database setup.

We also considered deploying a separate Linux server running Postgres, however, we decided to use the vRA appliance version, as it would be fully supported by VMware GSS.

Scale-up or Scale-out the IaaS Components?

In our vCAC 5.2 design, we had separated out all components of the IaaS solution, such as DEM workers and vSphere Agents, in order to increase the scalability of the solution:

vCAC 5.2 HA Architecture

vCAC 5.2 High Availability Architecture – IP addresses and hostnames removed to protect the innocent.

What we discovered after years of operation, deploying and managing thousands of VMs across 2 datacenters, was that the DEM workers and vSphere Agents are idle almost all the time.  By idle I mean only a few hundred MB of RAM consumed and less than 5% CPU usage all the time.  They really don’t require dedicated servers, and this recommendation seems to be from the vSphere 4.x days when 4 vCPU and 8GB of RAM was considered a large VM.

We made the design decision to combine the DEM workers and vSphere Agents onto a single pair of IaaS servers running the Manager service in active/standby.  This simplifies the solution and reduces complexity, increasing manageability.  We are confident we can scale these servers up if needed, and are starting with 8 vCPU and 12GB of RAM each.  In addition, we can always add a 3rd or 4th and scale-out if this ever becomes an issue.  This decision differs from the reference architecture, so I wanted to explain our reasoning.

High Availability Deployment and Configuration

To deploy the HA configuration, we need to make sure we have taken care of some things first:

  1. Deploy your load balancers first, and make sure they are working properly before configuring any appliances.  This will save you from having to reconfigure them later.
  2. Take vSphere snapshots of all appliances and IaaS components prior to configuration.  This will give you an easy way to roll back the entire environment to a clean starting point in case something goes wrong.

HAproxy Global and Default Settings

vRealize Automation has some unique load balancer requirements.  Because the HTTP requests can be very large (more than the default 16K buffer size), we must tune HAproxy to increase the default buffer size to 64K.  If you don’t do this, you’re likely to get some HTTP 400 errors when requests that are too large get truncated by the load balancer.  We also set some reasonable defaults for maximum connections and client/server timeouts, as well as syslog settings.  Here are the settings we’re using:

global
log 127.0.0.1 local0
log 127.0.0.1 local1 notice
maxconn 2048
tune.ssl.default-dh-param 2048
tune.bufsize 65535
user nobody
group nobody

defaults
log global
option forwardfor
option http-server-close
retries 3
option redispatch
timeout connect 5000
timeout client 50000
timeout server 300000
maxconn 2048

HAproxy Load Balancer Front-End Configuration

First, let’s cover the front-end load balancer, which provides load balancing services to portal users.  We will need to configure front-ends for the following listeners:

  • 80/HTTP – redirecting all users to 443/SSL.

frontend www-http
bind *:80
mode http
reqadd X-Forwarded-Proto:\ http
default_backend vra-backend

  • 443/SSL – terminating SSL encryption for portal users.
    • This frontend must route all content to the vRA appliances, except our portal content located at https://onecloud.mckesson.com/portal.  I’m using an acl with path_beg (path begins with) regex to accomplish this:

frontend www-https
bind *:443 ssl crt /path/to/cert.pem ciphers ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-SHA384:ECDHE-ECDSA-AES256-SHA:ECDHE-RSA-AES256-SHA384:ECDHE-RSA-AES256-SHA:DHE-RSA-AES256-SHA:!NULL:!aNULL:!RC4:!RC2:!MEDIUM:!LOW:!EXPORT:!DES:!MD5:!PSK:!3DES
mode http
reqadd X-Forwarded-Proto:\ https
acl url_portal path_beg /portal
use_backend portal-https if url_portal
default_backend vra-backend
stats enable
stats uri /stats
stats realm HAproxy\ Statistics
stats auth username:password

Note: I’m not going to include the ciphers on every front-end, for ease of formatting this post.  If you’d like to understand the reason for choosing this exact set of ciphers, please read this excellent blog post Getting an A+ with SSLLabs, which describes how to achieve excellent browser compatibility while using the highest grade encryption possible.  Sorry Internet Explorer 6 on Windows XP users, you’re out of luck!  Time to upgrade your OS… or run Chrome or Firefox.

  • 7444/SSL – terminating SSL encryption for SSO.

frontend sso-https
bind *:7444 ssl crt /path/to/cert.pem ciphers (snip)
mode http
reqadd X-Forwarded-Proto:\ https
default_backend sso-backend

  • 5480/SSL – terminating SSL encryption for management of the vRA appliances.
    • Note that this is required in order to properly configure the vRA appliances for HA clustering.

frontend 5480-https
bind *:5480 ssl crt /path/to/cert.pem ciphers (snip)
mode http
reqadd X-Forwarded-Proto:\ https
default_backend vra-mgmt-backend

HAproxy Load Balancer Back-End Configuration

Now, here are the back-ends that the above configured front-ends connect to:

  • The main vRA appliance back-end on port 443.
    • Note that redirection to SSL on port 443 if port 80 (unencrypted) traffic is seen happens via the “redirect scheme…” command.
    • Also note that “balance source” and “hash-type consistent” ensure that traffic will remain sticky on the same back-end server (sticky sessions). If you have a network where client source IPs change frequently (mobile users, etc) you might need to use another persistence method.  vRA isn’t designed for mobile use (yet), so this isn’t a big concern on our network:

backend vra-backend
redirect scheme https if !{ ssl_fc }
balance source
hash-type consistent
mode http
server vr01 (ip address):443 check ssl verify none
server vr02 (ip address):443 check ssl verify none

  • The vRA management back-end on port 5480:

backend vra-mgmt-backend
redirect scheme https if !{ ssl_fc }
mode http
balance source
hash-type consistent
server vrmgmt01 (ip address):5480 check ssl verify none
server vrmgmt02 (ip address):5480 check ssl verify none

  • The SSO back-end on port 7444:

backend sso-backend
redirect scheme https if !{ ssl_fc }
mode http
balance source
hash-type consistent
server sso01 (ip address):7444 check ssl verify none
server sso02 (ip address):7444 check ssl verify none

  • The portal content back-end on port 443.
    • You might recall, this leads to our custom portal content on a different set of webservers:

backend portal-https
redirect scheme https if !{ ssl_fc }
mode http
balance source
hash-type consistent
server web01 (ip address):443 check ssl verify none
server web02 (ip address):443 check ssl verify none

IaaS Components Load Balancer Configuration

For the HAproxy load balancer that is front-ending the IaaS components, we have a much simpler configuration.  Because we don’t want to break Windows authentication protocols, we are actually doing TCP load balancing where the packets are transmitted directly to the back-end with no SSL termination or re-encryption at all.  This was required on vCAC 5.2 due to the way that the reverse proxy functionality breaks Windows authentication protocols, but I’m not so sure it’s required in vRA 6.x due to the way the vRA appliance redirects content from the IaaS components.  For now, we’ll keep this configuration simply because it works:

listen ssl_webfarm
bind *:443
mode tcp
balance source
hash-type consistent
option ssl-hello-chk
server iaas01 (ip address):443 check
server iaas02 (ip address):443 check backup

Note that when load balancing IaaS components, only one Manager service can be active at once, so the “backup” directive tells HAproxy to only send traffic to the secondary server if the primary server is down.

vRealize Orchestrator Cluster Load Balancer Configuration

The HAproxy load balancer that is front-ending the vRealize Orchestrator cluster is setup very similar to the other primary front-end load balancer, with separate front-end and back-end.

  • Front-end configuration:

frontend vro-https
bind *:80
bind *:443 ssl crt /path/to/cert.pem ciphers (snip)
mode http
reqadd X-Forwarded-Proto:\ https
default_backend vro-backend
stats enable
stats uri /stats
stats realm HAproxy\ Statistics
stats auth username:password

  • Back-end configuration:

backend portal-https
redirect scheme https if !{ ssl_fc }
mode http
balance source
hash-type consistent
server vro01 (ip address):8281 check ssl verify none
server vro02 (ip address):8281 check ssl verify none
server vro03 (ip address):8281 check ssl verify none backup

This back-end configuration reflects our HA configuration for the vRA cluster, which I might go into more detail about in a future post.  Essentially, we have a two-node cluster with a single dedicated vPostgres database, and there is a 3rd standalone vRO appliance with it’s own vPostgres database, so that in the event of an outage on the primary database, we can still run workflows on the standalone appliance.  Thus, the “backup” directive will only send traffic to the standalone vRO appliance if the cluster is down.

vRealize Appliance Configuration

In order to configure the vRealize appliances properly, it’s important to do a few things.

Hosts File Settings on vRA Appliances

On all of the vRA appliances, add a hosts file entry that points your primary URL, in our case, onecloud.mckesson.com, at the IP address of your front-end load balancer.  You’ll need to login as root with ssh to do this, and it should look like this on each appliance when you’re done:

vr01:~ # cat /etc/hosts
(ip of load balancer) onecloud.mckesson.com
# VAMI_EDIT_BEGIN
127.0.0.1 localhost
(ip of local appliance) vr01.na.corp.mckesson.com vr01
# VAMI_EDIT_END

SSO Appliance Configuration

The SSO or VM identity appliances should be configured with the load balancer’s name as follows:Screen Shot 2015-02-14 at 9.57.59 AM

You still need to join the SSO appliances to your Active Directory domain with their own hostname, but this hostname is what they present to the vRA appliances for URL redirection, so it needs to match your front-end load balancer.

vRealize Automation Appliance Configuration

On each vRA appliance, you’ll need to configure the SSO settings with the front-end load balancer name as well:Screen Shot 2015-02-14 at 9.58.26 AMAgain, it’s important that the load balancer is up and running before you begin configuration of the SSO Settings on your vRA appliances.

IaaS Components Installation

When installing the IaaS components, it’s very important that you browse to the front-end load balancer address from your individual IaaS servers to download the IaaS components installer.  The URL will be encoded in the binary executable you download, and if you downloaded it directly from a vRA appliance, you won’t have a highly available configuration.

Conclusion

When all is said and done, we now have a highly available vRealize Automation 6.2 cloud management platform!  More importantly, we’ve met the design requirements for security with strong SSL encryption:

Screen Shot 2015-02-14 at 2.11.12 PM

And the user experience meets requirements by delivering a single outside URL for all services, including Single-Sign-On:

Screen Shot 2015-02-14 at 2.21.31 PM

We’re still in the midst of our upgrade and roll-out of vRealize Automation 6.2, so hopefully all goes well.  Database migration has turned out to be the most challenging aspect of our upgrade, rather than the design and installation itself.

Are you deploying vRA 6.2 in your environment?  Please let me know how it goes in the comments section.

Going over the Edge with your VMware NSX and Cisco Nexus

Brad Hedlund, from the VMware NSBU, just published a great article that really digs in deeply to all of the available Edge routing options available to NSX customers that are using Cisco Nexus physical network gear.

http://bradhedlund.com/2015/02/06/going-over-the-edge-with-your-vmware-nsx-and-cisco-nexus/

Which Edge topology are you using in your environment?

What’s new with vSphere Data Protection 6

I’m really digging all of the improvements with vSphere 6 that VMware is sharing during their 28 Days of February.  It seems like one of the biggest improvements that happened during the release of vSphere 6 is that VMware decided to move all of the features of vSphere Data Protection Advanced into the core product, as long as you own vSphere 6.0 Essentials Plus Kit and higher editions.

This is huge for ELA customers like us, because it gives us the option of leveraging proven Avamar technology for backing up and restoring virtual machines, at no extra charge.  VMware has really fired the warning shot across the bow of Veeam Availability Suite with this release!

And, with VDP Advanced (now just called VDP), we can do some of the things we could do with Veeam before, like application consistent backups of SQL server, Exchange, and Sharepoint.  We can also do things like truncate logs during the backup process for the above mentioned databases.  File level restore is now available, and even individual SQL tables can be restored through the vSphere Web Client. :Screen Shot 2015-02-12 at 8.49.59 PMAnother feature is automated backup verification, which can be done on every backup job.  This automates the process of booting the VM in an isolated network bubble, from backup storage, and verifies that the guest OS boots by checking for VMtools heartbeats.

Altogether, I’m very excited about the possibilities of VDP in vSphere 6.0.  You can read more about it here:

http://blogs.vmware.com/vsphere/2015/02/whats-new-vdp6-and-vr6.html

Another Great Enterprise Feature in vSphere 6 – vMotion for MSCS

VMware’s blog covers details around another great enterprise feature in vSphere 6 – the ability to run shared disk Microsoft clusters and still use vMotion. This finally allows us to properly virtualize business critical apps that use SQL clustering to store data, without having to pin them to specific physical hosts and schedule downtime whenever we need to do maintenance.

http://blogs.vmware.com/apps/2015/02/say-hello-vmotion-compatible-shared-disks-windows-clustering-vsphere.html