Wad of Stuff: infrastructure

Showing posts with label infrastructure. Show all posts

Tuesday, 17 November 2009

Behaviour Driven Infrastructure

I've been following the development of puppet for many years and this gem of a thread caught my attention recently. Martin Englund asks the Puppet Users mailing list:

how do you validate that puppet has done what it is supposed to, and even troublesome, how you validate that it has done what you intended it to do?

This is something I've struggled with over the years with my JASS/SST-based jumpstart build system. I've gone so far as to automate the build testing process using buildbot and pass or fail a build by using regexps to search for errors in the installation process. But my testing ends when the console login prompt appears. Validating whether a system build functions as intended is beyond what I currently test now and even beyond the capabilities of puppet.

This is where Martin's Behaviour Driven Infrastructure (BDI) approach comes into play. Martin is using Cucumber, a Behaviour Driven Development testing tool, to describe a system's behaviour using natural language that is readable and easy to understand by non-technical users (i.e. your IT helpdesk or even your business stakeholders).

Where this really gets interesting is combining puppet, cucumber and a monitoring system such as Nagios to do Test Driven Infrastructure. For example, you can use cucumber-nagios to integrate your cucumber tests with Nagios, then write a test for a new feature you want the system to have, e.g. "Should be able to send email". Initially this would result in Nagios marking the system as having a fault because you haven't yet implemented the feature that passes the test. You then proceed to implement the feature using puppet such that the test passes.

Over time I would imagine the amount of test coverage to grow such that system behaviours like DNS resolution, LDAP authentication, host based firewall policies etc would be tested. Any change to the system that broke one of these tests could be quickly pinpointed and fixed.

A perfect example in the Solaris world is the application of the latest recommended patches to a box. At a minimum it tends to break the sendmail and snmpd configs on my systems and I know to manually backup and restore these files before and after applying the patches. With BDI combined with a monitoring system you could be alerted to these breakages or any others that you weren't previously aware of and rapidly respond to them.

Brilliant!

Tuesday, 1 April 2008

One rule to bind them - part III

Apologies for the length of time between this article and the last in the series. Home and work life has been hectic and hasn't left much time for writing.

In the previous article in the series I wrote a how-to on DHCP booting a Solaris installation and passing custom DHCP options. In this article I'll conclude the series and explain what you can do with this information to make your installation hands-free.

What does a begin script do?

A begin script allows you to execute a series of commands before the installation commences, such as backing up files before an upgrade or creating a derived jumpstart profile. I'll be discussing the latter.

In my jumpstart environment a build list is generated which provides the minimum information about each Solaris build I maintain. It looks like this:

DNS:5.10:RNET:Domain Name Server
FIREWALL:5.9:CORE:Firewall-1
CORE:5.10:CORE:Hardened CORE
CORE:5.9:CORE:Hardened CORE
DEVELOPER:5.10:DEVELOPER:Hardened DEVELOPER
DEVELOPER:5.9:DEVELOPER:Hardened DEVELOPER
ENTIRE:5.10:ENTIRE:Hardened ENTIRE
ENTIRE:5.9:ENTIRE:Hardened ENTIRE
MINIMIZED:5.10:CORE:Hardened Minimized
MINIMIZED:5.9:CORE:Hardened Minimized

The first column corresponds to the JSBuild DHCP option that is sent to the client. The second column is the Solaris version. The third column is the Solaris metacluster: Core, User, Developer, Entire, Entire OEM. The fourth column is a short description of the build.

As the name of the article implies there is only one rule therefore only one begin script. This script does all the heavy lifting required to generate the derived jumpstart profile on a per build basis.

Determining what to build

The first thing the script does is extract the JSBuild DHCP option which tells it which build it is installing. If the JSBuild DHCP option empty (due to incorrect DHCP server configuration or use of old style bootparams installation) then the begin script falls back to the build list above and generates an interactive menu for the user to choose a build from.

Deriving the profile

Now the system knows what it is, it needs to know how to install it. This is what the profile does. The begin script tests whether it has to perform an initial installation or flash archive installation by NFS mounting a directory from the jumpstart server and looking for a flash archive. I have an automated process that builds and patches flash archives for each Solaris metacluster (more about this in a future article). If no flash archive is present then the script falls back to an initial install.

The initial install logic looks for a metacluster include file which defines all of the packages to install. The include file looks like:

#!/bin/sh
#
# Minimal Solaris 10 build
# Start with the minimal required number of packages
echo "cluster         SUNWCreq" >> ${SI_PROFILE}

# For performance monitoring
echo "cluster         SUNWCacc        add" >> ${SI_PROFILE}

# To support Compression (gzip)
echo "cluster         SUNWCfwcmp      add" >> ${SI_PROFILE}

# To support the Network Time Protocol
echo "cluster         SUNWCntp        add"      >> ${SI_PROFILE}
...

For Solaris 10 initial installations the script also outputs a profile patch keyword like:

patch patch_order nfs://jumpstart/path/to/10_Recommended

Which applies the recommended patch bundle during jumpstart. For Solaris releases prior to 10 this is done using the JASS install-recommended-patches.fin finish script.

Disk layout and mirroring

The next step is to see if the build has a custom partition configuration. Here is an example partition configuration file:

s0:6144:/:logging:
s1:1024:swap::
s3:4096:/var:logging,nosuid:
s6:free:/srv:zpool:srv
zfs1:free:/opt::srv/opt
zfs2:free:/var/opt::srv/var-opt

This file defines the disk slice, size, file system mount, and options. You'll notice that it caters for creating ZFS file systems as well even though the current 8/07 release of Solaris 10 doesn't. If the custom configuration doesn't exist then it uses a default one.

The begin script (well actually another nawk script) parses the partition configuration and outputs filesys profile keywords for each slice. It also outputs SVM mirroring information for Solaris 10 if a second disk is present with the same geometry as the root disk, again, for older Solaris releases, this is done by a script on first boot.

Preserving information between begin and finish

Once the disk layout has been derived then the begin script has done its job and can exit and hand the rest of the installation over to the installer and finish script(s).

What if you want to use some of the information you just derived in your finish scripts?

You can't just export an environment variable and have it persist from the begin script to the finish script. You have to store the information somewhere that the finish script can read it from. The simplest way to do this is write them out in VAR=value format to a file under /tmp and then source this file at the beginning of your finish script. The information I keep between stages is the build, metacluster, root disk device, mirror disk device, patch cluster release date and whether the system was built from a flash archive or not.

Are we there yet?

Although I have managed to modularize most aspects of a build you can still see the complexity for a build developer in maintaining all of these files. In a future article I'll discuss a new layer on top of JASS that I have developed that consolidates all of this information into a single, easy to maintain configuration file.

Monday, 28 May 2007

One rule to bind them - part II

In part I of this article I tried reducing the number of steps a user was required to perform to get a system installed via jumpstart and addressed a couple of the goals I mentioned in Building a manageable jumpstart infrastructure using the Solaris Security Toolkit.

My ultimate goal was to reduce steps 1 through 5 and selecting from the build menu to a single interaction such as entering the jumpstart client details into a web page or a configuration file. To eliminate the build menu I had to find a way to inform the client which build it was to install onto itself. The /etc/bootparams file doesn't allow custom parameters to be passed to the client. So after some research and experimentation I switched to DHCP with my own custom client options.

I started by working out the minimum amount of information a client required. I settled on the following DHCP option definitions in dhcp_inittab(4) format:

JSBuild         SITE,           128,    ASCII,     1,      0,      sdmi
JSRevision      SITE,           130,    ASCII,     1,      0,      sdmi
JSOSVersion     SITE,           131,    ASCII,     1,      0,      sdmi

JSBuild: A single word such as FIREWALL, DNS, WEBSERVER etc.
JSRevision: The version of the jumpstart environment. I have all of my finish scripts and configurations under source control using subversion. This option is so that I can build hosts from a specific version of my jumpstart environment.
JSOSVersion: The OS version e.g. Solaris_8, Solaris_9, Solaris_10.

Both JSRevision and JSOSVersion are used on the DHCP server side to craft the paths of a number of jumpstart specific DHCP client options.

These new DHCP options were added to the dhcp inittab in the OS install miniroot:

/srv/install/OS/Solaris_10/Solaris_10/Tools/Boot/etc/dhcp/inittab

This was necessary to allow me to run '/sbin/dhcpinfo JSBuild' or '/sbin/dhcpinfo 128' in my begin script.

I chose the ISC DHCPD for my DHCP server as it is extremely easy to configure and very flexible. The dhcpd.conf is:

#
# DHCP Configuration
#

pid-file-name "/var/run/dhcpd.pid";
lease-file-name "/srv/dhcp/dhcpd.leases";

ping-check true;
ddns-update-style none;
authoritative;

default-lease-time 86400;
max-lease-time 86400;

# MY Custom Options
option space MY;
option MY.jumpstart-build code 128 = text;
option MY.jumpstart-revision code 130 = text;
option MY.jumpstart-osversion code 131 = text;

# SUN's Jumpstart DHCP Vendor options
option space SUNW;
option SUNW.root-mount-options code 1 = text;
option SUNW.root-server-address code 2 = ip-address;
option SUNW.root-server-hostname code 3 = text;
option SUNW.root-path-name code 4 = text;
option SUNW.swap-server-address code 5 = ip-address;
option SUNW.swap-file-path code 6 = text;
option SUNW.boot-file-path code 7 = text;
option SUNW.posix-timezone-string code 8 = text;
option SUNW.boot-read-size code 9 = unsigned integer 16;
option SUNW.install-server-address code 10 = ip-address;
option SUNW.install-server-hostname code 11 = text;
option SUNW.install-path code 12 = text;
option SUNW.sysidcfg-path code 13 = text;
option SUNW.jumpstart-cfg-path code 14 = text;
option SUNW.terminal-type code 15 = text;
option SUNW.boot-uri code 16 = text;
option SUNW.http-proxy code 17 = text;

subnet 192.168.1.0 netmask 255.255.255.0 {
   authoritative;
   deny unknown-clients;
   next-server jumpserver;
   server-name "jumpserver";
   option tftp-server-name "jumpserver";
   option domain-name "example.com";
   option routers 192.168.1.254;
   option subnet-mask 255.255.255.0;
   option broadcast-address 192.168.1.255;
   option domain-name-servers 192.168.1.1, 192.168.1.2;
   option ntp-servers 192.168.1.1, 192.168.1.2;

   option SUNW.root-server-address         jumpserver;
   option SUNW.root-server-hostname        "jumpserver";
   option SUNW.posix-timezone-string       "Australia/NSW";

   # default OS is Solaris 10
   option MY.jumpstart-osversion "Solaris_10";
}
#
# Solaris
#
class "SUNW" {
   match if substring(option vendor-class-identifier, 0, 4) = "SUNW";

   #
   # Serve the correct inetboot file to sun4v hardware platforms.   
   #
   # Note: T2000 is actually SUNW.Sun-Fire-T200
   #
   if option vendor-class-identifier = "SUNW.Sun-Fire-T1000" or
       option vendor-class-identifier = "SUNW.Sun-Fire-T200" {
       filename = concat ("inetboot.SUN4V.",
           config-option MY.jumpstart-osversion, "-1");
   } else {
       filename = concat ("inetboot.SUN4U.",
           config-option MY.jumpstart-osversion, "-1");
   }

   option dhcp-parameter-request-list 1,3,6,12,15,42,43,128,129,130;

   site-option-space "MY";

   vendor-option-space SUNW;

   option SUNW.terminal-type       "vt100";
   option SUNW.root-mount-options  "rsize=32768";
   option SUNW.install-path = concat("/srv/install/OS/",
       config-option MY.jumpstart-osversion);
   option SUNW.install-server-address = config-option SUNW.root-server-address;
   option SUNW.install-server-hostname = config-option SUNW.root-server-hostname;

   # the path to the miniroot
   option SUNW.root-path-name = concat(config-option SUNW.install-path,
       "/", config-option MY.jumpstart-osversion, "/Tools/Boot");

   # the path to correct the version of the jumpstart scripts
   option SUNW.jumpstart-cfg-path = concat(config-option SUNW.root-server-hostname,
       ":/srv/jass/", config-option MY.jumpstart-revision);

   # the path to the OS specific sysidcfg file
   option SUNW.sysidcfg-path = concat(config-option SUNW.jumpstart-cfg-path,
       "/Sysidcfg/", config-option MY.jumpstart-osversion);

   # there is always a symlink in /srv/jass to the latest release.
   option MY.jumpstart-revision "latest";
}

# Solaris host declarations
include "/srv/dhcp/hosts/solaris";

A host declaration typically looks like this:

host testfw01 {
       hardware ethernet 8:0:20:ab:cd:e1;
       fixed-address 192.168.1.10;
       option MY.jumpstart-build "FIREWALL";
       option MY.jumpstart-revision "2.555";
       option MY.jumpstart-osversion "Solaris_9";
}

I include the host declarations from a separate file so that I can assign group privileges to users to edit that file only or generate it from the information the user enters into a Django web front-end to a database.

Now on the client I can drop down to the ok prompt and type:

ok boot net:dhcp -s

and when it gets to the root prompt run:

# /sbin/dhcpinfo JSBuild
FIREWALL
# /sbin/dhcpinfo JSRevision
2.555

There you have it. The client now knows which build it is meant to be. You'll notice that I haven't actually saved the user any work yet. In the next part of this series I'll demonstrate how I use this information in the begin script to select a jumpstart profile and pass this information to the JASS driver for post-installation processing.

Monday, 7 May 2007

One rule to bind them - part I

One of the major developments in my jumpstart system's life was the move from having a jumpstart rule per client to a single rule for all clients. Up until then when a user was building a new system they had to:

Login to the jumpstart server.
Obtain root privileges.
Edit /etc/hosts and add an entry for their client.
Edit /etc/ethers and add an entry for their client.
Run JASS add-client script to add the client configuration to /etc/bootparams.
Edit the rules file and add the client to it with the appropriate Begin and Driver scripts for the system build they wanted installed.
Regenerate the rules.ok file with the check script.
Connect to the client system's console and type 'boot net - install'.
Come back and hour or so later to a fully installed system.

This worked well for a few people but handing out root privileges to the growing number of users was not really desirable. Also, if any of the required steps was not done or an error/typo was made then the build would fail and possibly cause other users builds to fail as well. I needed to reduce this to at most 3 steps including steps 8 & 9.

My first attempt resulted in a more sophisticated begin script that eliminated steps 6 & 7. The user would perform steps 1 through 5, then step 8, wait a few minutes while the system booted from the network miniroot, then they would be presented a menu with a selection of different systems builds (firewall, DNS, web server etc) to choose from. The new menu system allowed a great deal of flexibility for our users and our system build developers.

The main advantage was that the jumpstart rules file needn't be maintained on a per host basis, instead a single rule like this was used:

any - Begin/begin = Drivers/driver

What this says is that for any client run Begin/begin script, which will generate (=) the jumpstart profile for installation. Then after installation run the Drivers/driver finish script to perform post-installation tasks such as the standard JASS hardening.

The majority of the magic lies in the Begin/begin script. This script determines which OS version is being installed, presents a menu to the user with the list of builds for that OS, then based on the user's selection generates the jumpstart profile for that build to include the base Solaris installation cluster (Core, User, Developer, Entire, or Entire plus OEM), additional SUNW clusters and packages, and the root file system layout and swap allocation.

The begin script also stores the user selection in a file which is read by the Driver/driver script to determine which build specific driver to execute post-install.

If you look at what I achieved above you'll see I only reduced the number of steps to build a system by two (from 9 to 7) but I laid the groundwork for much more extensive automation. In part II of this article I'll discuss how I developed the system to further reduce the work required by the user and issues caused by human error.

Monday, 19 February 2007

Building a manageable jumpstart infrastructure using the Solaris Security Toolkit

Luke Kanies has been stirring the hornet's nest lately about why system administration isn't evolving despite the huge growth in the development communities around the applications we manage and use every day. The tools system administrators use get nowhere near the attention, if any, that the applications do. Nor do system administrators often publish the tools they use because they feel they are too specific to the environment they were written for, are one-off hacks, they don't have time, etc, etc. The main aim of this blog is to address my lack of sharing the sysadmin love over the years.

When I started working for my current employer a few years ago, one of my first tasks was to deploy around 20 Check Point firewalls. I had used the Solaris Security Toolkit (aka JASS) many times before as a post-installation hardening tool and set forth building a jumpstart server using JASS to automate the build of these systems. After a couple of weeks I had a build system that could have firewall up and running from bare metal in 20 minutes.

It wasn't long before a colleague had to do a similar roll out of another application so I quickly added it. He was soon to be followed by another colleague...and then another.

Cut to a year later and there were 20 different system builds being done with JASS and it was starting to become a little difficult to manage. The kinds of issues I started facing were:

Modularity - the JASS drivers could be modularized to a degree but it was error prone and difficult for new developers to grasp.
One rule to bind them - having to manage the rules file was a pain. A single rule like "any - Begin/soe.beg = Drivers/soe.driver" that would allow a client system to boot, automatically select the build it was meant to have, and install would be ideal.
Flexibility - varying disk layouts, users, third-party packages, etc on a per build basis.
Testing - When there were 5 individual system builds it was easy for one person to test each one and make sure they built correctly. Once it got to 20 and beyond (now closer to 40) it became very time consuming and error prone. Add to that a few developers checking in changes and you end up with bugs being introduced that could potentially impact many systems.
Builds were slow - initial installation and applying recommended patches were painfully slow.

To solve 1 and 3, I developed a simple configuration layer on top of JASS called slant. Slant removes the need to edit JASS drivers directly. Instead a simple configuration file is created and from it a driver file is generated from a template.

For 2, DHCP based booting rather than bootparamd was needed. A custom DHCP option is provided to the jumpstart client during installation so that it could determine which system build it was installing.

For testing, I discovered a great tool called buildbot that is normally used to automate the compile/test phase of software projects. I've developed a buildbot configuration that allows me to test all of my system builds from bare metal to login prompt and report any errors, warnings, core dumps and more that could possibly happen during an installation.

To address the slow build times I developed an automated flash archive building system that would check for new recommended patch bundles, download them, and trigger a rebuild of the flash archives with the new patches applied. New systems being built look for the presence of a flash archive and use it, otherwise they do a standard (slow!) installation. One of additional problems faced here was making a flash archive that could be built on a V210 and deployed onto a V440 with all the necessary drivers.

All of these developments have culminated in a system that is now used by dozens of people every day with nice a django web front end for provisioning that also generates our DHCP configurations and DNS zones. There is also a jumpstart "appliance" build that is put onto loan or permanent servers for other groups in the organisation to use.

I plan to discuss each of these topics in much greater detail in later articles so stick around.