Monday 19 February 2007

Building a manageable jumpstart infrastructure using the Solaris Security Toolkit

Luke Kanies has been stirring the hornet's nest lately about why system administration isn't evolving despite the huge growth in the development communities around the applications we manage and use every day. The tools system administrators use get nowhere near the attention, if any, that the applications do. Nor do system administrators often publish the tools they use because they feel they are too specific to the environment they were written for, are one-off hacks, they don't have time, etc, etc. The main aim of this blog is to address my lack of sharing the sysadmin love over the years.

When I started working for my current employer a few years ago, one of my first tasks was to deploy around 20 Check Point firewalls. I had used the Solaris Security Toolkit (aka JASS) many times before as a post-installation hardening tool and set forth building a jumpstart server using JASS to automate the build of these systems. After a couple of weeks I had a build system that could have firewall up and running from bare metal in 20 minutes.

It wasn't long before a colleague had to do a similar roll out of another application so I quickly added it. He was soon to be followed by another colleague...and then another.

Cut to a year later and there were 20 different system builds being done with JASS and it was starting to become a little difficult to manage. The kinds of issues I started facing were:
  1. Modularity - the JASS drivers could be modularized to a degree but it was error prone and difficult for new developers to grasp.
  2. One rule to bind them - having to manage the rules file was a pain. A single rule like "any - Begin/soe.beg = Drivers/soe.driver" that would allow a client system to boot, automatically select the build it was meant to have, and install would be ideal.
  3. Flexibility - varying disk layouts, users, third-party packages, etc on a per build basis.
  4. Testing - When there were 5 individual system builds it was easy for one person to test each one and make sure they built correctly. Once it got to 20 and beyond (now closer to 40) it became very time consuming and error prone. Add to that a few developers checking in changes and you end up with bugs being introduced that could potentially impact many systems.
  5. Builds were slow - initial installation and applying recommended patches were painfully slow.
To solve 1 and 3, I developed a simple configuration layer on top of JASS called slant. Slant removes the need to edit JASS drivers directly. Instead a simple configuration file is created and from it a driver file is generated from a template.

For 2, DHCP based booting rather than bootparamd was needed. A custom DHCP option is provided to the jumpstart client during installation so that it could determine which system build it was installing.

For testing, I discovered a great tool called buildbot that is normally used to automate the compile/test phase of software projects. I've developed a buildbot configuration that allows me to test all of my system builds from bare metal to login prompt and report any errors, warnings, core dumps and more that could possibly happen during an installation.

To address the slow build times I developed an automated flash archive building system that would check for new recommended patch bundles, download them, and trigger a rebuild of the flash archives with the new patches applied. New systems being built look for the presence of a flash archive and use it, otherwise they do a standard (slow!) installation. One of additional problems faced here was making a flash archive that could be built on a V210 and deployed onto a V440 with all the necessary drivers.

All of these developments have culminated in a system that is now used by dozens of people every day with nice a django web front end for provisioning that also generates our DHCP configurations and DNS zones. There is also a jumpstart "appliance" build that is put onto loan or permanent servers for other groups in the organisation to use.

I plan to discuss each of these topics in much greater detail in later articles so stick around.

1 comment: