#c9d9 panel on Huge Scale Deployments

A few weeks back I had the great pleasure of joining a #c9d9 panel hosted by Sam Fell on Continuous Delivery in Huge Scale Deployments, alongside me were Andrew Siemer, Malcolm Isaacs & Seb Rose who all brought a wealth of practical experience on maintaining velocity in large environments, I thoroughly enjoyed it!  If Continuous Everything is your current boggle, go take a look at all of the past recordings & there post packed Continuous Delivery blog, it’s well worth a read!

Introducing kitchen-salt, a salt provisioner for test-kitchen

Over the last week I’ve been working on kitchen-salt, a SaltStack
provisioner for Test Kitchen, this allows you to perform integration
testing on salt-formula. Test Kitchen will create a VM (through
Vagrant, LXC, OpenStack, ec2 etc), install a version of salt of your
choosing, apply a given set of states & then optionally allow you to
perform some automated validation of the states via the supported
testing frameworks (bats, serverspec etc).

The tests can all be run from your workstation, but we’ve also just
started plugging this into our CI system, so that changes are gated on a
successful test run, hopefully preventing bad states ever making it into
a branch.

I’ve started a walkthrough of setting up a working Test Kitchen
environment & adding a test-kitchen support to an existing formula, you
can see it here: http://tinyurl.com/mts5uo2

kitchen-salt supports multiple salt install vectors (apt & bootstrap,
apt allows you to specify a version & repo to pull from) & setting of
pillars & state_top. Test Kitchen allows you to define suites of tests,
and you can change the pillars, state_top and installed version of salt
per suite too, so you can test many different scenarios.

You can find the full project here:
https://github.com/simonmcc/kitchen-salt

Looking forward to some feedback & hopefully some PR’s

I’m not going to SaltConf, but I will be at @cfgmgmtcamp in Gent in a
few weeks, see you soon!

SHOCKER: Vagrant base box using RPMs

I’ve been using some of the base boxes available from http://www.vagrantbox.es/ as a starting point for lots of Vagrant VMs recently, but came unstuck when the version of puppet in use on the base box was substantially different from our production environment (2.7 v 2.6.8 in our production environment). (I was working on alt_gem, an alternate package provider for maintaining gems outside the RVM in use by puppet)

At first I thought it would be simple enough to downgrade puppet on one of my Vagrant VMs, but then I discovered that nearly all of the CentOS/Red Hat vagrant boxes install ruby & puppet from tarballs, which is balls frankly, shouldn’t we be using packages for everything?!  (Kris Buytaert says so, so it must be true)

So instead of ranting, I tweaked an existing veewee CentOS template to install puppet & chef from RPMs, for puppet, it uses the official puppetlabs yum repo, for chef it uses the frameos packages. (I’m a puppet user, so I’ve only tested the puppet stuff, chef is at least passing the “veewee validate” tests).

You can grab the box here: https://dl.dropbox.com/u/7196/vagrant/CentOS-56-x64-packages-puppet-2.6.10-chef-0.10.6.box

 

To use it in your Vagrant config, make sure this is in your Vagrantfile:

  # Every Vagrant virtual environment requires a box to build off of.
  config.vm.box = "CentOS-56-64-packages"

  # The url from where the 'config.vm.box' box will be fetched if it
  # doesn't already exist on the user's system.
  config.vm.box_url = "https://dl.dropbox.com/u/7196/vagrant/CentOS-56-x64-packages-puppet-2.6.10-chef-0.10.6.box"

 

I’ve sent a pull request to Patrick to get the new template included in veewee, and a pull request to Gareth to get the box listed on www.vagrantbox.es.

Now, time to go back to what I was doing originally before I got side tracked 🙂

 

 

An Average Day

There is no such thing as an average or normal day, but here’s what today yesterday looked like (the first day back after a 6 day break):

  • 1000-1200 Wiki Gardening – moving some WIP from my head/evernote to team wiki pages
  • 1200-1230 Monitoring & Measuring Catchup – a quick check around on the stuff we don’t get alerted about, checkup on some new nodes I added to cacti before I finished up last week.
  • 1230-1330 Reading – closing out a bunch of open tabs etc
  • 1330-1430 Lunch (the joy of working from home, soup & sandwiches with family:-))
  • 1430-1600 Open CM tickets for up coming changes, upgrade Puppet Dashboard, publish some more strategy information on our internal wiki.  Work through the dreaded email back log.
  • 1600-1700 Baby Dr Appointment
  • 1700-1800 Weekly Team conference call (mostly around some major work scheduled for this weekend)
  • 1800-2100 Family Time – help the kids tidy up & get them to bed, get something to eat
  • 2100-2200 Finish up some puppet work, mostly tidying up & committing some work to git, review steps for some up coming work with Pacific Time colleague
  • 2200-2300 TV Break  – Michael McIntyre 🙂
  • 2300-0030 more follow up with PT colleague, ironed out plan for moving from DAS to NAS for a pilot set of machines, committed plan to wiki for tracking, discussed the general meanness of some of the people we work with.
After the brain searching trying to remember what I did, I’ve re-installed RescueTime..

Faking Production – database access

One of our services has been around for a while, a realy long time.  It used to get developed in production, there is an awful lot of work involved in making the app self-contained, to where it could be brought up in a VM and run without access to production or some kinds of fake supporting environment.  There’s lots of stuff hard coded in the app (like database server names/ip etc), and indeed, and there’s a lot of code designed to handle inaccessible database servers in some kind of graceful manor.

We’ve been taking bite sized chunks of all of this over the last few years, we’re on the home straight.

One of the handy tricks we used to get this application to be better self-contained was avoid changing all of the database access layer (hint, there isn’t one) and just use iptables to redirect requests to production database servers to either local empty database schema on the VM, or shared database servers with realistic amounts of data.

We manage our database pools (master-dbs.example.com, slave-dbs.example.com, other-dataset.example.com etc) using DNS (PowerDNS with MySQL back end), in production, if you make a DNS request for master-dbs.example.com, you will get 3+ IPs back, one of which will be in your datacentre, the others will be other datacentres, the app has logic for selecting the local DB first, and using an offsite DB if there is some kind of connection issue.  We also mark databases as offline by prepending the relevant record in MySQL with OUTOF, so that a request for master-dbs.example.com will return only 2 IPs, and a DNS request for OUTOFmaster-dbs.example.com will return any DB servers marked out of service.

Why am I telling you all of this?  Well, it’s just not very straight forward for us to update a single config file and have the entire app start using a different database server. Fear not, our production databases aren’t actually accessible from the dev environments.

But what we can do is easily identify the IP:PORT combinations that an application server will try and connect to.  And once we know that it’s pretty trivial to generate a set of iptables statements that will quietly divert that traffic elsewhere.

Here’s a little ruby that generates some iptables statements to divert access to remote, production, databases to local ports, where you can either use ssh port-forwarding to forward on to a shared set of development databases, or to several local empty-schema MySQL instances:

require “rubygems”
require ‘socket’

# map FQDNs to local ports
fqdn_port = Hash.new
fqdn_port[“master-dbs.example.com”] = 3311
fqdn_port[“slave-dbs.example.com”] = 3312
fqdn_port[“other-dataset.example.com”] = 3314

fqdn_port.each do |fqdn, port|
puts “#”
puts “# #{fqdn}”
# addressess for this FQDN
fqdn_addr = Array.new

# get the addresses for the FQDN
addr = TCPSocket.gethostbyname(fqdn)
addr[3, addr.length].each { |ip| fqdn_addr << ip }

addr = TCPSocket.gethostbyname(‘OUTOF’ + fqdn)
addr[3, addr.length].each { |ip| fqdn_addr << ip }

fqdn_addr.each do |ip|
puts “iptables -t nat -A OUTPUT -p tcp -d #{ip} –dport 3306 -j DNAT –to 127.0.0.1:#{fqdn_port[fqdn]}”
end
end

And yes, this only generates the statements, just pipe the output into bash if you want the commands actually run.  Want to see what it’s going to do?  Just run it.  Simples.

State of the Java Onion

I’m siting on my flight home from my first devopsdays in Goteborg, so firstly, many thanks to the awesome Patrick Debois, Ulf & many many others that put the effort in to organising the conference, and everybody that turned up and made the event so worth while! My primary reason for going was to hear other people’s experience with configuration management and general ops deployment experience. (I’m in the process of adding puppet to our large legacy LAMP stack)

I kind of expected to be the fuddy duddy in the room (my group runs 4 SaaS services, our largest is a LAMP+JBoss SIP stack, a Solaris/Tomcat/Oracle/Coherence stack, a Linux/Tomcat/MySQL stack and a Apache/Weblogic/Cognos/Oracle stack, all hosted on our own hardware, how retro), so I was prepared to hear stories of how easy it was to deploy services built on modern interpreted stacks to the cloud, but I was pleasantly surprised to hear that plenty of people are using java application servers of all shapes & sizes in production. I was less pleased to hear, but somewhat comforted, that everybody running java stacks in production is suffering pain somewhere (damn, no silver bullet to take home).

 

Deployment Pain

Lots of people were good enough to share their success & horror stories about how their current java stacks get into production, some of the recurring topics:

Orchestration

I think this deserved a talk or open space on it’s own, but John E. Vincent covered chunks of this in his great tools talk, and it came up in the “deploying java artifacts” open space.

I’ve got some take away reading to do about tools like Apache Whirr & UrbanCode’s deployment & configuration tools, but everybody has similar problems, needing a controlled, reliable method of automating the pre & post deployment steps (traffic bleed off, deploy, service verification, data load, back in service) and managing the service availability during the deployment (or managing the stress on systems affected by the post deployment steps)

Hot/Cold deployments?

In general, hot deployments never seem to work as planned reliably, hot deployments are highly desirable for some services due to session requirements, but most people observed that hot deployments are prone to problems, leaking memory on many occasions, leading to hot deployments being something you can only get away with a few times, if at all depending on your memory overhead.
The guys from zeroturnaround demoed their latest jrebel/liverebel tools. JRebel is a developer focused tool that allows a jar to be hot updated in a running JaS, for quicker iterative java development. LiveRebel is built on similar technology but aimed at use in production, to do hot updates (I’m not sure how this differs from the hot deployments of war & ear etc, but that’s a gap in my JaS understanding)

war/ear or exploded webapps directory?

Currently we do both, and each have their pros & cons, exploded webapp directories have a tendency to build up undocumented cruft essential to the smooth running, and war/ear deploys have a tendency to break your heart with environment issues (what do you mean we need a new build to use a different database server!?)

For our next service going into production, we need to be able to vary the number of tomcats running on a physical host, each running on a different port, to support this we’ve extended our existing in_service hook, which in a our simpler environments just lets the load balancer know that this host is now good to take traffic, now it will build out the multiple tomcat CATALINA_HOME trees from scratch, going as far as grabbing the ant & tomcat tarballs required (version numbers pulled from central config db, allowing per host overrides for piloting versions on individual machines), the aim here is 2 fold, have a clearly documented process for building a working CATALINA_HOME and be able to dynamically vary our tomcat count without lots of manual preparation required)

Environment & Configuration

Lots of issues & lots of different solutions to this one, best case, war file ships with 3 environments configured, default to production, over ride on the command line for other environments (down side, production passwords are in the artefacts & therefore in source control).

There was some discussion over externalising the config, various methods (XML includes in comtext.xml/server.xml) and providing an standard-ish API to get/set properties (some commercial JaS already do this).

Horror stories included the deploy process having a start to explode the war, stop, remove the war, fix the config, restart. Another involved a post restart data load that took 30-40min before the tomcat was ready for traffic again.

General concencus was that involving dev in more of the ops deployment pain helped hilight areas that needed some improvement.

Config/Properties APIs

There was a little bit of discussion around APIs for managing configs in a running AS, some of the commercial AS already have this, but their was little support around the room for single vendor solutions, although most agreed that practically nobody changes AS after initial selection, the desire for a single tool was focused on having a single tool to gain momentum instead of fragmented tooling.

No Ops in the Java Serverlet Steering committee?

I missed the exact names of the standards & people involved, I’ll update this if you have specifics
One of the participants in the java artifacts open space is on the Java Community Process mailing list, he pointed out that their was practically no one representing ops in the ML, some of the proposed changes for horrified ops people in the room.

the platform/application split

Some ardent supporters of deploying nothing but packages, people using FPM and other tools to build RPMs and other packages of the AS and another package for application.

Part of this also comes down to your orchestration tools and how you run your CM, Ramon of Hyves hilighted that they don’t run puppet continuously, they run once a day, due to orchestration requirements and scale (they have 3,000+ application server in production).

Most people agree that the CM goes as far as preparing the AS for an application to be dropped in, although most of these environments ran a single AS per host.

Windows 7 Essentials

I’ve just rebuilt my laptop (a combination of McAfee Whole Disk Encryption slowing the current build down & a Crucial ReadSSD 128Gb that was too cheap to resist forced me to, honest guv), so it’s time to refresh & re-document the essential software list:

  1. Windows 7 Professional 64bit
  2. VistaSwitcher (better alt-tab)
  3. WindowSpace (snap windows to screen edges & other windows, extended keyboard support for moving/resizing)
  4. Launchy
  5. Thunderbird6
    1. Lightning (required for work calendars)
    2. OBET
    3. Provider for Google Calendar (so I can see my personal calendar)
    4. Google Contacts (sync sync sync)
    5. Mail Redirect (bounce/redirect email to a ticketing system)
    6. Nostalgy (move/copy mail to different folders from the keyboard)
    7. Phoenity Shredder or Littlebird (the default theme is a bit slow, these are lighter and quicker)
    8. Hacked BlunderDelay & mailnews.sendInBackground=true
  6. Chrome + Xmarks
  7. Xmarks for IE
  8. Evernote
  9. Dropbox & Dropbox Folder Sync
  10. PuTTY (remember to export HKEY_CURRENT_USERSoftwareSimonTathamPuTTYSessions)
  11. WinSCP
  12. Pidgin + OTR
  13. gVim
  14. Cisco AnyConnect (main work VPN)
  15. Cisco VPNClient (backup & OOB VPN)

I think that’s it for now.

Apple Mac Toolbox

Following up on my recent post on the engineers toobox, I’ve just rebuilt my Apple MacBook (newer, bigger hard disk was the perfect opportunity for a fresh Snow Leopard install and to fix some annoying iPhoto index & thumbnail corruption), so here’s my list of essentials for my MacBook, in no particular order:

  • Evernote
  • SpanningSync
  • Dropbox
  • Last.fm
  • Xmarks for Safari
  • Xmarks for Firefox
  • Panic Coda
  • Thunderbird 3.x
  • Adium
  • VMWare Fusion
  • iPhoto
  • iMovie
  • QuickSilver
  • Microsoft Office
  • Google Chrome
  • Flickr Uploader
  • Skype
  • Google Picasa
  • Get iPlayer Automator
  • Cyberduck
  • MacVim
  • Spotify
  • iSquint
  • AudioHub
  • VisualHub
  • SuperSync
  • TweetDeck
  • ClickToFlash
  • Growl