Jo Rhett

508 Page St · San Jose, CA 95126


I use Agile processes and continuous integration practices to architect and automate self-healing infrastructure that provides seamless service to developers and customers alike.

I have 25 years of hands-on experience delivering resilient, large-scale Internet services:


Learning Puppet 4 Learning MCollective

I am an author of practical, self-help books from O’Reilly and Packt, such as:

I contributed and did technical editing for the following books:

Things I can contribute to the community are at

Experience:             Click on any sentence to get informal, verbose details.

03/2004 - Current
Net Consonance: DevOps Architect (consultant)

Provide training on configuration management, orchestration, Agile, DevOps, and continuous release practices using Puppet, Docker, Kubernetes, Chef, Ansible, MCollective, and related tools.

Ad-hoc training, consulting, and contract jobs to advise and train managers on Agile practices, architecture, design, and automation goals.

Many companies spent thousands of dollars every year on generic training programs where the staff is unable to apply the knowledge to create value within their working environment. I go onsite and spend time learning the operations environment and working with the team, then create customized training programs tailored specifically to their day to day needs. These programs have delivered positive change for the customer every time.

Consulting Engagements:

03/2017 - 04/2018
Nuance: Senior Manager / Principle Automation Engineer

Manage tools and automation engineering team for Nuance Enterprise global services. Migrate team to Agile Sprint practices, create framework and roadmap for project and release management.

Nuance was engaging in a radical shift of their operations model from data centers to cloud-based services. I consulted and trained directors and managers in the new approach, and created tools and automation to support the new business structure. I also had the pleasure of bringing together and focusing some of the brightest, most creative minds I've ever had the pleasure of working for.

Consult, assist, and create tools for seamless Docker application deployments in Kubernetes pods. Create Puppet types and providers for Azure API provisioning and infrastructure management.

Designed a dynamic management Azure implementation intended for Dockerized applications deployed in Kubernetes prods. An entire service region could be bootstraped by Puppet, from nothing to a complete replication of all services in a matter of minutes.
Created new Puppet types and providers for Azure which made direct use of the ARM API. My contributions were folded into the Puppet-supported Azure module, and can be seen at
10/2016 - 02/2017
Quixey: Principle Engineer

Improve and automate Jenkins and Docker CI/CD on a global AWS / datacenter hybrid cloud. Evaluate and implement Mesos and Docker Swarm for seamless service upgrade and rollback.

Migrate Quixey release processes from an error-prone Jenkins-based manual CI/CD process to automatic, version-controlled updates across a dynamic, heterogenous blend of development, canary, and release targets. This was built on Git Flow brancing model with branch-based docker image tagging. The Docker "ships" were built and maintained using Puppet. Engineers had direct control of deployed versions from their release dashboard.
Evaluated Docker Swarm, Mesos, and then Swarm mode for managing distributed availability of services. Created Puppet modules to provide Docker facts, and to manage Swarm deployments. Created Jenkins processes for rolling out version changes.

Advise and guide Agile Scrum implementation for Operations and Cloud Engineering. Create usage guides and implementation documentation for operations and developer teams.

Consulted and engaged with management to implement minimal Agile Scrum processes for work management. Advised management on story creation and prioritization. Assisted engineers with understanding how the processes and thought model support and bolster their work.

I centralized, updated, and vastly expanded the documentation to include not just how it was done, but usage guides for how to develop for and make use of best practices.

At every place I've worked at in the last 20 years, I haved created and improved more documentation than all other team members combined.

Dec 2015 - Sep 2016
Ooyala: Site Reliability Engineer

Design and optimize automation for a worldwide, auto-scaling AWS / OpenStack hybrid cloud. Create node-based registration and self-healing functionality to minimize alarms and on-call support. Improve documentation, advocate for, and train engineering teams on best practices through an SRE model of engineering support.

Ensuring that any node can run in any cloud environment can be tricky. I used AWS storage services for node metadata sufficient for automatic region detection and assignment.
The goal was to ensure every service was redundant, and no single node mattered.
Persuasion is essential when control is distributed. It's easier to influence thoughts and minds than to get consensus on global rules.

Build new Chef Automate Workflow-based testing and release management system. Create and rebuild cookbooks that implement worldwide deployments for dozens of engineering teams. Create a Rails-based portal for TLS certificate creation, registration, and management within Chef Vault.

Chef Automate is a bleeding edge release management system designed around the specialized needs for the Chef cookbook release process. Ooyala was customer #2 and deeply involved in the early feedback process.
Design something specific enough to deploy in an Ooyala way, but general enough that every team could use it.
This was a Rails 4 portal that would create, generate, renew, and store TLS certificates in Chef Vault for use within cookbooks. It defaulted to using Let's Encrypt for free certificates, but allowed certificates to be purchased through any service.
12/2014 - 11/2015
Chegg: Senior DevOps Engineer

Created automated development environments using Chef in multiple large AWS installations. Rebuilt existing services in place, and design incremental improvements to improve service. Advised and collaborated on DevOps/Agile methodology and team evolution.

Dozens of globally diverse development teams with different technology stacks. Multiple and ongoing acquisitions.

Move a classic closed-door Operations team into a new era of DevOps/Agile transparency and collaboration. Create automated drop-in replacements for classic monolith applications. Design and consult on release and deployment strategies for zero downtime requirements.

Architected change management tools to handle deployment, tracking, and scaling of AWS resources. Designed and implemented flexible, multi-tier data management beyond Chef attribute precedence.

Built and enhanced tools for managing AWS EC2, SQS, Redis, Route53, ELB, etc. Designed application and role cookbooks for internal and public application services.

Published open-source enhancements shared with the community. Worked with vendors and third parties to integrate patches and enhancements back upstream.

Shared enhancements and improvements with partners, vendors, and the community.
07/2013 - 02/2014
StubHub! (eBay): Senior Operations Architect

Shepherd for new technology projects from conceptualization through to deployment. Strategized progressive changes for a global cloud of more than 6,000 VMware hosts. Trained and socialized changes in workflow across the business.

As an Operations Architect I was involved with every project from initial concept through construction and handoff to the production support teams. We provided advice during project planning, and shepherded projects through the documentation to deliver their project for Production Handoff. We strategized changes for a global environment of more than 6,000 hosts in production facilities. I was responsible for improving the process to increase velocity and accountability. We reduced the documentation while improving the value of what was received. I either created or advised on all new infrastructure to resolve existing gaps, and provide more automation.

Consulted on strategic technology implementations. Coordinated between teams and provided tier-3 support for complex problem resolution. Evaluated new technology implementations to increase velocity and reduce cost.

Unlike previous jobs, at StubHub I did very little direct tools programming but instead interfaced between teams to provide direction on initiatives, and to suggest solutions from experience. However during major outages or when backfilling for lack of personnel I was often directly involved in problem analysis, and writing new automation tools.

The main object was to use my experience to best focus potential solutions to achieve business directives.

02/2012 - 02/2013
Pinger: Principle Operations Engineer

Managed deployment 24x7 for the Textfree and Pinger applications, which service millions of concurrent users around the world. Restructured the network to improve availability and security.

Pinger provides free texting and calling to millions of concurrent interactive app users. Any slowness or failure in the backend service is immediately visible to users in every timezone. Every change to the service had to be evaluated and implemented carefully.

The network used Juniper SRX and EX switches for external gateways, Cisco internal switches and NetScalar load balancers. At the time I arrived there was very little documentation of the network, and everything was done cowboy on the network units. I restructured the network to isolate the out of band management systems and built tools to simplify common operations.

Implemented Puppet management of diverse global sites. Created tools to ease hands-on management and automate common processes. Automated creation of application-specific data points and alerts in Zenoss.

The pinger service had two production sites for the backend service, with voice media systems spread out around the world. I revised the manual one view of everything configuration push system at Pinger to accomodate multiple diverse internal and external configurations around the globe.

All network management was done via Zenoss or provided data to Zenoss. I cleaned up a lot of issues with Zenoss and tracked open bugs with the Zenoss team. I wrote python code to make changes to the Zendmd backend store, and to automate creation and removal of Zenoss data points and graphs.

02/2010 - 09/2011
Equinix: Senior Network OSS Tools Developer

Designed completely automated, vendor-agnostic, world-wide provisioning system for Equinix Carrier Ethernet Exchange. Delivered 3 months ahead of schedule. Created an automated customer self-help OAM and RFC-2544 testing system.

The initial provisioning scope deliverable was due by October, 2010. I delivered the initial deliverables in July 2010. This service was Equinix's first completely automated provisioning system, so I had to break a lot of new ground here. This involved significant amounts of data exchange between many business units in the company.

The main code had to be vendor-independant since the provisioning system had to deploy on Alcatel-Lucent ESS-7200 switches at first, but also support Juniper switches within a year.

The code acquired operating parameters from Oracle network port information and Oracle Financial InstallBase module, storing it in local databases and synchronized with related sources.

The code base was built to be environment smart, allowing it to function in NetDev, SwDev, QA, UAT, PILOT and PROD instances without any code changes.

Designed an integrated code base which allowed automated OAM and RFC-2544 network testing from the customer portal. This code dynamically provisioned new connections on the Alcatel gear to Accedian NIDs and ran test suites for the customer without any human involvement.

Developed libraries of code for the Network Tools team standards. Enhanced a diverse variety of statistics systems using Memcache. Tested other NoSQL implementations for a distributed SNMP poller.

Built new common libraries for use by the entire Network Tools development team:
  1. Consistent interface for creating and updating Remedy tickets using SOAP protocols.
  2. Simple perl interface for creating Monolith events using SNMP traps.
  3. Simple perl interface for using SOAP-provided services -- wrapped around SOAP::Lite.
  4. Clean database constructor which provides an enhanced DBI object for any of hundreds of internal Oracle, MySQL, Postgres databases.
  5. Environment detection library for auto-sensing which backend resources to use.

Enhanced existing statistics systems to use new common libraries for reporting errors and notifying the NOC of issues. Built requirements and a test lab to existing statistics systems to use new common libraries for reporting errors and notifying the NOC of issues.

07/2004 - 09/2009
Silicon Valley Colocation: Senior Network Engineer

Implemented production IPv6 allocation, routing, and name service for customers.

  1. Acquired a /32 allocation from ARIN for routing.
  2. Updated RADB route set to list IPv6 testbed routes.
  3. Set up internal routing infrastructure using Quagga host for a discrete IPv6 testbed network.
  4. Set up IPv6 BGP peering with all providers and peers able to do IPv6 routing.
  5. Created up a nameserver that holds all production domains but answers only to IPv6 queries.
  6. Mirrored production website to a IPv6-only site.
  7. Wrote new code for the colo control panel to allow customers to add/modify AAAA and IP6.INT records.

Redesigned the multi-site backbone to provide greater redundancy and flexibility. Upgraded the switching platform to Force10 for asic-based forwarding and hitless failover.

When I arrived the network depended on proxy arp to border gateway, which required configuration on every Extreme Black Diamond switch in the network for each customer. I simplified the configuration to use local routing and OSPF, which allowed me to improve flexibility and redundancy in case of link failures between the switches.

After this was done, I removed some expensive peering points and links which were logically redundant but not physically so. Then I arranged truly redundant peering with multiple tier-1 providers and settlement-free peering on the PAIX public mesh.

The Extreme switches had a number of fairly horrible bugs in BGP routing which made them unable to handle full feeds from all providers (which I built test cases for and proved to Extreme in their own lab). I engineered a tiered BGP solution which was able to provide full routing tables to customers while providing a limited view of functional routes to the Extreme Black Diamond switches.

The net effect of these changes was to reduce total cost-per-megabit from $135 to $32, and increase by an order of magnitude the amount of traffic customers passed through the network. Reliability was increased to nearly 3 years without a single customer-visible network outage. This was due to a power failure under control of facilities staff.

I also acquired an ARIN IPv4 block and migrated all customers from provider-specific IP space to the PI block. Which involved many months of wrangling customers.

This was a massive, multi-year project to replace the aging Extreme Black Diamond switches at the core of the network.

I created and continously updated business requirements. I identified all market solutions capable of handling datacenter core duties. I provided all findings and total cost-of-ownership details directly to the board of directors. I technically evaluated solutions provided by Cisco, Juniper, Force10, Extreme BD 10k and Alpine and Foundry switches.

When we selected a potential solution, I built a test lab and evaluated performance and functionality of the solution in extreme detail. In the Force10 test, I identified 6 bugs in the routing protocols of the FTOS software. Which compared favorably with dozens in the Cisco 6500 product at the time.

After acquisition of the Force10 E300 switches I created and tested extensively a perl script which converted all customer configuration from Extreme to Force10/Cisco format. This allowed us to simply move cables at migration time.

I replicated all standard customer functions in the new environment using before/after Wiki documentation. As Force10 did not have a function to download its configuration at a specified time, I wrote a small shell script suitable for nightly cron invocation which caused the unit to TFTP the configuration.

Implemented a distributed Nagios network monitoring system. Wrote custom plug-ins for specialized tests of old-world facilities management gear. Unified all systems to a cfengine-controlled FreeBSD standard.

Built a set of Nagios monitoring systems with instances in each datacenter. Configured all Nagios instances to report via NCSA to a central monitoring system used by NOC staff.

Created custom Nagios event handlers to only report problems visible from multiple instances (where appropriate).

Created a Nagios check utilities to alarm based on variety of data stored in pre-existing RRD files. This utility does not require Nagios to gather the RRD data, and deals appropriately with both kibibytes and kilobytes.

Created Nagios check utilities to gather useful alarm data from facilities systems including large APC units, old and new MGE power systems, IT watchdogs (and some random other) environment sensors, and a variety of telnet or custom protocol units.

Built another Nagios environment for customer monitoring. Built an easy to use monitoring setup and management UI inside the Colo Control Panel so that customers could add/remove monitoring of hosts with zero NOC staff involvement. This involved reading in and writing out Nagios configurations, as well as careful tests and loading of the new configurations on the fly.

The SVcolo environment had dozens of systems with few common functions. These units were all built with the same script, however devolved significantly over time. I built a test lab and evaluated cfengine2, puppet and bcfg2 as possible management solutions.

After selecting cfengine, I rebuilt the entire multi-hundred line build script as a cfengine ruleset to maintain the systems over time. Each system's unique functions became documented and described within the policy language, providing a centralized repository of configuration information tracked in a source code tree.

As cfengine did not have FreeBSD package management at the time, I engineered code to properly support FreeBSD and got it integrated into the main source tree. I followed up by enhancing cfengine's package management to include removing and upgrading packages for all platforms, including a significant amount of optimizing for Yum environments.

Some other tasks:
  1. Created a common hardware standard and a standard pool of spares for easy swap by non-technical staff.
  2. Upgraded all systems to a common FreeBSD standard.
  3. Unified logging to a centralized server.
  4. Set up log analysis using Simple Event Correlator to create NOC tickets, activate nagios alarms or start tracking other events as appropriate.

Designed and implemented a LAMP solution to track all internal systems, customers and resources. Created a LAMP-based Control Panel to track customers, ports, cross connects, and power usage.

Upgrade a simple spreadsheet of customer data into a de-centralized MySQL database with dozens of tables containing all customer resources and assignments.

Built a LAMP (perl) server setup to handle both internal/staff and customer requests. Designed a set of OO modules around each set of data, and another set to handle interactive/ajax requests. Created a UI for the staff to review and maintain this data. Upgraded the customer UI from a 1993-style 3-page web UI to a 100-plus page modern Ajax-enhanced customer portal. Continously upgraded both environments to provide more customer-management functions to the NOC staff and self-help functions for customers, including:

Set up an rwhois server answering all queries with customer IP allocation data from the database.

Created various hourly/nightly/weekly/monthly reports based on customer power and network usage allowing NOC staff to proactively prevent power circuit overload, system compromise/abuse issues, etc.

04/1999 - 06/2004
Isite Services: Chief Technology Officer

Theorized and created an integrated e-commerce application suite in C++ and Perl/LAMP. Designed tools used for zero-downtime upgrades.

Developed a set of tools to pause the HTTP servers, push out new code, refresh template caches and restart HTTP service so as to provide no visible downtime to the user.

This required custom enhancements of rdist, and later evolution of a wrapper system around rsync.

Consulted with the investors on all technology and business choices. Managed the teams implementing all technology directives.

Lots to say here, but the short and sweet is that I managed the business objectives with the board of directors, wrote all network and security related code for the e-commerce products, and performed project management and Q/A work on the other products. I didn't know these terms at the time, but using Agile terminology we worked in a style now known as Scrum sprint mode, and I was ScrumMaster for all projects including ones in which I didn't have a single line of code.

Implemented and managed a high-uptime managed co-location service. Enhanced Mon and Cricket to create a unified network management infrastructure. Automated tools to reduce manual effort.

Built out a colocation environment from scratch, starting from pencil and paper to a finished cage environment. I actually did this three separate times, once in 1999, again in 2001 and again in 2003 due to massive business growth and the ever-changing evolution of colocation providers in that time period. Each instance was completely different.

Short and sweet: built enhancements to request, bind, mon, cricket, majordomo, apache, cgiwrap and a wide variety of other tools we used to manage the environment. Submitted all of these changes back to the original developers or published for the community to use.

As Isite ran with a minimal staff and sold only through web developers, I automated every customer management process to avoid hiring low-skilled workers.

05/1993 - 04/1999
Visionary Management Technologies: Senior Network Engineer (consultant)

Managed Cisco, Juniper, Extreme, Nortel, Foundry, HP, Wellfleet, 3Com, Proteon and other routers.

I've touched a lot of routers over the last 20 years. Thankfully, the user interface has simplified around IOS lookalikes. Unfortunately the use of ASIC-based forwarding makes many issues more complicated.

Redesigned an IPSec VPN network to use policy routing for best performance. Implemented with no downtime in a 24-hour production network, including 3 European and 4 Asia-Pacific sites. Stabilized a Cisco VoIP call manager and implemented media gateways to improve voice quality in 24 offices worldwide.

I was brought into this project by Cisco when the best of the CCIEs couldn't figure out how to make Cisco's design work. The simple part of the project was properly documenting the actual usage of the network (instead of the executive theory) and redesigning the packet flow to support this. This involved thousands of lines of traffic engineering and packet shaping configuration on each node point, but that was the easy part.

The harder part was fixing all the bugs in the T-train (not yet production) Cisco VoIP software. I ended up creating so many bug reports that I was given direct source code access and a new line of development code Q-train? I forget... was created to track the issues we identified in production. In a very real and practical sense every site using Cisco's VoIP solution with Cisco routers has benefitted from the work I did on this project.

Designed and implemented firewalls using context-based filtering routers and transparent proxies. Maintained a multi-organization firewall segregating distinct internal security domains.

I've been doing network security work for 18 years now. In addition to the common Internet/IP firewalls, I've done internal/corporate firewalls for IP, IPX, XNS, HP-NS, AppleTalk, and NetBIOS protocols.

For Internet security, I've come to the conclusion that simple packet filtering is basically useless if you actually want to protect the internal networks from damage (which isn't always a break-in). Many of the commercial FireWall packages are neat products, but they sometimes make too many assumptions about the network structure (or are simply too permissive).

I have built back-to-back proxy systems with three layers of packet filtering routers. It sounds paranoid, but the protected zone cannot be directly attacked, even by denial-of-service attacks. Having the source code lets me review the security of the code itself, and made changes as necessary. If nothing else, I often change what is logged and when.

Tested performance and reliability of technologies in laboratory and site-specific configurations.

I love lab work. Knowing what is really involved in network transactions is essential to having a clue when there are problems in the implementation.

Provided advanced system administration, network configuration, and host security management for FreeBSD, Linux, Solaris, HP/UX, Irix, SCO Unix V and UnixWare systems.

Textbook (SAGE) definition of Senior System Administrator, with experience in FreeBSD, NetBSD, Solaris, HP/UX, SCO Unix, UnixWare and Linux distributions Red Hat, CentOS, Debian, Gentoo and some small image distros I've used on flash drive systems.

I rarely just use an operating system. I often end up being actively involved in the development of the OS while supporting an environment. You'll find my name in patches to everything from Solaris boot code to FreeBSD package management tools.

Installed and maintained Novell and Windows servers and clients. Designed the network to provide single logon and seamless operation between Unix, Windows, and NetWare environments.

I was supporting LANs back when NetWare 2.11 was common and 3+/Open was still a viable(?) platform. I continued supporting both through NetWare's NDS and Microsoft ADS platform implementations and growth into the scalable solutions they are today.

Most networks grow by attrition, rather than design or plan. Besides the obvious network topology issues, many issues regarding data access present themselves. A business finds crucial business data on a variety of PC Networks, Unix workstations, Minicomputers, and legacy systems. I'm really good at is working with multiple platforms, providing transparent access to information. Cross-platform data access can be handled through various connectivity products, but a good solution is dependant on the needs of the users trying to access the data -- and these are rarely technofiles.

My home environment is probably a good example of interconnectivity. We have a FreeBSD server, Solaris server, 2 different Linux distribution desktops, two Windows XP/Vista desktops, 2 Macs and both Mac and Windows laptops. There is a single logon to all network resources, and they are available on all platforms. Since I use my home as a test lab, the network is configured to be flexible for expansion as needed.

Designed custom plug-ins for SunNet Manager and HP OpenView to test unique resources and implement custom alert notifications.

Network monitoring is the one place where a single decent application that does what everyone needs seems to be an impossible hurdle.

Things that fail to alarm are bad. Things that alarm too often get ingored. Very few tools seem to do this correctly out of the box. And what is correct changes based on the organization, the team, even the resource in question.

I spend a lot of time documenting people's ideas of what they want checked and how they want to be alarmed about it, and writing plug-ins for the various monitoring tools to give them what they need.

1992 - 1993
Technology, Management & Analysis Corp: Systems Engineer

Created Secret security-level WAN links between the Pentagon City SEA05 Metropolitan Area Network and shipyard offices throughout the US. Implemented mission-critical military systems with Sun and HP Unix. Created advanced IP and IPX protocol filters to control link utilization on large multivendor networks.

I can't remember what parts I'm allowed to discuss here so I'll be brief. 16 towering office buildings just south of the Pentagon, legally known as Crystal City but affectionately known as Pentagon City. All of these buildings were connected in a single large fiber network known as NAVSEA SEA04/SEA05. Each department and/or contractor on this network had their own lowest-cost-bidder contracts for network help, so implementing both milnet and interior/classified routing was often precarious because you counldn't trust that the department next door wouldn't just re-use your unique IDs.

Now multiply that by an order of magnitude when you connect shipyards all around the US and bring them into the same network. Network clue was fairly low, so we spent more time firewalling against mistakes than against intruders.

Implemented mail transfer and directory synchronization on all Navy ccMail systems, which created the largest unified ccMail system in the world at that time.

We linked together pretty much every navy office in the continental Unites States, each of which had their own separate ccMail installation. The Navy had given up on phone call synchronization since it simply couldn't keep up, so the network upgrade made synchronization possible again.

This pushed the boundaries farther than ccMail (prior to acquisition by Lotus) had imagined them. We were getting custom patches from them daily for several months to address issues we identified and documented for them. I also did a lot of redesign work on timers and settings for large synchronizations. I was later told that this document floated around Lotus for a while, and was used without significant change for the default Notes synchronization settings.

1991 - 1992
Network Alternatives: Network Administrator

Installed and maintained SCO Unix, Novell NetWare, LAN Manager, 3+/Open, and Lantastic environments using Ethernet, ARCNet, TCNS, and G-Net networks. Performed server installations and migrations as technology changed.

I was supporting LANs back when NetWare 2.11 was common and 3+/Open was still a viable(?) platform. I continued supporting LANs as 10base2 evolved into 10baseT and network cabling finally started being done by the facilities folks. Thankfully, I haven't seen ARCnet, G-Net, or thick ethernet in over 16 years.

Anyway, LAN support in the early '90s tended to be one-stop shopping; we did it all. We ran the network wire, configured the servers, installed the applications, and supported the users. Things have diversified since then, and I've focused on network/application/voice issues, and Internet/Intranet services.

On all projects since May of 1992 I have been the project lead or solely responsible for my portion of the project.
I work well with either independent goals or as part of a team effort.

Other Information:

Presenter at Bay Area Large Scale Production Engineering (LPSE)
Presenter and published by Usenix LISA Conference,
Presenter at BayLISA.
Founding member of the League of Professional Systems Administrators (LOPSA)
Participation in NANOG (North American Network Operators Group)
Participation in FIRST (Forum of Incident Response and Security Teams)
Novell Certified Network Engineer


Marc Kodama, DevOps

Phil Clark, CEO

Mark Izillo, Site Operations Manager

David Graziano, General Manager
Silicon Valley Colocation

Craig Michaelis, Vice President of Engineering
Visionary Management Technologies

More references and contact information available on request.