Nowadays, most companies use one or another Agile methodology for their software development projects. That makes people involved in software development projects at least aware of agile principles - whether they truly try to follow agile practices or just pay lip service to them for a variety of reasons remains debatable. To avoid any association with tainted practices, I’d rather use the name "Exploratory Development". As with software development, exploration has a vague feeling of the final target destination, and a more or less detailed understanding on how to get there.[1] Plus on a map, the plotted path is generally not a straight line.
However, even with the rise of the DevOps movement, the operational side of projects still seems to remain oblivious to what happens on the other side of the curtain. This post aims to provide food for thoughts to think about how true Agile can be applied to Ops.
The legacy approach
As an example, let’s have an e-commerce application. Sometimes, bad stuff happens and the development team needs to access logs to analyze what happened. In general, due to "security" reasons, developers cannot directly access the system, and needs to ask the operations team to send the log file(s). This too common process results in frustration, time wasting, and contributes to build even taller walls between Dev and Ops. To improve the situation, an option could be to setup a datastore to store the logs into, and a webapp to make them available to developers.
Here are some hypotheses regarding the source architecture components:
- Load-balancer
- Apache Web server
- Apache Tomcat servlet container, where the e-commerce application is deployed
- Solr server, that provide search and faceting features to the app
In turn, relevant data/logs that needs to be stored include:
- Web server logs
- Application logs proper
- Tomcat technical logs
- JMX data
- Solr logs
Let’s implement those requirements with the Elastic stack. The target infrastructure could look like this:
Defining the architecture is generally not enough. Unless one work for a dream company that empowers employees to improve the situation on their own initiative (please send it my resume), chances are there’s a need for some estimates regarding the setup of this architecture. And that goes double in case it’s done for a customer company. You might push back, stating the evidence:
Engineers will probably think that managers are asking them to stick their neck out to produce estimates for no real reasons, and push back. But the later will kindly remind the former to "just" base estimates on assumptions, as if it was a real solution instead of plausible deniability. In other words, an assumption is a way to escape blame for a wrong estimate, and yes, it sounds much closer to contract law than to engineering concepts. That means that if any of the listed assumption is not fulfilled, then it’s acceptable for estimates to be wrong. It also implies estimates are then not considered a deadline anymore - which they never were supposed to be in the first place, if only from a semantical viewpoint.[2]
Notwithstanding all that, and for the sake of the argument, let’s try to review possible assumptions pertaining to the proposed architecture that might impact implementation time:
- Hardware location: on-premise, cloud-based or a mix of both?
- Underlying operating system(s): Nix, Windows, something else, or a mix?
- Infrastructure virtualization degree: is the infrastructure physical, virtualized, both?
- Co-location of the OS: are there requirements regarding the location of components on physical systems?
- Hardware availability: should there be a need to purchase physical machines?
- Automation readiness: is there any solution already in place to automate infrastructure management? If not, in how many environments will the implementation need to be setup and if more than 2, will replication be handled manually?
- Clustering: is any component clustered? Which one(s)? For the application, is there a session replication solution in place? Which one?
- Infrastructure access: needs to be on-site? Needs to get a security token hardware? From software?
And those are quite basic items regarding hardware only. Other wide areas include software (volume of logs, criticality, hardware/software mis/match, versions, etc.), people (in-house support, etc.), planning (vacation seasons, etc.), and I’m probably forgetting some important ones too. Given the sheer number of available items - and assuming they all have been listed, it stands to reason that at least one assumption would prove wrong, hence making final estimates dead wrong. In that case, playing the estimate game is just another way to provide plausible deniability. A much more useful alternative would be to create a n-dimension matrix of all items, and estimate all possible combinations. But as in software projects, the event space has just too many parameters to do that in an acceptable timeframe.
Proposal for an alternative
That said, what about a real working alternative that might not be to satisfy dashboard managers but the underlying business? It would start by implementing the most basic requirement, and to add more features until it’s good enough, or enough budget has been spent. Here are some possible steps from the above example:
- Foundation setup
-
The initial goal of the setup is to enable log access and the most important logs are applications logs. Hence, the first setup is the following:
- More JVM logs
-
From this point on, a near-zero effort is to add scraping Tomcat’s log, to help with incident analysis by adding correlation.
- Machine decoupling
-
The next logical step is to move the Elasticsearch instance to its own dedicated machine, to add an extra level of modularity to the overall architecture.
- Even more logs
-
At this point, additional logs from other components - load balancer, Solr server, etc. can be sent to Elasticsearch to improve issue-solving involving different components.
- Performance improvement
-
Given that Logstash is written in Ruby, there might be some performance issues on running Logstash directly along the component, depending on each machine specific load and performances. Elastic realized it some time ago and now propose better performance via dedicated Beat. Every Logstash instance can be replaced by Filebeats.
- Not only logs
-
With the Jolokia library, it’s possible to expose JMX beans through an HTTP interface. Unfortunately, there are only a few available Beats and none of them handle HTTP. However, Logstash with the http-poller plugin gets the job done.
- Reliability
-
In order to improve reliability, Elasticsearch can be cluster-ized.
The good thing about those steps is that they can implemented in (nearly) any order. This means that after laying out the base foundation - the first step, the stakeholder can decide which on makes sense for its specific context, or to stop because it’s enough regarding the added value.
At this point, estimates still might make sense regarding the first step. But after eliminating most complexity (and its related uncertainty), it feels much more comfortable estimating the setup of an Elastic stack in a specific context.
Conclusion
As stated above, whether Agile principles are implemented in software development projects can be subject to debate. However, my feeling is that they have not reached the Ops sphere yet. That’s a shame, because as in Development, projects can truly benefit from real Agile practices. However, to prevent association with Agile cargo cult, I proposed the use of the term "Exploratory Infrastructure". This post described a proposal to apply such an exploratory approach to a sample infrastructure project. The main drawback of such approach is that it will cost more, as the path straight; the main benefit is that at every step, the stakeholder can choose to pursue or stop, taking into account the law of diminishing returns.