Watching Westworld recently, I couldn’t stop seeing parallels with software engineering. In particular, the series is rife with cautionary tales about how not to do software development. In this article, I’ll share 9 lessons we can take from the show to improve the effectiveness of software engineering teams – and hopefully prevent them from going “full Westworld”…
Like a lot of people, I’m watching a lot more TV at the moment because of the Covid-19 lockdown. Despite being recommended to me ages ago (thanks Alex!), I’ve only just got around to watching Westworld. When we finally started watching, it immediately displaced all other shows and we crashed through season 1 in no time at all. (We haven’t started season 2 yet, so this article is just about season 1.) I’ll try to avoid spoilers where possible, but, well… the episodes I’m talking about aired back in 2016.
For those that haven’t see the show, Westworld is basically a massive open-world RPG made reality; a western-themed park inhabited by robot NPCs. Guests pay large sums of money to come and play out their wildest fantasies, including the killing and raping the robotic “Hosts”. A lot of the story in the show takes place behind the scenes, giving us a window into how the “Body Shop”, “Behaviour”, “Narrative”, and “QA” teams interact, as well as the effect this has on the smooth running of the park.
As someone interested in processes that can enable development teams to be effective and the cultural trait effective teams exhibit, I regularly found myself spotting areas for improvement whilst watching Westworld.
NB: Without giving too much away, there are some really interesting ethical questions raised by the show. For the purposes of this article, I’m going to ignore these and just assume we all want the park to be run effectively. I’m also going to gloss over some of the twists and simply take the park at face value.
So, without any further ado, 9 things that Westworld (season 1) teaches us about software engineering:
1. Dev, Ops, QA should all work together
At several points in the story, we see problems arising due to poor communication – or even outright hostility – between various teams of Westworld staff. If the Body Shop had been more able to communicate freely with the other teams, perhaps they would have raised the alarm when a Host starts displaying behaviours that should be impossible. Or, if there hadn’t been such a power struggle between QA and Behaviour, perhaps they could have collaborated to more quickly resolve a botched update.
The parallels are clear: The Body Shop look after hardware, Behaviour is the coders, and QA is, well, QA. Time and again the show reveals the consequences of the adversarial relationships between these fictional teams. I bet most engineers have experienced some degree of this in the workplace, as well as some of the consequences (though hopefully not so catastrophic). The lesson here is to remember we’re really all on the same team, with the same overall goal. If QA, DevOps, Development (and all the other associated teams) can all pull in the same direction we’re all going to be able to do our jobs better.
2. Complexity is the enemy
Westworld is effectively a massive interconnected collection of computers – each Host has many layers of its own set of motivations stacked on top of a common set of “Prime Directives”. This complexity causes tension even when the park is running without bugs, bad updates, or glitches. A team of analysts nervously keeps an eye on activity, whilst others act as enforcers with itchy trigger fingers.
The problem with complexity is that it decreases the chance that anyone will be able to be rational when debugging. We can only hold so many causal links in our heads, only so many triggers and reactions. When a system grows to the sort of complexity that exceeds this limit, we begin to rely on heuristics rather than reason. Even if we think we understand the system, we can encounter “emergent behaviours” when features interact in unanticipated ways.
How do we keep systems simple? Some might say architecture, but you can have a complex set of microservices just as you can have complex monoliths. In part, we can reduce complexity by carefully designing components so that they interact in simple ways. And we can try to manage any residual complexity with tests, monitoring, and documentation.
3. Make sure you understand legacy code
One of the big themes of season one is the possibility that a former contributor to the team has left in some code. This is a possible explanation when the Hosts appear to gain abilities they shouldn’t have. Malicious intent is suspected, but this is hardly the point. The fact that it’s a mystery is the real problem here.
When we are new to a legacy codebase, we really need make the effort to try and understand it. Too often we find ourselves producing layers of new code before we get a chance to really become familiar with the old code. This can be especially true when we’re working on a very complex, unreadable, impenetrable codebase.
Certain circumstances make it especially important to really comb through the code of an inherited legacy system. First, if we suspect that our predecessors were either malicious or incompetent (or both!) then we really must do our due diligence. Secondly, if the application is particularly high-risk, the consequences of not understanding the code can make it foolish not to do so. What makes the Westworld case so egregious is that both these factors applied!
4. Don’t skip code review
Early in the season, we hear that one of the original developers, now very senior, has bypassed the code review process to add in a feature he’s developed. Out of deference to the developer’s seniority, the head of the Behaviour team permits the feature to be deployed to production. This backfires when the code is later linked to a serious defect. The story shows how important code review can be – and how it should even apply to very experienced developers.
Code review would also have helped the developers to better understand the legacy codebase. Further, code review can be an important tool in managing a complex system, as the knowledge of how each part operates becomes more distributed throughout the team.
5. Remember why we test
Despite having a well-staffed and powerful QA team, Westworld’s testing processes seem to leave a lot to be desired. As mentioned above, senior employees are apparently free to deploy features into production without any oversight, let alone testing. Where were the acceptance criteria? Why was there no regression testing?
When developing software, quality is everyone’s responsibility. Before writing any code, we should ensure that all parties share a common understanding of a feature entails. Developers should then test their own code as much as possible (e.g. manual testing, unit tests) before handing it over to QA.
In Westworld, many of the park staff seem to view the Hosts with a degree of superstition. Likewise, the QA team’s approach to testing seems largely ritualistic. This leads to a very superficial form of testing. Following processes is an important part of QA, but it isn’t why we test. Your list of checkboxes is there as a foundation, but effective QA can only be achieved if all contributors approach the system with an inquisitive mind.
Ultimately, we all want to ship the best product possible. That’s why we test.
6. Tidy up after yourself (and maybe beforehand too)
The Hosts are supposed to be completely reset between deployments, but we find that in some cases their memories aren’t fully wiped. In some cases, even when a Host is radically reconfigured to play a different character, elements of the old character may linger. This laziness isn’t of much consequence until some of these memories start to interact with new behaviours in unexpected ways.
The equivalent in day-to-day software engineering is to have a nice, empty context to work in whenever our program runs. If we cause some files to be written to the filesystem, for instance, we may have to tidy these up to prevent them from affecting a future task. Or if we need to create a temporary table in a database, this should be dropped when it is no longer needed. Where our code has side effects, we should manage these carefully to prevent pollution of future scopes.
7. Assess the logs with a healthy dose of scepticism
Without giving too much away, it becomes clear that the action we’re watching isn’t all playing out at the same time. Events that appear contemporaneous are actually, in some cases, separated by decades.
This reminded me of trying to solve a particular bug. The logs were apparently telling me that events had taken place in an impossible sequence. How could B have happened before A? It didn’t make sense! Well, like the rest of the application, the logging had been written by developers and turned out to be less reliable than expected.
Separately, we find that (at least some) members of the Behaviour team can edit the interaction histories of the Hosts to cover their tracks. Depending on how your logging system is implemented, consider whether it would be possible for a colleague to alter your logs, either accidentally or intentionally.
Either way, remember that your logs (like everything else) have the potential to be imperfect. If your gut tells you something doesn’t add up, try to bring in evidence from other sources and see if it tallies.
8. Make your application (easily) observable
Aside from the logging issues noted above, we also find out that aspects of the legacy system are rather hard to inspect. Hosts are, it seems, debugged primarily via voice commands. Westworld staff frequently have to enter the park to identify issues/malfunctions with the hosts. One member of the Behaviour team even has to travel to a location within the park to access logs from a particular part of the infrastructure.
It should go without saying that having to be physically close to a system to extract diagnostic information is hardly ideal. It might be tolerable in some consumer electronics, but Westworld was built by one corporation on a single (admittedly large) site. I could perhaps forgive them the lack of remote diagnostics if it weren’t for the fact they seem to have advanced telecommunications throughout the park.
Where possible, having all logs, monitoring and diagnostic data sent to a central, accessible location (e.g. the cloud) is going to make it loads easier to manage your application. Simple questions about the general health of your application are much easier to answer when you’ve got a steady stream of data coming in.
9. Beware the 10x developer
Remember the very senior dev we mentioned earlier on? To my mind, he’s the embodiment of a “10x developer” – and I don’t mean that as a compliment. Although undeniably highly intelligent, he sees himself as above the rest of the team, hoards knowledge, and acts unilaterally. He’s a maverick in all the wrong ways.
Many of the points raised in this article can stem from (are at least be exacerbated by) this sort of behaviour. Having a single senior developer with this much knowledge and influence can quickly lead to an infantilization of their colleagues if they have the wrong mindset. This can lead to the decline of critical thinking and the decay of healthy processes. Code review ceases to apply to the senior dev. No-one else understands the system well enough to test it effectively. Corners get cut and teams become siloed. Intelligent, collaborative software development gets reduced to the performative following of process.
Worst of all, 10x developers and unmanageably complex systems seem to be symbiotic. 10x developers often favour their own baroque style of coding, which can be impenetrable to others. This can lead to a vicious circle of an increasingly complex system, understood by an ever-shrinking ring of developers.
So, the obvious thing to do is to avoid hiring 10x developers. The problem with that is that 10x developers are rarely hired – many 10x developers are born in the crucible of bad culture. This means that the advice here is more nuanced than simply “don’t hire 10 developers”. Instead, the advice is to be vigilant for the red flags that might signal that your team’s culture is souring, that your application is becoming unmaintainable, or that one of your developers is on the road to 10x.
So there we have it, 9 cautionary lessons I think we can all learn from the ill-fated Westworld team. I really hope you found them useful and maybe gave you something to think about in the context of your own team.
Bizarrely, Westworld was one of the most relatable representations of professional software development I’ve seen on screen in a long time. Even though the premise is so fantastical, the cultural and organisational problems exhibited by the team are very realistic.
Hopefully, your team isn’t in as bad a place as the Westworld team, but it pays to be vigilant for the sorts of behaviour that can lead to problems. From time to time, it helps to step back and reflect on what makes a good team, and whether your team exemplifies this.
After starting to write this article, I found that there are already versions for DevOps and QA. There is some overlap with this post, but I hope I’ve given a useful developer perspective. Since one of the main points is to collaborate with allied teams, understanding what lessons our DevOps and QA colleagues took from the show can be really informative – go check them out!
Think I’ve missed something? Leave me a comment below.