What is dogfooding?
There is a term we use in the software industry, ‘dogfooding’. This refers to the phrase ‘eating your own dog food’ or, less metaphorically, using the product you produce. The origin of the term is uncertain but its popularisation is commonly credited to Microsoft in the late 1980’s. (As the story goes, a couple of decades later Microsoft’s CIO tried to rebrand this practice as ‘ice-creaming’. It appears not to have caught on.) Dogfooding is usually an intentional activity designed to flush out bugs, pain-points and quality issues in pre-release software before it reaches the public.
Can we eat our own dogfood?
Here at Hyper Anna our business, like most, has multiple functional areas: sales, marketing, finance and ‘product development’. We know our product works for Finance and Marketing use-cases. We also have “test datasets” for internal development drawn from a diverse range of industries such as retail, commercial banking and accounting. But as product developers and software engineers, these use-cases do not give us an opportunity to use our own product on data that is near and dear to our own hearts.
As we develop Hyper Anna we always strive to put ourselves in our customers' shoes. We want to walk a mile in them and pop the blisters. Our mission is to democratise data. To connect you with data stories and insights that matter. How do we know if we’re doing a good job? How do we know if we’re telling a good story?
To really get a feel for how well we are doing I need some data that sings to me, that is personally meaningful. As a technical engineering manager my daily work-life is often consumed with details of who is building what, when and how. I am keenly interested in personal growth of our developers, but also the effectiveness of the system as a whole; the efficiency of our development efforts and whether the way we choices we make when deciding how to build software are helping us or hindering us. It is natural to wonder whether there might be something about our development environment or processes we can measure.
Developer Productivity and Statistics
First, a disclaimer. It is a common fallacy in software engineering to imagine it might be possible to measure developer productivity. This arises from a flawed analogy with the construction world. Walls are made of bricks. The faster you can lay bricks, the faster you can build a wall. If you can build a wall fast you must be ‘productive’. Now software is made out of ‘code’, so surely the faster you can write code and the more code you write, the more productive you are as a developer?
Unfortunately not. Building a product is primarily an exercise in problem solving. One would instinctively never think to consider “number of mock-ups produced” as a meaningful way to measure the contribution of a Product Designer. So, too, with software engineering. There are no individual measures, metrics or statistics that will give us a meaningful view of engineering productivity.
But contemporary wisdom suggests that looking at measurements for a team or company as a whole can be useful. So fortunately for our quest to find some palatable dog food, there are plenty of things we can measure at an aggregate level may give us a good overview of what constitutes “business as usual”. Although we can’t draw any direct conclusions from these metrics we can use them to get a feel for the pulse and flow of the team.
At Hyper Anna we utilise Git for version control and practice both Continuous Integration and Continuous Delivery (a combination commonly referred to as CI/CD). We continually release our software “off trunk” using a combination of feature-flags and small, short-lived ‘pull requests’ [PRs].
The PR process involves a developer making a copy of the latest code, applying their own changes to it and then requesting that another developer “review” their proposed changes. A reviewer can comment on the proposed changes and then either approve them or request amendments. Once the changes are approved they are applied back to the latest code by ‘merging’. At any point during this process the developer may choose to ‘commit’ their code. This is equivalent to saving a copy at a point in time. You must commit your code before it can be reviewed and merged, but a developer may choose to make multiple commits as they work. When they are ready, all the commits in their PR are rolled up and reviewed together.
This PR process immediately gives us something to measure. Setting aside momentarily the question of whether we should measure these things, and what they might mean, we could look at the following attributes of a pull-request:
Number of files changed, lines of code changed/added/removed, number of commits, days before merge.
It would be nice if ‘days before merge’ was ‘1’ or less. That fairly unambiguously tells us that reviews are happening in a timely fashion. The others are more nuanced.
Like any review process, the more changes we have to review the longer the review will take and the more chances there are that the reviewer will lose concentration, get bored, impatient, lost or overwhelmed. So intuitively we feel that keeping each pull request small would be a Good Thing. However there is a risk that if changes are too small and granular the reviewer loses context. They lose the forest for the trees, so to speak, and could only comment on surface level issues like syntax and idiom. An analogy would be if I sent this blog post off to a copy editor and they remarked only upon my grammar and use of punctuation, completely neglecting to pull me up when I go off tangent and talk for three pages about the life cycle of cicadas.
So it seems sensible to expect a metric based around ‘changes’, ‘changed files’ and ‘commits’ to be low, but not too low, for some definition of low. In reality it is not possible to pick a single value that sharply defines what makes a ‘good’ pull request from one that is too large or too small. We might refer to this value as the “goldilocks zone”. Not too big, not too small, just right. But we are talking about reviewing changes to a set of files, line-by-line. It feels sensible to want “changes per pull-request” to be in the hundreds at most. Higher than that and we’re probably entering the territory where reviews become burdensome and overwhelming.
Integrating Anna with GitHub
I am acutely aware that perhaps not everyone out there is as interested as I am in the ebb and flow of data through the electronic rivers of our digital lives. Nevertheless, I feel that a high level explanation of how we fed our pull-request metrics into Hyper Anna is illustrative in general of how Hyper Anna can be set up to ingest and refresh arbitrary datasets.
At Hyper Anna we host our code in private GitHub repositories. Whenever something happens on GitHub an ‘event’ is triggered. It is trivially possible to configure GitHub to send those events to a receiving service. In technical terms, we configured a GitHub webhook to send pull-request events to an AWS lambda receiver. Our little lambda extract the metrics we’re interested in from the GitHub event and write them into a database table.
Hyper Anna can be configured to import data on a regular schedule from a database table, so it is a simple matter to set up a nightly refresh. Backfilling historical pull-request data was accomplished by writing a small script that pulled the information down using GitHub’s GraphQL API.
‘Discovering’ our pull-request data
After importing the data the first thing we see in Hyper Anna is the “Discover” page, designed to give us a high level overview of our key metrics. This page immediately shows we had an uptick in pull-requests in September compared to August.
If we were dealing with revenue, sales, units shipped or other straightforward business drivers, this uptick would like be an obviously good thing. But recalling our earlier caveat regarding development metrics it would be wise for us not to focus on the actual number and instead look at longer terms trends.
So we’ll scroll down to the “Overview” section and look at how the number of pull-requests has been changing over time.
In this graph what jumps out at me is a massive and sustained spike in the number of pull-requests starting in April 2018. This happens to be a period when we increased our engineering head-count so it may very well have been caused or at least influenced by that. The other thing that happened in April was that I joined Hyper Anna and began strongly advocating for smaller pull-requests, feature flags and continuous integration. So could it be a case that the same amount of code was being written but it was spread across a larger number of smaller pull-requests? We’ll come back to investigate this later. Before we move on, let’s have a quick flick through the rest of our metrics.
The next metric of interest to us is the average number of changed lines per pull-request. The “Discover” page handily calculates this for us.
We can see we’re averaging under 200 changes per pull-request and that number has remained steady for about a year, except for a blip in March that we could dig into were we so inclined. The drop compared to the year prior is very sharp and could be interesting to investigate further.
We have a number now, but is it a good number? Anecdotally it seems okay. We regularly hold ‘retrospectives’ where we reflect on what went well and what could have gone better. Based on our retrospectives and informal chats the size of our pull-requests feels okay. So what this graph might be telling us is that if can hold steady around this value we should be fine - we seem to be in a “goldilocks zone” at the moment. If we see a serious spike it could mean someone accidentally checked in a million lines of log files (which has happened before!), or it could be a sign our discipline is slipping and we need to reexamine how we’re working.
We’ll look next at “average commits per pull-request”. The graph is a little lumpy in the past but has been intriguingly stable for over a year. Of course looking at that second graph alone you wouldn’t know how many changes were in each of those commits, but when we combine it with the graph above we gain a feeling that we’ve been doing a good job at working at a steady, regular cadence.
Just to round it off, we can have a quick look at “days before merge” over the last year.
The graph alone seems a little lumpy, but if we look at the scale we can see the values hovering between 2 and 0.5. While personally I would love to see this stay below ‘1’, seeing it under ‘2’ is not bad at all.
Earlier we were wondering whether 200 changes per pull-request was a good number or not. Putting it all together, it seems we’re working at a pace that typically sees a line of code flow from an engineer’s hands to our integration environment in under a day. I don’t believe these metrics are predictive or even necessarily indicative of anything, but they do appear to paint a picture of a relatively steady flow of code out through to production.
Moving on from our overview I’m still curious about that spike in pull-requests back last April. Was it just more work? Or was it a change in habit? Or both?
We could eyeball all our graphs on the “Discover” page but Anna is smart enough to plot these things together and tell us whether they are correlated. So we switch to the “Search” tab and look for “number of Pr Identifier and number of changes”. Note that I have limited the search data range to October 2018, which is a couple of months after our number of pull-requests peaked.
Anna immediately shows us a strong correlation across this period and moreover tells us that the number of changes (as a percentage rise over historical values) was increasing much, much faster than the number of pull requests.
Does this mean the spike was likely due to increasing our head-count and working at a furious pace? The null-hypothesis here would be that we worked at the same pace but refined how we sized our pull-requests.
Let’s look for more evidence either way by examining how commits correlate with pull-requests, and how commits correlate with changes.
We can see that not only do changes, commits and pull-requests all correlate, but they are linked very tightly. I find this fascinating. Over this period we doubled our engineering head-count and embarked on several major new initiatives. Yet despite this disruption and absorbing a large number of new developers, the amount of work that goes into each individual commit remained roughly the same. This might not be directly actionable, but it is a very interesting insight into the way we were operating.
At this point I’m really curious to see what “average changes per commit” might look like over time. In Anna’s parlance we call this a “ratio calculation”. Hyper Anna has ‘behind the scenes’ support for ratio calculations but “self-service” options around this are still under development. It would be ‘cheating’ for me to use my inside knowledge and special powers to configure a ratio calculation right now, because the whole point of dogfooding is to feel the same pains as our customers. As a customer right now I’d really love to configure a ‘ratio calculation’!
For now though, see what happens to our metrics after this historic high around July/August. Let’s keep the comparison the same but change the date range to look at October 2018 until now.
Anna reports a weak correlation but shows a huge drop in “changes” without a corresponding drop in the number of commits. So although we’re hitting the “commit” button just as often, a lot less code is being changed. This was a “Well, huh!” moment. It was something I did not expect to see in the data as it breaks the pattern we had observed prior to October 2018.
We have to be careful here. It is a naive assumption to think that this might mark a drop in productivity because there are suddenly less changes per commit. But it is simply not possible to directly attribute a rise or fall in the amount of code being written to a rise or fall in productivity. When you build a new interstate highway you might begin by blasting through mountains, shifting tons of rock and laying miles of asphalt. But you can’t drive on it until you’ve painted all the roadmarkings and laid out the speed signs. Neither side of the project is more important than the other, both must be completed before the road can be used. But if you’re only measuring tons of rock moved, obviously one half of the project will stand out more than the other.
“Changes” includes both code added and code deleted. So although I’m reasonably certain these ‘changes’ are going to be additions we’ll just double-check that by asking Anna to show us additions and deletions separately.
But no! I am truly surprised. Additions and deletions are strongly correlated! There could be multiple interpretations of this, but the one that springs to mind most immediately is “rework”. Modifying one existing line of code counts as two changes. A ‘deletion’ and an ‘addition’. So this could mean we are reworking a lot of code. Either by rewriting existing code or by adding new code and removing old code at the same time.
This intrigues me to the point where I want to see this graph for our entire history. Easily done using our ‘time controls’ by expanding the range back out to “all data”.
This graph also reveals a truly fascinating glimpse at our company’s history through code. We can see in our early history a few large spikes of development frenzy where ‘additions’ jumped, followed by, one might guess, a bit of clean-up where deletions spike. We then see a truly massive jump in additions that is sustained for a few months before sliding down into a much more regular pattern.
As much as I’d like to dive deeper into this pattern of leap-frogging additions and deletions, I would be mostly indulging my curiosity. It would be wonderful to create a ratio calculation of additions over deletions, which might be indicative of ‘code churn’. Definitely something I will add to my data once improved support for ratio calculations rolls through the roadmap. But for now there are still a couple of things I am keen to learn. One is to focus in on that peak around October 2018 and see what caused it. The other is to investigate whether the recent strong correlation in additions and deletions implies a lot of rework.
Let’s ‘zoom in’ on the months around those peaks by restricting our date range but changing the granularity to ‘weekly’ so we can see more data points across that time range. I haven’t included a picture of that graph because the very moment I did this I realised what I ought to be looking at were ‘daily’ changes over this time period.
Well okay! One large spike on the 7th of April. It’s looking like our “frenzy of development” might have been an outlier.
To check this out I’ll likely have to go back to our raw data source, but first I need to know where to look. I asked Anna to show by the additions by repository on the 7th of April and it’s pretty clear where the anomaly lies.
Having narrowed down the spike to a particular day and a particular repository I go back to GitHib. There are only seven pull-requests to check, easily done by hand. Very quickly I find the culprit. Someone has committed a model of the english language alongside the code.
Now suspicious, I review the entire “spike” period, broken down by repository, at a weekly granularity.
Each of those spikes is so far above the baseline as to make me believe they are similar in nature, developers accidentally committing something they shouldn’t. I invest a few minutes in investigating each one and determine my suspicions were correct. In each case we find a single commit in which log files, or test data, or some other enormous machine-generated file was introduced.
Was our dog food tasty?
There was definitely a delectable crunch to many parts of my experience, but overall I would say this experience was a chewy sort of ‘interesting…’ From this exercise I did learn a few things. Firstly, that looking at ‘changes’ alone is a terrible idea as the data is not a reasonable or useful proxy for amount of human effort involved. Secondly, that viewing aggregated data can result in some nice, smooth graphs but the story you invent in your mind when trying to interpret them is very easy to get wrong. Thirdly, that every assumption is worth double-checking. Fourthly, that spotting these outliers casts doubt on my original interpretation of the data and what I really should do now is remove the outliers and retrace my steps.
In today’s Anna that would require me to go back to the source database and remove or change rows of data. Although I have the freedom to do this, I imagine in a vast number of workplaces, for very many datasets, this idea would raise issues around data governance and lineage. Wouldn’t it be nice if Anna had some way for me to tell her to disregard these points and recompute?
Wouldn’t it be even nicer if Anna proactively warned me about these outliers and how they might screw up the narrative I was carefully constructing in my head? I must add that Anna already has proactive anomaly detection for the latest month of data and we are currently hard at work on improving this even further. Anna did provide me with the tools to discover the flaws in my assumptions, but it required a healthy dose of curiosity and willingness to explore on my part.
Even this small data-set is so rich I could go on digging practically forever. In particular I’d love to investigate the rise in pull-requests and commits after the historic peak. The peak I now know was due to outliers, ignoring that we still see a sustained rise in our number of pull-requests. But I think my patience for diving into our code history vastly exceeds your patience in reading about it. So if you’ll indulge me one last time… I happen to know we currently release from our beta environment to production once a week. I am curious to know if this leaves a footprint in the data.
And yes! It does. Can you look at this graph and guess when we release our software to production? If you’d guessed “Wednesday night” or “Thursday” you’d be close enough! We currently cut our releases after close-of-business on Wednesday, test internally on Thursday and deploy to our clients on Thursday after close-of-business. I can clearly still see the influence of this process in a flurry of activity on Wednesday to “make the release.”
While I’m pleased our engineering team is so eager to get their changes our to our customers, ‘haste makes waste’ as they say! I would hope to see that as we continue on our CI/CD journey this mid-week spike will even out and we’ll see a more even, sustained and ideally elevated pace.
How This Helps Us Helps You
In the course of writing this article, I spent about an hour digging through my data in Anna. I feel that overall I accomplished what I set out to learn, although getting there - I must admit - could have been a smoother experience. I yearned for being able to easily set up a ratio calculation and I would love to see the outlier detection we are working on integrated across the product. I would have loved to have a way to easily “zoom in” on our line chart without having to go back to the date-picker.
I also wonder if some of the time I spent thinking, “Hmm - what do I look at next?” could have been assisted by Anna herself. We aim to build Anna into a virtual data-analyst. Into a team-mate more than a tool. This feels like a rich area we could explore for future enhancements to Anna.
In just a single short hour I learnt more about our strengths and weaknesses than I would have in months of listening to our product manager articular requirements. (No offence, Rob!). I feel very vulnerable admitting a gap in our product, and I do so hope our Marketing champion will still agree to let me publish this - but this cuts to the heart of why we “dogfood”. Why all of us here at Hyper Anna look for ways to bring more data into our lives and to look at that data through Anna. In this way we hit the same barriers, pains and problems as our customers and through it make a better product.