Fighting Chance: A Statistical Analysis of the Halo 5 Multiplayer

The mechanical fabric of Halo 5 rewards consciousness of the combat above all else, and as a shooter package, it's stuffed to the gills with modes and features to please all species of players. I think this is a shining beacon of action game design and art direction, but I find it hard to reconcile that view with my many memories of coming away from the game feeling cheated and frustrated. Some of that frustration came from the way that my allies would handle themselves in fights or from scratches on the design like the inconsiderate spawn system, but what consistently rubbed me the wrong way was the difficulty in some matches feeling insurmountable. It's not that Halo 5 would always overestimate my skill, but that I could fall into an occasional rut where it did because match difficulty was erratic. At least, that's how I perceived it. No one was happier than me to hear that 343 Industries was revamping their player matching, and on paper, the improved TrueSkill system looks phenomenal. How it feels from the perspective of a player is, however, something else.

Here, I've compared stats from 50 matches before the introduction of TrueSkill2 and 50 matches from after to explore the Halo 5 of the past and the Halo 5 of the present. I also hope that this article can serve as a real-world example of how to analyse multiplayer stats, particularly with the goal of determining a game's fairness. If you want to see the dataset I'm working from, you can find it on Google Sheets at this link, while my full player history is available on Halo Waypoint. If you want to nerd out over all the specifics of the data sampling, check the first note at the bottom of this article,^[1] otherwise, just know that I based this on my performance in one-hundred random "Balanced" Team Arena games. Let's start with the most straightforward aspect of these games: wins and losses.

TrueSkill Win Rate: 50%
TrueSkill2 Win Rate: 50%

I saw an ideal 50%/1.0 win rate under TrueSkill, and not surprisingly, that continued into TrueSkill2. Remember, if we value fairness and challenge then the ideal success rate is not the highest success rate, it's the one that represents that matches have an equal chance to go one way or another, and Halo 5 has no problems there. But we don't just view the fairness of match-ups in terms of whether we win or lose but also whether we feel like we can hold our own as a player. If we're carrying or being carried by our allies, we're not going to view the match favourably. In shooters, kill rates are a widely applicable measure of individual player success so let's look at those.

Mean K/D Ratio Under TrueSkill: 1.396304162
Mean K/D Ratio Under TrueSkill2: 1.044855431

First off, let's talk about why we use K/D ratio rather than raw K/D when assessing player performance. That is, why we use kills divided by deaths rather than kills minus deaths. Feel free to skip to the next paragraph if you already understand. Imagine that you finish one game with 10 kills and 5 deaths and another game with 20 kills and 10 deaths. If we subtract deaths from kills to reach a K/D total, our first game gives us +5 K/D and our second game gives us +10 K/D. So it looks like we did twice as well in that second game as in the first game, but it's harder to view it that way when we realise that in that second game we had double the deaths that we did in the first. If we want a fair measure of our performance, then we can't just use an equation that mashes our successes and failures together; we have to find an equation that considers our successes comparative to our failures, that is, our kills relative to our deaths. We can do that by dividing kills by deaths. In the above example, 10/5 = 2 and 20/10 = 2, revealing you did proportionally as well in both matches.

As with the Win/Loss ratio, the optimum mean K/D ratio is 1.0. I'm not going to turn my nose up at my positive K/D pre-TrueSkill2, but me getting that 1.4 does mean players elsewhere were finding it harder to compete. You could say that the only thing that matters is skill and that if I got that leg up on other Spartans, I did it fair and square, and that's a valid opinion. If you want a less balanced multiplayer experience, Halo 5 and some other games do let you select an "Expanded" search option, but matchmaking systems are about balancing and that 1.4 represents an imbalance. By comparison, TrueSkill2 gets me shockingly close to a 1.0 K/D ratio. There is no criticism I can make based on this statistic, but you also have to understand that it's possible for the mean to come down around 1.0 but for many individual matches to leave me with K/D far above or below that average. You can skip the next paragraph if you're familiar with statistical analysis.

Imagine that I play three matches where I get a 1.8 K/D ratio and then three matches where I get a 0.2 K/D ratio. The first three matches are statistically a breeze for me to complete, while I'm getting demolished in the last three. The difficulty is all over the place, and yet, if we calculate the mean of these six K/D ratios, it resolves to an equal-looking 1.0.^[2] The statistics suggest that matches are in the sweet spot for challenge, but the experience is one of that wildly mismatched. This is because, while the mean can give us an insightful overview of the multiplayer experience, we can't guarantee by default that the K/D ratio in any given match will fall close to that average. Although, while a mean 1.0 K/D ratio is what we desire and we don't want matches to stray far from the mean, we should also remember that we don't want every single game to have a 1.0 K/D spread. We need some variation in the difficulty not to get bored but not so much that it feels like the game is rolling a die when deciding how hard combat is going to be. Remember, one of my strongest criticisms of Halo 5's multiplayer is that it's too inconsistent in its difficulty. To work out how far, on average, individual data points tend to fall from the mean, we look at a variable called standard deviation. Below, I've calculated the standard deviation of my K/D ratio in these games.

K/D Ratio Standard Deviation Under TrueSkill: 0.9378690375
K/D Ratio Standard Deviation Under TrueSkill2: 0.5853917466

When measuring standard deviation, the closer the number is to 0, the more uniformity we see in the data, while the further away it gets, the more that data is spread out. The standard deviation of 0.9 that TrueSkill gave me meant that my K/D ratio in the typical match was 0.9 higher or lower than the mean 1.4 K/D ratio we looked at above. Put another way, it would be normal to find some of my matches at 2.3 K/D ratio and others at 0.5; that's a copiously wide swing and would explain why the difficulty of games felt random. Although, it is worth noting that both with means and standard deviations, a few extreme data points can disproportionately influence the figures. For Halo 5, that implies that one game where I had an unusually high or low K/D ratio would count more towards the standard deviation than others where I had a more reasonable one. Still, between this and just scrolling through the K/D ratios on the spreadsheet, I think most people would agree that TrueSkill was unreliable at matching for individual player performance. TrueSkill 2 is almost a third better which is nothing to sniff at, and we must admit that there's no objective answer for what the right amount of deviation should be; this is a matter of personal preference. However, an average variation of K/D that closes in on 0.6 is too high for my liking; I wish we saw a little more stability in this number.

Now that we understand standard deviation, we can drill down a little further into how win rates and skills gaps may differ between the gametypes. While Halo 5 now gives roughly 1.0 Win/Loss rates and K/D ratios, it was my impression for the majority of my time with the game that teams understood the basic strategy of Slayer but didn't grasp the tactics of objective-based gametypes as well. I compared thirty-five Slayer matches in TrueSkill with thirty-five Slayer matches in TrueSkill2 and then compared forty Strongholds matches under Trueskill with forty Strongholds matches under TrueSkill2. I'd have liked to collect a full fifty for both gametypes, but even after playing well over one hundred matches across two seasons, this wasn't possible with the time I had available. I chose Slayer and Strongholds rather than Capture the Flag or Oddball (which also appear in Team Arena) as CTF is only scored from 0-3, and so you don't get a lot of specificity in the figures, while Oddball was added at the tail end of TrueSkill's reign, rendering me unable to collect a lot of data for it. Let's see how the team scores from these matches compare.

TrueSkill

Slayer Win Rate: 60%
Mean Slayer Score Difference: 14.4
Slayer Score Difference Standard Deviation: 8.073923166

Strongholds Win Rate: 45%
Mean Strongholds Score Difference: 56.675
Strongholds Score Difference Standard Deviation: 28.58849284

TrueSkill2

Slayer Win Rate: 57.14285714285714%
Mean Slayer Score Difference: 11.77142857
Slayer Score Difference Standard Deviation: 7.219604496

Strongholds Win Rate: 45%
Mean Strongholds Score Difference: 57.975
Strongholds Score Difference Standard Deviation: 28.17116866

I understand these stats may look a little intimidating upfront, but they represent concepts we already understand. With both TrueSkill and TrueSkill 2, I was getting a roughly 60% win rate in Slayer and a roughly 45% win rate in Strongholds which is close enough to 50%. Under TrueSkill, in Slayer, I was seeing one team get, on average, 14 more kills than the other team; with TrueSkill2 that dropped to 12. Matches tended to come out 7 or 8 points to one or other side of that average which created some stinkers, but the typical Slayer match is nothing to write home about. Strongholds, on the other hand, is a statistical car crash. Keep in mind, we'd expect the scores in Strongholds to be double what we see in Slayer because Slayer is scored on a range of 0-50 while Strongholds is scored on a scale of 0-100, but even when you compensate for that, the gaps between teams in Strongholds are off the charts, and the standard deviations of scores don't look much better. Bringing in TrueSkill2 didn't budge this metric a notch.

This maths cannot tell us where these gulfs of ability in Strongholds come from, but I have an educated guess. It may be that Strongholds demands skills of players that the other modes in Team Arena don't, or at least, demands them in higher amounts than other modes, so the talents that let you get ahead in Slayer or CTF don't count for as much in this gametype. This means that the matchmaker may assume that two means are roughly matched because they have performed similarly across Team Arena as a whole, but that tells the system little about whether they play well in Strongholds specifically. The gametype does rely on concepts of resource management and area control which are not as prominent in other Team Arena modes so this explanation is, at least, internally consistent.

Obviously, when player empowerment games are about agency, we don't just care about how evenly distributed wins and scores are or about whether the code gives us a fighting chance as individuals, but also about whether our contributions to a match translate into wins or losses for our team. If fighting well doesn't make our team likely to win and fighting poorly doesn't make them prone to lose then our efforts are pointless, and there's not much reason to play. You should skip to the figures below if you're a maths buff; for everyone else, let's talk about correlation coefficients. A correlation coefficient is a measurement of the relationship, or lack thereof, between two sets of numbers. The coefficient will always come out somewhere from 1 to -1. A 1 indicates that as one of the numbers increases, so does the other, a 0 indicates that one variable does not seem to move in line with the other, and a -1 tells us that as one of them increases, the other decreases. The two variables we're looking for a correlation between are K/D ratio and Win/Loss ratio.^[3] The closer to 1, the stronger my individual performance was affecting my team's performance, and the closer to -1, the less that's true.

TrueSkill K/D Ratio/Win Correlation:^[4] 0.3763704913
TrueSkill2 K/D Ratio/Win Correlation: 0.3390783394

The ideal for this number would be 0.25 as there are four Spartans on a team in Team Arena and if they all contributed 0.25 of the individual effort to the final score, that would be equal pushes by everyone. However, you also get a lot of matches where players quit out, and so individual player performances sometimes end up swaying the game more. The multiplayer not correctly accounting for players quitting out has been its own problem, but all things considered, 0.38 is a reasonable correlation between the competence I displayed and the outcomes of matches. Again, I can raise no complaint here. The correlation under TrueSkill2 is marginally lower, but a difference of 0.04 isn't enough to raise any eyebrows and could be reasonably explained by chance.

But I have one last experiment before we go and it's not going to involve comparing TrueSkill and TrueSkill2 this time. Play enough Halo 5, and you can't help but pay attention to the burden that is put on players when their teammates quit out. While it wasn't my original plan to investigate it, it felt like such a relevant factor to team success that I began recording games where players left. In the end, I had forty-three games, all played under TrueSkill2, and a way to test for a connection between players quitting out and the affected team's likelihood to win. In all forty-three of these games, I correlated^[5] how many more players my opponent's team had than my team with whether my team won or lost and ended up with this:

Quits/Win Correlation: -0.8147018826

I also crunched some other figures:

Quit Rate: 24%
Mean K/D Ratio in Matches with Quitters: 1.740024069
K/D Ratio Standard Deviation in Matches with Quitters: 1.789220406

Remember, the closer a correlation coefficient is to -1, the more one number rising is connected with another falling, so that -0.8 suggests a crushingly strong correlation between people on a team quitting and that team losing. In almost a quarter of my TrueSkill2 games, someone quit out, and in 41/43 of the quitter games I measured, you could predict who the winner of the match was going to be just by looking at who had more players on their team. So while the matchmaking system may be free of bias overall, matches in which players quit out are largely deterministic: The vast majority of the time, the outcome is decided the moment a player leaves and the losing team just has to suffer towards it. Additionally, as one team becomes outnumbered by the other and the advantaged team can pick them off with less resistance, K/D ratios begin pulling away from that mean of 1.0 and will deviate wildly depending on whether you're on a team with quitters or facing a team with quitters.

That's all I have for now. To summarise, Halo 5 has always been a game which balances teams perfectly overall, although that's less true in Strongholds than it is in the other gametypes and is almost never true when players begin quitting out, a common occurrence in the multiplayer. It is also the case that the difficulty of matches is inconsistent, even if that inconsistency has calmed somewhat with the introduction of TrueSkill2. Despite this, Halo 5 has always been a game where you see an equal number of winning and losing games, and where each team member's contributions are weighted roughly equally. TrueSkill2 has also now made it so that games are flawlessly matched to your aptitude, with your team neither carrying you nor needing you to carry them. Thanks for reading.

Notes

1. I collected all matches for the dataset from my personal game history, and all individual-based data reflects my performance, not that of any other team member. The criteria for a match to be included in the dataset were that I found it via the "Balanced" matchmaking preference, I played it within the Team Arena playlist, I did not quit out of it, and it did not end in a tie. Tied games would have complicated the data analysis considerably and are rare enough that our data isn't significantly less representative of normal Halo 5 play when we exclude them. All TrueSkill matches were collected within one season and all TrueSkill2 matches within another to ensure that changing seasons did not create inconsistencies between the data beyond the unavoidable season change between TrueSkill and TrueSkill2.

The TrueSkill dataset includes this match and the forty-nine matches played directly before it that fit the above criteria, while the TrueSkill2 dataset includes this match and the forty-nine matches played directly after it that fit the above criteria. Collection for gametype-specific datasets also started on the above dates, again, with TrueSkill data moving in reverse chronological order and TrueSkill2 data moving in regular chronological order. Capture of TrueSkill2 data did not commence immediately after TrueSkill2 was implemented as player matching was unreliable for roughly the first two months while the system adjusted. I believed that data in this period would be unrepresentative of users' overall experience with the multiplayer and so did not capture it.

2. To calculate any mean, we add all relevant variables together and divide them by the number of variables in the set. 0.2 + 0.2 + 0.2 + 1.8 + 1.8 + 1.8 = 6.0. 6.0 / 6 = 1.0.

3. We can turn wins and losses into a number that we can operate on by assigning a "1" to wins and a "0" to losses.

4. We're using the Pearson correlation coefficient formula for these calculations, the most common measure of correlation coefficients.

5. I compiled the "quits" dataset over the same period as the other TrueSkill2 data. The only match that contained a player quit in that period that I did not record was this one where the quitting player only removed themselves from the game in the final seconds. For each match, the number of quits on my team was subtracted from the number of quits on the opposing team. The correlation also used the method of representing wins with 1s and losses with 0s. This total quit difference figure across the matches was then correlated with the wins and losses across the matches to reach a final correlation coefficient.

Halo 5: Guardians