You are here

Rating System Editorials - True Average

John 'Gestalt' Bye

True Means or All We Are Saying, Is Give Means A Chance

For the last year NWVault has been using medians to figure out the scores of modules that appear in the Top Rated list. This method works by listing all the ratings a module has got in order from lowest to highest, and then picking the one in the middle as the average. This has several major problems - the scores come out a lot higher than they should be, the lower half of the votes have little or no effect on the final score, it's easy to rig the vote to unfairly boost a module's rating, and scores can only go up or down in steps of 0.1, so dozens of modules come out with exactly the same score, making it impossible to sort them accurately.

The result was easy to see - we ended up with a chart where pretty much the entire Top 20 had a perfect 10.0 average, and you had to scroll down several pages to find anything with a rating of less than 9.0. This was good for the egos of the builders who made those modules, but absolutely useless if you wanted to know which of those modules were actually the best.

Last Friday the ranking system was changed from using medians to means. For a couple of days (until they reverted to using medians) the Top Rated pages were suddenly a lot more useful. The scores were more spread out, so we weren't stuck with 19 modules all rated at a perfect 10. The modules that have got consistently high ratings floated to the top, rather than the ones which have the most rabid fans. And the whole system was a lot more representative.

A True Mean is the simplest, fairest and most common form of average - you just take the scores, add them all up, and then divide by the number of votes. This is the only one of the four ranking methods on offer this week that counts every vote and treats them all equally. It doesn't discard your vote because it thinks you've made a mistake, and it doesn't change the score you've given because it doesn't trust your opinion. With True Means, when you rate a module you can be sure that your vote will be treated just the same as everyone else's, and that the score you gave the module is the score that will be counted when the final ratings are worked out.

Now, some people see this as a disadvantage. They say that we can't be trusted, and that counting all of the votes equally will encourage flamers and trolls. The suggestion is that there's a plague of flame voting going on that we need to be protected against. This is absolute nonsense. Out of 9000 votes cast for the top 100 modules in the all-time Top Rated list, about 120 look like they MAY be flame votes. 120 possible flames out of 9000 votes. That's barely 1%. Almost all of those flames are against modules which have 200 or more votes, so they have very little impact on the modules' overall scores. And many of them are from the same dozen trolls, whose accounts will no doubt be deleted soon anyway.

And yet the people backing the "Trimmed Mean" system suggest that we should automatically throw out 10% of the votes for every module on NWVault to solve this non-existent problem. That means that for every dud vote which is removed, nine perfectly valid ones will also be discarded. Is this really a sensible way to do things?

The fact is that flame votes are rare, and when they do occur they're usually very easy to root out. We already have a system for doing this - repeat offenders can be reported to Maximus and, if there's clear evidence that someone is abusing the system by consistently flaming popular modules, or by registering multiple accounts to inflate the rating of their own module, their voting accounts will be deleted and their votes will disappear. There's no need to indiscriminately throw away thousands of our votes all over the site just to get rid of this handful of trouble makers.

The other supposed problem with True Means is that a module that has just entered the charts might appear near the top. Again, I don't see this as a problem. A module already needs ten votes to get into the chart at all, so if it's got (say) a 9.7 average from those ten votes it must be doing something right! And if it does turn out that the module is overrated, it will rapidly sink down the charts to its correct position as other people play it and give it more realistic ratings.

The proposed fix for this "problem" is to use the Bayesian estimate, a complex looking piece of algebra that shifts all of the scores around depending on how many people have rated the module, how many votes a module needs to get into the charts, and what the average score of every other module on NWVault is. What it does is take the average score of a module and then shift it towards this super-average of all the scores of every module on NWVault. How much the score is shifted depends on the number of votes the module has received. The idea is that the less votes a module has got, the less accurate its average score is. Which is true. But how is the average score of every other module on NWVault any more accurate than the individual ratings that people have given that particular module? The answer is, it isn't.

What the Bayesian estimate does in practice is to reduce the score of great new modules, shifting them maybe 30 or 40 places down the charts, simply because they don't have as many votes yet as the Dreamcatcher or Penultima modules. But modules further down the charts inevitably get downloaded less often, and therefore rated less often, so it will take new modules even longer to get to their rightful place. It's hard enough to get the ten votes you need to get into the chart already. Using the Bayesian scoring system will make it even harder for new releases to get the attention they deserve, and will discourage new designers by giving them much lower scores than they should have. So as well as warping the ratings you've given modules using a series of fiddle factors which are almost impossible to work out without a spreadsheet, this method will also hurt the community in the long run by penalising new talent. And bizarrely, the Bayesian system will also artificially boost the scores of bad new modules, putting them a lot higher up the charts than they deserve to be.

None of these methods is perfect. They all have advantages and disadvantages, and none of them can deal with every situation that is thrown at them. But vote rigging, flaming and other problems are thankfully very rare, and when they do occur we already have ways of dealing with them. There's no need to use a lot of inaccurate and indiscriminate statistical trickery to try to compensate for these issues, especially when the "solutions" being proposed will cause more problems than they solve.

I believe that the True Mean is by far the best of the four choices for NWVault. It's fair, easy to understand, and produces a chart which is both useful for players and representative of the votes that have been cast. It doesn't penalise new releases or discard thousands of votes on a whim, like the other methods that have been suggested. And the True Mean is the only method that counts every vote, and treats every vote equally.

Sometimes the simplest method really is the best...

Migrate Wizard: 
First Release: 
  • up
  • down