As probably a lot of you have experienced or at least got to know, our peace time didn’t really feel like bringing peace into the game this year. And unfortunately, this didn’t happen for the first time in Grepolis history, so this blog post aims to explain why peace times have been a constant issue in Grepolis, what went wrong this year and how we’re going to do better in the future.
Peace times going wrong? Again?
Unfortunately, Grepolis has quite a history regarding peace times – but this year, something was different! While we had problems with the logic and checks for those special occasions in the more distant past, the last peace times still had a lot of problems with them not being set up properly. And indeed, the system to set up such a “peace” was very sluggish, our Community Managers couldn’t really plan, and it had to be done manually on every single world, so this process was kind of destined to fail at some point (forgetting to tick a checkbox for one of the worlds, forgetting to paste one of the digits for a timestamp…). That is why we reworked the mechanic of setting up peace times to use the same proven mechanic we use for our ingame events (such as the recent Wheel of Fortune).
Why did we still have issues? Did we break it while refactoring? Didn’t we test enough?
While moving the system, we wrote a lot of automatically executed tests (known as Unit and Integration Tests) to ensure the functionality of the mechanic, especially also covering a lot of the scenarios we had issues with in the past, including the one which failed this year. And everything was passing. Of course, we also tested the peace time manually, basically simulated Christmas on our internal servers and there everything was working.
Wait, what? So, what did go wrong?
To explain this, I shortly need to explain how Grepolis works in the background. Grepolis is a game where our “Backend” (the “program” which is running on our servers, e.g. managing all the ongoing actions and fights and so on) is running under PHP, so it’s not actually “running” all the time but we have to build up the game with every request from your browser or app or our own system (e.g. if a command arrives). As you can imagine, Grepolis is a very big and complex game, so building it up could take quite some time. To improve this (and therefore have lower response times for your requests) we use a lot of different techniques, and one of them is caching. You might know caching from your web browser, where the browser remembers images or other files so you don’t have to load them again and again if they don’t change. We basically do the same in our backend, also especially with this “happening system” (meaning the system we use for our ingame events as it is kind of a special happening).
The idea is: Why should we ask the database (the place where we store all the information about the game) if we already know about something as asking the database is quite “expensive” (costs time), so we try to remember the answer we already got.
Okay, what does that have to do with the problems we experienced?
Just to recall the issue we had: 24 hours before the peace times, it should not be possible to send any more colony ships to other towns (so basically it shouldn’t be possible to conquer new cities if the coloships aren’t already on their way).
To ensure this and a lot of other rules, we have a list of checks to go through when a player tries to send an attack – any only if none of the rules are violated, we actually send the attack.
And here comes the problem: When sending an attack, we first check if there would be peace at the arrival time of an attack (lets call that timestamp1) OR 24 hours later if and only if a colony ship is being sent along with the attack (timestamp2).
And here comes the problem with our caching: We already loaded the peace data for timestamp1 and cached it. When doing the second check, our caching thought that it would already know everything about peace times – so if there would be a new peace time starting somewhen between timestamp1 and timestamp2, our game just wouldn’t see that.
Okay, got that – but how could our tests pass then?
This is actually the reason why solving this bug took so long (we started debugging at 9pm on the 23rd of December and it took us until 1am on the 24th to actually have a fix for that bug. The problem is: This caching is disabled on our test system as we don’t really need the performance boost there but on the other hand want to simplify our testing process (I won’t go deeper into that topic). So even though we were testing a lot, we weren’t testing exactly what was happening in the live game and this was definitely our fault, no argument about that – and we are really sorry about this!
Yay, bug fixed, but then we made another mistake ☹
In order to reduce the impact of this bug on you all, we wrote a script that would check all running attacks including a colony ship AND all running conquests to see when their attack has been launched.
Canceling the running attacks: worked like a charm. Canceling the running conquests: not so much – and again, we are very sorry about that. The logic of our repair script was basically like that: Get all conquests, check when they’ve been started, calculate the travel time from the origin town to the town under conquests and check if this could have been a legit attack. That was fine so far – besides one problem: When calculating when a conquest started (we don’t actually store that anywhere), we had to calculate that back from the moment in time when it will end (we store this) and subtract the hours a conquests takes on that world. Problem here was: On most worlds, this value is 0.
Why 0? There are no worlds with instant conquests?! 0 indicates a default behavior, which means that we should calculate 24 hours divided by the world speed. It was already 4am when we deployed that script (deploying takes some time and of course we again tested everything: first our fix under live conditions then our script, was working, go!) and by that time we just couldn’t remember this detail anymore – while on the other hand we were under sever time pressure as we wanted to minimize the impact of our bug as much as possible. On our test servers everything worked as we usually set this value there (we just don’t want to test something conquest related and then have to wait for 8 hours our more), so we also couldn’t catch this default case there.
This is the reason we then deployed another script on the 25th of December to compensate all players being affected by our faulty cancelings (this time not developed at 3am in the morning and with close to no time pressure) and executed it at 2:30pm.
First of all, we owe you an apology – we are deeply sorry, you all deserve better and we promise to take this issue seriously, reflect on it and also take appropriate actions to prevent such issues in the future (and most important for now: prevent such an issue during the New Year’s peace time).
Second, this issue revealed some oddities and design problems with our existing code – which is absolutely normal on a game being as old as Grepolis, but of course we have to tackle them, so we don’t make the same mistakes again on just another case.
Third, thank you to everyone who kept up the Fair Play by not sending any more colony ships even though it would have been possible!