When the number of reports increased, it became clear that this was an outage. Several Lindens jumped in to diagnose the reports. After some digging, we discovered our back end infrastructure was being overloaded. Once this was resolved, the positive impact was almost immediate. The data now looks good - on the back end and we are no longer receiving inventory reports. We’re taking steps, including a deploy late last week, to prevent these issues in the future - and we have already seen progress in making the service more robust.Įvery week we restart the simulator servers to keep them running smoothly. We only restart a certain number at the same time, allow them to finish, then start another batch. That’s why we call them “rolling restarts.” But on Tuesday a recent upgrade to the simulators meant that the usual number of simultaneous restarts was too much for the system to handle. The result was load spikes and numerous regions going down. Ultimately more than 12,000 regions were stuck in restart mode, which is 40% of the grid. The team came together quickly to bring simulators back up in smaller batches, and then manually fixed blocks of regions. After some trials and monitoring, we found a smaller number of concurrent restarts that worked better. By 1:00 PM all regions were restored and operating normally.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |