Background Threads

Nyegomi supports a system with a load of actor objects tied to a bus, and supporting huge numbers of concurrent users. Once per second, the server looks at all the active objects and calls their update method, which gives them a chance to do vital housekeeping. Many of the objects may spin up background threads during that time.

Like a lot of threading code, this leads to loads of problems in the wrong hands. Extra problematic in Python.

class Service(object):
    …
    def update(self):
        if not self.task_complete_yet or not self.can_do_background_task(self.target): 
            Thread(target=self.try_to_do_background_task).start()
        if self.will_do_background_task:
            Thread(target=self.do_background_task).start()

    def try_to_do_background_task(self):
        while not self.task_complete_yet:
            time.sleep(0.1)
        time.sleep(0.15) 
        if not self.task_complete_yet and not self.can_do_background_task(self.target):
            if not self.task_complete_yet:
                self.will_do_background_task = True
            Thread(target=self.try_to_do_background_task).start()
            return
        self.update()

    def do_background_task(self):
        if not self.can_do_background_task(self.target):
            return
        self.world.do_background_task(self.target)
        self.task_complete_yet = True
        self.will_do_background_task = False
        def _end_task():
            time.sleep(1.0)
            self.task_complete_yet = False
        Thread(target=_end_task).start()

The details of what the background task is or what the target is are really irrelevant. Let’s see if we can trace through what this thing is actually doing.

On an update, if its task isn’t complete or it can’t do the task at all, it calls try_to_do_background_task as a thread. This thread just idle loops until the task is finished. Then it sleeps a little longer, once the task is done, then it checks: if the task isn’t complete (again) and it can’t do the background task, and then if the task isn’t complete yet (checking a third time), we set a flag to say we will do the background task. Then we start a new copy of this thread.

Now, here’s the key problem with this. The resources needed for performing the background task (the logic behind can_do_background_task) are shared, and this particular actor might not get access to them for many seconds. But every second, this spins up a new copy of the try_to_do_background_task thread. Which sleeps until the task is complete, but the task is completed in do_background_task, which is only invoked if will_do_background_task is set to true, which only happens after the task_complete_yet flag is true. And if two trains meet at a crossing, neither may continue until the other has passed.

Even if it wasn’t just an endless loop factory, nothing about this uses any locking or mutexes. Now, while Python’s GIL prevents that from being a problem in many cases, if a thread is sleeping the variables it shares with other threads may mutate a bunch- which means a thread could easily sleep through the window when task_complete_yet is true.

And yes, this baffling code does end up creating huge piles of race conditions and conflicts with other threads in the system. Nothing about this threading logic makes sense.

Remy Porter

Source link