Welcome back. Today I wanted to share with you some thoughts about the problem with SQLAlchemy used in conjunction with Pylons framework based application that has been getting on my nerves for some time. However, before I’ll talk about it I want to lay the basic foundation for further elaboration on this problem.
My company has been developing pretty advanced Content Management System for almost two years now. This CMS is based on Pylons framework which is probably the most flexible stable Python web framework existing currently. Of course Pylons developers aren’t sleeping and their Pyramid framework, which is loosely based on the experiences from Pylons project and ideas from various other micro-frameworks, is probably the next big thing happening in Python web development world. Everybody’s jumping around and praising the new Pyramid. I’ve even heard that some other frameworks are willing to rewrite their cores using Pyramid. What will happen we’ll see in the near future. For me personally it’s good to hear that, but since Pyramid is currently under development, I’m still willing to use Pylons for some time, before Pyramid foundations stabilize. After all Pylons is and will be supported by Pylons developers indefinitely.
So. Let me speak about this CMS system we’ve been developing. As I’ve said it’s based on Pylons web framework. The framework itself was heavily extended and rewritten in some areas to better suit our needs. From the start we’ve been aiming at the modular architecture. We’re using SQLElixir on top of heavily acclaimed SQLAlchemy ORM for our database model and operations. Of course there are many advanced mechanisms and libraries that have been developed and used for this CMS. But I’m not going to talk about it for now. We had few successful client website deployments using this product.
As you probably may know Pylons is based upon Paster application server, but it’s usage is not recommended for production environments. It’s mostly intended to be used as development server. This doesn’t matter however, because any application server can be used for Pylons based application. This is the primary component of the Pylons based application – the application server. There’s also another component which is the database itself, but Pylons application can be written without any database. Our CMS however is using PostgreSQL for it’s database backend.
The application server itself is the core of the system, but there is also another component that is required for proper functioning of the CMS, which is neither standard nor included in the Pylons framework. We have developed this component in house and that is a supporting server – Scheduling Server.
This is very important part of the system, despite the fact that application server can be run without it. Scheduling Server is a secondary, independent daemon process which is used in conjunction with some advanced administrative panel capabilities of the primary application. It’s main purpose is to offload resource intensive and/or time consuming tasks that can be performed with some delay and don’t require immediate user interaction. It can be also used for periodic tasks and checks on the primary application server itself.
So. The Scheduling Server is basically a Python based Cron daemon for our CMS and on the other hand the biggest source of problems for us. Despite it’s rather simple architecture it has seen numerous rewrites for the last fifteen months or so. Why? Let me explain a little bit how this supporting daemon works and what it does for the system.
Scheduling daemon is relatively simple server with primary loop that periodically checks for new tasks. It launches checks for new tasks (jobs) with the interval of few seconds. Each check is launched in a different thread, so multiple of them can be done at the same time. It mostly interacts with database through the database model developed for primary application. The model is shared between the two, so you can say it’s being reused within the Scheduler. Before scheduling server enters primary loop it loads pylons configuration and environment by the means of pylons.config and pylons.load_environment functions. Those functions use the same .ini file that’s used for application server configuration before it’s started. So basically it also reuses the configuration of primary application. After configuration, the database model is initialized, the SQLElixir/SA engine is created, connection to database is established and you can use the model the same way you use it within the primary app. After all these things the primary loop starts.
Scheduler is used mostly for performing checks on new translations using i18n facilities within Pylons, making database backup dumps and restoration, email sending, queuing tasks, external services synchronization among other things. Some of those tasks like compiling new i18n translation files or restoring database dumps requires the Scheduler to restart the primary server and ensure it is working after the restart. During most of these operations the Scheduler performs database interactions with the Tasks table.
And here comes the problem. For most of the time, the Scheduler after launch is performing it’s tasks without any problems. However the longer it was running, the less tasks were performed, up to the point where it seemed that it’s not doing anything, despite the primary loop was functioning o.k. As I have said, checks are performed in separate threads, so it’s quite difficult to debug what’s going on, despite having extensive log facilities in place.
For most of the time there was no tracebacks, no STDERR or STDOUT information within the log files and since the server is daemonic we couldn’t easily debug. The problem was really elusive for a long time, because Scheduling server seemed to work fine after restarting it and we couldn’t find any errors whatsoever. We’ve implemented interactive mode just to debug, without any STDERR or STDOUT redirects, so we would immediately see what’s going on. We’ve reimplemented the server more than six times, three times using our in-house built solutions, three others using available scheduling libraries. Without any success. The problem disappeared for some time, but after clients started using either translation facilities or backup facilities the problem appeared after some time. We couldn’t exactly trace the origins of the problem and we didn’t have any clue as to what may cause such an odd behavior.
Few times I have tried to comment out portions of the code and run tasks, without any threads whatsoever, only sequentially directly one after another in the primary loop, because I had suspected that there’s a strange bug in the tasks code themselves. Even when running for few hours and testing it this way no problem appeared at all. No single error. No traceback. Not even a single warning. Everything was working as it was intended to. I was clueless. Until today.
Once again I wanted to attack this problem with a fresh perspective and open mind. Up until today it was irritating, but we got around it using very nasty hack, which restarted the Scheduler after some time. However I didn’t want it that way, so I’ve wasted whole day for testing, looking for the bug, testing even more, rewriting, trying, getting nervous, breaking the code to pieces, analyzing, testing, uploading to server, retesting, searching at the other place, spitting curses, getting mad, debugging and everything again ad infinitum.
When I got to the problem I thought after some time that the problem lies within the threads. Or at least with the task code behavior when wrapped around in the threads. This enlightening thought came to me after I’ve rewritten most of the Scheduler core to use APScheduler library and getting rid of the daemonizing code. I’ve left purely interactive code I could more easily debug. I’ve also rewritten logging facilities which finally gave me some meaningful output. This allowed me to see that after some time, when debug running Scheduling server with it’s guts ripped out, that after doing some operations in primary application administration panel, task threads began to throw tracebacks.
The traceback stated: “TimeoutError: QueuePool limit of size 5 overflow 10 reached, connection timed out, timeout 30”.
Quickly googling for answer brought only a slight clue, that the connection pool queue is probably exhausted. Reason still unknown. So I’ve used two other previous implementations of scheduling server to see if this problem with persists or is it simply some kind of implementation caused bug. It turned out that on the other implementations the problem is still appearing. I’ve tried to fiddle with sqlalchemy.pool_timeout parameter, but to no avail. After spending few hours trying to resolve the issue I thought that explicitly closing or removing model.session would help. Still no luck.
So I’ve used poor man’s debug, namely “print variable” and “print dir(variable)” statements to see if I can learn something from it. There was still no clue whatsoever for the reason the bug might happen. I’ve turned out to review the system’s model initialization functions and tried to learn where the connection pool might reside, checking for function explanations in SQLAlchemy documentation. I’ve deduced, that when I close the session inside the thread nothing happens. But when I get rid of database engine variables the next task that is ran by Scheduler is throwing exceptions about not being able to use the connection to database. Step by step I’ve thought that it must be something between the model’s engine and session instances. I knew it was the connection problem, but since I was using SQLElixir, there was no simple way to get into connection pool debugging or at least I didn’t know how to do it easily.
I thought I’d give a model.engine functions a try and see if I can find anything relevant to the connection closing. After looking at the documentation immediately one function grabbed my attention: model.engine.dispose(). I reasoned that if the problem is related to threads, the connection pool must be somehow left open after the thread has finished. Or maybe the task is finished, but the thread is still running, because garbage collection is not performed properly by Python and the SQLAlchemy connection QueuePool is left somewhere in the memory without cleanup.
After adding model.engine.dispose() after each executed threaded task and testing extensively, the problem went away. I’ve immediately understood what was causing the bug. It was related to… imports.
Each task thread had it’s own module with a class for this thread. Each of those modules had “from cms import model” statement. We have successfully used this statement within the primary application, but since the primary application is built on Pylons and the application is by default threaded, the problem didn’t occur, because the model was shared by Pylons among the threads and reused from memory (I know this is oversimplification). However, in our supporting Scheduling Server when scheduler launched new task, it probably didn’t cleanup model after thread termination (I’m not sure if the thread was terminated at all) and the model.engine with it’s connection pool was still active and residing in memory. Since each new thread launched a new model instance and not reusing the previous one (since it was using module and not Singleton based class), without previous model instances cleanup, the connection pool was exhausted quickly. That’s why forcing engine shutdown by the means of model.engine.dispose() cleaned up the stale connections within the thread and allowed it to finish. This is just a guess, because I didn’t have time and will to debug the cause further, since I’ve already wasted too much time on it. My bad, I know. 😉
Anyway. This note may not be very useful to you, but at least shows that sometimes really absurd things can cause havoc and be the cause for many hours of debugging and cursing everything around. Some quite simple and (mostly) logical things may be your nemesis when programming. I’m still not very convinced about the explanation I gave to you here, but it sounds probable. The other thing may be that the SQLAlchemy or Python itself has some nasty bug that causes this problem in a specific situation. As I have mentioned. I’m not willing to investigate it further for now. Debugging multi-threaded programs is a pain in the ass and in Python it’s especially true.
That’s all for now. Thanks for reading.