Debugging Stuck Threads in Multi-Process Python Programs

I’ve been fiddling with the Journal thread in networking-odl lately..
One of the changes cause the unit tests to get but with no idea what’s stuck where I started looking for a good solution.

A very nice project addressing this is pystuck which runs a debugging server in the background and allows you to connect to it to dump the stack trace of the program.

It’s very simple to use, you just need to:

pip install pystuck

And then to use it just add this line to a python script:

import pystuck; pystuck.run_server()

And then when python is running just execute pystuck and it will open an ipython shell that allows you to query the stack trace (and presumably do some other stuff which I haven’t looked at).

This was quite nice although somehow placing in the wrong place caused the test runner to flip with errors such as “WARNING: missing Worker 0! Race in testr accounting.” and somehow the tests didn’t get stuck.

The way the OpenStack testing framework works is that by default it spreads the tests across all available threads, so in my case the laptop has 4 real cores and with hyper threading active doubling that and totalling in 8 cores available for execution.
I ended up placing pystuck it in one of the test modules where there are enough tests to make sure that it gets spread across all the test processes that get forked from the runner.

Also it’s important to note that OpenStack uses tox to execute the tests in an isolated virtual environment so installing pystuck outside of it wouldn’t work.
What you need for it to work is add pystuck to the test-requirement.txt file and then run tox (If you still get errors delete the.tox directory and re run tox).

OK so great work until now, but when running pystuck while the tests were hanging I got –

$ pystuck
unable to connect to the server, please follow the instructions:

After some moments of thinking it struck me that while some thread does get stuck somewhere in the tests, finding it will be a bit more challenging because Python doesn’t have real multi-threading and hence the test framework runs the tests in different processes forked from the main one.
By default pystuck connects to port 6666, but it is easy to specify a different one. What I ended up doing for simplicity is use the PID as the port. This works quite well since only the stuck test processes remain running while others are finished, so determining which port(s) to connect to is as easy as running

ps -ef | grep python

So, to have pystuck use the PID as port you need to use this code in your program:

import os; import pystuck; pystuck.run_server(port=os.getpid())

And then you simply find the stuck PID and run

pystuck --port={PID}

to connect to that process.

Now this might not work in all situations since the PID could conflict with an existing port, but then you can just re-run the test and the PID will change.

Needles to say this helped me find out why the tests are getting stuck.
Happy debugging!


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s