Heroku memory quota killing machine

When I was migrating a suite of Exvo apps to Heroku‘s Cedar stack I’ve noticed a very strange behaviour. For two apps during first “run” the main process spun out of control, trying to consume way too much memory. After doing a manual restart everything went back to normal and the problem was basically gone (we haven’t seen it reoccuring).

Here is the log of such an out of control worker:

2012-04-02T11:52:32+00:00 heroku[worker.1]: Process running mem=550M(107.5%)
2012-04-02T11:52:32+00:00 heroku[worker.1]: Error R14 (Memory quota exceeded)
2012-04-02T11:52:52+00:00 heroku[worker.1]: Process running mem=811M(158.5%)
2012-04-02T11:52:52+00:00 heroku[worker.1]: Error R14 (Memory quota exceeded)
2012-04-02T11:53:12+00:00 heroku[worker.1]: Process running mem=858M(167.7%)
2012-04-02T11:53:12+00:00 heroku[worker.1]: Error R14 (Memory quota exceeded)
2012-04-02T11:53:32+00:00 heroku[worker.1]: Process running mem=1099M(214.8%)
2012-04-02T11:53:32+00:00 heroku[worker.1]: Error R14 (Memory quota exceeded)
2012-04-02T11:53:52+00:00 heroku[worker.1]: Process running mem=1539M(300.6%)
2012-04-02T11:53:52+00:00 heroku[worker.1]: Error R14 (Memory quota exceeded)
2012-04-02T11:54:12+00:00 heroku[worker.1]: Process running mem=1580M(308.7%)
2012-04-02T11:54:12+00:00 heroku[worker.1]: Error R14 (Memory quota exceeded)
2012-04-02T11:54:32+00:00 heroku[worker.1]: Process running mem=1709M(333.8%)
2012-04-02T11:54:32+00:00 heroku[worker.1]: Error R14 (Memory quota exceeded)
2012-04-02T11:54:53+00:00 heroku[worker.1]: Process running mem=2115M(413.1%)
2012-04-02T11:54:53+00:00 heroku[worker.1]: Error R14 (Memory quota exceeded)
2012-04-02T11:55:13+00:00 heroku[worker.1]: Process running mem=2301M(449.6%)
2012-04-02T11:55:13+00:00 heroku[worker.1]: Error R14 (Memory quota exceeded)
2012-04-02T11:55:33+00:00 heroku[worker.1]: Process running mem=2302M(449.6%)
2012-04-02T11:55:33+00:00 heroku[worker.1]: Error R14 (Memory quota exceeded)
2012-04-02T11:55:53+00:00 heroku[worker.1]: Process running mem=2747M(536.6%)
2012-04-02T11:55:53+00:00 heroku[worker.1]: Error R15 (Memory quota vastly exceeded)
2012-04-02T11:55:53+00:00 heroku[worker.1]: Stopping process with SIGKILL
2012-04-02T11:55:54+00:00 heroku[worker.1]: Process exited with status 137

This is just a worker, but I’ve observed the same for web processes as well. Be aware that as soon as your application starts serving R14/R15 errors it stops processing any new requests (they simply time out).

Previously: Introduction to Heroku H12 timeouts.

Introduction to Heroku H12 timeouts

At Exvo we were experiencing problems while trying to send an email to all of our 150K+ users. This lead to a careful investigation of what is exactly happening during this process. So we setup our loggers (heroku logs -t | tee output.log), run heroku consoles, increased to 20 dynos and 10 workers and begun the sending process.

It all started with a POST request to send those emails:

app[web.10]: Started POST ...

Which timed out and there was no:

app[web.10]: Completed 200 OK ...

or similar log entry later on, just silence.

This is Heroku’s killing machine in action. It silently kills every process, which runs for longer than 30 seconds (we’re using the Bamboo stack). This is more or less how Heroku deals with long running requests. I don’t blame them for this behavior, even though I very much dislike it. But I digress.

When Heroku kills such process it does so in a very nice way, i.e. it lets it finish its business first. In other words the process will continue in the background and after finishing it will just die/stop/restart. Our email collecting job (selecting 150K users from the database…) run uninterrupted for over 40 minutes. So this is the good part.

The bad part is that when this killed process is still running in the background the dyno which it was connected to will stop serving new requests until this process finally dies, but Heroku will keep sending new requests to it! This is observed by the infamous H12 timeouts:

heroku[router]: Error H12 (Request timeout) -> GET auth.exvo.com/users/sign_in dyno=web.10 queue= wait= service=30000ms status=503 bytes=0

Also notice the 503 error code (Unavailable), which is returned by the Heroku router.

So this is really bad as all sorts of different/random web requests will just keep failing without you knowing what’s going on.

So while I greatly appreciate that Heroku lets such processes still run in the background, I really don’t like that it still keeps sending traffic their way.

PS: Yes, sending emails should be done as a background action by the Heroku workers. I know. But it’s not. Yet.