Ok guys, I think I know what the problem might have been. We run daily backups for the site and database in the early hours of the morning. A large, and temporarily locking, database backup job would have been running around 3am PST (7am AST), so already, that sounds very suspect.
Looking at my logs over time, this job when the site was smaller took < 1 minute to complete. Last night it took about 12 minutes. So... that's bad.
I've done three things:
1) moved that backup job to an earlier time
2) purged the database of a bunch of stuff to get it down to ~10% of its former size
3) setup an alarm when this job takes too long to complete
The idea is to have a smaller window (~60s) of unavailability at a period of time that is less likely to be busy (middle of the night). Yes, there are more complicated ways to go about getting better than 99.999% availability but I think this is ok for now.
Can any of you kind folks check tomorrow morning @ 7am AST and let me know if this is still an issue? Appreciate it.