We experienced the same problems and I've mostly resolved these by doing the following:
- We pushed cleanup back to 3AM (from 12 AM) and shortened to 2.5 hours. For us there are nights where it doesn't complete all cleanup. But over time it does keep up. Also, the 2.5 hour cutoff is not a hard cutoff. It will continue until it finishes the batch it's currently working on. Generally it's finished by 6AM which is when our user activity starts to kick in. I have had a couple of times where it ran really long which causes issues.
- There are nights where I know the nightly integrations / processes will be running during the 3AM time frame. For example our EOM JE process runs all night. I manually turn off the Cleanup on those nights. Afterwards I'll sometimes trigger an extra BO Cleanup on evenings or weekends to help it catch up. I do check the sever log every morning to see if cleanup completed and if it is behind.
Some night's our integrations/process are still running at 3AM and cause deadlocks. I have a report that runs every morning and looks for WF Warnings / Failures. These deadlocks show up there. I review these to see if any action needs to be taken. I keep a spreadsheet of all these failures so I can identify trends over time.
My long term plan is to use the new Admin Console APIs to control when clean up runs each night. When nightly processing is done it can automatically be kicked off. On nights where processing runs to long, the clean up will be skipped.