*sight*
this would have been way to easy, right?
so ... again sorry for the offline-time.
Post-mortem (as of now):
Mysql does have an issue with limits in their query and very large datasets.
There was someone / more than one? That queries all threats of the forum. With the bug of mysql a query did take something like 6 Minutes. After running a few of them, the system maxes out the CPU. (more cores or ram just move the time until it locks again a bit further). The database requsted nearly 1GB/s select queries.
Normally we are idling at ~10-15% of 4 cores and something like 10MB/s read queries.
Timetable:
Identified the issues on the database.
Restart of the database only got us like 1hour of a somehow working system. (see last post)
I tried to update xenforo (since there was an update) -> no success
I tried to update the linux-OS of the main-system, database, redis-cache -> no success
Unfortionatelty the VPN-Connection between the frontend-systems and the backend crashed (not sure why, because all other connections were working; exceopt to the frontend) so I had time to update the firewall, since everything was done anyways.
VPN-Connections did come online again, but the frontend still didn't want to talk with the backend.
so ... time to update the frontends; rebooting them and clearing all cashes.
Still no success...
I tried to upgrade mysql to the newest version -> no success
I tried to go use the old php-versions, because i thought, that this could have been an issue after the first update-attemt. -> nope.
I was 2s away from opening up the replication-nodes for read-access for this query or rolling back all the changes (yay for backups).
Then i started googeling and found the issues with mysql and limited searches on huge datasets ... I should have googled way before all that.
So after that I at least did know where to look to resolv this issue.
There was some plugin for xenforo that already did optimize some of the quries that are affecting us.
So we are now using a plugin, which I hoped, we didn't need. But well...
Since it took me like 20min to write this (lots of people accessing my office today) the systems seem to be stable.
Let's hope for the best.