The situation is as bad as it gets. After several hours of diagnostics by the OVH team, they concluded that there’s nothing wrong with the server.
Shortly after they rebooted it, we received a new alert that the server had frozen again (how ironic…).
As I’ve already mentioned in some tickets, the situation is particularly complicated because this node is under a 12-month contract. It's the most expensive one we operate, and migrating to a new node would mean paying for this one for a full year without being able to use it, selling in loss.
Following last night’s intervention, I can't even access the OVH UI anymore, which prevents me from attempting the last resort: a full reinstall of Proxmox to see if that changes anything.
If the reinstall doesn't solve the issue either, I will purchase a new node for migration and absorb the financial loss for this one.
I sincerely apologize to all affected clients and want to assure you that I'm doing everything I can to resolve the issue. I hope that, at the very least, being transparent about the situation helps a little.
TO DO: Full backup of the VMs and complete reinstallation of the virtualization system.
UPDATE 1 [05.06.2025 13:00]: The technical team from the datacenter intervenes for the second time today with the clear task of checking the RAM and the motherboard.
Compared to yesterday, we have a new certainty, namely some kernel logs that clearly indicate some NULL pointers at the host level.
Result: As requested, OVH replaced the RAM. We are monitoring closely to see if any more errors occur. All services should be UP and RUNNING, otherwise please contact us.
UPDATE 2 [06.06.2025 17:30]: We have reached an agreement with the provider to cancel our old server after we lease a new one with a contractual obligation of at least the same 12 months. This new server has already been delivered and we are going to migrate all clients to it (1-5 minutes downtime per VM, without having to reboot). We chose to do this because the RAM modules on the old server failed twice in 20 days and we were afraid that it would happen again because we treated the effects, not the cause (which we do not know at the moment).
Thank you for your understanding!