I Ripped Out Docker Compose from Our ML Platform and Put Everything on EKS. Here's What Actually Happened.

I'll be honest — I resisted this for longer than I should have. Our ML pipeline on Docker Compose was working. Not perfectly, but it was working. I knew where everything lived. I could debug it. Th...

By · · 1 min read
I Ripped Out Docker Compose from Our ML Platform and Put Everything on EKS. Here's What Actually Happened.

Source: DEV Community

I'll be honest — I resisted this for longer than I should have. Our ML pipeline on Docker Compose was working. Not perfectly, but it was working. I knew where everything lived. I could debug it. The data science team understood it. And every time someone suggested moving to Kubernetes, I'd think "that's a lot of complexity for a problem we don't have yet." Then we had the problem. Three data scientists started running concurrent training jobs. One job consumed all GPU memory and the other two silently failed with zero useful error messages. Our serving container kept getting OOMKilled under load and nobody knew why because there was no proper metrics collection. We had a Friday afternoon incident where a model that had been in production for 4 months started returning garbage predictions — turned out the feature distribution had shifted weeks earlier and we had no monitoring to catch it. We only found out when the product team noticed the fraud catch rate had dropped. That was the mome