Ninja Van CTO on switching cloud providers

- Advertisement -

Ninja Van is a tech-enabled express delivery company providing delivery services for businesses of all sizes across Southeast Asia. Launched in 2014, Ninja Van started operations in Singapore and has a network currently spanning six countries across Southeast Asia: Singapore, Malaysia, Philippines, Indonesia, Thailand and Vietnam.

Ninja Van’s Co-founder and CTO Shaun Chong is responsible for crafting their technological architecture and managing their engineering and product management team. Shaun’s engineering team of 90 is split across Vietnam, Indonesia, and Singapore, with the latter being the main engineering centre. We spoke with him about his experience using – and migrating between – Amazon Web Services (AWS) and Google Cloud Platform (GCP), the particular demands their workloads make on the cloud, and his advice for new startups.

Can you tell us about Ninja Van’s experience with cloud services?

We started off at the end of 2014. We’re a digital native company, so when we deployed to production first it was with the cloud. We started off with Amazon because at that point in time they were the market leaders – in this part of the world AWS was the only big enough brand that we could rely on and trust. Things could’ve been different if Google already had a centre in Singapore.

Fast-forward two years later, Google announced that they were coming to Singapore, and that’s where we decided to give that service a try at that point of time in Taiwan. And we were not disappointed at all.

What actually led us to even consider Google — who wouldn’t? You have Google.com and YouTube.com being powered by the exact same platform; you have a tech giant behind your back if you need any support.

We’re very experimental, so when we launched our services in Taiwan, we played around with the platform. For us as developers it made a lot more sense. Amazon had tons of products, but for us we focused a lot on just bare metal performance: basic resources like RAM, disk, compute, reliability…

If you look at the way Google thinks about cloud infrastructure, it’s fundamentally very different from what we experienced with other cloud providers.

For example, when we tried to create a Virtual Machine (VM) in GCP, we could choose any number of cores with any amount of RAM. This was not present anywhere else.

With our previous cloud provider, we had to choose a specific machine. The thing about cost reduction is you have to figure out how much resource you want to buy in advance, right? You have the concept of reserve instances, which made it very difficult. In our business we go up and down, we scale up during sales periods — our traffic is very unpredictable. Do other cloud providers allow that? Sure, but in terms of cost optimisation it’s not great.

We can’t predict how much resource we want to buy. With Google, we saw an immediate reduction in our cost because of the automated sustained usage discounts. Obviously with committed use you can bring that down even further.

We shifted our entire production network to GCP from AWS in just a day, and that was when we started to deploy in Kubernetes.

When we went to production, we were looking at our telemetry data and said ‘hey how come our services are performing much better than they were at Amazon?’ We were actually doing like-for-like machines, and wondered ‘hey — does Google actually have better Intel processors or something?’ After some digging around, we realised that the network infrastructure at Google was way superior.

That led me to go and research white papers on what Google was doing: everything is software defined in the data centre. They can guarantee you latencies and capacities inside their data centre — we were looking at five, even ten times lower latencies across zones. That to me was ridiculous and we basically got an almost free upgrade in performance.

We have AWS serving us on the DNS level with Route 53. Route 53 has geo DNS, which allows us to route some of our Chinese traffic, but 100% of our compute is on GCP. And we still use a little bit of S3, but we’re starting to move everything to Google. We also use cloud players like Ali and Tencent specifically just for our China business.

What led us to move to GCP so easily was because we never really made use of a lot of the AWS products like RDS, SES, SQS. It was really just compute, which let us do lift-and-shift very easily. Even right now on GCP we follow that mentality, but this mindset is slowly changing.

We’ve shifted to entirely to Google Kubernetes Environment (GKE), where we host Jira, Bitbucket — everything. We run multiple clusters, and to me, GKE is really a godsend — it allows us to bootstrap clusters within minutes. We used to have to write lots of scripts, configurations, figure out what kind of virtual IPS we want to do on service layer, port layer and all that. Today it’s just a few clicks away; we can even start to automate that right now.

The beautiful thing about GKE is that upgrades become almost trouble free. You have versions coming out on an almost weekly basis and if you’re self-managing it, it’s a very scary moment when you upgrade a cluster, because your entire production network can go down.

With GKE what is promised is that when they release an update, they tell you it has been tested by customers, by Google’s internal team, and that’s why they are confident to release the update to you. That’s why you don’t always get the latest bleeding-edge version on GKE, but I can tell you that we used to have three people managing self-hosted Kubernetes, today we only need one person. In fact, we can even get someone on board who doesn’t know Kubernetes to teach them how to manage the cluster very easily or even create clusters.

We have come up with a new project called project Gaia that creates entire environments for us and we’re spinning up full clusters to allow QA to test. We have so many features going on in parallel that we have to create lots of separate environments to test project features independently, and GKE is helping us do that very efficiently.

How do you decide on an environment? Do you care about the profile of the other customers on board, or do you look at it as a developer: purely on the merits of the platform?

I take the developer viewpoint — it’s just like choosing a technology.

It’s about trying it out, POC-ing it, doing some R&D and figuring it out. It’s useful to hear what people say about products and that’s the general hygiene stuff you do: if someone says it sucks, don’t just trust it immediately — try it out.

For us it’s all about data. We go in and we try it out — we don’t really mind if we’re almost the first. Obviously, reliability is important, but no one is 100% reliable.

My advice is always: build for failure. Things are going to fail at the worst time and your architecture just has to support the cloud going down. Some things are just beyond the control of even the cloud vendors sometimes. Like if an earthquake happens at a data centre you better make sure you don’t put all your eggs in one basket.

Why did your workload require the mix and match of compute and memory?

We unfortunately have some applications which have not been memory-optimised — the ratio of RAM to CPU generally is quite high. With certain fixed ratios in Amazon, to get that amount of RAM you need to have so many CPUs. And that impacts cost a lot, because you are deploying an application that does not need so many CPUs, but you have no choice.

What advice would you give new startups?

Don’t over-engineer too early.

When we started off, we were just consuming VMs. We automated our deployment, but it was not containerised — everything was done with Ansible. We maintained an inventory of 10 -15 servers that would deploy app A, app B and so on.

It came to a point where we started breaking up our monolith into microservices. As that service count went from three services to ten services, in figuring out which host, which node to deploy the application to, we knew that was not a scalable approach.

We started looking at getting Docker into production. At that point, two container technologies were fighting for supremacy: Rocket and Docker; internally our development environment was already using Docker. I think during those early 2013-2014 days people were a bit afraid of pushing Docker into production. We took the leap of faith sometime in early 2015: we adopted a project from CoreOS called Fleet — that’s where we started our journey into containerisation.

So that was working pretty well — our applications were in production in Docker, everything was automated. No doubt it was a very rudimentary approach, because we’d just schedule a container into the host that had the least number of containers: it did not care about whether there was enough RAM or CPU in that machine. It was very basic.

Towards the end of 2016, Fleet announced that they were going to stop the product — Kubernetes was coming up and that was going to become the de facto standard. We played with it, it was very simple: we figured it out within less than a month and realised that’s what we have to do.

When I talked about moving from Amazon to GCP in that one day, we actually did two things. One: change cloud providers; and two: shift everything from Fleet to Kubernetes in the same day.

A lot of engineering effort went into making sure that when we made that switch, there was zero downtime.

All those Saturday nights?

Yup! But in the kind of environment we’re in, we don’t really follow that old school thing where we deploy every two months on a Sunday. We make tens, hundreds of deploys a day sometimes, and we have to do it with zero downtime, so it could be easily done on a Monday afternoon as well.