Extreme-scale data needs an extreme-scale brain

As the IoT grows – especially in the form of smart city projects – so does the complexity of smart-city platforms and the amount of data generated from hundreds of millions of devices. Steve Marsh, founder and chief technology officer at GeoSpock, talks to Frontier Enterprise about his company’s geospatial platform, which began life as a simulated human brain and now promises to help smart cities and businesses alike not only handle the big data tsunami headed their way, but extract valuable insights in real time.

Steve Marsh, Co-Founder & CTO, GeoSpock.

According to your bio, the technology behind GeoSpock actually started as an attempt to build a supercomputer that could mimic a human brain – tell us about that.

Way back before the company started, when I was doing my PhD at Cambridge, we were building a custom supercomputer architecture to carry out real-time simulation of human brain functions – trying to take the brain’s hundred billion neurons, a hundred trillion synapses, and compute it at the same rate of the brain does, which means everything must be updated every millisecond.

At the time, approaches like nVidia’s graphics card solutions and IBM’s Blue Gene took a compute-centric view of the problem – which meant throwing more hardware at it. What the brain can do in one second, the state of the art was taking two weeks to execute.

Our research group realized that it was a communication problem, so we designed a communication-centric supercomputer with custom vector processors using FPGAs, high speed serial links, and we were able to get the execution time down from two weeks to a single second, which is pretty cool.

I realized that shipping custom hardware wasn’t a particularly scalable way to address the neuroscience problem.

I set about making biologically-inspired massively parallel architectures run purely in a software environment on automatic systems using commodity standard server hardware.

But the neuroscientists didn’t have a brain-scale model to stress test our machine with. We were probably two decades ahead of the market on that one.

So how did you go from that to applying geospatial technology to things like IoT-generated data and smart cities?

Around 2011-2012, I got very excited by the rise of smartphones and the fact that with every smartphone in our pocket, we will also have a GPS chip in our pocket. And then I started thinking about the future (or what the present is now) where it wouldn’t be just GPS in the devices that we carried, it would be in the devices that we drove – millions of connected cars – and the tens of millions of sensors in the physical environment allowing us to measure that environment.

Many of those will be static sensors – they don’t necessarily have to have GPS chips, but the location still matters, because there’s no point putting a sensor in the environment if you have no context as to where it is. A temperature reading means nothing unless you can tell me where that temperature is.

Imagine a connected vehicle driving through a city. As it’s moving, it is passing static sensors – IoT-connected streetlights, congestion-monitoring cameras, pollution monitors. A city that can dynamically manage congestion and street traffic needs knowledge of all the moving things and all the static things. So a static sensor still needs a location to give it context, maybe not for itself in isolation, but for the broader complete system.

Anyway, geospatial systems were built around the use case of digitizing paper maps, not tracking billions of moving objects and giving each one of them contextual intelligence on demand in a second.

I realized that by having all this knowledge of how you build custom supercomputers using commodity systems, part of the solution was there. So I decided to just go tackle it.

To clarify, it seems you’ve got two problems to tackle there – not just tracking all those components, but the extreme amount of data they generate, right?

Absolutely. By 2023, Gartner is predicting that we’re going to be generating 54 exabytes of IoT data a year. That’s more than the projected storage capacity that we’re going to manufacture. And they also predict that by 2035, we’re going to have a trillion connected IoT devices. The data that generates is going to be insanely large. And the biggest problem that we have in terms of data analytics is getting value from that data, especially in real time.

Coming back to the congestion example, to monitor congestion, I need to know where every vehicle is in a city on a second-by-second basis. I need to know where they were over the past five minutes so I can see where congestion is emergent. I need to know historically whether that’s normal or unusual behaviour. And I need to be able to combine all of those things together in a second, because I need to know whether I can intervene to get better outcomes.

The trouble with big data systems today when they’re reliant on the batch-processing model is that it may take two weeks to get an insight.

So it’s always after-the-fact insights. And better luck next time, because when you measure the physical world, the likelihood of that scenario ever occurring exactly the same way again is close to zero.

So our thing is to use geospatial data to enable people to understand contextual intelligence on the fly, in real time. And from that base point, then you can do AI modeling of future scenarios and then choose the one that you most want the world to look like so you can intervene and change the outcome for the better. And this is where we start getting into really intelligent city platforms, intelligent automated platforms, intelligent maritime platforms, things that can change the outcome and not just say, ‘Well, I should do it better next time.’

Also, our big thing is that we’re really good at de-siloing data. So when we go into a city – let’s use Cambridge as an example, because our headquarters is there – they have 86 different types of IoT sensors, connected buses, connected bus stops, smart streetlights, smart traffic lights, connected rubbish bins, microclimate, pollution sensors, ANPR [automatic number-plate recognition] cameras, Bluetooth flow control, all sorts. All of it is siloed, so it’s not very easy to draw new insights.

We went in with our geospatial data platform, and we were able to de-silo that data, bring it home to one place, store it on AWS, then open that up for a programmatic API. And using SQL, we were able to basically turn that city into an innovation platform, so anyone can come along with a new question and just focus on extracting the value. Our ability to produce real-time value extraction – because we operate in seconds, not weeks – means you can ask new questions on the fly, unknown questions, and have them come back really quickly.

So you’re actually doing all this on AWS – in the public cloud?

Absolutely. We partner with AWS.

We have dynamically scalable compute, and it’s really low-end compute, which allows us to just completely change the unit economics of these big data queries.

And because we’re on AWS, we send our software to where the data is – shipping exabytes of data around for every new use case is expensive. So if they already have their data, we send our algorithms to where the data is, index it, and then open it up.

The thing that’s going to scale the most is data, so we scale data the cheapest way possible: on S3. And the compute is right-sized, depending on the complexity of the question you’re asking. So the more complex the question, the more machines you spin up. So you can always get an answer in a second.

What are some of the most interesting or unexpected insights that you have gathered from working with Cambridge, Singapore and other cities?

With smart cities, we’re seeing a really big push in measuring automotive pollution output. However, 90% of the world’s global trade is maritime, and yet those engines are the big polluters as well. We’re working with maritime as well, so actually being able to track that and combine that data with city data, we can suddenly start understanding where is the pollution coming from, who’s causing it, and can we actually address those issues to help prevent climate change?

Our data platform separates the data production and the sensors to help bring it into one platform, then enable a million next-generation citizen services, city insights or new commercial opportunities, new services, applications. So now we’re seeing this aggregation between application and data generation for the first time – data being repurposed.

And that’s actually a bigger thing – in the past, no one had the ability to do universal insight extraction, so they were focused on data sharing and data selling. We can’t do that anymore – the data is too big. We have to get the insights where the data is being produced, and that’s what our technology allows. So, people can move away from a data-sharing model to an insight-sharing model, and that’s actually a lot better.

If you’re an automotive manufacturer gathering an exabyte of data, you can’t sell that to every city that you operate in – they don’t know what to do with it. What they want to know is, is my road congested?

That’s a yes or no question if you can solve it in real time, right? So we shift the model to insights rather than data. People are much more willing to share insights – it helps protect privacy and allows commercially sensitive data to remain in a secure environment.

Moore’s Law is going to come to an end at some point, so the amount of power required to crunch these huge amounts of data is going to be an issue. Do you think quantum computing is going to be the answer to that?

My take on quantum computing is that it produces a probabilistic output – you still have to take the output and put it into a conventional system. I see it as a bolt-on capability to the traditional compute model. So just like you have CPUs, GPUs and FPGAs, and now people are talking about neural processing units, you’ll have quantum processing units. I don’t think the framework of computation will be changed. I think it will be augmented with quantum.

And quantum doesn’t solve the data storage issue. However, the advent of solid state technology is allowing us to store vast amounts of data more effectively than a traditional magnetic storage medium like spinning disks, and actually making that data faster. You still have the issues of how I manage that data – where do I put it? How do I still get maximum performance in terms of data extraction? But there are advances. If we look at NVMe (non-volatile memory express) technology, which is flash-based storage, I see that as the next wave.