Dell CTO on rethinking data along new dimensions

Image courtesy of Dell Technologies.

Patricia Florissi, Vice President, Global CTO for Sales and Distinguished Engineer, Dell Technologies, believes the future lies in distributed analytics. Geographically-dispersed data will need to be analyzed in entirely novel ways because of the sheer volumes of data we’re dealing with.

We sat down with her to find out more on the new ways of thinking about the edge and the core, and the nine most important technologies of the future, of which only one will be talked about in a decade — because the rest will be ubiquitous and woven into the fabric of society.

Could you expound on your idea of the dimensionality of data?

I’d like to use the example of autonomous driving, because it makes a little bit more tangible. In autonomous driving, the original idea was actually twofold. Number one, how can the car have at least the same amount of information as the driver has, in order to make decisions? And the second point is, if you consider 40 or 50 years from now, where the universe is natively digital, you won’t need most of the sensor technology that cars have today in order to drive.

I’ll give you specific examples. Today, the reason a car has cameras is because humans actually look at traffic lights that are not digital and, based on the color of that traffic light, make a decision on whether they should stop, continue, or slow down.

In the future, traffic lights will not exist because cars are going to communicate with other cars.

And they are going to coordinate, if they are going over an intersection, on whether or not they should stop because the other cars have precedence.

The cars have eight cameras: lidar, radar, sonar, etc. Therefore they have a lot more information — 360 degrees across four dimensions. First of all, the breadth they can capture is much more than our field of vision. The depth, the attention to detail, is much more than what we can see. They have a notion of height; I live in a small town and there is a tiny bridge there, and every two months there is a truck that gets stuck under that bridge. And because cars communicate with other cars, they actually know what is the traffic ahead of them, and can make better decisions on how fast or how slow they should go to either avoid or pass through the traffic.

These are the dimensions. They come as a result of the fact that in order to acquire some of the senses that humans have, cars have exceeded [our capabilities]. One of the advantages of that is not just the precision — with that level of accuracy on the perception of the environment, you can also have elements of redundancy. A mistake in one image or an oversight in one image can easily be overcome by another angle that you are seeing. AI models are going to have much more precision on the data and they are going to have redundancy to eliminate some of the errors in prediction.

I believe that some of the activities that we worry about today will not exist in the near future.

People talk a lot about how I can cleanse or unify the data, or make it uniform etc. You may not need to do any of that because you have so much more data that you can resolve for conflict or doubt.

Also, because you have more standardized sensor information, you don’t need to normalize the data anymore.

How do you see the federated analytics model working in the future? Do you think we’ll have all these disparate data sources and computing close to the data sources? How does everything get consolidated into a single whole, or is that not even unnecessary?

As a scientist, I believe there will be three modes of computation: centralized, distributed, and federated. It will be centralized whenever you can bring some data together, and you’re going to be able to do some analysis that is autonomous in a single place.

Decentralized is when you have data that is completely partitioned — you have an edge here and an edge there. The models have to be trained locally and make local decisions so they become a centralized entity in and of themselves.

And the third model is federated, where you actually have a virtual fabric of those disparate collections of data that we call data zones, and then they will collaborate not on raw data but only on intermediate results, to try to address a business need.

Should I put all the data in a central location? What are the incentives of doing that? In the past, we brought all the data together just because we either owned the data or because there was some regulatory compliance on the data, etc. Today we no longer have that need. So when and where to move the data, the results, is going to be completely business-driven. What is the use case you’re trying to solve? What is the information you’re trying to gain? What is the insight you’re trying to generate?

One of the big points I make very vehemently is that we need to decouple the design of the model from how the model is going to run, because these are two different decisions in different moments in time.

A designer should design their model training and inferencing assuming that it can either be centralized, federated, or distributed. And there are ways of doing that.

Then the decision of whether that training is going to be centralized, decentralized, or federated is based on deployment time, where the data is going to be collected, the network connecting those data sources, the quality of service there, and the computing and storage available close to the edge. The decision needs to be in runtime, and the environment may actually change.

We at Dell have a saying: you have to design for the future. You cannot design for the moment that you are living in. And the way you do that is to architect for choice and engineer for change.

How did you come up with the idea of federated analytics?

I started thinking in those terms in a genomic conference. Someone was presenting about the challenge of creating genomic data of the population in Russia. They said the only way they could do that was by creating six genomic centers. And I thought: number one, it would be impossible to bring all that genomic data to a single location. Number two, each center would have enough computing and storage capacity to do analytics on their own. And number three, there was tremendous value if they could collaborate in order to gain information. That’s where the idea came from.

IoT, at that time, was not really all the rage. And when IoT came along, we thought, it’s exactly the same model except with a need to actually do analysis and training in real time, in an edge environment.

The flexibility in the approach that we’re using comes from the fact that the IoT sources of data are very ephemeral. You have sensors that come and go. You have some sensors that only measure the data during the day and you have some other sensors that only measure the data during the night. How do you design a system that would adapt to changes in real time?

When you talk about federated analytics and the information, only some of the information will be passed back from the edge. What decides what information this is? What kind of model do you use?

The data scientist and the platform decide together.

For example, imagine I am a data scientist and I want a model that is going to calculate a global histogram. A histogram has, let’s say, five bars, and you want to decide how many entities fall in each one of them. We see this example a lot in electronic medical records or in clinical studies for FDA approval. You want populations that are very dispersed, and you want to understand those populations, and also to understand the distribution of the population globally. But you don’t want all the data to be shared — each participating hospital is going to send a number of samples in each one of the five categories. And then the data scientist that started the computation will do the combination of all of that.

Each site can have as much data as it can store, and only needs to convert that data into five samples, one count for each histogram bar. You’re transforming the data into a finite amount. That finite, deterministic amount of data is privacy-preserving because in many situations, as long as the data sample is big enough, you cannot the reverse engineer the individual points based on that summary information. And the platform can make sure, through encryption and some other mechanisms, that even that consolidated information comes in a way that is more privacy-preserving.

So in that particular case, the data scientist decides what data comes and the platform makes a decision on what is the safest way to actually send that data. When you are training Machine Learning and deep learning, each location is going to do the training of its own models, share the weights, and then you combine the weights and send the updated weights back to each one of them.

Is what we see as AI right now, is it truly autonomous in sort of human sense, or is it still basically brute force, with algorithms and applications that are better defined?

You could divide AI across two dimensions. In the first dimension, you have what we call narrow AI or for-purpose AI, and then you have the general AI, or Artificial General Intelligence.

In for-purpose AI, you are trying to use AI to solve a particular problem. You end up with many solutions because it’s not generic. Let me give you a concrete example from my household. We have one of those robots that are constantly cleaning the floor — it has a lot of intelligence of what it should be doing. I have many devices, and each one of them fits a particular purpose. There will be a million of these, because honestly, it is too much of an ask to have one robot in my house that is controlling everything, and needing to re-engineer the house to have all these sensors communicate.

I think we are stuck, not only from a design perspective of how you are going to design for general AI, but challenges from a societal perspective. But in terms of narrow AI, I firmly believe that we are there.

In another dimension, you can divide AI according to whether it’s undifferentiated or differentiated.  You have undifferentiated AI that everybody must have, not because they want to, but because it’s an equalizer. For example, everybody is going to have a voice recognition system to answer phone calls, or Natural Language Processing for a chatbot. Companies are going to buy these from other companies the same way we buy CRM from Salesforce.

Then you have the differentiated generic AI — a kind of a robot that is made to solve all of the problems of your business. It may not do anything in another industry.

The differentiated ones are our focus at Dell Technologies, because in my opinion every company will have at least a handful of differentiated AI applications in their business. And that differentiated AI is going to run on data that the company doesn’t want to share with everybody else.

That’s where they get their edge. They may share analytics on that data with other partners in the industry, so airlines who are members of the Star Alliance might want to share some of the analytics as long as it serves a common purpose.

When you open a door remotely with an application for example, or even a robot, aren’t these largely still rule-based rather than very deep decision-making entities?

If I may rephrase the question, for some of the narrow, undifferentiated AI, can’t rule-based systems solve it? The answer is yes. It’s not about rule-based systems versus something else, it’s whether or not it has the capacity to learn a different behavior and to be coached into learning. I call it not just continuous integration, continuous delivery — it’s continuous learning, because my habits change.

What sort of technologies really excite you?

I believe there are nine technological forces that are going to play the biggest role in the near-digital future. But none of those nine technologies are going to be a topic of conversation 10 years from now, because they are going to be completely interwoven [into society].

The nine technology forces are:

  • AI, Machine Learning, Deep Learning – I think that we are going to sort it out and in ten years we are going to be talking about generic AI or not at all.
  • IoT – In these 7-10 years we are going to put sensors everywhere and just take it for granted.
  • Augmented Reality, Virtual Reality, and Mixed Reality – We will absorb information through these, and it’s going to be a new way of interfacing humans and technology.
  • Robots – We are going to see many, many more robots helping us at airports, railway stations, and so on.
  • Accelerators – We are going to be developing specialized courses for AI, gaming, Virtual Reality, for cars, etc.
  • A full continuum – Today, we don’t talk about whether or not you have a particular processor on your tablet or on your phone. You just don’t care, you buy the latest model. In theory you don’t really know what you have inside here. It is going to be the same way for the underlying infrastructure across the cloud continuum that goes from the edge cloud, the private cloud, then the public cloud. You know that technology infrastructure is going to be everywhere and you do not care. There will be a single fabric for data storage, processing, and networking, and we are on top of it and no longer care how it happens.
  • 5G – Everything is going to be 5G. Whether there is a 6G, 7G, or 8G would depend on how we can monetize 5G or otherwise maybe a combination of that.
  • Blockchain – I believe that blockchain, this idea of a distributed ledger, is going to be how we transact business in the digital era.
  • Quantum computing – Last but not least, we are going to see quantum computing solving very specific problems 9 or 10 years from now.

The only one that we may still be talking about in 9-10 years is quantum computing. The other eight are going to be irrelevant. We are going to be worried about other problems, and I don’t know what they are.