Simplifying machine learning through data virtualization

Machine learning, in which systems solve problems by recognizing patterns rather than by following specific sets of instructions, has recently been making headlines. Machine learning is supporting diverse domains such as business intelligence, earth science, and online personalization. Organizations are using machine learning to facilitate complex use cases such as speech recognition, fraud detection, and demand forecasting. Enabling machine learning requires sophisticated infrastructures that can quickly integrate and process large amounts of data from disparate sources, often involving multiple data platforms, tools, and processing engines. However, establishing and maintaining such infrastructure can be complex and costly, but data virtualization can greatly simplify the data integration process while saving costs, thus accelerating machine learning initiatives.

A Need for Change

To support machine learning, many organizations leverage data lakes, as they can collect large volumes of data from multiple sources, including structured and unstructured sources, and store the data in its original format. However, storing data in different formats does not necessarily facilitate discovery, as data in different formats must first be integrated before it can be leveraged for machine learning. Due to the increasingly distributed nature of data infrastructure in today’s enterprises, data integration gets more complex. Data scientists can spend up to 80 per cent of their time on these tasks, suggesting that it is time for a new approach.

In addition, the slow, costly replication of data from systems of origin can mean that only a small subset of the relevant data will be stored in the data lake. Companies may have multiple data repositories distributed across a number of different cloud providers and on-premises systems.

The burden of adapting the data for machine learning then falls on data scientists who, while able to access the necessary processing capacity, tend not to have the skills required for integration. The past few years have seen an emergence of data preparation tools designed to help data scientists carry out simple integration tasks, but many tasks require more advanced skills. An organization’s IT team may be called in to create new data sets in the data lake specifically for machine learning purposes, but this can significantly slow down the overall initiative.

If organizations are to unlock the full benefits of data lakes, and other diverse sources, to support machine learning, new technologies are needed.

The Many Benefits of Data Virtualization

Rather than moving data from multiple data sources to a new, centralized repository, data virtualization creates real-time, consolidated views of the data, leaving the data in its original locations. Data sources can be on-premises or in the cloud, or structured or unstructured, and data virtualization can support the myriad data silos including the data lakes. Data virtualization enables data scientists to access more data, and to do so in the format that best suits their needs. Because it auto generates the data access SQL scripts and APIs, data scientists can integrate data without having to learn complex new data integration protocols and procedures.

Data virtualization provides a single access point to any data, regardless of its location and native format. By applying a combination of functions on top of the physical data, data virtualization provides different logical views of the same physical data, without the need to create additional copies of the source data. It offers a fast and inexpensive way to meet the data needs of different users and applications, and it can help to address some of the specific challenges faced by data scientists in integrating data for machine learning.

Best-of breed data virtualization tools also offer a searchable catalog of all available data sets, including extensive metadata on each data set such as tags and column descriptions, as well as information on who uses each data set, when, and how.

Keeping it Simple

Data virtualization offers clarity and simplicity to the data integration process. Regardless of whether data is originally stored in a relational database, a Hadoop cluster, a SaaS application, or a NoSQL system, data virtualization will expose the data according to a consistent data representation and query model, enabling data scientists to view it as if it were stored in a single relational database.

Data virtualization also makes it possible to clearly and cost-effectively separate the responsibilities of IT data architects and data scientists. By leveraging data virtualization, IT data architects can create reusable logical data sets that expose information in ways that can be useful to them for different purposes. Data virtualization takes considerably less effort than traditional methods to create and maintain logical data sets. Data scientists can then adapt these data sets to meet the individual needs of different machine learning processes.

Machine learning may still be in its relative infancy, but the market is expected to grow by 44 percent over the next four years as businesses look to analytics as a way to drive operational efficiencies through deeper insight. As its adoption continues to grow, and as data lakes become more prevalent, data virtualization will become increasingly necessary for optimizing the productivity of data scientists.

By enabling data scientists to access more data and leverage catalog-based data discovery, and by simplifying data integration, data virtualization will enable data scientists to focus on their core skills rather than being burdened with data management tasks. By simplifying data access, data virtualization simplifies machine learning. In due course, the whole organization will enjoy the full benefits of cost-effectively gleaning real-time business insights.