Why and How To Enable Data Science with an Independent Semantic Layer

Written by Kevin Petrie, VP of Research, Eckerson Group | November 13, 2023

A guest post from Kevin Petrie, VP of Research, Eckerson Group

Nobel Prize-winning author Thomas Mann observed that “order and simplification are the first steps toward the mastery of a subject." While Mann died in 1955, his observation describes perfectly the challenge of modern analytics. How do you simplify complex data into something business managers can understand?

The semantic layer is an abstraction layer that aims to do just that. It derives consistent business metrics from underlying data and presents them to BI tools, AI/ML tools such as notebooks, or applications containing analytical functions. Many BI, data warehouse, and data lake products include a semantic layer. However, most practitioners—57% according to a recent poll by Eckerson Group—now prefer an independent semantic layer that can unify data from all their platforms. TimeXtender, for example, offers an independent semantic layer as part of its data management product.

The need for an independent semantic layer continues to rise as data science gains traction in the enterprise. This blog examines how it supports AI/ML use cases as part of data science projects. We’ll consider the five primary elements of a semantic layer: metrics, caching, metadata management, application programming interfaces (APIs), and access controls.

Metrics

The semantic layer presents metrics such as average revenue per sales rep by region, operating costs per factory, or annual growth in unit sales per customer. These metrics describe market trends and business performance as part of BI projects, and also serve as features for ML models as part of data science projects. For example, data scientists might identify and refine features such as annual sales per customer, transaction size vs. average, or historical market prices. The semantic layer then presents these metrics and their values to ML models that segment customers, detect fraudulent transactions, predict market prices, and so on.

The semantic layer presents metrics that serve as features for machine learning models.

Caching

The semantic layer pre-fetches high-priority metrics into cache (i.e., memory) along with their supporting tables, columns, and records. While caching can push up cloud costs, it also reduces latency for real-time ML use cases such as fraud prevention or customer recommendations. Use cases like these might have an ML platform that uses a semantic layer to pre-fetch metrics and calculate feature values in memory based on inputs from a streaming data pipeline. The ML model then uses those feature values to produce its real-time predictions or recommendations. Caching plays a critical role in performance when metrics and features derive from distributed, even far-flung data sources.

Caching speeds the processing of ML features that derive from far-flung data sources.

Metadata Management

Metadata includes the names, locations, structures, ownership, usage, and other properties of datasets. The semantic layer catalogs metadata to help data scientists find, prepare, and analyze the right data for AI/ML models. To detect fraud, data scientists need metadata that shows the lineage of credit ratings, IP blocklists, or other data sources. To predict housing market prices, they need metadata that explains the content of available datasets on recent home sales. Metadata like this gives data science teams the intelligence they need to select the right AI/ML inputs. (For more context on metadata, read this blog post from Micah Horner.)

Metadata helps data scientists select the right inputs for AI/ML models.

Application Programming interfaces (APIs)

No data science project runs in isolation—rather, these projects rely on diverse AI/ML tools, libraries, and datasets. The semantic layer provides access to such ecosystems via myriad APIs for Python, R, or other data science communities. This helps data scientists apply a rich array of inputs and algorithms to their AI/ML models. They use APIs to download ML models from public platforms such as Tensorflow or mlflow, then train them on their own data to produce a customer recommendation engine. They also use APIs to access data inputs from internal databases, merchant networks, or public repositories.

APIs provide access to a broad ecosystem of AI/ML tools, libraries, and datasets.

Access Controls

As with BI projects, data science projects must not enable rogue users, expose sensitive data, or raise compliance risks. A strong semantic layer includes role-based access controls to ensure that only authenticated users perform only authorized actions. Access controls are especially important for AI/ML applications that might otherwise perform risky actions based on the wrong user instructions or the wrong dataset.

Access controls reduce the risk that AI/ML applications perform risky actions.

What’s Next

For years, companies have applied the semantic layer to tabular data within databases or data warehouses. The rise of AI/ML, generative AI in particular, increases the use of unstructured data such as images and text files within cloud object stores. The semantic layer is evolving to support GenAI by deriving metrics from these new data types. For example, in the near future a semantic layer might derive sentiment scores from text descriptions of customer conversations. Data scientists and NLP engineers then would use these scores to fine-tune their language model, making the GenAI outputs more accurate.

Our hyper-digital world makes order and simplification more elusive than Thomas Mann could have imagined decades ago. It’s no surprise that the inputs and methods of modern data science baffle most business managers. Designed and implemented well, an independent semantic layer can help data science teams take a much-needed step toward simplicity.

Kevin Petrie is the VP of Research at Eckerson Group, where he manages the research agenda and writes about topics such as data integration, data observability, machine learning, and cloud data platforms. For 25 years, Kevin has deciphered what technology means to practitioners, as an industry analyst, instructor, marketer, services leader, and tech journalist. He launched a data analytics services team for EMC Pivotal in the Americas and EMEA, and ran field training at the data integration software provider Attunity (now part of Qlik). A frequent public speaker and co-author of two books about data management, Kevin most loves helping startups educate their communities about emerging technologies.

View full post