Building a Semantic Layer for Disparate Data Consumers

There’s a pattern that I have come across that seems to work quite well for the majority of organizations that meet a two basic criteria:

  • There are multiple platforms that consume the same datasets (BI tools, CRM, APIs, etc)

  • There’s potential for embedding data into an external or internal facing product either via an iframe or custom web app

Businesses that meet these criteria can fund success by implementing a semantic layer. If you’re unclear on what a semantic layer is, it’d be good to read up on someone else’s description (A word of caution - there are many people out there with strong opinions on the matter - I’d stay away from LinkedIn for this specific learning journey). I’ve referenced it before, but here’s a one definition from Airbyte.

The pitch for a semantic layer is simple - data that makes its way into multiple downstream consumers should be well governed and allow for easy but controlled updates. Business logic should be ironclad especially when served to paying customers, and nobody should ever see data they aren’t allowed to access. Semantic layers solve these problems easily with their out-of-the-box functionality.

But what happens when the types of data consumers are inherently disparate? What if one of them is a standard dashboard tool, another is a custom web app, and a third is an LLM? How do we design the semantic layer to support these different types of downstream tools?

Build Good Practices

For starters, we need to adopt some principles of software engineering. DRY code becomes more important than ever, and a well oiled CI/CD process is essential. These are table stakes, but they aren’t the secret sauce. 

On the topic of DRY code and CI/CD - semantic layers allow us to modularize our SQL code base, such that we have reusable components that ultimately build the final SQL statement to run against the database. SQL itself isn’t written, but is abstracted into the semantic layer via YAML, JS, Python, LookML, etc). Templating languages like Jinja, Mustache, Liquid, and others further our ability to make a truly dynamic and DRY semantic layer. This ability, coupled with full integrations into git, allow us to build a solid foundation for pushing business logic to downstream consumers in a reliable, governed, and scalable fashion. 

The Importance of Metadata

With a foundation in place, we can move on to what makes a semantic layer really shine - metadata. Semantic layers are an incredible place to store metadata that downstream consumers can leverage.

Think of the LLM use case - how can an LLM obtain enough context to take a natural language prompt from a user and create a reliable SQL query from it? The answer is the rich descriptions we can add into our field definitions in a semantic layer. There are rarely any limits on how much text you can add into a field description, and these objects are available via API - meaning an LLM can read them before deciding which field to choose for its final query construction.

Then take the custom web app example: Different users with wildly different personal experience backgrounds, internal or external, can be using a data app. A manager of a sales team might log into a portal to start identifying their next set of territories, with the intention of exporting a list of email addresses and supplemental information to their CRM. This user may have questions about how different objects they are viewing relate to each other, but might not have a good grasp on what a “Join” is. 

Meanwhile, an analyst might log in, and have similar questions, but need to know the specific technical details under the hood. The ability to templatize your semantic layer gives the development team the ability to dynamically display information based on key:value pairs associated with a user’s profile. 

The sales manager might see “Opportunities are related to addresses, such that a single opportunity may have multiple associated addresses.”

And the analyst might see “The all_opps table is left joined to contact_info on all_opps.id = contact_info.opp_id”

One Data Model to Rule Them All

Both of these definitions can be served from the same object in the semantic layer, so we keep our code DRY and still meet users where they live.

With the ability to create dynamic semantic layers, we can ensure that downstream consumers, whether human or machine, can reliably ask and answer questions of the data.

Next
Next

An Elegant Data Stack for Embedded Analytics