Identity Has a Language Problem

If you spend enough time in identity conversations, you start to notice a pattern.

Two smart teams can spend thirty minutes agreeing with each other before realizing they are describing completely different things. One might be talking about household clustering of digital identifiers, while the other is focused on resolving digital identifiers deterministically to offline personally identifiable information (PII). Someone references a 70% match rate. No one clarifies whether that refers to deterministic versus probabilistic matching, person versus household resolution, or a match rate versus a find rate.

The meeting ends with polite nods. The confusion surfaces three weeks later when expectations are not met.

Advertising does not have an identity crisis. It has a vocabulary crisis.

The technology itself is complicated, and the environment around it keeps getting more constrained. Regulation shapes what can be collected and shared, signal loss has changed how systems perform, and the mechanics underneath identity are not simple. Yet the thing that most often derails otherwise productive conversations is far more basic. It is the language we use to describe the system. The same words are used to refer to different layers of the stack, and people move through discussions assuming shared understanding that often is not actually there.

I am writing this ahead of a fireside chat at Marketecture Live III titled “Deterministic? Prove It.” The conversation will touch CTV identity integrity, signal degradation across the supply chain, and the measurement expectations that sit on top of both. None of that works if participants mean different things when they use the same terms.

This glossary is an attempt to establish a baseline.

What follows walks through identity the way it actually functions in practice. It begins with the raw identifiers that act as the underlying signals. From there it moves into the infrastructure that organizes those signals, then into the matching mechanics and collaboration layers that connect systems together. The goal is to ground the discussion in how identity actually operates in the market, not just how it is described in theory.

These are not official industry definitions. They are working definitions. If we can at least agree on what layer we are discussing, the rest of the debate becomes a lot more productive.

The Foundation: Identifiers Are Not Identity

Everything in identity starts with identifiers. But identifiers themselves are not identity. They function as signals within the system. Some remain stable for long periods of time, while others decay quickly as browsers, devices, or user behavior change. Some of these signals are deterministic and tied directly to PII, while others are statistical approximations that rely on patterns and probabilities rather than an observed correlation between an identity atribute and a digital identifier.

Cookie:

a small text file stored in a browser and assigned by a domain. Cookies allow a browser to be recognized across subsequent visits.

Cookies powered digital advertising for two decades. But confusion starts even here. First party cookies set by a publisher on their own domain behave very differently from third party cookies set across domains by external platforms. When someone says cookies are dead, the first question should be which type they mean.

Cookie Half Life:

the average time before 50% of a cookie pool disappears due to expiration, deletion, or browser restrictions.

This metric matters more than it gets credit for. A cookie with a 30-day half life behaves very differently from one that decays in 3-days, even if both technically exist. Match rates are often quoted without any reference to decay, which makes the numbers sound more stable than they actually are.

Mobile Advertising Device IDs:

operating system level advertising identifiers assigned to mobile devices.

MAID is the umbrella term used across the industry. It typically refers to Apple’s Identifier for Advertisers (IDFA) on iOS and Google’s Android Advertising ID (GAID).

These identifiers are resettable by the user and historically decay at much slower-rates than cookies; often around 50% annually. As a device-level identifier associated with the operating system (OS), their stability and persistance across multiple applications make them a valuable commodity; when user opt-ins allow.

On iOS, IDFA is governed by Apple’s App Tracking Transparency framework introduced in iOS 14.5. Apps can access the identifier only if the user explicitly opts-in to tracking. Industry opt-in rates have settled below 30%. The identifier still exists at the device level, but the consent layer dramatically limits the scale at which it can be used for cross app advertising.

Apple also maintains IDFV, the Identifier for Vendors. IDFV is scoped to the developer account that publishes an app. All apps released by the same developer share the same IDFV on a device. Meta’s ecosystem is the clearest example. Facebook, Instagram, and WhatsApp can recognize the same device through IDFV regardless of ATT opt-in status.

IDFV is not passed in OpenRTB bid requests and cannot be used across developers. It functions as a first party signal within a developer’s own environment and is commonly used for analytics, attribution, and fraud detection.

On Android, GAID plays a role comparable to Apple’s IDFA as a resettable device-level advertising identifier. Google has begun shifting the ecosystem toward alternatives through Privacy Sandbox initiatives, though the transition has been slower and less definitive than Apple’s ATT changes, with broader regulatory scrutiny potentially influencing its pace.

The shift is not the disappearance of device level identifiers. It is the increasing constraint around how broadly those identifiers can be used outside the environments where they originate.

IFA:

a standardized naming convention used across connected TV (CTV) devices for advertising identifiers.

In practice, IFA can refer to Roku’s RIDA, Samsung’s TIFA, Amazon Fire TV identifiers, and similar device level IDs. The shorthand is useful but incomplete. A CTV identifier generated by the operating system behaves differently from one generated at the app layer or by a network provider. Precision requires specifying the environment where the identifier originates.

PPID (Publisher Provided ID):

a first party identifier created and managed by a publisher, more often derived from an authenticated user session.

PPIDs are deterministic within the publisher’s own environment but are not inherently interoperable across publishers or platforms. Their value lies in persistent recognition inside a controlled ecosystem rather than portability to other properties.

IP Address (IPv4 and IPv6):

a numerical label assigned to a device when it connects to the internet.

It identifies the network location the device is communicating from at that moment, allowing data to be routed to the right destination. You can think of it less like a permanent identity and more like a temporary return address for where traffic should be sent back.

There are two main versions in use today. IPv4 is the older format, built around four numbers separated by periods (for example, 192.168.1.1). IPv6 is the newer standard, created because the internet eventually ran out of available IPv4 addresses. IPv6 uses a much larger address space and appears as a longer string of numbers and letters.

Where IP addresses become confusing in identity conversations is that they are often treated as if they uniquely represent a person, device, or household. In reality, they represent a network connection.

Most internet providers conserve scarce IPv4 addresses using a technique called Carrier Grade Network Address Translation (CGNAT). Under CGNAT, thousands of devices across homes or mobile networks can appear under the same outward-facing IP address. Inside the provider’s network those devices each have private addresses, but externally their traffic is translated through a shared public one. The result is that a single IPv4 address may represent hundreds or even thousands of devices at the same time.

Even in situations where the IP address appears more stable, it is rarely permanent. Residential IP addresses are routinely reassigned by internet providers, mobile networks change them frequently as devices move between towers, and corporate networks often route entire offices through one shared address.

IPv6 improves the situation somewhat because its vastly larger address space allows providers to assign addresses more granularly. In practice, however, adoption is uneven across regions, networks, and devices. Privacy extensions within IPv6 also intentionally rotate interface identifiers over time, which limits its usefulness as a long-term identifier.

For these reasons, IP addresses function best as contextual signals rather than durable identity keys. They can help answer questions like:

Are two events likely coming from the same household network?
Is this device appearing in a location consistent with previous activity?
Are multiple devices connecting through the same access point?

When used inside probabilistic models alongside other signals such as user agent attributes, timestamps, or behavioral patterns, IP addresses can help strengthen confidence in identity resolution.

What they cannot reliably do on their own is represent a persistent user, device, or household. Treating them that way tends to produce false precision and unstable identity graphs.

Another dynamic worth understanding is the gradual shift in internet traffic from IPv4 toward IPv6. IPv4 still carries a large portion of global traffic, but its share is slowly declining as more networks enable IPv6 connectivity. Measurements from Cloudflare Radar, the company’s public Internet observability platform show that in the United States, roughly half of internet traffic now travels over IPv6 when it is available, reflecting a steady increase in IPv6 usage over the past decade.

The growth is gradual because the internet is effectively running in a dual-stack world. Most services, networks, and devices support both IPv4 and IPv6 simultaneously. When both are available, modern operating systems typically prefer IPv6, which slowly increases its share of total traffic. Globally, about 29% of IPv6-capable requests are currently served over IPv6, with the percentage rising incrementally each year.

Statistical Identifier (Fingerprint):

a probabilistic identifier constructed from a combination of browser or device attributes such as screen resolution, installed fonts, operating system version, and language settings.

No single attribute is unique on its own, but when many of them are observed together they can form a probabilistic signature that appears distinctive for a period of time. That signature is not stable, however. As devices change, software updates occur, browsers evolve, or network conditions shift, the underlying attributes change as well, which causes the fingerprint to gradually degrade.

It is important to separate two concepts that often get collapsed together. Probabilistic matching two distinct and persistent identifiers is one process. Generating an identifier from statistical signals alone is another. The confidence levels and use cases are very different.

All of these are building blocks. None of them are identity on their own.

From Signals to Structure: Spines and Graphs

Identifiers become identity only when they are structured and linked together. This is where the distinction between spines and graphs becomes important.

Identity Spine:

a deterministic backbone built on verified linkages anchored in personally identifiable information (PII) such as email addresses, phone numbers, or physical addresses.

A spine is not probabilistic. If modeling is introduced into the linkage process, you are no longer describing a spine. You are describing a graph.

Cluster ID (Spine Identifier):

the persistent identifier that represents a label to group multiple identifiers together.

Think of it like a table number at a wedding reception. The number does not describe who the guests are or how they know each other. It simply gives the planner a way to keep a group together. If the seating chart changes, the planner can move “Table 8” across the room without needing to individually track every guest sitting there. The number is just a label that allows the whole group to move as a unit.

A cluster ID functions the same way. Every linked cookie, device ID, or email maps back to this central key. The cluster can represent a device, a person, or a household depending on the resolution model.

Identity Graph:

a network of linked identifiers associated with a device, person, or household.

Graphs may be deterministic, probabilistic, or hybrid. Most commercial graphs are hybrid. The important question when evaluating a graph is where deterministic linkages stop and modeled inference begins and how to denote between the matching mechanisms.

Identity Graph Health:

a measure of how accurate and current a graph remains over time.

Identifiers decay. Devices reset. People move and change email addresses. Graph health attempts to capture whether linkages still reflect reality.

Typical evaluation dimensions include precision, recall, and stability over time.

Matching: Where Most Confusion Lives

If there is one place where identity conversations reliably fall apart, it is around matching. Numbers get thrown around quickly, usually in the form of a headline match rate, and the room often treats that number as if it represents a single, simple reality. In practice, it rarely does.

Match rates are quoted without clarifying which identifiers are being matched, how many platforms sit between the datasets, or what methodology produced the linkage in the first place.

Consider a common example. A brand sends its CRM data to an onboarding partner and receives a 60% match rate. That data is then passed from the onboarder to a DSP, where another 60% match rate is reported. On paper both numbers look healthy, but they do not translate into 60% reachable scale inside the platform. Once the steps are compounded, the effective overlap is closer to 36%, and that is before asking the next practical question: how many of those identifiers can actually be found tied to available inventory in the channels where the campaign is being activated.

Understanding how these layers interact is critical, so it is worth unpacking what matching actually means in practice.

Deterministic Match:

a match made using a shared verified identifier such as the same hashed email appearing in both datasets.

At its core this follows the transitive property of equality: if identifier A is known to belong to person B, and person B also matched identifier C, then identifier A can be linked to identifier C through that shared reference point. The confidence comes from the fact that the relationship is grounded in an observed fact rather than an inferred pattern. Turns out the transitive property from high school algebra still shows up at work!

Scale, however, depends entirely on how often those shared identifiers exist across the datasets being compared.

Probabilistic Match:

a match inferred between two identifiers, such as a cookie and a device ID, using behavioral or contextual signals rather than a deterministic match.

The identifiers themselves are stable. The linkage between them is based on modeled logic.

Cookie Sync:

the process through which adtech platforms exchange mappings between their respective cookie identifiers.

A pixel fired by one platform allows another platform to record a mapping between its user ID and the first platform’s ID. Cookie syncing has been foundational to interoperability on the open internet.

Match Rate:

the percentage of records in one dataset that successfully match records in another dataset at a single point in time.

Match rate is a snapshot. It says nothing about persistence, determinism, accuracy, or downstream addressability.

Collaboration and Translation: Alternative IDs and Clean Rooms

As traditional signals weaken, interoperability layers have expanded.

Alternative IDs:

identifiers designed to supplement or replace cookies and device IDs in environments where those signals are declining.

Two widely discussed examples are LiveRamp’s RampID and Unified ID 2.0 (UID2).

UID2 is an encrypted representation of an email address or phone number created through an open source framework. It represents the underlying credential rather than a modeled person. In practice, a UID2 token is derived from a specific authenticated identifier, most commonly an email address, and can only be resolved back to that specific and singular email when the appropriate permissions and keys are present within the UID2 ecosystem.

RampID, by contrast, is a people-based identifier derived from multiple pieces of PII and centrally managed by LiveRamp. Rather than representing a single email address, RampID attempts to resolve multiple attributes associated with the same individual into a single persistent person-level identifier. That may include multiple email addresses, devices, or other identifiers that LiveRamp has determined belong to the same person through deterministic linkage within its identity graph.

Both aim to support persistent identity in a consent driven ecosystem, but they operate at different conceptual layers. UID2 represents an encrypted version of a credential such as an email address. RampID represents a resolved person behind one or more credentials. Understanding that difference is important when evaluating how identity flows through an advertising workflow.

Clean Room:

a secure environment where multiple parties can match and analyze datasets without exposing raw user level data to one another.

Clean rooms are collaboration environments, not identity systems themselves. Identity resolution occurs inside them under strict data handling constraints.

Synthetic ID:

a neutral identifier generated inside a clean room to represent a matched entity between two or more parties once their respective datasets have been compared inside the controlled environment.

In practice, each party enters the clean room with its own identifiers. A publisher may bring hashed email addresses of its authenticated subscribers, while an advertiser may bring CRM records or transaction data tied to its customers. Inside the clean room, those datasets are compared to identify overlapping users. Rather than exposing the publisher’s identifiers to the advertiser or the advertiser’s identifiers to the publisher, the clean room generates a brand new identifier: a match key.

That newly created identifier is the synthetic ID. It becomes a temporary reference point that both parties can use for communication of analysis, measurement, or activation within the clean room environment. Because the synthetic ID is newly created and only exists within that collaboration space, neither party gains access to the other’s underlying identity assets. The clean room can therefore enable joint analysis, audience creation, or performance measurement while preserving the separation of each party’s original identifiers.

What Identity Ultimately Enables

All of this infrastructure ultimately exists to enable two things the advertising ecosystem depends on: addressability and accountability.

Addressability:

the ability to recognize and activate against a known entity within a specific media environment.

Addressability exists on a spectrum. A logged in publisher environment may have strong person level recognition. Open internet environments may only support device or browser-level recognition.

The meaningful question is always resolution level: browser, device, person, or household.

Real Identity:

a verified representation of an actual person or household behind a cluster of digital identifiers.

Real identity sits at the top of the resolution hierarchy. It moves beyond anonymous signals to a point where the system can associate identifiers with a specific human being or household with high confidence.

Not every advertising use case requires this level of resolution. Measurement, attribution, and audience based buying increasingly depend on it.

Alignment Before Optimization

The industry does not need universal agreement on definitions, but it does need clarity about the ones being used. Too often conversations move forward with everyone assuming the same understanding, only to discover later that each party was describing a different layer of the system.

If discussions begin by declaring what is actually meant by terms like identity, match rate, graph, or person-level resolution, a surprising amount of confusion disappears. Identity challenges are real. Signal loss is real. But many of the frustrations that surface in day-to-day collaboration are semantic long before they are technical.

Before comparing match rates, evaluating graphs, or debating interoperability, it is worth pausing to confirm that both sides mean the same thing when they use the same words. The rest of the conversation becomes far more productive once that foundation is established.

That is part of the motivation behind the conversation at Marketecture Live III next week. The topic of the session is deterministic identity in CTV, but the broader goal is the same one that motivated this glossary: making sure we are speaking the same language when we describe how identity actually works, its abilities and the realities of its constraints.

The more the industry can align on the vocabulary that describes these systems, the easier it becomes to have honest debates about the technology, the trade-offs, and the future direction of identity infrastructure. The conversation will not end at one conference session, and it should not. But if more of those conversations start from a shared understanding of the terminology, the rest of the work becomes meaningfully easier.