Publisher transformation: Machine learning experts

October 03, 2023 4 min read

  • David Reischer

    Senior Product Manager at Permutive

This article explores what is required to operationalise modelling to scale data and increase audience reach.

  • As regulators and browsers increasingly focus on privacy practices in digital advertising, advertisers can only apply audience targeting for 30% of the open web. Equipped with the right tools and expertise in machine-learned modelling, publishers can fill the growing data gap and ensure advertisers don’t miss out on valuable audiences while growing revenue from advertising. 

    Publishers are becoming the new generation of data providers for audience targeting. This recognition goes beyond publishers’ ability to collect data in a privacy-compliant way and to recognise all users to create endemic and non-endemic audiences. It’s about their ability to model out niche and hard-to-scale datasets, including high-quality, self-declared ones. 

    In this article, the second in a two-part series, we’re exploring what is required to operationalise modelling at web scale. See part one for three types of modelling that publishers can leverage to build a scaled data offering.

    Taking this a step further and operationalising these models – making them scale to millions of users in real time – requires machine learning expertise and the right system design. 

    Operationalising Models

    In the new era of digital advertising, publishers have to become experts in machine learning. This is a complex task that requires data science expertise but also the technical infrastructure to operationalise models and make them scale to millions of users in real time. Generally, there are four components to this:

    • Seed Data

    The most important part of any ML effort is access to the right seed data. A model can only be as good as the data you feed into it. For publishers, this can be interest data that they collect from user interactions, it can be declared data from registered users or surveys, or it can be declared data from data partners. Generally, the larger the seed dataset is, the better the model can be trained. However, you don’t want to compromise quality for quantity. If you use loads of low-quality data, don’t expect great results.

    • Feature Engineering

    If you want to maximise scale and quality, you will have to make predictions in real-time. This requires you to think about feature engineering, meaning the aggregation of raw events into a format (user state) that a model trains against and that is then used to make predictions. This step can be the hardest part: As a publisher, you need to update millions of user states in real time, enabling you to target users before they leave your site (and may never return). Edge computing can help with this: Moving the feature engineering to the device removes any latency and is highly scalable.

    • Model Selection

    It’s critical to select the most appropriate model for your given use case. This requires an understanding of the problem at hand and a pragmatic decision of how complex your model needs to be. Luckily, there are a lot of algorithms to pick from, but it requires expertise, experience and experimentation to find the right mix of complexity and practicability. More complexity isn’t always better: The simpler your model can be to achieve a desired outcome, the better.

    • Inference

    Once you have trained your model and you have access to a user’s real-time state, you can feed that into the model to get predictions. This requires an inference service that returns predictions in milliseconds. For simpler models, like logistic regression, the inference could happen on device, eliminating any latency. For complex and large models using a neural network for deep learning, it might be more practical for the inference to happen in the cloud.

    Sourcing the right seed data and building the appropriate model are fundamental steps. But scaling the model to millions of users and making real-time predictions is the hard part. And real-time predictions are crucial for publishers: If you get your predictions in a nightly batch job, you will always be one step behind. This also means you’ve lost your chance to address all those users who have one session and then never return to your site. Slow predictions aren’t so much an issue when you want to send personalised emails, but they have a direct impact on your revenue when you need to serve campaigns to users while they are on your site.

    The opportunity ahead 

    Privacy is becoming one of the most important macro trends in the industry, driven by both regulators and browser vendors. This creates a huge opportunity for publishers, as they are in a unique position to allow advertisers to continue reaching their customers and to do so in a privacy-safe and scalable way across all platforms on the open web. However, publishers need the right tools to create a robust and differentiated audience offering and machine learning is an important tool to achieve that. Publishers need access to the right modelling solutions for specific modelling problems, and they need the right infrastructure that enables them to scale this at web scale.


    Want to know more? Speak to an expert