E015 - Taking Data Collection & Data Privacy Seriously, with Snowplow

By Rick Dronkers in Jordan Peck — Oct 11, 2022

Life After GDPR Podcast #E015 with Jordan Peck @ Snowplow

In this episode, I interview Jordan Peck from Snowplow.

Jordan is a Solutions Architect at Snowplow with a background in digital marketing analytics. He went from using Google Tag Manager and Google Anayltics for clients, to becoming a Snowplow client himself and experiencing the power of a customized data collection platform. In 2020 he decided to join Snowplow himself to help customers adopt Snowplow and get the most out of the data they collect. He’s active on LinkedIn, Twitter and in MeasureSlack.

In this episode we discuss:

What Snowplow is
What specific parts of snowplow are relevant from a data privacy perspective
What configuration options there are
What and Who you need to get started with Snowplow
And much more

Some of the resources mentioned in this podcast:

Make sure you follow the show:

Follow LifeAfterGDPR on Twitter and on LinkedIn
Follow the host Rick Dronkers on Twitter & LinkedIn.
Subscribe to the show on Apple Podcast or Spotify or wherever you listen to podcasts by searching for "Life after GDPR"
If you’d rather get notified by email, subscribe for updates via the lifeaftergdpr.eu website

If you want to help us out, please share the link to this episode page with anyone you think might be interested in learning about Digital Marketing in a Post-GDPR world.

Transcription Disclaimer PLEASE NOTE LEGAL CONDITIONS: Data to Value B.V. owns the copyright in and to all content in and transcripts of the Life after GDPR Podcast, with all rights reserved, as well as the right of publicity.

WHAT YOU’RE WELCOME TO DO: You are welcome to share the below transcript (up to 500 words but not more) in media articles, on your personal website, in a non-commercial article or blog post (e.g., Medi), and/or on a personal social media account for non-commercial purposes, provided that you include attribution to “Life After GDPR” and link back to the https://lifeafterGDPR.eu URL. For the sake of clarity, media outlets with advertising models are permitted to use excerpts from the transcript per the above.

WHAT IS NOT ALLOWED: No one is authorized to copy any portion of the podcast content or use the Life after GDPR Podcast name, image or ness for any commercial purpose or use, including without limitation inclusion in any books, e-books, book smaries or synopses, or on a commercial website or social media site (e.g., Facebook, Twitter, Instagram, etc.) that offers or promotes your or another’s products or services without written explicit consent to do so.

Transcripts are based on our best efforts but will contain typos and errors. Enjoy.

[MUSIC SOUND EFFECT BEGINS AND FADES]

[00:00:00] Rick Dronkers: Hey everybody. Thanks for tuning into another episode of the Life After GDPR podcast where we discussed digital marketing in a post GDPR world. In today's episode, I interview Jordan Peck. Jordan is a Solutions Architect at Snowplow. And, this is actually the first time that I decided to talk to a vendor on the podcast.

[00:00:28] So please share your thoughts with me about how we handled this topic and if it was valuable to you as a listener. I personally like Snowplow a lot. I wish we would do more with it in our day to day business. Hopefully we will in the future. I think it's a really powerful platform that has a lot of possibilities, especially if the maturity of the organization moves up a bit. In today's episode, we mainly focus on how Snowplow can help you from a data privacy perspective. Let's assume you are on Google Analytics right now, and you're evaluating whether you want to move away from that for obvious reasons and what, how Snowplow could potentially be an alternative and what features it has. Short disclaimer before we dive in. I am not a lawyer. Jordan is not a lawyer, and nothing that we say on this podcast is legal advice. So, without any further ado, let's dive in with Jordan Pack from Snowplow.

[00:01:35] Welcome to the podcast.

[00:01:36] Jordan Peck: Hi, Rick. How you doing,

[00:01:37] Rick Dronkers: Doing well, doing well. So you work for Snowplow. You guys are the first vendor on the Life After GDPR podcast so let's get that out of the way.

[00:01:38] Jordan Peck: I'll take the chance to introduce myself. I'm Jordan Peck. I am a Strategic Solutions Architect at Snowplow Analytics, or Snow.io as we are now. I have worked for Snowplow for very close to two years now. However, before joining the Snowplow team, I was a real life web analyst.

[00:01:58] Did a lot of Google Analytics work, a lot of GTM, BigQuery all that kind of stuff. Was an Analytics Consultant for a little while. We were background in digital marketing and digital analytics as well. Was also a Snowplow customer about four or five years ago when I first came across Snowplow when we were significantly smaller than we are now.

[00:02:18] Really, really loved the product. I thought it was one of the best that I'd seen in the space. And then when the opportunities came a couple years ago to come join the team, I was very ecstatic to get the chance to join.

[00:02:28] Rick Dronkers: Yeah, I can imagine. Yeah. So, I have a very, very similar background. It's also why I invited you on the podcast because I know that we see things in the same way the problems I ran into as a digital analytics consultant with my clients. Before we dive into Snowplow and privacy, which is obviously the angle for the podcast, could you give a quick overview for the listeners, what is Snowplow? And perhaps also highlight the difference between the open source part and what you guys are doing. Cuz it is a for profit business next to it.

[00:03:01] Jordan Peck: So Snowplow is at heart a behavioral data platform. It is a system for creating and collecting the best quality behavioral data about your users and your customers across various different platforms. So whether that be web, mobile, IOT, service side applications, desktop applications, wearables, all that kind of stuff.

[00:03:25] We are cloud native, so we run on AWS or Google Cloud platform and larger data in real time into a warehouse of your choice. Supported destinations are BigQuery, Redshift, Snowflake, and Data Bricks. As you mentioned, the core of our product is open source. All of our code is on GitHub all the code for our tracking SDKs or the

[00:03:45] cloud applications that we use to process and enrich validate the data and, and load data warehouse is all open source. If you are a technically advanced company with a set of smart data engineers, you can take all of our sales code and run it all by yourself on your cloud, of which several people do.

[00:04:05] We have a very, very large and engaged open source community, and it's hard to track sometimes, but we think it's in the region of several thousand businesses using Snowplow open source. So, Gusto in the UK do tr who owned by Atlassian is certainly, certainly used to, I think New York Times are big open source users as well.

[00:04:25] Yeah. As you mentioned, I work for Snowplow. We are a for profit business, so we offer a paid alternative where essentially you come to us and we manage really that Snowplow pipeline for you. So, as you mentioned, all of the codes on GitHub, but to use of slightly updated term, it's a big data platform, right?

[00:04:43] There's lots of services. I think there's like quality or 70 services across the two clouds that we build and maintain can be a challenge to run that effectively at high scale and make sure everything continues to work and doesn't fall over. So you can offload that responsibility to us.

[00:04:58] We will run it for you. However this is probably gonna be quite a key point of our discussion today. As part of our paired service, we are not pure SaaS what we are, we coin it private SaaS. So the tech is SaaS, we build it, we maintain, it's however we deploy all of our pipeline infrastructure into the customer's cloud accounts.

[00:05:19] Customer comes along, they open up a brand new empty AWS sub account or GCP project. We deploy our infrastructure into that environment, and then we remotely monitor it. So we make sure that if your service needs expanding or we will patch up updates for you, send new upgrades and things like that.

[00:05:37] And keep the lights on for you. So you offload all of that management of running the pipeline to us payer fee. And, you also get people I meet to tell you how to use it most effectively.

[00:05:49] Rick Dronkers: We have a couple of clients that use this service from you. So basically they, you could say they outsource their data engineering to you guys.

[00:05:58] Jordan Peck: Yeah. For their, for this prospect and it is also very, very technically sophisticated companies still do this mainly because they just don't want to have to do it. They've got all of their own systems that they're busy managing and running, and if they, you can get to an enterprise grid cloud piece of software that just is managed and run for you, and they have to worry about it.

[00:06:17] They just send data to it and conse data out of it, they're very, very happy. It's quite a nice model. It lends itself to a lot of privacy benefits, which we're gonna talk about today this deployment model as well as a lot of features we've built across the product for this kind of consideration.

[00:06:32] Rick Dronkers: It's interesting. I was, I just thought about it like, it's not real sales, like you said. It's a slightly different model, but from a privacy perspective, it's exactly what you would want because the benefit of course being I create my own Google platform account or AWS. I own it. Own everything inside it, and I give you guys rights.

[00:06:55] And if I wanna throw you guys out, [laughs] I could throw you guys out. Still probably have to pay you if I throw you out. But the data is mine, right?

[00:07:03] Jordan Peck: That does happen. Sometimes we've had customers kick us out and then we go, you're paying a managed service and we're not able to manage it. That does happen from time to time. Thankfully not as much I think as it used to. But yeah it never leaves your own servers inside your cloud, and we only have access to it, to perform, monitor, and regard access to the data to do anything.

[00:07:24] It never flows through our servers. So your data that you're collecting on your own users never hits any server infrastructure that we own. Sometimes we do get access to customers data, but only we're permission and time bounded permission rights for us to do specific actions. We take it very seriously. We take it extremely seriously because it's very important now more than ever.

[00:07:43] Rick Dronkers: So let's look at Snowplow through, through my lens, right? So I am a consultant helping a lot of companies with Google Tech Manager, Google Analytics. Depending on when this airs, we will know if in how many countries Google Analytics is or is not legal, right? But that's an issue, right? So people have this fear and they have alternatives. I presented on this framework and then you, you were there as well. So we had a nice discussion about it at Measure Camp, which is, you have like these three options of either you're gonna fight it, like you wanna stay in the Google ecosystem and you're gonna hire lawyers and you're gonna optimize your implementation, make it as, I don't know, proof as possible.

[00:08:23] I would say let's spark that option cuz we're not gonna discuss that today, but that's one of them.

[00:08:29] And then you can, you can flight, you can either move backwards or forwards and backwards. I would say is a little bit downgrading to like, let's call them simpler analytics solutions. So there's a bunch of solutions out there. The French DPA listed the whole list of solutions that they say are GDPR proof.

[00:08:50] Jordan Peck: Gentis. Objective. There's a number of them. Favom, I think is another one.

[00:08:55] Rick Dronkers: Yeah. And the Matomo, Pewick, those kind of solutions and not necessarily saying they're bad, but I do think they're either on par with Google Analytics or a little bit so far, I would normally say.

[00:09:09] Jordan Peck: They haven't had the development, a company like Google can allow to, put into a product.

[00:09:13] Rick Dronkers: Exactly, but also I feel like they are also geared more towards, let's call it simple web analytics, or not necessarily simple, but really geared at web analytics. Whereas more complex use cases, you often see that the need arises to integrate multiple sources and to create a wider thing. And you saw Google Analytics also with GA4 is also moving in the direction with a more, the event based model. And then Snowplow comes into the picture, right? Once you start looking at, okay, I want something more advanced, even before all the privacy stuff, Snowplow usually came on the short list for people. Like, hey, this is something, it is gonna take more work to do, but it has some additional value where you look at it like that.

[00:10:04] Jordan Peck: And just to put your listeners and viewers minds at rest, I'm not gonna dive into a sales pitch. That's not why I'm here. I would think this, but I think Snowblow is the most advanced and best product in the market for doing the more advanced analytics from multiple sources in real time.

[00:10:20] In the most granular detail with the most custom properties and flexible schemers, et cetera, et cetera. , if you're interested, it's not, speak to a salesperson. [laughs] That has definitely been the way we've been positioned, certainly since when I was a customer. If you really, really serious about taking your web analytics as seriously, seriously as a business takes the rest of its data. Data warehouse and has been around for like 30 years now.

[00:10:44] People have been doing data science on business data for, for a very long time and like logistics and supply chain and trade in and stuff like that. If you wanna take your web analytics as seriously and you might analytics as seriously as that a tool like Snowplow is one of the best options for it because of the flexibility that you.

[00:11:03] Rick Dronkers: It's the trade off of a lot of things you can customize and it is very flexible. But then obviously, it's not plug and play. Here's your JavaScript snippet and good luck. That's the trade off you're gonna have to you need a team of people to maintain it if you're definitely, if you're gonna go the open source route,

[00:11:22] Jordan Peck: We've made efforts to make Snowplow easiest to use. But again, I would certainly send you stellar opinion of Snowblow always, that it is more effort, but it will provide more value at the end of it if you're willing to put that effort in. You need people who can write SQL.

[00:11:34] You need people who can understand how to derive value and create meaningful data out of a big data warehouse. We are putting efforts in to make it easy to get value quicker, more easy to set up, easier to use for more people, more personas. But in essence, it is a more technical product than installed GA and GTM gets a pretty report to the GA in space. It's not the market we're filling.

[00:11:58] Rick Dronkers: I recorded a podcast with Timo Dechau last week. In there we were also exploring what type of analytics is valuable. and one of the things we figured out, like you have to highlight the core value of data to your company, if you take a, let's say a travel company, you figure out that the way their algorithm and like what trip to show to the visitor at what moment based on what about the user and their search behavior that can actually be a huge improvement on conversion rates.

[00:12:34] And then you realize data is now, it's no, no longer just about calculating. Did my Google ad spend, did my click generate this much money? But it's an integral part of your business and there's a lot of business models where that's the case.

[00:12:49] Jordan Peck: It's building data products and data apps. And even in the analytics space, like the reporting you can still think of it as a data product or an application in that sense. This is my trading reporting data application. Or to use your example in terms of real activation, like my search product that I am optimizing through the use of data we're observing from our users.

[00:13:15] Recommendation engines are a lot of, are are another popular one that we have a lot of people interested in. Which I, think is a really, really having been in our analyst for a long time, I think's it a really nice thing to see in the market that data and analysts aren't cost centers.

[00:13:32] Like people always think like data's actually just in this little bubble requests come in and PowerPoints come out [laughs] and that can be a bit demoralizing really. I know, I did it for a long time and to see that actually the perspective now is that you could build data products which actually drive value and increase.

[00:13:50] Top line, bottom line, increase, improve customer experience, improve stickiness, is a really, really nice thing to see in the industry.

[00:13:57] Rick Dronkers: Yeah, I think all tools were slowly moving towards that point, but I think Snowplow was really redesigned from the ground up or not redesigned was invented from the ground up with these kind of use cases. You look at the early presentations of Alex and Yali there early presentations already with the old logo, with the actual snowplow. [laughs]

[00:14:17] Jordan Peck: With the actual stuff. [laughs] Yeah. I love the story. Alex and Yali used to be consultants at two co-founders, and they would go into a business and they'd say, Oh, can we have access to your transactional data? And they go, Sure, here's the data warehouse.

[00:14:30] And they'd have all this flexibility to slice and dice and join them, manipulate the data as they wished, and then they'd say, they used to give DVDs of white exports of the data warehouse so they could do the analysis that that would fly in a GDPR environment, would it? [laughs] But yeah, and then they say, Oh, you've got a really nice website.

[00:14:47] You've got a lot of users on. Can we analyze that? And they say, Oh yeah, sure, here's a GA login. And it'd be like, Oh, it's not quite the sort of experience that we wanted really. Like, it would be so good to be able to do the same type of analytics we're doing with the over transactional and operational data.

[00:15:02] We would love to out the data and therefore there's gotta be a tool for that. There wasn't. That was what stuff I was built for the purpose of. And yeah, there was more advanced use cases. Those more complex, but ones that ultimately, generally drive more value. The GDPR perspective and data privacy and security is so it's also something else we've taken very seriously.

[00:15:22] But people wanna build data apps like we've been discussing. While still being GDPR compliance and respecting users privacy. It seems like it might make it more unattainable because it seems like you've got even more things to overcome before you can start doing these things.

[00:15:37] So we take it very seriously and, making sure that we can still use love to do those, advanced use cases while still respecting users privacy.

[00:15:46] Rick Dronkers: The first obvious thing is what? What we already tackled, right? So the fact that Snowplow will always, if you guys manage it, or if I do it open source, It probably should always be in my ownership. Like technically, I could set it up on somebody else's cloud and then put it on my website, but let's assume I don't do such a thing. So that's the first part, right? So the data ownership is always first party. So where the data ends up, BigQuery or Redshift, whatever, the collectors, right? You're gonna run the whole, the whole pipeline will run on your own cloud infrastructure.

[00:16:18] Jordan Peck: Yeah, that's correct.

[00:16:19] Rick Dronkers: That's by default, by default by design.

[00:16:22] Jordan Peck: That is bad design. Yeah. And you can add in your governance policies, retention policies cleaning and archiving or move data for cold storage or deletion. If you say, we don't want our logs to contain more than 30 days worth of data, will archive them or delete them after a certain time period.

[00:16:42] We only want these very, sorry, we only want people to access data with these particular policies. Only these types of people can access this type of data under these types of circumstances. All of that's capable. We don't enforce any of that. It's entirely up to how you wanna control your data and your data access and the governance around it.

[00:17:01] Rick Dronkers: Yeah. It's all possible. I would say, it might be interesting for you guys to have a, to create a GDPR best practice implementation way for Snowplow. Where you basically suggest if you wanna take privacy as serious as possible, then you would wanna do all these settings in this way.

[00:17:20] Jordan Peck: It's difficult, right? I mean, First of all, there's a legal side of that, isn't it? We don't ideally wanna be told that at some point by a customer that says Snowfall told us to do this, and then we end up on the hook for if they've done something horrendously wrong.

[00:17:35] T also considerations, right? For like, if you are saying, we'll trash all data and stairs and our logs or our streaming platform, we'll trash them after like seven days. Which is fine except there are other implications of doing that, right?

[00:17:48] I was speaking to a customer yesterday about this. Let's say one of the things Snowblow is, if the event is failed for whatever reason, cause it's not the right format or something we don't like silently drop it, we actually store it in, in cloud storage. Let's say you stick a retention policy or a life cycle policy to delete data up every seven days, those failed events can be reprocessed accessed.

[00:18:07] You can essentially correct them and send them back to the pipeline, but if you've only got seven days worth of retention, you've only got seven days to action that.

[00:18:13] Rick Dronkers: Yeah, you need to be on top of it.

[00:18:15] Jordan Peck: Yeah, exactly. So there is a balance if you really wanna be privacy conscious, you delete it every day or between files or something.

[00:18:22] But if you do that, then you end up opening yourself up to some of these data the gaps in your data because you don't have time to reprocessed it. So there is a balance act to be made there.

[00:18:32] Rick Dronkers: Yeah. And definitely you wouldn't wanna a recommendation that this is GDPR compliant, but like you said, you give a recommendation on, this is the most privacy friendly way you could configure Snowplow in theory. And this is not legal advice. Right. That kinda…

[00:18:50] Jordan Peck: Yeah. I'm not a lawyer. I find myself saying that a lot these days. [laughs]

[00:18:54] Rick Dronkers: That's on the data retention policy side. Let's talk about the data collection. Cause if we take Google Analytics again as an example, then, all the issues start off with the identifier, the cookie, the whatever, the way you're gonna stitch it all together. And by default, the framework for well, at least for universe analytics, right, was user session page view. And now for GA4, there's still like the user identifier still the GA cookie.

[00:19:24] Jordan Peck: The client id. Yeah.

[00:19:25] Rick Dronkers: So how does this look by default or how can this look as well from a Snowplow point of view?

[00:19:32] Jordan Peck: On the web, we do solve cookie identifiers. On mobile we use device identifiers. A user identifier that's tied to the installation of the app. We do it a slightly different word than Google. So we actually place multiple cookies. We focusing on web for a second.

[00:19:45] We place multiple cookies. We place a first party JavaScript created cookie which is very similar to like, the FBQ for Facebook or GA Client id. document.cookie, sets cookie, and by default it'll expire after a year. So that's this all like normal out the box cookie identifier.

[00:20:04] We also place session id cause we do client side sessionization. There's also that main user ID is what we call it. So that cookie id is like every other web analytics cookie, like the client id. It's setting a browser. And it's susceptible to ITP for one. So ITP took it down to seven days or if you have the answer version on just, just 24 hours is a bit problematic.

[00:20:27] Yeah, so that's the main one. We also have placed another cooking, which is actually one set by our pipeline by set, by our server. A very common pattern that snowblower users and customers do is all of the Snowplow our pipeline. You stick track rickdronkers.com. You stick a first party sub domain influence of it.

[00:20:46] We probably don't recommend people call it tracking.something, but like, analytics dot my side.com. And then the whole Snowplow pipeline and the collection servers anel behind that domain, so then that server can set a cookie back to the browser. So if I'm rickdronkers.com and tracking.rickdronkers.com sends a cookie back then that is also first party.

[00:21:08] This is where I don't, I'm not a huge familiar with the wording around this, but because this is set by the same domain that you are, that the users visiting, it is it's own infrastructure, it's all first party. Rather than it been set by google-analytics.com, rather than it being sent back

[00:21:29] by a third party, like Facebook or something. I know that this is very popular as well with GTM server-side. So people are trying to set their GTM server behind one of their own domains as well. And I know that people are essentially trying to use it for exactly this purpose, right? To get around ad blockers and to get round tracking prevention methods from my identifying was as tracking measures.

[00:21:53] Rick Dronkers: That sounds a lot like the way server-side manager right now is being used. Disclaimer, you should not use it right without consent of the user.

[00:22:04] Jordan Peck: Yeah. Which, I mean we will get into the consent management in a, in a little bit, but maybe we can get into it right now. Like, Yeah. You can use this, we call it network user id. Basically our sub set. Yeah, that will supersede things like ITP that will probably duck ad blockers and stuff.

[00:22:19] But just because you can't, doesn't mean you should. You should still respect users' preferences. You should still only play does identifiers when relevant when the user is happy for you to do so. I know Google have done some work in this space with by consent manager, consent motto.

[00:22:40] We have even more options than that. We allow you to do full cookieless tracking should you wish and again, I'm not a lawyer, but if you can make the intention that even if the user doesn't grant consent, we'll just track raw events without cookie identifiers.

[00:22:57] So you can still do like event based analysis, pageview based analysis conversion. Did a page view take place on this page? Did a conversion happen? But no user identifiers, no cookies, no nothing. And then you might, you might say that that's still legitimate interest in improving your site, improving your experience, but you don't know anything about the user who performed the action.

[00:23:16] Rick Dronkers: Let's say, hold the legal side of this, right? Let's assume that there's a way to do this from a legal point of view. So then my data set there would be a large part of my data set, hopefully that has a user identifier, right? For the, from the people who consent to it and then I can basically plot out how those users went through sessions and also how they had multiple sessions, perhaps over longer periods of time. But then another part of the data set would not be tied to users, but would simply be tied to what, what would then be the main aggregation? Would it be like on a page level or is that up to you to decide, like up to you to model?

[00:23:55] Jordan Peck: This is one of the things I often think I actually mentioned it in my Measure Camp talk when we were both in London a few months ago, like cookie tracking sounds great from a privacy perspective, but if you want to do session based analysis and you don't place a session ID. [Laughs]

[00:24:10] Sorry, and you wanna do, you need users or new versus returning and you're not placing the user identifier. You’r out of luck. But that isn't to say that the data isn't valuable. you can still say most viewed pages, you can still say like, did this forms of mission happen?

[00:24:26] Cause you can still track that the form was submitted and maybe some of the values that were put into the form depending on the, on the use case you can still see that maybe a purchase was made and what products were in it, et cetera, et cetera. You can still track those as events. But tie in those conversion events or those events to a marketing channel

[00:24:45] might not be the case. We will still track UTM parameters. So you can still say, you could probably still count up ppage views by source equals email or something. So you can still do that kind of analysis as well. But this kind of like stitching it together into a coherent user journey. It wouldn’t be enabled in this fashion.

[00:25:03] Having said that, so we allow configuration options to do anonymous or cookieless tracking. There's some nuance into the terminology there, but essentially what we allow you to do is you to actively, as I mentioned, turn off or on User ID tracking session ID tracking even like IP address tracking at the point of collection actually in the tracking SDK on the website.

[00:25:29] One quite popular pattern might be to land on the website for the first time and I get the consent banner and I don't grant consent immediately because it's really some small or I just, I'm not bothered. And I click around, I view a couple of pages, generate a view like events, and then I click consent, I click agree.

[00:25:48] At that point, you then deactivate anonymous tracking and then start placing those user identifiers, session identifiers. And this in theory will allow you to actually back stitch again, assuming that you've figured out to do this in a legal way. So you would have no user identifying track identifiers placed until con, but you would have session identifiers in place

[00:26:11] pre-consent, then the user grants consent, then you start a place in user identifiers. And then because we load the data into your warehouse row by row one row per event, you can theoretically stitch back across those events that didn't have a user identifier cuz the users now consented.

[00:26:28] You can use that, use your session ID to stretch that back, which is very popular pattern. And we provide the tools to do that.

[00:26:35] Rick Dronkers: Whenever somebody starts working with Google Analytics and then they get a few years of experience and then they, the wish for able to update the data set, like after it's being collected, that trigger is probably what drives most people to look at something like Snowplow at one point in time because that,

[00:26:53] Jordan Peck: I cannot tell you, and I'm sure most of your listeners will be familiar with it, but I cannot tell you the amount of times I've screamed at my computer when I was a GA Analyst. Like you've said, the wrong data. Oh no, it's like this forever now isn't.

[00:27:06] Rick Dronkers: Maybe with Google Analytics 4, with BigQuery, on the back of it, you can do a little bit of this, but of course the user interface will still reflect whatever you send in. But yeah, especially with use cases like this, identifying a user later on and then perhaps more interesting or is the cross device journey, right?

[00:27:26] Where eventually you figure out that this is my laptop, this is my smartphone and this is my tablet. But you don't figure that out at the same time. And maybe I've been already browsing some of my smartphone first and then on my tablet and then I convert it on the laptop and then of course the laptop gets all the conversion cuz the rest never gets tied back to it. That’s also one of those use cases where a tool like Snowplow and being able to backstage is so valuable.

[00:27:51] Jordan Peck: Yeah, user stitching is extremely popular use case. It's relatively classic standard. If you have an authenticated user who signs in and generates a user id, you can check that as well in Snowplow as well as on the mobile app. So even though there's no such thing as a cookie on an iOS app, if I sign in and say, I am jordan@snowflake.io.

[00:28:12] And then they do the same on desktop. Then I know that these two completely unrelated devices are now actually the same person. So I can do that cross browse, cross device activity stitching.

[00:28:22] Rick Dronkers: On historic data, right? Once people are logged in on all devices to the future data, that's the easy part. Also updating the old data . That's where the interesting stuff is. Yeah. Going back to this consent banner like, I land as an anonymous user on the website. I did not consent yet. At that moment, I have a session id, and then let's say I don't consent, that session Id gets destroyed.

[00:28:45] Jordan Peck: So it will, it's a session cookie, so it's a 30 minute timeout. So if you don't do anything for 30 minutes, it will expire. And then the next time you come back it will generate the new one. So that does place a cookie cause it's the only way it can like, keep track of the time. But session cookies are considered, it was a different to terms of cookies anyway since it expires up to 30 minutes.

[00:29:04] Yeah, I think it's generally five. You can even actually configure it down to like actual. Browser session. So, If you set the time out to zero, it will expire as soon as you close the.

[00:29:14] Rick Dronkers: Probably like from a privacy point of view, probably a good option. The more I talk to people who are deep into the whole privacy topic. The more I wonder like, is anything allowed? So even like a session cookie? We're gonna park that for this discussion.

[00:29:29] But I do the setup where, Okay, I come in as an anonymous user or a first first time visitor. I have a session cookie. If then a few clicks in, I do decide to, accept, opt into everything. Then the user cookie gets placed. And then later on in the processing of the data, the few hits that I had before consenting gets stitched to it. And otherwise, if I don't opt in, it just stays separated.

[00:29:54] Jordan Peck: Yeah. And as far as the Snowplow user, like an analyst and BigQuery knows. That's just a session. We can count it as a session. you can even fill in, like this is the one number, this is the whole wonderful sequel, is you can even like populate like a random value, as a user number.

[00:30:09] So you can still like count distinct users. We even go one stage further. So one of the options that we also have is a consent context where I won't go too much into our context, but the idea here is that you can actually attach to every single event you collect. You can attach the level of consent that was granted.

[00:30:29] So you then have a column that says, all your events where I wrote, and then a column in your warehouse that says this event was not consented. This event was not consented. This event was consented to marketing but not advertising. This one was consented to everything. And you actually bundle that in, we call it self prescribing data, right?

[00:30:48] You can actually bundle that into the event that you send from the browser to your pipeline. So that becomes really easy to have. You are a marketing analyst who wants to, you're a digital ads guy. You do banner ads or something. You wanna send an audience up to double click for a customer match.

[00:31:02] You can consent. You can filter the consent coln for just marketing advertising, generate your segments and export it up to Google Ads. Excluding everybody who didn't consent to it.

[00:31:13] Rick Dronkers: I do this right now for an implementation where I think we use Cookiebot there, but yeah, all, they all give the same kind of output. But the consent management platform basically gives you a couple of values back, what did you consent for? Yes or no? And then we pass it along to GA4, not a customer dimension, but in the BigQuery export.

[00:31:34] So that when you get data deletion requests. So we also pass the unique consent string, like from the consent management platform. And then when you get a data deletion request, at least in theory, you could delete it all. You still have to build a mechanism to actually do that. But at least the key is there to actually be able to do it.

[00:31:52] Jordan Peck: When this first I kind of kicked up in my face, in my career we were using One Trust, So similar consent platform, but it was all, that was all the consideration There was, use One Trust to fire or not fire events as like a blocking method inside GTM rather than what we've just described, actually bundling in that consent level into the data itself.

[00:32:15] It's the next logical step. And as we move to more to this sort of space and we work with a lot of customers all over the world, They're not quite as, in America, they're not quite as concerned about this because they haven't had they haven't had to hit them as hard as GDPR has.

[00:32:28] But Europe especially is a very, very obviously big concern. Somebody you really should be considering, whichever all you're doing, like you're saying, Google Analytics, you really should be fondling in the consent. We even give you the option to, let's say you've got multiple versions of your privacy policy or your docents, then refer to it.

[00:32:45] You can even bundle in which version of that docent into your tracking. So as you make it update to your privacy policy, you change your track in. So now we're version, we're referring to version six, not version five of the privacy policy. So you can actually see audit through history how your users have consented over time and what they've consented to.

[00:33:03] Rick Dronkers: I think it's a no brainer that if you wanna continue to collect personal data, then adding, if there was a, if there was a form of consent, which I think there should probably be it concerns personal data, then documenting that consent will, it will also make your life a lot easier if you, if you ever need to, explain something or audit.

[00:33:25] Jordan Peck: So. I think people who don't do it today will regret it in some years' time. It all very much bite them in the backside.

[00:33:34] Rick Dronkers: Probably, unfortunately, the people that follow them in that job will regret it. That's likely what's gonna happen, right? [Laughs]

[00:33:41] Jordan Peck: That's like that, that's probably, that's probably much fairer to say Rick. I guess also on the topic of, people say like s are you GDPR compliance, etcetera. Like all of these tooling that I just talked about, and maybe this goes without saying, if you just abuse it, like people say like, Snowplow’s GDPR compliant.

[00:34:00] Well if you send unconsented data and or personally identifiable information of our users to Snowplow just using, using Snowplow doesn't make you GDPR compliant. Like you can still abuse the tracking policies. You can still not listen to user consent and what user's wishes.

[00:34:18] It shouldn't be here. And I'm certainly not gonna be here and tell your listeners that if you sign to use Snowplow you're gonna be GDPR compliant. Cause that's just not true. We offer a number of ways of dealing with sensitive data or private data that users generate.

[00:34:34] So we've talked a lot so far about actually up the tracking side. On the data collection side, we offer a number of tools actually solving in the middle of the pipeline, like out through the processing over and above what we discussed earlier about it already been in your environment.

[00:34:47] So one of the things that we offer the big thing that kicked off the GA was it, who was it first? Was France or Austria, I think it was, wasn't it? They were the first one. Yeah, Austria was the first one. It was because GA contained IP address information and it was going off to Google. So one of the things that you can do is at a tracker level, you can block IP addresses ever been stored or collected.

[00:35:08] I do find it funny talking about IP addresses to certain privacy people because every HTTP request made on the internet has an IP address on it. That's how the internet works. You can't just not collect IP addresses. That's not how it works. But you can disable storing it right in the level of tracker, which is quite appealing.

[00:35:30] We also offer a real time truncation, so if your IP address is 1, 2, 3, 4, you can say, I'll truncate the last oktet. So it's 1, 2, 3.x. So you don't know the user's IP address but a certain users on the same block, for instance. Obviously, on the same subnet, I don’t know networking. So you can still like, say these groups of users, maybe they're all in the same office or something.

[00:35:55] You can do that in real time. The other one, which I think is really cool is we do real time pseudonymization. So let's say you collectm you sign up, you sign up to a website, and the user ID is an email address as is common for most places. You can actually hash that value in real time whilst the event has been processed before it lands and is stored at rest.

[00:36:19] So you can choose everything. I think we support like MD2, MD5, SHA1, all the way up to like SHA526. You can assault it as well and user id. It's an obvious one, but you can do any feel, any value that you collect. With Snowplow, you can do real time Obfuscation of it. And what's great about that is because, because you hash it, every unique value will still remain unique.

[00:36:41] So you can still do count distincts, you can still count unique users, count unique account IDs or something, or addresses or something. And if you've got that data in the backend somewhere, like again, we look at email address as an example. If you've got the user's email addresses in the backend system that you wanna join with Snowplow all you have to do is use BigQueries, SHA256 function, rehash all of your emails from your backend database, and then you can run effective joins and still merge all that data without necessarily ever knowing what the email addresses were.

[00:37:11] Which is very attractive to people who are concerned about holding onto information like that.

[00:37:17] Rick Dronkers: I think that's attractive not only from the privacy point of view, but also the security point of view. Like you follow @security Twitter, which is always fascinating to me, you can follow all these hackers while reporting on what they find.

[00:37:29] Jordan Peck: Anyone isn't following @malwaretech on Twitter? Go find him. He's fantastic. He's the guy who found the ransomware that hit the NHS and deactivated it for a while. He's great.

[00:37:38] Rick Dronkers: Zach Edwards is also a, yeah researcher on that topic. They publish great content, but if you follow them, then you become aware of how many things get hacked. Companies obviously don't want this stuff to get out, so they try to downplay it as much as possible.

[00:37:52] But it's probably best to operate from the assumption that you will get hacked eventually. So all the hashing of anything you can do is probably, you probably wanna go down that route.

[00:38:03] Jordan Peck: Everyone. One of the sayings I heard from a technology person speaking once was like build things like you're being attacked because you will be you cannot assume that no one is gonna try and get into your data because they will be.

[00:38:14] And even if you think no one's gonna be interested in our web behavioral data in our BigQuery. I don't wanna sound offensive, but like, it's pretty naive and pretty a poor approach, poor way to think about it. you always have to assume the worst and code for the worst action is the most, is the defacto way. Nowadays, char two. Don't do anything less, don't do SHA128. Definitely don't do SHA1…256 or 512 SHA512. Sorry. Definitely is the way to go at anything that you think might be sensitive data really.

[00:38:44] Rick Dronkers: I also used to think a lot of my clients though, like no value in this data set, perhaps to their direct competitors. But then once I started following this Zach Edwards that I just told you about he is basically exposing like, online how they are basically using all those identifiers to mimic real hits. So there, there's value to all data sets. You, we just don't realize it. Like for what kind of nefarious use they're gonna, they're gonna take it.

[00:39:11] Jordan Peck: Your, one of your first questions was about user identifiers and the browser and cookie values. You can hash them using the same mechanism so you can use the same PII pseudonymization enrichment. We actually do this on our own site. If you wanna go, well want say, but all of the cookie values, which are like normally UUIDs, and they're obviously stored in the browser.

[00:39:29] When we collect them, we hash them with SHA256 with assault. So like to us, it's just a random string. We don't, as an analyst, we don't really care, say you still do icks, you can do, still do session stitching, but I'm looking at a completely different user Id value, cookie Id value then what actually is stored in the user's browser. So I can't even theoretically find out what the user's cookie is.

[00:39:53] Rick Dronkers: Now that all, almost all browsers have locked down the ways of third party scripts and what they can do. But before, what they would basically do is share buttons that everybody included on their website. They would just harvest everything they could find and then just out how you would browse the web. Like those were the real privacy infringing techniques and that that could have been stopped by these kind of techniques.

[00:40:19] Jordan Peck: Cookie stealing is still a valid attack on web browsers slightly less now because it requires cross-site scripting and most people are a bit aware of it. But essentially you can remotely execute some job script on a user's browser. Back their cookie cookie you can impersonate them and you can pretend to be them

[00:40:36] when you go and try to log into their Amazon, So we're taking a bit of a tangent, but I find this stuff really interesting as well.

[00:40:42] Rick Dronkers: People who listen to the podcasts probably also do like we’re the audience, right? So this is good. If I distill all of this, like basically options and more flexibility and also like you guys developed most of it, right? I think it's open source, so probably there's also contributions for audit from others, but I think the Snowplow core team is definitely doing the bulk of it, I would say.

[00:41:06] Jordan Peck: Yeah, we do, we get contributions back every so often. It depends on the piece of tech, right? The Skyler SDK doesn't get a lot of contributions outside the core team because not many people are doing tracking in scale or, or something. But the JavaScript tracker, the iOS, Android tracker, we get quite a lot of contributions. We had a contribution from a customer. Last week and one of our DBT models. That's really cool. But yeah, to answer your question is mainly the core engineers. Yeah.

[00:41:32] Rick Dronkers: So the value is there, There is already. It's not like building from scratch, which would be like the most extreme alternative, right? [Laughs] This is like, I feel like in between, off the shelf SaaS, right? Google Analytics snippet, throw it on and then build it yourself is on the other end of the spectrum and Snowplow right in between there.

[00:41:53] Jordan Peck: We get quite a few prospects every now and then who have come from a world where they have done that. They've built their own, they've gone, we're a big company, we've got a big set of engineers and SREs, we reckon we can do this ourselves. How hard can it be? I would potentially argue, in fact, I certainly argue that, First of all, we're better at it cuz we've been doing it for, so we've been doing this specific, build a product to do the specific type of thing for 10 years now.

[00:42:18] We've been around for 10 years now. Yeah, I'm sure you've got really good engineers and SREs and developers, but like they'd have, they've been building your products, they've been building your things. Why don’t you think that would be as good as we would be? Which might sound a bit like arrogant, but I don't, I don't think it's an unreasonable thing to say, but I would also potentially argue that if you have rolled it yourself, all of these things, I actually mentioned this in my Measure Camp talk as well, there's probably loads of things that you haven't even thought of that will only be become apparent to you when you stop building that.

[00:42:48] And privacy and security are one, two big things in what you'll have to have consider when you're building your tooling. Data access, like persistence retention policies access rights, what values are you storing, what's the of those values? Likelihood of you being unique, likelihood of them being tracked back to what the users were like.

[00:43:08] That's so much stuff. And that's not even really considering actually building the car functionality, right? That's not even thinking about how you actually track page views, all link clicks or conversions, right? And how you manage it and stuff. So I would potentially argue that even if you do go over end of that spectrum of build it yourself, you're still potentially liable, maybe I shouldn't use the word liable.

[00:43:26] You're still potentially running the risk of not building it in a privacy centric or secure way. Whereas obviously we've taken a lot of time and a lot of effort to building these features to give users these functionality.

[00:43:38] Rick Dronkers: You told the story of Alex and Yali thinking of, we should build this ourself. The first thing I thought was well, they probably thought that, and then they figured out, Oh shit, this is, this is quite a bit harder than we thought. [laughs]

[00:43:48] Jordan Peck: Yeah, I know, Yeah. I actually for something we were doing internally, I had to go back to the very first GitHub commit on GitHub.com/snowplow/snowplow which was back in 2012. And it's a very different product. [laughs] It was extremely basic like it was a JavaScript tracker, a collector application then an S3 loader and a Redshift loader.

[00:44:13] I think that was it. Like literally like four components. And now we have like 27 tracking SDK in 27 different languages, two clouds, four warehouses. It's a really, really big tech estate we have these days.

[00:44:24] Rick Dronkers: The privacy and the security are two things, but then also like. just think about having to deal with all these different browsers and different browser versions, supporting all that. And then the different know, mobiles. Like it's whenever a company says to me like, Yeah, we're gonna build our own API, like extracted to get the data out of Facebook and Google and whatever.

[00:44:44] I'm like, Okay, but please, I would use Stitch or FiveTran, or Super Metrics or whatever because you don't wanna support it. Like why would you? Right. Like, It's probably not gonna be worth it.

[00:44:56] Jordan Peck: I know, I see that I am guilty of doing things like that. In the past I wrote extractors from using R to do stuff like that. We had people in my organization doing the same with Python scripts, and I think that's a very prevalent, I think got a lot of people around the world who do that, but just give it to someone who's done.

[00:45:13] Rick Dronkers: Yeah, but, and will maintain it. You know that when Facebook changes the way the API works, you're gonna find out three weeks later and it's already broken for three weeks. That stuff's gonna happen. Most of the time's not worth it.

[00:45:25] So in that case like where Snowplow is positioned like in between of building it all yourself, but also not off the shelf SaaS it's in the middle ground I feel is a good place to be, or, and I think it has become a good place to be. Like, I don't think it really, I think Snowplow grew into that role, so to say.

[00:45:46] Jordan Peck: Hmm. I think, yeah, I think so. I think like, one of our positioning messages is that like, you get a tool built by the engineers of the quality of the Netflix and then the Amazon’s of the world who have built all of this, the Spotify of the world who've built all of this stuff themselves because they have an army of developers who they can dedicate to this particular task.

[00:46:03] Most organizations can't dedicate that kind of resource. They don’t have the money, the people in place, organization, size, time, priorities, they can't do that themselves, income to us and get a product overlap scale and complexity and flexibility. But still run it in the own cloud, right?

[00:46:21] Still have it integrated with their own applications. And build their own recommendation engines, build their own personalization decision engines and stuff like that. There's lots of companies who want to do those things that don't have the engineering capabilities just know even where to start.

[00:46:35] If we can help 'em get to that point without, not have to worry about maintenance and getting up in the middle of the night and the weekend when your service's falling over. And having SREs expensive, SREs having to monitor it, monitor it 24/7, like we can take that away from them. Then we think we had a lot value.

[00:46:51] Rick Dronkers: Well, that ties in nicely to what I wanna ask you. Let's say people listening to this podcast they have an affinity with digital marketing, digital analytics, and they are intereste in this, this privacy topic and how it's affecting them.

[00:47:04] So now they're considering Snowplow right this after this raging review from you and me. They have two options, right? They can, they have the open source option or get it via you guys, and have you guys run it for them. So take that last option first. Cause I think that's the easiest. It will cost them money to hire you guys. What would they need on their end? So you guys run the stack for them, right? So you take that out of their hands. What would you recommend, like from a, like the marketing manager or the CMO listening to this, he's like, Okay, maybe I wanna go down that route. What kind of people would he need if he goes down that route?

[00:47:42] Jordan Peck: Most of these businesses will have like good front-end resource. We're seeing it become more popular, to actually build Snowplow tracking into your application, into your website, into the source code. The front-end developers do that. A lot of benefits to that. Loads faster, less weight in the browser.

[00:47:58] Jordan Peck: Nice developer experience, developer checks and tests, all that kind stuff. But we can also run a GTM. We published a GTM template which generates a gooey inside GTM for adding a new tag. So if you've already got front-end developers, I've already got GTM analysts and people who can do that.

[00:48:15] You can get up and running with stuff. Plow like that shouldn't be a blocker. Most companies have that kind of like well covered at this point. What I would say as I think is from my experience, the gap that I see the most in organizations that maybe prevents Snowplow from being a fantastic fit for them right now is on the opposite end of the day or opposite end of the pipeline.

[00:48:33] It's the consuming the data in BigQuery. I will actually have to tip my hat to Google here. Given BigQuery export out for free with GA4 is a wonderful thing. I know it's not free anymore, it is free to a limit, but the idea that they've put event level data in the hands of more people, so more people are getting used to writing queries in BigQuery and using BI tools like Looker and generating cool data models in data warehouse with web and behavioral data and mobile data is fantastic.

[00:49:03] Well done Google. I think it's really, really good move. It is growing in that space. People maybe like ourselves who are from digital marketing, digital analytics backgrounds wouldn't have had those skills and don't have those skills if, if that hasn't happened. So, that is a space that you do need people.

[00:49:19] You need people who are comfortable writing SQL. There's a tool mostly, you've probably heard it called DBT, which we leverage quite a lot to make writing SQL data models, converting the event by event data in a very, very deep, very, very wide table into a set of more easy to understand, easy to consume tables that are easier for an analyst to look at and an easier for analyst to query and easier for your tools to consume.

[00:49:44] Because you don't want Tabler having to run very, very complicated table calculations or whatever to calculate something very straightforward. You want it to query a very nice sanitized, clean aggregated table. The role, the titles of these people is now basically been settled on as an analytics engineer.

[00:50:03] The idea of someone who can sit and write production grade SQL against a raw data sets to convert them into a simple to use samples. That is a, it's a borderline mandatory, I would say, because that's the main way that we deliver data. We think we deliver the best quality data to the data warehouse from behavioral applications, but it isn't usable isn't always usable, straight out the box.

[00:50:25] It depends on your use case. And we do have DBT models that do a lot of this for you. But yeah, if you want to be able to go okay, we've just launched a new feature on our website. We want a funnel chart to show usage of that new feature over time. Sliceable by device and by customer type. Well, someone's gonna have to turn that event data into something that Tableau or Looker or Holistics or Power BI or whatever tool you’re using. Date, studio, whatever can consume. Yeah, that's probably the biggest thing that you'd wanna look for, And then there's other things like being able to translate business requirements into tracking designs.

[00:50:59] We have this idea of custom schemers where we have custom events and custom entities deciding what you decide, what a user looks like. You decide what a product looks like, you decide what an organization looks like. You decide what listing wherever these things are on your website are your apps.

[00:51:14] You decide what they are, you decide what all the properties and how they should look. And then translate that into actual tracking code. And then that's what will land in your warehouse. So there's like linking up business data strategists those kinds people who business analysts converting or translating business requirements into.

[00:51:34] Snowplow entities, Snowplow custom schemers and concepts that can then be translated into tracking code, which can then be translated into a data model in the warehouse. But the main technical role is that analytics engineers are SQL Engineer working in the warehouse. And then obviously I didn't touch in it as much cause it's more in place than more places as I've seen. But yeah, the front-end people who can implement good quality tracking. Good in your website, all your applications.

[00:52:00] Rick Dronkers: I just realized this, we didn't mention this, but people were not really aware of Snowplow, like Snowplow. You don't get a graphical user interface, right? So, you need to choose your tool of choice how you wanna visualize on top of it that's important.

[00:52:13] Like if you compare it to Google Analytics, you get the data model, you get everything that's under the hood, but you don't get the graphical user interface. So you have a lot of choice there, which is also great, right? And, and you can switch them, right? It doesn't really matter. So that's a benefit, but it is something to keep in the back of your head.

[00:52:29] Jordan Peck: Sorry, I may have glossed over with that. Yeah. Looker, and Tableau are very, very popular in the space. Power Bi again very popular. There's some nice Upstart Holistics are a nice tool. They're out of somewhere in East Asia, just Singapore. I think they did. They they've got a new nice new tool that, that's really nice.

[00:52:46] Looker and Tableau and then, or if you got a bunch of data scientists are Python, get access to that same data, use GT plot. If you use our, like I used to do or whatever tool you use.

[00:52:59] Rick Dronkers: If you go down the route of hiring you guys for the data engineering part, then having an analytics engineer, which is gonna be in between the people who consume the data, like people who need reports, who need answers and basically handling that broad data set that Snowplow will deliver.

[00:53:14] And modeling on top of that and making sure that in the end they get something in a graphical user interface of choice that they can make a decision upon. So that's an essential person to have in the organization.

[00:53:28] Jordan Peck: Exactly. You can't expect a product manager or a marketing analyst to know SQL to a high level. It's just not realistic. I think every analyst, everyone who's got analyst in their job title should know how to write some SQL.

[00:53:40] Rick Dronkers: That's not the case.

[00:53:41] Jordan Peck: It makes sense, right? They're not in it to write code. They're here to figure out what best to do for their campaign or best to what, what next decision they want to make on this feature that they're just launched. They don't, they shouldn't have to be able to actually necessarily to get those answers. So having somebody who can serve those answers to them is very important.

[00:54:00] Rick Dronkers: Okay. And let's say we go down the open source route. I don't wanna hire you guys, but I do wanna use your cool open source product. I would still need the analytics engineer, right? Because the end result will be the same. What would be the extra, what would be the extra people I would need to set this? Like bare minimum kind of set up?

[00:54:20] Jordan Peck: So you would need some SREs. So if everyone who's not familiar, SREs are Site Reliability Engineers or DevOps engineers or cloud op or cloud engineers essentially. Snowplow is made up of multiple components. There’s a collector, we have a validator, we have an enrichment app, we've got warehouse loaders. Those applications need to run somewhere. So when we deploy Snowplow, we are opinionated about how we deploy it when we have a view on what type of surveys should be running on, how we should be running, how we should be configured. That's just our approach. If you have some site SREs you can take our collector application and choose to run it on whatever server you wish.

[00:54:59] Jordan Peck: So if you're using GCP, maybe you'll run it on App Engine, maybe you'll spin up Cloud CM. Maybe you'll use managed Kubernetes, which is what we do. You'll run it there and you'll need SREs who are familiar with Kubernetes or if you're on Amazon, like EC2 or or whatever, to be able to set those servers up, install the application, and then all the other applications network them together properly so they speak to each other and they do the right thing in the right order.

[00:55:26] A data engineer as well. So the difference I would say, between a data engineer and a site reliability engineer, the data engineer builds, in our case anyway, the data engineers build those applications. They build the, our collector, they build the data warehouse loader, And then the SRE

[00:55:40]would configure what that application runs on. So if you're running that tech open source. Maybe you wanna take what we've, what we've done cuz you have some other use cases. Having a data engineer who's familiar with Skyla or Java. Most of our components are written in Skyla, which is very popular in the data engineering space.

[00:55:56] Having some scale engineers who can look through our code make a necessary adjustments or hawks if they need to or want to probably is also a, not a bad idea, but definitely DevOps, our SREs in terms of setting it up and running it and making sure it keeps running.

[00:56:12] Rick Dronkers: In that case, you're talking about a team of at least three people, right? Including the analytics engineer three, likely four.

[00:56:19] Jordan Peck: Yeah, I would say so. Depends on your scale. We have customers that send 5 billion events a day to their Snowplow pipeline. And that pipeline is significantly bigger than someone who sends hundreds of millions a month. And also more expensive. [laughs] But also like potentially more business critical, right?

[00:56:36] Which means it needs to be up all the time. Now we manage the pipeline for us, but the, the, the bigger the scale, the more vole that you are plowing through your pipeline, you need more service to cope with it. You need to network together more physical bits, physical machinery to keep them running all the time. So I don't think it's quite linear, but as your volume goes up, you will start to need more and more people to manage it.

[00:57:00] Rick Dronkers: Which makes sense.

[00:57:01] Jordan Peck: Yeah. And there's the business criticality increases because some people Snowplow goes down for a few hours, like that's fine. Like they're doing daily reports.

[00:57:13] That's fine, right? We can wait for it to come back up online, rebatch, everything, and then we're up and going. Some businesses can't afford, some businesses use Snowplow to power their own product and monetize the data. They come out with Snowplow and they can't afford any downtime. So as business criticality goes up, probably the more people you'll wanna dedicate to that.

[00:57:31] Rick Dronkers: Especially wanted to get your insights on like the entry levels, right? Because then for people, they can imagine like, okay, what are we talking about, right? Like from a budget point of view okay, you would need this amount of people to consider that move, to see if it makes sense for you.

[00:57:46] Jordan Peck: If you're interested in open source, then the route I would recommend is something that we call open source quick start. So we've got AWS Quick start and GCP Quick start. These are sets of Terraform Scripts. So Terraform is infrastructure is code. Essentially what it does is it’s a set of commands that will go off to your AWS account and spin up all of the things for you.

[00:58:10] You run two commandants in a terminal window, and you could have a full Snowplow pipeline up and running for you. It'll be all the components. It'll be all of the applications within the pipeline. So the collector of our data enriched Alerter and the database for you to query data. It won't be production scale.

[00:58:28] It's designed to be quick to spin up. So it's not quite I wouldn't put like a full production website set traffic at it, but if you want to get to see what our open source looks like and you want to see what applications run and what they're doing and how they work together I think it's github.com/snowplow/quickstart/examples.

[00:58:49] Rick Dronkers: We're gonna put it in the links.

[00:58:52] Jordan Peck: Yep. Just clone those in Reaper. Choose whatever you wanted on US GCP. You do need access to an environment. You fill in a few variables and values in a couple of the scripts. TF apply and then wait a few minutes and you've got a Snowplow end point to send data to.

[00:59:09] Rick Dronkers: I have personally done this, mucked around with it a bit. It's a lot of fun. It's a nice weekend project to explore.

[00:59:15] Jordan Peck: I demoed this in my measure cam talk in real time in front of everybody off of my phone wifi, cuz I couldn't get the buildings wifi to work. And I also had to connect to my company's VPN to get it to work. It still worked and it spun up in, I think about four or five minutes.

[00:59:32] It's quite nice, fun little thing to have a try. And, if you are considering Snowplow, either open source or coming to us to manage it. It's a nice way to look under the hood, see what it's doing. You can send data to, I think we spend up a Postgres database, so we'll send data to there.

[00:59:47] You can run some queries. We load in real time as well, so you can literally like put the tracking on your website in GTM, click a few things, and then do select staff from events and you get all the things that you've just triggered. So that's what I put, Yeah, if you want to get a flavor of what it looks like and what it's like to use that's a really good approach.

[01:00:03] Rick Dronkers: Definitely the place to start. Okay. Last question then we're gonna, we're gonna drop off. How about running all of Snowplow, not on AWS, not on GCP, but on some yet unknown to me, EU based cloud provider.

[01:00:25] Jordan Peck: Oh, on premise?

[01:00:27] Rick Dronkers: No, not necessarily On Premise, but fully EU based. I don't know if there is a good competitor to AWS and GCP in the EU, but let's assume there is, that, is that even a possibility? Is it on the roadmap?

[01:00:39] Jordan Peck: So it's not really possible right now. The reason for that is we leverage some cloud specific technology on each AWS and gcp. So on AWS we leverage Amazon SIS for the real time stream. Same on GCP. We utilize PUBSUB as our streaming platform. Hypothetical EU based cloud company doesn't have those services, so we'd have to build our real time application using whatever the cloud could provide there.

[01:01:07] However, it is definitely on the roadmap what we, what we're trying to do and I hope I don't get too technical for people, but we're basically trying to take all of these applications that we've built and essentially make them transportable. Rather than been tied to AWS services or tied to GCP services, essentially make them a Docker, Dockerized container, which you can then run in Kubernetes.

[01:01:29] And if it's Dockerized container, you can run it anywhere that you can run Docker, which is basically anywhere. So hypothetical EU Cloud, obviously you manage Kubernetes or a way of running servers, spin up Docker, install our applications, all the individual applications, Docker containers, and now you can run Snowplow on, in theory anyway, you wish. You could run it

[01:01:51] on a separate cloud. You can run it on Azure, you can run it on prep. If a lot of businesses still run a lot of things on premise. If you want, you could hypothetically run it there as well. we're unfortunately quite a bit away from that being a reality, simply because we've got such a big tech stack, right?

[01:02:06] We've got such a huge estate and it's we need to battle test it. We need to make it production ready. However, it's definitely on our roadmap. I'm not gonna say roughly when I think it will be because my engineering team will shout at me. But it, it is definitely something that we're working towards.

[01:02:22] Rick Dronkers: Okay. Really cool. Cause yeah, that's from the privacy angle, the big fear for me as like the marketing manager of a certain company would be, Oh no, I now migrated to Snowplow to evade away from this. But, I'm still hosting Snowplow on, on AWS or Google, and it's still a US based company so, it didn't result in any positive outcome for me. So being able to do it on the EU cloud would be really the best outcome.

[01:02:48] Jordan Peck: There's a lot to be said, isn't there? I mean, There was a while ago, I think a while ago where people were saying the privacy advocates are coming after the public clouds. Because Amazon are still an American company, Google are obviously an American company. So even if the servers are on in the EU, then you’re potentially sending data to these American companies.

[01:03:07] I think that's slightly over hyped. I don't think that's likely. You could do things like bring you on key on Amazon and GCP. So even Amazon or Google can break through your encryption and can't access your data. And also if they did that, then the internet would stop working. [laughs] If you decided that you're not allowed to use Amazon or GCP, then the internet in Europe stops working.

[01:03:29] Rick Dronkers: It seems like we were gunning for it. So, if you could be one step ahead, like, I think the solution you just described with everything running, via Kubernetes and Docker, like, like if people wanna go the extreme route and they find some cloud host in their own country where they can host it all and, and feel like totally a hundred percent safe about it, I could imagine like, if you're gonna make this investment, you might wanna make, that extra step as.

[01:03:53] Jordan Peck: That's very future proof, right? You could just, like I said, you could just spin up a servers and spin up some servers in your office and read it there if you really wanted to. Then you're really not risking anything. But that's pretty hard line to take, I would say. But yeah, it depends on your appetite for risk, I guess.

[01:04:08] Rick Dronkers: Cool that you guys are at least working on it. I've taken up a lot of your time. , It was a cool talk. a lot. Is there anything else you wanna share before we drop off? We'll throw all the links in the show notes to everything that you referenced. But, is there anything else you wanna, you wanna add?

[01:04:28] Jordan Peck: Nothing much more to add. Just I think you're doing a great job with bringing a lot of this to, to more mass attention. I think that, as I mentioned earlier, just because you use X tool does not mean you're GDPR compliant. Look at the mechanisms that are available to you to make sure you're doing this correctly, we put a lot of effort into options and flexibility to making those things possible. But whatever tool you're using yeah, make sure that it should be more front of mind than ever. I think it is, I think people are starting to realize that this can't be an afterthought.

[01:04:58] I think that's a good thing to close the podcast on. Thank you for joining.

[01:05:02] Jordan Peck: No worries. Thanks Rick. Thanks for having me.

[01:05:04] [MUSIC SOUND EFFECT BEGINS AND FADES]

Life After GDPR EP015 Transcript