We sat down with engineering leadership to discuss analytics pipelines and serverless cloud infrastructure versus self-managed infrastructure.
James Trunk - Head of Engineering, Znipe esports
Daniel Hilmersson - Senior Engineering Manager, Spotify
Marc Marais - Engineering Manager, Tink
Thomas Nilsson - CTO, Marakanda
"How do you design a data infrastructure and organisation so it does not become the bottleneck of your new data-driven organization?"
So partly, this is from my two last companies where I've been part of trying to build up data organizations, I mean, forming and recruiting data engineers, data scientists, looking at potplayer pipelines and all the classic infrastructure. But then I came across a paper from a really interesting thoughtworks. I'm not sure what her name was, I don't remember but it's on a data mesh. And it kind of round in with me, the struggle I see and also the tendency I start to see DC park where so you build all these nice pipelines, you build this infrastructure, but all of a sudden, you're the bottleneck for organizations because they realize that you can give you stuff but you don't have enough engineers. You don't don't have the insight. Sometimes, if you Depending on mobile app, getting to the instrumentation and getting the tracking in, or if you're doing it in backend service, and so forth. So I kind of wanted to post question, what what are your thoughts on this? Like? How do you not become the bottleneck of your organization by being the only one who can work with the data? Because it I think, at some point, it will block you from being truly data driven. So that was kind of the background for the questions.
Marc Marais' Question:
"How do you review the differences between running you pipeline on serverless cloud infrastructure versus self-managed infrastructure?"
So the the context is that is that my out to date engineering setup, I think is quite small at the moment. And we but we're skating quite quick. In fact, To quickly in the set, and now it's a matter of how do we determine those pipelines? Should they be both? Or is it better to build them using serverless serverless architecture? Or should we be looking to run, manage less managed or self managed services? So between let's say, they take spark between a managed spark instance or EMR in AWS or something like glue
James Trunk's Question
"What’s your process to go from data collection to generating and sharing actionable data insights?"
Yeah, I guess this is kind of the classic question around metrics and data and dashboards and pipelines. And big data is that it needs to be usable, right? At the end of the day, it needs to be something that pushes the business forward. Otherwise, what value is it adding? And I think that's been a challenge for a few of the companies where I've been part of this process and snipe, we're quite early in our data pipeline lifecycle with its this year basically the we've got it in place. And so the I think this is a question that we're going to come up against pretty soon. And we spend a lot of time in the beginning thinking about, well, what metrics do we care about making sure that they're connected to decisions that we could make around those around the business around the product? So we've put a lot of time in, but I'm just wondering how it is to maintain that over time, and to not let stakeholders push us towards things that they feel, oh, we have to have this and she don't want to give you a good explanation why they needed that they just need it and they need it fast. And then over time, you're polluting the metrics, the dashboards with things that aren't useful. I was just wondering if people had experienced that and what kind of strategies to try and keep the quality really high even after this initial honeymoon period that we're on now?
Thomas Nilsson's Question:
“how to go from R&D pipelines to proper production end-to-end ML pipelines. Pitfalls, architecture and configuration management?"
This is the question that has, where we have had most of our discussions in the team among like our tech leads architects and trying to design our production pipelines. You know, when, again, as a start up, we've done the r&d, we have produced models, we're getting the ACS that we've won, and so forth, and we are ad hoc where we've been running things on and you know, on those notebooks and you know, we're moving into glue etc. But the, the especially where I come from, I'm not the data, you know, data engineer, you know, where I come from now, I think about traditional software development and configuration management, how isolate how to isolate components and look at dependencies between couple emanates I have this traditional way of thinking and kind of moving into the data domain, especially when you're doing machine learning, it just seems like there's this famous quote from some paper I write that change anything changes everything, right? Have you heard that story? So it just seems like unless you're really careful, basically the your pipelines becomes like one gigantic monolith, you cannot change anything. And then you need to, you know, in mind, the smallest change can actually destroy your UAS ease in the, in the, in the end of your beautiful models. So we're just trying, you know, obviously, we cannot have this gigantic monolith monster of all the pipelines. So you need to kind of do some trade-offs and you know, partition this, these pipelines into smaller architectural pieces, and do configuration management on those smaller pieces. So I just interested In the heart, what kind of people's experience? What kind of trade-offs do you have? Do you mean? Or do you? Do you actually do like big monoliths? Or have you? Do you have like, can you can deconstruct your, your pipelines and your thoughts around configuration management?