Knowing that you have data and can use it is one thing, but how do you get the best from it? In this episode, Zoe Cunningham shares some insight into how you should approach data before delving into using it for machine learning and other functions. Zoe is joined by Softwire Head of Consulting and director, Jonathon Artus, and Technical Lead, Rob Owen.
Read the summary of this podcast episode: What is Data Engineering and how can it help your business?
Techtalks brings you industry insights, opinions, features and interviews impacting the tech industry. Follow us to never miss an episode on SoundCloud now: See all our Techtalks online for free on SoundCloud
***
Transcript
Zoe: Hello, and welcome to Softwire TechTalks. I’m Zoe Cunningham. Today, I’m delighted to welcome Jon and Rob. Jon and Rob, can I ask you to introduce yourselves? Tell us what you do at Softwire and maybe also an interesting fact about yourself.
Jonathan: Hi, so yes. I’m Jonathon Artus. I run the consulting team at Softwire. I’ve been around for about six years now, and I was the person who came up with the idea of doing Softwire’s first-ever real escape room. We basically demolished a meeting room, put in an escape room, and made over 100 people escape from it over the course of about a month.
Zoe: Amazing. We have a whole podcast about our escape room-type things that we have gone on to do since that escape room, which was amazing. Great fact, Jon.
Rob: Hello there? My name’s Rob Owen. I am a technical lead at Softwire. I’ve been working here for about three years now and I specialize in cloud technologies and data engineering, data storage technologies more broadly. Interesting fact about myself is that in 2015, I ran a marathon through the Sahara desert.
Zoe: Wow. That’s hardcore. Was it really tough? Obviously, it must have been really tough.
Rob: It was difficult and got harder because you start running early in the morning and as you go longer, the longer it takes the warmer it gets. I was not breaking any records, but I did complete the thing.
Zoe: Wow, amazing. Whew. Well, in today’s episode, we are going to talk about data engineering, which now seems really easy compared to running through the Sahara. Data is a hot topic for us right now, as businesses realize that their data can hold immense value. You could be sitting on a data goldmine. According to PWC, the world is producing about 2.5 quintillion bytes of data per day, with 90% of all data having been produced in just the last two years. There’s a lot of data and ever-increasing amount of data.
Technology research outfit Gartner has found that nearly 97% of data remains untouched and unused by companies. Knowing that you have data and can use it is one thing, but how do you get the best from it and what do we mean when we talk about engineering in a data context? I guess that is the question for our guests. What do we mean by data engineering?
Jonathan: I think that’s a good question, Zoe, and I think it’s one way you’d get different answers depending on who you ask. I think the big picture is that it is about applying some structure to the way that you manage and control and hold your data in the same way that people tend to do on software projects or indeed anything else with the word engineering. When you build a bridge, you don’t just chuck some boards together and hope that it holds up. You sit down and think about the structure, what the load is, what the span has to be, and you plan it all front.
Similarly, with data engineering, it’s about taking the engineering rigor and thinking upfront about what you need to be able to do with that data, where it’s going to come from, how much of it there is, and planning for that in advance so that you can then do exciting things with it, be that machine learning, be that reporting at the other end, or be it consuming that data from web sites or other platforms.
Rob: I totally agree with everything Jon’s saying there. To add my own spin on that. I often think that there are two halves to data engineering. There’s the technical aspect of it. How do I arrange a business’ data such that it is useful to analysts for machine learning for reporting purposes, but also how can I bring correct governance to that data so that I can use it in a way that is legally correct and in agreement with the privacy policies I’ve set out to the users of my software and any other interests in that space.
On the technical front, I often think about data engineering as removing the smell of the source from business data and I say that coming from a software engineering backgrounds. Traditionally software systems will be backed by a database and that database will store data in a way that reflects how those software users creates and updates the data. That is exactly what a data analyst does not want to see. They don’t want to look at a table of insurance policies or products on a website and see lots of trace details about how that data was created and managed internally by one application.
It’s about taking data and abstracting it from the details of how it was made and presenting it in relevant business terms that can be aggregated with data from other sources and used to derive additional value and insight.
Jonathan: I think it’s a hot topic at the moment in particular because of the explosion of novel use cases like machine learning. What we’re seeing is a real explosion in the set of technologies, which can be used in data engineering. We’ve seen stream processing like Kafka and Spark become really popular for doing live scale data. All of the main cloud platforms are releasing things quickly in this space. AWS have just released Redshift. There’s BigQuery on Google, there’s Synapse from Azure.
There’s a real explosion in the set of tools out there. They’re sufficiently big and complicated and sufficiently different from what’s come before that it is worth specializing and focusing and really understanding what the use cases are for the new tools being developed.
Zoe: It’s about the approach you take to dealing with your data. Actually, this starts before you even adopt machine learning or other processes with the data, you’ve got to get the data to the place where you can start using it. You have to deodorize it as Rob says. What are the kind of common problems that you see with people’s data sets?
Rob: Some of the common problems we see with data sets, where there hasn’t been over overarching data strategy and data engineering oversight is, for example, missing fields where data just isn’t present in the final output for analytics purposes. That can be really hard to retrofit. We also sometimes see problems with data quality constraints not being enforced. This can be really relevant, particularly with relational data models. If I’ve got several applications that will work with SQL databases, you might have no values in a certain column. To take an example, perhaps I have several systems that are involved in the sale of books and there’s an author field on a book.
What does it mean to have an author that is null? Perhaps you have one system where null means we don’t currently have the name of that author. We haven’t validated that that’s the correct source. For another, it might mean that this is a collection of books and there no single author. Those are totally valid meanings, but if you aggregated the data from those systems together, and they’re both called null in the end, your data analysts have no idea what you mean.
A big part of data engineering is about enforcing consistent data quality constraints across multiple systems, as you bring their data together into the kind of data warehouse technologies Jon was talking about.
Jonathan: I think there’s also a risk with multiple sources systems in particular, that terms don’t have a well-understood meaning. You get fields like revenue in database, where it has radically different meanings between multiple systems. As soon as you add it together, it no longer makes sense. A big part of the data engineering is understanding what each field in each source system actually means and making sure that you don’t produce invalid results by adding the wrong things together.
Zoe: If you are thinking about how to store this central data truth, what kind of features are you looking at for it to have in order to be an ideal way of storing it?
Jonathan: That is an incredibly hard and complicated question to answer. I think part of the reason why it’s inevitably gotten wrong in some access. A big part of this depends on what questions you are going to need to ask. We’ve done projects in the past where a surprising requirement is the ability to get a point-in-time view of the data so to say, not just what does the data look like now, but what did the data look like two years ago on this particular date? That can be really important for answering business or regulatory questions, but if you don’t build that time facility into your database upfront, it can be incredibly hard and complicated to retrofit that later.
A lot of this is thinking about what are the questions you want to ask. Do you want to segregate by time? Do you want to segregate by geography? Do you want to be able to have point-in-time snapshots? Those sorts of questions are really important to tease out of the anticipated use cases upfront.
Rob: Jon, I’m glad you handled this question first because, as you say, it is a tricky one. I’d like to highlight some other considerations that should feed into to deciding how to store your data. Then more on the infrastructure side. If you have a central source of truth about your business’ data, you need to make sure it meets certain security criteria. You got to make sure it’s accessible, not just from your in-office staff, but potentially from other suppliers who you work with. You might access segments of it. You need to make sure it’s scalable because every business wants to grow and you need your data infrastructure to be capable of doing so alongside your business and it needs to be cost-effective.
This is often where we give into discussions of on-premises storage versus, for example, cloud providers.
Zoe: It’s really interesting because I feel like I’ve just realized, I asked the wrong question because I was saying, well, what’s the ideal way to store your data. Actually, it’s bit like saying what’s the ideal employee. It depends what you want them to do. You can have some general hints, like you want them to be friendly and a team player, but actually it’s very dependent on what you use it for. Actually, should you be aiming to say we have a central single truth about our data, or is it more a case of managing multiple data systems? What shape do you want your data in?
Rob: I think it comes down to, as you say, what are you going to do with it? We’ve already discussed how data engineering isn’t really an end product itself. What it does it powers data aggregation tasks such as analytics or machine learning and if your business has a use case for that, or believes you can derive particular insights from having that, then there is a data engineering system that needs to be in place to power that. I wouldn’t really advise doing it just for the sake of it, you need to know what you’re trying to get out of the system because that both determines whether it’s worth doing and how you would architect it.
If you have no need for central analytics of your data, and you are sure of that, then I don’t see an issue with not pursuing a central source of truth in your data storage.
Jonathan: There are definitely cases where it’s not worth the effort of having a single source of truth and when you’ve got well-segregated business domains, you can definitely have multiple silos for your data and not pay the price. Having said that, I think the further you go, and the more data you get, the harder it can be to join across the second you ask that first question that crosses multiple data silos, and you find actually you can’t join the data together. That can be a really painful experience and I know Rob you’ve just finished a project where that was really one of the root causes that got us involved.
Rob: Yes, I’ve just rolled off an almost a typical example of how these problems are really hard to address retrospectively. I think one of the issues with this particular business was that it was only once they had scaled to a certain size that they realized the value that could be drawn from connecting up all their data silos. Perhaps they did that cost analysis early on and decided it wasn’t worth it, but then they grew, they were very successful, but it meant they had a lot of data engineering work to catch up on before they could really derive any value from their data. It wasn’t impossible, but it was probably more costly than it would’ve been to address early on.
Jonathan: The general problem there was that they had a massive silo of very rich advertising data and then quite a complex business domain with lots of venues, lots of countries, lots of cities that they wanted to do analytics at, but they had no way of joining across those silos, which when they wanted to ask really important and pertinent business questions, they weren’t able to do that without a lot of manual analysis and that’s what you often find as well is that the way you know there’s a need to join things up is where people are doing ad-hoc and manual analysis. These joined-up data silos are happening on a case-by-case basis normally in Excel as well.
[laughter]
Rob: Yes, I think if Excel’s become part of a routine data transformation workflow of any scale, that’s usually a good sign that there are efficiencies to be gained by bringing in some data engineering expertise.
Zoe: It’s very interesting to think about this parallel with software engineering again, that actually it sounds, and the complexity of it, you could also ask the question, well, how do you do good software engineering? There are some answers, but it depends on what you want to do. I’m really struck by this notion of just like in software engineering, you need to implement something well so that it scales and lasts for the long-term, but you also want to implement something in an agile and flexible manner where you don’t design for all kinds of cases that you don’t need and will never need, and actually ends up hands stringing yourself in that way.
This sounds very similar to me in terms of, “Yes, obviously, you need your data if you will need to connect it in the future, you need to think about that now, but also, don’t just make a massive data system for the sake of it if you don’t know what you’re going to use it for.”
Jonathan: I think I’d add to that another parallel between the software and data engineering side, which is that if you don’t do it, if you don’t implement it, it will happen anyway and I think we’ve all seen businesses where they don’t have a software application to do a certain thing. Normally, the financial domain for some reason, but what you then find is complex spreadsheets evolve. People learn to write macros and the software happens anyway and you end with organic free range of software, which certainly hits a scalability limit and then needs some proper software engineering and the same thing’s happening in data.
If you are sitting there in your business and seeing people spend a substantial fraction of their lives, bringing together and joining data in Excel, that is normally a situation that you need to have a think about. Could you do it better?
Rob: I think there’s a parallel to be drawn between the two, the skill sets involved, the way of thinking about problems, and the solutions. Those problems are very similar. I think you can think of data engineering as being where software engineering was maybe 20 or 30 years ago. I think that’s a reasonable parallel in terms of ways of working as of high level.
Zoe: If you’ve got some, you’re almost definitely going to need to change it, [giggles] however, you also have systems running off it already, so how do you go about improving your data storage and your data modeling when there will be dependencies on that data already?
Rob: This is one of those areas where the answer already exists in software engineering. There are two aspects. This one is testing, the second is defining interfaces between your data system and source or client applications. In the world of software engineering, you define an interface of your application and you say, great, I can change anything inside that interface. That’s internal, that’s okay, as long as the external behavior is unchanged, it’s fine.
You have a very different process if you are changing the behavior for a client, and then you might look at things like different schema versioning, different API versions, these are angles you see commonly in the world of software engineering, and the same thing can be applied to data engineering.
Jonathan: I think it being a newer field and with the explosion of technologies and tools and techniques, a lot of this is still up for grabs and like Rob says, I think it’s a bit behind software engineering in having well-trodden paths that we know to be successful. I think there are some very well-defined software architectures like classic three-tier, which people know how to build and they solve the problem well, I think data engineering is still a bit more of a bleeding-edge discipline.
Zoe: People have evolved those over tens of years of software engineering. It’s not like someone could just think about the problem and you immediately know the answer. You learn by working with it and coming up with new models and new paradigms of how to solve these challenges that we’re talking about.
Jonathan: I think the explosion in data engineering has been driven by a few different things. One is that storage has gotten really fast and also really cheap and it’s suddenly become viable to collect huge amounts of data and store it, and actually accesses at a sensible speed, you’re not having to ship things off to tape when it gets too big like you did back in the day, but I think also the explosion in Webscale companies, so companies with tens or hundreds of millions of customers has meant that people want to be collecting and analyzing data across at that scale and that again, that wasn’t the case 10 years ago if you think that 10 years ago, we didn’t really have a Netflix.
This is a new thing. Data at this scale is a new thing and it’s still a case that the brilliant engineers at companies like Netflix and Amazon are solving these problems from first principles and what they are building is starting to trickle out into the wider development ecosystem.
Rob: There’s another side to this discussion and that the regulatory and governance landscape in which you are handling your data. For example, we are in the UK and we have the European GDPR regulations in place and similar legislation exists across much of the world and is continually drafted and use some example for new privacy laws. You need sensible change control process to make sure you’re adhering to those regulations and any other privacy policies you may have while also being able to distribute the data from your central data store in a controlled manner to approved users.
Jonathan: That again is driven by the fact that data is now so massively valuable. Obviously, regulators have now realized that it’s something that needs to be protected and that users should be protected from unscrupulous users of that data, but it does present huge engineering challenges. The GDPR, for example, gives you the right to have your data removed or anonymized in business systems and if that is 10 separate applications with a load of spaghetti in Excel together, actually removing or anonymizing someone’s data is really hard and it difficult to prove that you’ve done it right as well.
Or indeed to service requests for the disclosure of the data that a company holds and proper data engineering approaches also make those sorts of challenges much easier to overcome.
Zoe: What we’re really talking about here is having governance, so what kind of governance practices do you need around managing data in general?
Jonathan: It’s a really key area because I think people are increasingly appreciating their right to privacy in terms of data. I think 10, 15 years ago, this was not a consideration, but now people do have a reasonable expectation that their data wouldn’t be used beyond the purpose for which it was provided and that personal and sensitive data must be well protected and any breaches or areas around that are disclosed and properly dealt with. We’re doing a few interesting bits of work at the moment with Morefield Eye Hospital on systems that use patient data and the governance processes around patient data in the NHS and with the researchers that we’re working with are extremely tight.
There are a very stringent set of controls to make sure that anything is fully de-identified before it is shared with researchers and that’s one where we have data lineage in place, so that means that the data warehouse that we have built with the hold patient data, we can trace back to where every bit of data in that has come from. What that means is that we’ve got quite carefully segregated databases for patient data and anonymized data and we’ve got a stringent set of controls about what data can flow between those and detailed policies that we have signed off with the NHS information governance controller to ensure that we’ve done absolutely everything necessary to make sure we’re protecting patient data there.
Rob: I think that’s a really neat example of how even in the most regulated and rightly so environments where patient confidentiality and privacy are hugely important you can still do really interesting work with data provided the proper protocols and processes are in place.
Zoe: Are there any other examples of what you need for good governance?
Jonathan: I think traceability is the big one. I think anything that you’re doing that is driving decisions or anything that you are publishing, if somebody questions a number being able to say where that number comes from is really key and we’ve seen this over the past few months with the global handling of the COVID pandemic that actually a lot of numbers get published which journalists or fact checkers will dig into and find that there’s very little justification.
Actually people can’t explain where that number has come from and you see Government Statistical Departments under a lot of pressure to produce figures, but it is also important to be able to trace right back to the source data and say this is how we derive those numbers. I think the global reliance on high quality data in tackling the current pandemic really shines a light on why that’s important and why actually being able to trust data and trust where it’s coming from is key in shaping policy.
Rob: Data provenance is a huge aspect of maintaining and being able to prove data quality. It also cuts across into the governance concerns for example, the Morefield’s work John talked about for example if you need to delete data from a certain user, that user might have entered data across dozens of different systems that you’ve aggregated data from or perhaps you are retaining data from one specific system, but removing it if it has been sourced from others. That selective trimming of your data is impossible without appropriate data lineage and on the technical front there are other concerns.
Imagine you have 6-3 or 4 system systems all pulling data into the same warehouse. What if you might discover that actually for three months of last year there was a data quality problem in one of the source systems that had since been corrected, but you’d like to go in and adjust the historical data you are currently storing so that it reflects what you now understand to be truth. In that case, you need to know about the date range of the data you’re sourcing and which system it came from. Provenance and time control can be very important in those regards.
Zoe: If I have got my data system I’m like, “Oh, actually I don’t know if I have got good provenance on this data what can I do about it?
Rob: The first step is to really dig into how that data got into place where it is now and you’ll be able to evaluate what your current process are and are not giving you. If that is not adequate then you need to be looking at some change process as we discussed previously on how to expand your data provenance because that’s the missing field which is one of the first problems I flagged in a common problems section we talked about earlier.
Jonathan: I’d add to that it doesn’t necessarily have to be complicated. I think the principles behind data provenance and traceability can be applied to good old Excel spreadsheets as long as you’re keeping copies and you know that for a particular published set of numbers you have got the set of spreadsheets that we use to derive those then that’s actually absolutely fine, it doesn’t have to be complex. It’s where you are modifying spreadsheets over time, it’s where the same ones get used cycle after cycle for your financial reporting that’s where it gets problematic. A lot of this is around discipline and making sure that you have rubber stamped the version that was used for a purpose and can go back to it.
Zoe: Essentially this is super important because what we’re saying is if you have a dataset that doesn’t include information on the provenance, that dataset could just be totally useless to you. You may need to go and recreate all of that data. Either validate line by line or go and recreate all of that data again so it’s super important to think about it at the start.
Rob: That situation is the worst of both worlds you don’t even know that the data is necessarily to be discarded and ignored. It might be really insightful but you have no way of being sure and that’s what we try to avoid when doing data engineering work.
Jonathan: Absolutely. I think the other nightmare scenario that I’ve seen a number of times is where you try to recreate a set of figures and actually find that you can’t. What’s that’s typically caused by is where you found that people have made manual adjustments, where people have over typed, where people have actually brought in versions which have since been overwritten and you can just completely lose the history. That can be a real problem particularly in regulated industries like financial services where you’ve got things like the Sarbanes-Oxley Act that require you to have traceability of data.
Zoe: Right or a problem for anyone if you’re trying to extract insights that are true.
Jonathan: More generally
Rob: We didn’t explicitly discuss it earlier but I think one of the core considerations when architecting a data platform is making sure that your data is immutable. It could be updated but not overridden, that consideration generally makes it far easier to reconstruct point in time, snapshots or analyses at a later date.
Zoe: Fantastic. Thank you both so much. I hope we have provided you some tips that can help you find some nuggets of gold in your data mine so, yes. Thanks again to John and Rob and join us next time on Softwire TechTalks.