Monica Manney
Welcome to UVA data points, a new series where we explore the foundations practice and impact of data science. I'm your host, Monica Manney. In this four part series, we'll explore topics ranging from open research commons, to the role of body image in hiring, and from sports medicine to the Qʼeqchiʼ Mayan book of creation. Perhaps most importantly, we explore ideas for training and empowering the next generation of data scientist. But before we get into all that, we should probably answer the question, what is data science?
Raf Alvarado
It's kind of weird that you'd have a School of Data Science for something that doesn't have a definition.
Monica Manney
That was Raf Alvarado?
Raf Alvarado
Hi, my name is Raf Alvarado. I'm an associate professor in the School of data science. And I am the director of the residential master's degree program.
Monica Manney
And Raf has developed the four plus one model to define the field of data science.
Raf Alvarado
Yeah, so the purpose is, in a very general way, it's basically to scratch an itch and actually come up with a definition. And it's also kind of annoying when people say, Oh, it's just a buzzword, which is, I think, very annoying. It's actually, I mean, to some degree, it is a buzzword, but I'm an anthropologist, and when people say things with sincerity, and even if they're not actually academically correct, they usually have a good reason for it. And so, you know, I tend to take something like that seriously and say, "Well, what is behind this, why are people using this term?" It can't just be a buzzword. Because if it's simply an empty term, it would probably dissipate very quickly, it wouldn't have the legs that it has. And it certainly wouldn't have the money following it the way that it does now. So it was really kind of a good faith effort to flesh out and provide a definition, to sort of put an end to some of this dismissal of the field,
Monica Manney
The 4 + 1 model is based on four areas of data science surrounding the central component of practice. The development of this model began with the examination of the data pipeline.
Raf Alvarado
So the data science pipeline is an idea that you'll find in the literature, and probably the most famous one is Chris Wiggins's definition. I believe he calls it the OSEMN pipeline, that's an acronym for the different phases of data science. And he makes a strong case that data science is essentially expertise in this pipeline. And it consists of these different phases. So I think the O stands for obtain, and then S is scrub and there's explore, model, and then interpret. And so that's what the pipeline is. And you'll see this definition in different places. And you'll see lots of variations, but pretty much they stick to the same sort of logic. And so that became the basis for the model is like, Okay, let's take this, this pipeline and let's deconstruct it and see what what's common to it. And it turns out, there's some, I feel like you can break it down into four different different areas that that emerge from that,
Monica Manney
From here, Raf began to deconstruct the pipeline and extrapolate the foundational components of data science. It's this model that serves as a structure of this series.
Raf Alvarado
The 4 + 1 Model, l is it's important understand that it is an analytical tool. And so what I did was I took this sort of composite view of what data science is, and realize that the pipeline that was being described is really more of an arc. In other words, if you look at what the the the point where data comes in to the system, and compare that to where data comes out of the system, they're not on opposite sides of anything, they're actually in the same place, which is the world right. So data comes from some place in the world, some domain of of research, like physics or something, or finance, and then whatever is analyzed in that pipeline and put back goes back into that same world and has an influence on it. So you can think of it as like a, an arc where data moves into the system, and then comes back out. If you look at it as an arc, you can actually see that there are parts of the of that pipeline that folds back on itself that are similar to each other, but kind of look different if you don't think of it this way. And so as a result of this, of this analysis, I ended up discovering that there were four major areas that you could sort of deconstruct from the pipeline. And the reason we say plus one is because all those parts are just abstractions, they have to be integrated. And they're always related to each other in any kind of application of data science.
Monica Manney
So that's an outline and history of the 4 + 1 Model. But it's important to understand the four different areas and how they contribute to the practice of data science. So let's start with the first area: value.
Raf Alvarado
By Value, I mean, why are you acquiring data in the first place? Right? What's the business proposition? What's the scientific motivation? What is it in the world that you're interested in studying or affecting, that you're acquiring data for and doing analysis for? So we call that the area of value? Because that's where the, the purpose of working with data comes from. And also it's where data has an influence on the world. Or it can either do good or harm. And so that's where ethics comes in right
Monica Manney
After value, we have the area of design.
Raf Alvarado
And the area of design is this area of translating data as it is coming from the world in such a way that the machine can understand it. A lot of what we do in data science is translating between human ways of representing things and machine ways of representing things. And then taking it from the sort of the machine way of representing things back into the world to the human way. So that's what we call design.
Monica Manney
Next, the third area: systems.
Raf Alvarado
And then the systems part is kind of self evident as about technology.
Monica Manney
And finally, the area people commonly associate with data science, analytics,
Raf Alvarado
That's where statistics is, that's where computer science is, when you think of computer science is a branch of discrete mathematics. That's where simulations are, that's where Systems Engineering is in the sense of operations research and things like that. And that's the whole analytical part.
Monica Manney
So that's the four areas of the 4 + 1 Model. But this still leaves the plus one, which sits right in the middle of it all: practice,
Raf Alvarado
None of these things live function autonomously. There's no such thing as, as a data modeling process, that's done independently of value, let's say because that's where bias can be introduced. And it's an obviously, you, when you're designing a data model to support some research, you're looking forward to how you're going to model this data analytically, and what you're going to do with it. So there's always is every area of the model is always connected to every other area, in practice. So that's why we think of having a plus one, which is the area of practice where all these things are integrated. But the key there is what's introduced in practice, that's not part of the four plus one is connecting the four areas to the world of other other domains. So we're talking about when you start applying the model to medicine, or to finance or to education, or to politics, or to whatever those that connection to those other areas is what sort of drives the integration of those of the four area into into a practical reality. So that's what we think of only when we think of the plus one part of the model.
Monica Manney
So the 4 + 1 Model is how we answer the question, what is data science? And with this shared definition, we can begin our exploration of the real world application of data science. Each episode through this first series of UVA Data Points, will focus on a particular area of the 4 + 1 Model, examining how it works in practice and how it connects to the real world around us. We'll be starting in the value space for Episode One, featuring an interview between Brian Wright, a faculty member and the Director of Undergraduate Programs at the UVA School of Data Science, and Cathy O'Neil, data scientist and a New York Times bestselling author. To stay up to date with current episodes, click subscribe wherever you listen to podcast. For more information about UVA School of Data Science, visit us at datascience.virginia.edu. And if you have a data science topic you'd like us to explore, email us at
[email protected]. We'll see you next time.