4 + 1 Model of Data Science

Show Notes

Before diving into the complex world of data science it seemed to wise to establish a shared definition of the field. Here at the UVA School of Data Science, we have defined data science with the 4 + 1 Model. This model serves an outline for the first series of UVA Data Points. It also serves as a guiding definition within the School of Data Science, touching everything from research to course planning.

In this introduction trailer, host Monica Manney discusses the history, development, and function of the 4 + 1 Model of Data Science with its main author, Raf Alvarado.

Below is a brief expect from An Outline of the 4 + 1 Model of Data Science by Raf Alvarado:

“The point of the 4 + 1 model, abstract as it is, is to provide a practical template for strategically planning the various elements of a school of data science. To serve as an effective template, a model must be general. But generality if often purchased at the cost of intuitive understanding. The following caveats may help make sense of the model when considering its usefulness when applied to various concrete activities.

The model describes areas of academic expertise, not objective reality. It is a map of a division of labor writ large. Although each of the areas has clear connections to the others, the question to ask when deciding where an activity belongs is: who would be an expert at doing it? The realms help refine this question: the analytics area, for example, contains people who are good at working with abstract machinery. The four areas have the virtue of isolating intuitively correct communities of expertise. For example, people who are great at data product design may not know the esoteric depths of machine learning, and that adepts at machine learning are not usually experts in understanding human society and normative culture.

Each area in the model contains a collection of subfields that need to be teased out. Some areas will have more subfields than others. Although some areas may be smaller than others in terms of number of experts (faculty) and courses, each area has a major impact on the overall practice of data science and the quality of the school’s activities. In addition, these subfields are in an important sense “more real” than the categories. We can imagine them forming a dense network in which the areas define communities with centroids, and which are more interconnected than the clean-cut image of the model implies.

The areas of the model are like the components of a principal component analysis of the vector space of data science. They capture the variance that exists within the field, and, crucially, provide a framework for realigning (rebasing) the academy along a new set of axes. One effect of this is to both disperse and recombine older fields, such as computer science, statistics, and operations research, into new clusters. Thus we separate computer science subfields such as complexity analysis and database design. One possible salutary result of this will be the formation of new syntheses of fields that share concerns but differ in vocabularies and customs..."

Episode Transcript

Monica Manney Welcome to UVA data points, a new series where we explore the foundations practice and impact of data science. I'm your host, Monica Manney. In this four part series, we'll explore topics ranging from open research commons, to the role of body image in hiring, and from sports medicine to the Qʼeqchiʼ Mayan book of creation. Perhaps most importantly, we explore ideas for training and empowering the next generation of data scientist. But before we get into all that, we should probably answer the question, what is data science? Raf Alvarado It's kind of weird that you'd have a School of Data Science for something that doesn't have a definition. Monica Manney That was Raf Alvarado? Raf Alvarado Hi, my name is Raf Alvarado. I'm an associate professor in the School of data science. And I am the director of the residential master's degree program. Monica Manney And Raf has developed the four plus one model to define the field of data science. Raf Alvarado Yeah, so the purpose is, in a very general way, it's basically to scratch an itch and actually come up with a definition. And it's also kind of annoying when people say, Oh, it's just a buzzword, which is, I think, very annoying. It's actually, I mean, to some degree, it is a buzzword, but I'm an anthropologist, and when people say things with sincerity, and even if they're not actually academically correct, they usually have a good reason for it. And so, you know, I tend to take something like that seriously and say, "Well, what is behind this, why are people using this term?" It can't just be a buzzword. Because if it's simply an empty term, it would probably dissipate very quickly, it wouldn't have the legs that it has. And it certainly wouldn't have the money following it the way that it does now. So it was really kind of a good faith effort to flesh out and provide a definition, to sort of put an end to some of this dismissal of the field, Monica Manney The 4 + 1 model is based on four areas of data science surrounding the central component of practice. The development of this model began with the examination of the data pipeline. Raf Alvarado So the data science pipeline is an idea that you'll find in the literature, and probably the most famous one is Chris Wiggins's definition. I believe he calls it the OSEMN pipeline, that's an acronym for the different phases of data science. And he makes a strong case that data science is essentially expertise in this pipeline. And it consists of these different phases. So I think the O stands for obtain, and then S is scrub and there's explore, model, and then interpret. And so that's what the pipeline is. And you'll see this definition in different places. And you'll see lots of variations, but pretty much they stick to the same sort of logic. And so that became the basis for the model is like, Okay, let's take this, this pipeline and let's deconstruct it and see what what's common to it. And it turns out, there's some, I feel like you can break it down into four different different areas that that emerge from that, Monica Manney From here, Raf began to deconstruct the pipeline and extrapolate the foundational components of data science. It's this model that serves as a structure of this series. Raf Alvarado The 4 + 1 Model, l is it's important understand that it is an analytical tool. And so what I did was I took this sort of composite view of what data science is, and realize that the pipeline that was being described is really more of an arc. In other words, if you look at what the the the point where data comes in to the system, and compare that to where data comes out of the system, they're not on opposite sides of anything, they're actually in the same place, which is the world right. So data comes from some place in the world, some domain of of research, like physics or something, or finance, and then whatever is analyzed in that pipeline and put back goes back into that same world and has an influence on it. So you can think of it as like a, an arc where data moves into the system, and then comes back out. If you look at it as an arc, you can actually see that there are parts of the of that pipeline that folds back on itself that are similar to each other, but kind of look different if you don't think of it this way. And so as a result of this, of this analysis, I ended up discovering that there were four major areas that you could sort of deconstruct from the pipeline. And the reason we say plus one is because all those parts are just abstractions, they have to be integrated. And they're always related to each other in any kind of application of data science. Monica Manney So that's an outline and history of the 4 + 1 Model. But it's important to understand the four different areas and how they contribute to the practice of data science. So let's start with the first area: value. Raf Alvarado By Value, I mean, why are you acquiring data in the first place? Right? What's the business proposition? What's the scientific motivation? What is it in the world that you're interested in studying or affecting, that you're acquiring data for and doing analysis for? So we call that the area of value? Because that's where the, the purpose of working with data comes from. And also it's where data has an influence on the world. Or it can either do good or harm. And so that's where ethics comes in right Monica Manney After value, we have the area of design. Raf Alvarado And the area of design is this area of translating data as it is coming from the world in such a way that the machine can understand it. A lot of what we do in data science is translating between human ways of representing things and machine ways of representing things. And then taking it from the sort of the machine way of representing things back into the world to the human way. So that's what we call design. Monica Manney Next, the third area: systems. Raf Alvarado And then the systems part is kind of self evident as about technology. Monica Manney And finally, the area people commonly associate with data science, analytics, Raf Alvarado That's where statistics is, that's where computer science is, when you think of computer science is a branch of discrete mathematics. That's where simulations are, that's where Systems Engineering is in the sense of operations research and things like that. And that's the whole analytical part. Monica Manney So that's the four areas of the 4 + 1 Model. But this still leaves the plus one, which sits right in the middle of it all: practice, Raf Alvarado None of these things live function autonomously. There's no such thing as, as a data modeling process, that's done independently of value, let's say because that's where bias can be introduced. And it's an obviously, you, when you're designing a data model to support some research, you're looking forward to how you're going to model this data analytically, and what you're going to do with it. So there's always is every area of the model is always connected to every other area, in practice. So that's why we think of having a plus one, which is the area of practice where all these things are integrated. But the key there is what's introduced in practice, that's not part of the four plus one is connecting the four areas to the world of other other domains. So we're talking about when you start applying the model to medicine, or to finance or to education, or to politics, or to whatever those that connection to those other areas is what sort of drives the integration of those of the four area into into a practical reality. So that's what we think of only when we think of the plus one part of the model. Monica Manney So the 4 + 1 Model is how we answer the question, what is data science? And with this shared definition, we can begin our exploration of the real world application of data science. Each episode through this first series of UVA Data Points, will focus on a particular area of the 4 + 1 Model, examining how it works in practice and how it connects to the real world around us. We'll be starting in the value space for Episode One, featuring an interview between Brian Wright, a faculty member and the Director of Undergraduate Programs at the UVA School of Data Science, and Cathy O'Neil, data scientist and a New York Times bestselling author. To stay up to date with current episodes, click subscribe wherever you listen to podcast. For more information about UVA School of Data Science, visit us at datascience.virginia.edu. And if you have a data science topic you'd like us to explore, email us at [email protected]. We'll see you next time.

Next Episode

4 + 1 Model of Data Science

Show Notes

Episode Transcript

Other Episodes

U.S. China Tech Competition | Apps, Platforms, and Surveillance

Exploring the Protein Universe via AI

Exploring the Popol Vuh with Allison Bigelow and Raf Alvardo