People Data Science

5 min readSep 15, 2021

This one is for all of you data people with an interest in people data.

Most anyone who's heard me speak about data science for even a few minutes knows that the kind of data that really makes me tick is data about or produced by humans. The kind that can speak to cognition, emotion, values, beliefs, attitudes, personality, culture, social influence, and everything that produces or explains human behavior. I'm also very interested in finding ways to overcome the stubborn biases that characterize the human mind and hold us back.

I consider myself a data scientist. But at that, I have a rather unique profile. Before immersing myself into the technology and digital transformation world, I studied psychological science; from undergraduate all the way to doctoral level. Not the psychotherapy kind — the systematic, stats-heavy, knowledge-producing kind. The edge this gives me in data science is that I have a deep understanding of people data, how to collect it, how to analyze it, and what kinds of insights can be derived from it under what conditions.

Data about or produced by people are some of the most complex data organizations have. So, what are some of the quirks of this data that may not be so obvious to someone who's been used to nerding around other kinds of data?

1. People lie. Doctor House was not the first to worry about this. There is actually a theory that lying and lie detection were, at least in part, responsible for the evolution of the human brain. Our species has taken lying to such heights that we even lie to ourselves, and that’s actually associated with living a happier, better-adjusted life. Lying is even considered a social developmental milestone for children. But I digress…

The point is, when inferring anything from data provided by people, especially when this data has been collected using self-report, we must either rule out or factor in dishonesty. The concepts of measurement validity and reliability are colored by this basic fact when it comes to people data. Whereas in a science like physics or chemistry, aside from faulty instruments, the main causes for low validity or reliability might have to do with the observer, in psychology and social sciences, there is a second possible cause: the observed; and a third: the interaction between the observer and the observed.

It's not all grim, there is good news too. Psychological scientists have developed numerous methods involving data triangulation to ensure that conclusions we draw from data are not biased by lies. The not so good news? Most of these tools function primarily at the level of data collection. If you were planning on working with self-report data you already have on hand, but that data was collected without expert input, you might benefit from an expert’s assistance.

The expert's advice might just be that you should disregard some of the data you already have, or collect other kinds of data, before you can confidently draw inferences that you can base decisions on. If you are not too far along the data collection path, they might advise you on how to modify your data collection strategy so that you are able to obtain the valuable kinds of insights you seek before data collection is concluded. The risks of bypassing expertise? You could be ignoring big underlying issues with how your data were collected that bias the conclusions you draw from the exercise; or you could be well on your way to reinventing the wheel. Neither of those are good prospects for success.

2. People are not very good at predicting the future. You might as well call this one "people lie even when they try to tell the truth". Emotion is a domain that is ripe with examples of this phenomenon, and actually includes the future as much as it does the past. People tend to underestimate how bad something made them feel in the past, and overestimate how bad the same event will make them feel in the future.

For those inclined to believe data is only useful when it has a straightforward interpretation, this can be particularly disheartening. Nonetheless, psychological science has dealt with this problem for decades and, for the most part, successfully moved past it. The gist of it? Evaluating stable patterns over time. For example, knowing that the human mind is not rational is not particularly surprising nor is it particularly useful. But knowing the particular ways in which it is irrational, being able to quantify each one? It was in one occasion worthy of a Nobel prize.

Back to the future, knowing that people are not very good at predicting how they will feel, think, or behave, forces the data professional to look for decision-making methods that go beyond the literal interpretation to hypothetical questions (e.g.“What percentage of your salary would you be willing to give up to be able to work remotely?”), and really consider the questions behind the questions (if we’re talking about self-report). More broadly, one needs to consider what constitutes weak evidence that barely passes statistical scrutiny vs. strong evidence made up of different kinds of data that all converge towards the same conclusion.

3. People change. Even when people tell the truth and have good insight, or when a researcher is skilled at distilling that insight using all the arsenal at their disposal, data collected yesterday won’t be as relevant for decision-making as data collected today. This is because people change. Attitudes, preferences, even personality and memory - they all change.

One of the most useful concepts for data scientists working with behavioral or people data to get acquainted with is the concept of model drift. In simple terms, model drift happens when the factors that predicted or explained your outcome of interest yesterday no longer predict or explain it today. The context in which your model operates has suffered a dramatic shift. Sometimes, this shift is correlated with a large societal-level event, such as a pandemic, or a large change in political regime, event the data scientist isn’t able to model up until it’s already happened. Other times, it’s a matter of an overlooked variable being included in the model, such as whether it was winter or summer.

How different kinds of change unfold over time, and the subtle and not-so-subtle ways in which contextual factors influence behavior, provide some of the most interesting lenses through which to view our past and our future. Because they speak to how we might influence positive change, for individuals as well as societies.

Are you a people data scientist? What are your favorite quirks about people data?

People Data Science

Written by Ana Bedacarratz