Researchers predict users’ income level based on their twitter behavior
Good News

Researchers predict users’ income level based on their twitter behavior

The words people use on social media can reveal hidden meaning to those who know where to look.

Linguists have long been fascinated by this notion, connecting a person’s words to age, gender, even socioeconomic status. Now computer scientists from the University of Pennsylvania and elsewhere have gone a step further, linking the online behavior of more than 5,000 Twitter users to their income bracket. They published their results in the journal PLOS ONE.

Two Greek researchers are among the team: Daniel Preotiuc-Pietro a post-doctoral researcher in Penn’s Positive Psychology Center in the School of Arts & Sciences led the research, collaborating with Svitlana Volkova of Johns Hopkins University, Vasileios Lampos and Nikolaos Aletras of University College London and Yoram Bachrach of Microsoft Research.

The team took an opposite approach to what psychologists and linguists have historically done: Rather than asking direct questions, the scientists looked at participants’ social media posts, often full of intimate details despite the lack of privacy these outlets afford. Researchers from Penn’s World Well-Being Project, of which Preotiuc-Pietro is a part, are curious about social media as a research tool that can support, or even replace, expensive, limited and potentially biased surveying.

​”​​The multimodal user data available in social media — from patterns of interactive behaviour to discussion themes — enable the automatic deduction of interesting conclusions. In the current research paper we propose a model that can be used to predict the income level of a Twitter user. Apart from the self-intuitive web products (e.g. personalised advertisement) that may emerge or improve from this kind of research, the main benefit ​may be located elsewhere. Such methods could be used in the future to enhance data-driven analyses in sociology as they offer a much larger pool of data compared to the traditional questionnaire-based approaches​,”​ stated Dr. Lampos.

For this experiment, the researchers started by looking at Twitter users’ self-described occupations.

In the United Kingdom, a job code system sorts occupation into nine classes. Using that hierarchy, the researchers determined average income for each code, then sought a representative sampling from each. After manually removing ambiguous profiles — for example, listings referencing the film Coal Miner’s Daughter grouped as “coal miner” for profession — the team ended up with 5,191 Twitter users and more than 10 million tweets to analyze.

From there, they created a statistical natural language processing algorithm that pulled in words that people in each code class use distinctly. Most people tend to use the same or similar words, so the algorithm’s job was to “understand” which were most predictive for each class. Humans analyzed these groupings and assigned them qualitative signifiers.

Some of the results validated what’s already known, for instance, that a person’s words can reveal age and gender, and that these are tied to income. But Preotiuc-Pietro said there were also some surprises; for example, those who earn more tend to express more fear and anger on Twitter. Perceived optimists have a lower mean income. Text from those in lower income brackets includes more swear words, whereas those in higher brackets more frequently discuss politics, corporations and the nonprofit world.

Aletras noted an overall picture that emerged about Twitter use.

“Lower-income users or those of a lower socioeconomic status use Twitter more as a communication means among themselves,” he said. “High-income people use it more to disseminate news, and they use it more professionally than personally.”

​”​We need to note that our research does not present a method that is mature enough to guess the income of individual users ​with high accuracy. At this stage, our derivations are based on sets of multiple users, where the statistical error is reduced significantly,” Dr. Lampos concluded.