Detecting Latent User Properties in Social Media
Delip Rao and David Yarowsky (2010)
Go back to the survey page
Keywords: Academic Paper, Gender, Twitter, SVM, stacked SVM, Text Features
Classifier: Support vector machine
Features: Text features (Twitter user language in tweets and profiles)
Citation: Rao, D., Yarowsky, D.: Detecting latent user properties in social media. In: Proceedings of the NIPS MLSN Wokshop (2010)
Rao and Yarowsky set out to infer the gender, age and political orientation of Twitter users utilizing a stacked SVM classification method (combining SVM models) over a set of features. The authors emphasize the importance of language in identifying latent user attribute and propose a mixture of lexical and sociolinguistic-based feature for facilitating classification of Twitter users based on their informal textual communication.
They utilize user’s status messages as a means of inference. They test and compare 3 different classification models: (1) a sociolinguistic feature model: where digital socio-linguistic cues such as the use of emoticons or certain punctuation (ellipses and exclamation marks) are used as features; (2) a lexical n-gram model: where the unigram and bigram of the tweet text was derived and; (3) a stacked model that’s features were derived from the predictions of the previous 2 models.
For gender, they found that their sociolinguistic model (71.76% accuracy) performed better than their lexical model (68.70% accuracy) from the status text alone -- their stacked model did marginally better, achieving a 72.33 % accuracy rate. It should be noted that the authors also examined the use of social network structure (e.g. follower-followee ratio) and communication behaviour (e.g. reply rate) and determined that they were not valuable in inferring latent author attributes.