For my project, I researched machine learning approaches to personality identification using online social network (OSN) data. The project consisted of two key stages: data acquisition and data analysis.
To acquire data, I built spider applications that ran in parallel on a 20 server cluster. Using search APIs provided by OSN platforms, the spiders scraped historical posts and relevant users for specific search queries. Using live-streaming APIs, I was also able to track posts for topics in real-time. The spiders write to a NoSQL graph-oriented database – which facilitates the indexing and querying of complex relationships between posts and users.
I used self-organising feature maps – an unsupervised machine learning technique – to cluster users based on a variety of attributes, including: language usage, interests, mutual friends, emoticon usage, sentiment analysis, and account activity. Furthermore, utilising single class support vector machines, I trained a model to distinguish between ordinary users and brands / celebrities. This allowed me to exclude advertisements and other irrelevant content from my training datasets.
Once users have been clustered, it is possible to identify and extract more specific personal details and various personality traits, such as: gender, average age, location, profile activity, and shared interests. I am particularly interested in how this information can help inform targeted advertising and marketing campaigns. I intend to build a web application that will help brands identify their target demographics by visualising the data surrounding a topic / their brand name on OSNs.