Adrian Graap, Johannes Herkner, Oliver Schmidtke - Data Science Semesterprojekt

Analysis of a web forum using the Arch Linux forum

This project is analyzing the structure of an online forum on the example of the Arch Linux forum. Therefore, we analyzed the discussions and the user groups participating in them. For the collection of the data, we developed a web crawler using the Python library PyQuery and stored the data inside a document collection with MongoDB using PyMongo. For our research we collected 107892 discussions. Because of the large amount of discussions we were not able to fetch all and reduced the two largest topics only to discussions beginning 2014.

With NMF (Non-Negative Matrix Factorization), that is provided via scikit-learn, we were able to extract 20 "meta topics" from the given
discussions which overlap with many of the predefined topics of the forum. There are also time dependent changes in the relevance of single topics visible. Building on the previous results, we created a network graph of the common keywords for the "meta topics" to identify connections between them. For future research, the relationship between the "meta topics" can be researched.

We have used various approaches to cluster the users of the forum. Our goal was to find users who frequently interact with each other. Among other things, we tried to use a TF-IDF index to represent the users and then applied a K-means and DBSCAN clustering. Most users were assigned to a large cluster, which indicates that there are no clearly separated groupings in the forum. Another idea was to use a shopping cart analysis to find pairs of users, but these results had very low support and precision with only pairs of 3 people. This could be due to the large number of users or the fact that new, seemingly random groups of users are formed for each discussion, without fixed compositions.

We have analyzed the behavior of the users in the forum and have found, among other things, that the users are most active at 3 p.m and 9 p.m. while there are local drops in activity at 6 p.m. and 4 a.m. Between 2008 and 2012, it seems that the forum was most actively used. Throughout the year, activity is about the same, but decreases sharply at New Year's Eve. Most users are only active in one or two topics and most users have written around 500 comments, with a few extremely active users whose comments go into the 10,000s.

Finally, we created a relationship-graph between users that were involved in several discussions using networkx and bokeh, but this approach requires further investigation. For future research, our data could be used to create a graph database to enable easier analysis and visualization of user relationships in the web forum.