Christian Dillage, Brian C. Rodorff - Data Science (FB5) Semesterprojekt

Citation Network Analysis with the Open Academic Graph

Nowadays, data science projects face the challenge of dealing with huge amounts of data. In our case, this was the Open Academic Graph dataset, provided by Microsoft. Among others this contains data of scientific publications, their corresponding authors, their venues. As you can read on the official homepage, the graph is "a large knowledge graph unifying two billion-scale academic graphs". This results in a total size of over 300GB of unpacked data. The fact that graph data can be visualized with ease compared to other structures also prompted us to investigate this dataset.
We decided to focus on the AMiner part of the dataset. We quickly discovered that data cleaning and preparation, becomes a very time-consuming task. During the course of the project, we decided to focus more and more on this part of the data science pipeline. After we finished the data pre-processing, we loaded the new dataset into the Neo4j graph database for initial data exploration. Furthermore, we used Gephi to generate visualizations of the citation graph. In the future, the graph will be further analyzed with respect to co-author relationsships, research networks and citation analysis.