Understanding the network of author affiliations.

For those of you who frequent arXiv, you would've noticed that papers published on arXiv also publish author affiliation information. I was intrigued at the prospect of scraping together such lists of author affiliations from papers published on arXiv that are related to astronomy. As you can notice in the line '9' of the highlighted code, to start with, I obtained the first 1000 such papers using arXiv's public API.

The code was written in Python, I used the networkX Python library to generate the graph and I used D3.js to visualize the graph! I tried to visualize the graph using matplotlib and networkX and using the graphcanvas library, neither of which attempts resulted in a graph that looks half as good as the one I made with d3!

To be a little more specific, the graph below is an undirected graph that just shows which all nodes are linked to which others! In reality, such a dataset is best represented using an edge-weighted graph that can denote how many times a specific edge is repeated in the dataset i.e how frequently people from the two universities collaborate and publish, on arXiv.

I've been working on and off this pet project for a while now and I think I'm going to take a break from it given that I feel like I've reached a safe point that I can resume from later on, when I have the time and resources! So, you're welcome to explore it, add to it, play around with it and make additions/suggestions that you see fit. If you want to send them to me, you can open an issue in the Github repository here. If you are really interested, or bored, or whatever you want to call it, you can read about my thought process behind this pet project and about the progress I made here, here and here.

Last, but not the least, The embedded code was highlighted using hilite.me.

Enough chit-chat. The graph you see below is interactive and you can drag nodes in the graph to move their network around. Hovering over a node will also show the name of the university representing the graph! It is obvious at first glance that there are a lot of closed networks of 4/5/6 edges/universities but you can also see a big cluster with nodes that look like they're connecting various parts of the large cluster. Hover over them and you'll recognize them as some of the big universities, for example UCSD i.e U of California San Diego, NASA, Caltech, JHU i.e Johns Hopkins University and so on.

Note: Install the relevant packages in Python, change the number of papers that you scrape off of arXiv be changing the last field in line 9 of the code and run the code yourselves to generate a larger, more connected graph! I would also like you pay attention to the charge parameter defined in the force attribute of the javascript! Take a look at the page source and you'll understand. The positions of various nodes in the graph are being autogenerated by D3 and it maintains equilibrium of the graph by using a pseudo charged-particles-in-a-box kind of approach where every node has a charge which dictates what the equilobrium distance between various nodes are! The reason I'm telling you all of these things is that when you try to generate your own graph, it might so happen that a few of the nodes go out of the canvas, above and below, and become unreachable! To bring them back into the canvas, you will have to reduce the charge attribute I mentioned appropriately!

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
import json
import urllib

import networkx as nx
from networkx.readwrite import json_graph
from BeautifulSoup import BeautifulStoneSoup
import matplotlib.pyplot as plt

url = 'http://export.arxiv.org/api/query?search_query=all:astro&start=0&max_results=1000'
data = urllib.urlopen(url).read()
soup = BeautifulStoneSoup(data)

test = [tag for tag in soup.findAll('entry')]

affiliationList = []
for i in range(len(test)):
            if test[i].findAll('arxiv:affiliation') != []:
                                affiliationList.append([tag.string for tag in test[i].findAll('arxiv:affiliation')])

G = nx.Graph()

for list in affiliationList:
    if len(list) > 5:
        print list
        for pos, node1 in enumerate(list):
            for node2 in list[pos:]:
                G.add_edge(node1, node2)

for n in G:
        G.node[n]['name'] = n

pos = nx.spring_layout(G)
nx.draw_networkx_nodes(G, pos, node_size=100, node_color='blue')
nx.draw_networkx_edges(G, pos, edge_color='green')
nx.draw_networkx_labels(G, pos, font_color='red')

d = json_graph.node_link_data(G)
json.dump(d, open('force.json','w'))

plt.axis('off')
plt.show()