class: logo-slide --- class: title-slide ## NetworkX ### Applications of Data Science - Class 8 ### Giora Simchoni #### `gsimchoni@gmail.com and add #dsapps in subject` ### Stat. and OR Department, TAU ### 2020-03-01 --- layout: true <div class="my-footer"> <span> <a href="https://dsapps-2020.github.io/Class_Slides/" target="_blank">Applications of Data Science </a> </span> </div> --- class: section-slide # Why NetworkX? --- ## Pros and Cons of `NetworkX` Pros: - Maintained! - Friendly: Community, Docs, Installation (as opposed to `igraph`) - Relatively easy interface (as opposed to `igraph`) - Speaks pandas, numpy and other important python tools - Nodes can be anything, including python objects Cons: - Slow, probably not for large networks (as opposed to `igraph`, `graph-tool`) - Unimpressive viz (as opposed to Gephi, ggraph, others) - Not the most comprehensive algorithms selection --- class: section-slide # Creating a Graph(): Manually --- ## Four types of Graph objects Undirected: ```python import networkx as nx G = nx.Graph() ``` Directed: ```python G = nx.DiGraph() ``` Multiedge undirected: ```python G = nx.MultiGraph() ``` Multiedge directed: ```python G = nx.MultiDiGraph() ``` --- ## Add Nodes One or more nodes: ```python G = nx.Graph() G.add_node('John') G.add_nodes_from(['Paul', 'George', 'Ringo']) ``` Watch a Graph's nodes: ```python G.nodes ``` ``` ## NodeView(('John', 'Paul', 'George', 'Ringo')) ``` Print nicer as a list: ```python list(G.nodes) ``` ``` ## ['John', 'Paul', 'George', 'Ringo'] ``` --- ## Add Edges One or more edges: ```python G.add_edge('John', 'Paul') G.add_edges_from([('Paul', 'George'), ('John', 'Ringo')]) ``` Surprisingly, this would also work: ```python G.add_edge('John', 'Brian') ``` Watch a Graph's edges: ```python print(list(G.edges)) ``` ``` ## [('John', 'Paul'), ('John', 'Ringo'), ('John', 'Brian'), ('Paul', 'George')] ``` --- ## Remove nodes or edges ```python G.remove_node('Brian') ``` Did `NetworkX` bother to remove the edge as well? ```python print(list(G.edges)) ``` ``` ## [('John', 'Paul'), ('John', 'Ringo'), ('Paul', 'George')] ``` Remove an edge, remove from a list: ```python G.remove_edge('Paul', 'John') # why did it work? G.remove_nodes_from(['John', 'George']) ``` Break the Beatles altogether: ```python G.clear() ``` --- ## Good to know - Adding an edge also adds its nodes if they didn't exist before - Adding a duplicate node/edge is ignored unless `MultiGraph()` - Removing an edge does not remove its nodes - Removing a node removes all edges to/from it - Removing a non-existent node/edge raises an error unless it is part of a list in which case - ignored --- ## Nodes and Edges Attributes While adding to Graph: ```python G.add_node('Ringo', alive=True) G.add_nodes_from([('George', {'alive': False}), ('John', {'alive': False})]) # can also do: G.add_nodes_from(['George', 'John'], alive=False) G.add_edge('John', 'Paul', year=1957) G.add_edges_from([('Ringo', 'John'), ('Ringo', 'Paul'), ('Ringo', 'George')], year=1960) ``` Watch attributes with the `nodes()` and `edges()` methods: ```python print(G.nodes(data=True)) ``` ``` ## [('Ringo', {'alive': True}), ('George', {'alive': False}), ('John', {'alive': False}), ('Paul', {})] ``` ```python print(G.nodes(data='alive')) ``` ``` ## [('Ringo', True), ('George', False), ('John', False), ('Paul', None)] ``` --- ## Nodes and Edges Attributes Setting an attribute of an existing node/edge through the `nodes` and `edges` dictionary attributes: ```python print(G.nodes['Paul']) ``` ``` ## {} ``` ```python G.nodes['Paul']['alive'] = True #? ``` ```python G.add_edge('John', 'George') G.edges[('John', 'George')]['year'] = 1958 ``` As a dictionary you can also delete an attribute this way: ```python del G.nodes['Paul']['alive'] del G.edges[('John', 'George')]['year'] ``` --- ## Nodes and Edges Attributes Finally you can set an attribute from a simple dictionary: ```python year_met = { ('John', 'Paul'): 1957, ('Paul', 'Ringo'): 1960, ('Paul', 'George'): 1958, ('John', 'George'): 1958 } nx.set_edge_attributes(G, year_met, 'year') ``` Or multiple attributes with a nested dictionary: ```python node_attrs = { 'John': {'alive': False, 'born': 1940}, 'Paul': {'alive': True, 'born': 1942}, 'George': {'alive': False, 'born': 1943}, 'Ringo': {'alive': True, 'born': 1940} } nx.set_node_attributes(G, node_attrs) ``` --- ### Good to have The edge attribute `weight` is so important it got its own method: ```python G = nx.Graph() G.add_weighted_edges_from([(0, 1, 3.0), (1, 2, 7.5)]) ``` Quick methods to know: ```python print(G.number_of_nodes()) ``` ``` ## 3 ``` ```python print(G.number_of_edges()) ``` ``` ## 2 ``` ```python print(G.is_directed()) ``` ``` ## False ``` --- ### Good to have ```python print(G.has_node(3)) ``` ``` ## False ``` ```python print(G.has_edge(0, 1)) ``` ``` ## True ``` ```python print(G.subgraph([0, 1]).number_of_edges()) ``` ``` ## 1 ``` ```python print(list(G.neighbors(1))) ``` ``` ## [0, 2] ``` --- ## What about `DiGraph()`? Convert an undirected graph to a directed graph: ```python D = nx.DiGraph(G) ``` Edges direction matters: ```python D.add_edge(2, 3, weight=10.0) print(D.has_edge(2, 3)) ``` ``` ## True ``` ```python print(D.has_edge(3, 2)) ``` ``` ## False ``` Other than that... --- class: section-slide # Visualize a Graph(): matplotlib --- ### `draw()`: The very basic ```python import matplotlib.pyplot as plt G = nx.erdos_renyi_graph(n=5, p=0.6) # a.k.a binomial_graph() nx.draw(G) plt.show() ``` <img src="images/Basic-1.png" width="45%" /> --- ### `draw_networkx()`: Some more options ```python nx.draw_networkx(G, node_size=800, node_color='red', edge_color='pink', width=5, font_size=16) plt.show() ``` <img src="images/Basic2-1.png" width="45%" /> --- You may like the unpacking a dictionary option better: ```python options = { 'node_color': 'black', 'node_size': 100, 'width': 3, } nx.draw_networkx(G, **options) plt.show() ``` <img src="images/Basic3-1.png" width="45%" /> --- ### Layouts Galore ```python plt.subplot(121) limits = plt.axis('off') nx.draw_networkx(G, pos=nx.kamada_kawai_layout(G)) plt.subplot(122) limits = plt.axis('off') nx.draw_networkx(G, pos=nx.fruchterman_reingold_layout(G)) plt.show() ``` <img src="images/Basic4-1.png" width="45%" /> --- ### Use a Layout directly ```python nx.draw_kamada_kawai(G) plt.show() ``` <img src="images/Basic5-1.png" width="45%" /> --- ### For nicer graphs go to... - [Grave](https://networkx.github.io/grave/)? - [nxviz](https://nxviz.readthedocs.io/en/latest/index.html)? - [Plotly](https://plot.ly/python/) (interactive graphs for Python or R) - [Gephi](https://gephi.org/) (a Desktop app, neither R nor Python but both export to Gephi format) - [d3](https://d3js.org/) (JS delight) - [ggraph](https://ggraph.data-imaginist.com/) (R)? --- class: section-slide # Create a Graph(): In Real Life --- ### Numpy ```python import numpy as np A = np.array([ [0, 1, 1, 0, 0], [1, 0, 1, 0, 1], [1, 1, 0, 0, 1], [0, 0, 0, 0, 0], [0, 1, 1, 0, 0]]) G = nx.from_numpy_matrix(A) nx.draw_networkx(G, node_size=0, font_size=40) plt.show() ``` <img src="images/Numpy-1.png" width="30%" /> --- ### Edgelist - a list of edges ```python beatles_edgelist = [('John', 'Paul', {'weight': 1.0}), ('John', 'George', {'weight': 0.5}), ('John', 'Ringo', {'weight': 0.5})] G = nx.from_edgelist(beatles_edgelist) weights = [10 * attrs['weight'] for u, v, attrs in G.edges(data=True)] nx.draw_networkx(G, width=weights) plt.show() ``` <img src="images/Beatles1-1.png" width="40%" /> --- ### Edgelist - a file This is how the `beatles.edgelist` file looks like: ```python John Paul 1 John George 0.5 John Ringo 0.5 ``` Read it with `read_edgelist()` or `read_weighted_edgelist()`: ```python with open('../data/beatles.edgelist', 'rb') as edgelist_file: G = nx.read_edgelist(edgelist_file, data=(('weight', float),)) print(G.edges(data=True)) ``` ``` ## [('John', 'Paul', {'weight': 1.0}), ('John', 'George', {'weight': 0.5}), ('John', 'Ringo', {'weight': 0.5})] ``` ```python with open('../data/beatles.edgelist', 'rb') as edgelist_file: G = nx.read_weighted_edgelist(edgelist_file) print(G.edges(data=True)) ``` ``` ## [('John', 'Paul', {'weight': 1.0}), ('John', 'George', {'weight': 0.5}), ('John', 'Ringo', {'weight': 0.5})] ``` --- ### CSV This is how the `beatles.csv` file looks like: ```python John,Paul,1957,1.0 John,George,1958,0.5 John,Ringo,1960,0.5 ``` You can still use `read_edgelist()` to read it: ```python with open('../data/beatles.csv', 'rb') as edgelist_file: G = nx.read_edgelist(edgelist_file, delimiter=',', data=(('year', int),('weight', float),)) print(G.edges(data=True)) ``` ``` ## [('John', 'Paul', {'year': 1957, 'weight': 1.0}), ('John', 'George', {'year': 1958, 'weight': 0.5}), ('John', 'Ringo', {'year': 1960, 'weight': 0.5})] ``` Or, you can just use Pandas `read_csv()` and... --- ### Pandas ```python import pandas as pd beatles_df = pd.read_csv('../data/beatles.csv', header=None) beatles_df.columns = ['source', 'target', 'year', 'weight'] G = nx.from_pandas_edgelist(beatles_df, 'source', 'target', ['year', 'weight']) nx.draw_networkx(G) plt.show() ``` <img src="images/Beatles2-1.png" width="40%" /> --- ### What about writing? All methods we've seen have a `write_` or `to_` complementary function: ```python print(nx.to_edgelist(G)) ``` ``` ## [('John', 'Paul', {'year': 1957, 'weight': 1.0}), ('John', 'George', {'year': 1958, 'weight': 0.5}), ('John', 'Ringo', {'year': 1960, 'weight': 0.5})] ``` ### What about `DiGraph()`? All methods we've seen have a `create_using` parameter: ```python D = nx.from_pandas_edgelist(beatles_df, 'source', 'target', ['year', 'weight'], create_using=nx.DiGraph) ``` .warning[ ⚠️ When creating `DiGraph()` NetworkX assumes `\(A_{ij}\)` corresponds to the edge from i to j, contrary to our convention. Therefore, you should use `A.transpose()` ] --- ### Bipartite For creating a bipartite network, use the `bipartite` node attribute: ```python G = nx.Graph() G.add_nodes_from([1, 2, 3, 4], bipartite=0) G.add_nodes_from(['a', 'b', 'c'], bipartite=1) G.add_edges_from([(1, 'a'), (1, 'b'), (2, 'b'), (2, 'c'), (3, 'c'), (4, 'a')]) nx.draw_networkx(G, pos=nx.bipartite_layout(G, [1, 2, 3, 4]), node_size=0, font_size=20) plt.show() ``` <img src="images/Bipartite1-1.png" width="40%" /> It is your responsibility though to check that indeed the two sets of nodes are not connected. --- "In real life" you would probably have something like the Marvel incidence matrix in a CSV, what `NetworkX` calls a *biadjacency_matrix*, or a list of edges you could make a Scipy *sparse matrix* with, from which you can create a bipartite graph: ```python from scipy import sparse marvel = pd.read_csv("../data/marvel_incidence_matrix.csv") B = marvel.iloc[:, 1:].values sB = sparse.csr_matrix(B) G = nx.bipartite.from_biadjacency_matrix(sB) print(G.nodes(data=True)) ``` ``` ## [(0, {'bipartite': 0}), (1, {'bipartite': 0}), (2, {'bipartite': 0}), (3, {'bipartite': 0}), (4, {'bipartite': 0}), (5, {'bipartite': 0}), (6, {'bipartite': 0}), (7, {'bipartite': 0}), (8, {'bipartite': 0}), (9, {'bipartite': 0}), (10, {'bipartite': 0}), (11, {'bipartite': 0}), (12, {'bipartite': 0}), (13, {'bipartite': 0}), (14, {'bipartite': 0}), (15, {'bipartite': 0}), (16, {'bipartite': 0}), (17, {'bipartite': 0}), (18, {'bipartite': 0}), (19, {'bipartite': 0}), (20, {'bipartite': 0}), (21, {'bipartite': 1}), (22, {'bipartite': 1}), (23, {'bipartite': 1}), (24, {'bipartite': 1}), (25, {'bipartite': 1}), (26, {'bipartite': 1}), (27, {'bipartite': 1}), (28, {'bipartite': 1}), (29, {'bipartite': 1}), (30, {'bipartite': 1}), (31, {'bipartite': 1}), (32, {'bipartite': 1}), (33, {'bipartite': 1}), (34, {'bipartite': 1}), (35, {'bipartite': 1}), (36, {'bipartite': 1})] ``` --- ```python films = marvel['Film'].to_list() characters = marvel.columns[1:].to_list() id_to_label = {id: char for id, char in enumerate(films + characters)} G = nx.relabel_nodes(G, id_to_label) nx.draw_networkx(G, pos=nx.bipartite_layout(G, films), node_size=0, font_size=10) plt.show() ``` <img src="images/Bipartite3-1.png" width="50%" /> --- You've seen how to project an incidence matrix with numpy: ```python P = B.transpose() @ B np.fill_diagonal(P, 0) Gc = nx.from_numpy_matrix(P) id_to_char = {id: char for id, char in enumerate(characters)} Gc = nx.relabel_nodes(Gc, id_to_char) nx.draw_networkx(Gc, node_size=0, font_size=10) plt.show() ``` <img src="images/Bipartite4-1.png" width="50%" /> --- But NetworkX can do all that for you: ```python Gc = nx.bipartite.projected_graph(G, characters) nx.draw_networkx(Gc, node_size=0, font_size=10) plt.show() ``` <img src="images/Bipartite5-1.png" width="50%" /> --- ### Similarity Let's recreate the sci-fi similarity network. We'll do some filtering: - The original CSV includes 156 books, we'll take only those marked as "popular" - Take only books with at least 2 features marked > 0 - Won't be looking at the author's name, gender, or date of release ```python scifi = pd.read_csv('../data/sci_fi_books.csv', index_col='book') scifi['n_features'] = (scifi.iloc[:, 6:17] > 0).sum(axis=1) scifi = scifi[scifi['popular'] == 1] scifi_fil = scifi[scifi['n_features'] >= 2] cols_to_drop = ['date', 'author', 'frequency', 'author_gender', 'quarter_century', 'century', 'popular', 'n_features'] scifi_fil = scifi_fil.drop(cols_to_drop, axis=1) # to use Pandas corr function need to transpose data frame scifi_corr = scifi_fil.T.corr(method='spearman') ``` --- We reached a 25x25 symmetric correlation matrix for 25 books: ```python scifi_corr.values[:5, :5] ``` ``` ## array([[ 1. , 0.62769528, 0.2975283 , 0.69310571, -0.2111842 ], ## [ 0.62769528, 1. , 0.68760573, 0.3303204 , 0.23530427], ## [ 0.2975283 , 0.68760573, 1. , 0.0037037 , 0.48989795], ## [ 0.69310571, 0.3303204 , 0.0037037 , 1. , -0.04082483], ## [-0.2111842 , 0.23530427, 0.48989795, -0.04082483, 1. ]]) ``` Turn all upper traingle to `Nan`, so we'll get only 25 * 24 / 2 correlations: ```python scifi_corr.values[np.triu_indices_from(scifi_corr.values)] = np.nan scifi_corr.values[:5, :5] ``` ``` ## array([[ nan, nan, nan, nan, nan], ## [ 0.62769528, nan, nan, nan, nan], ## [ 0.2975283 , 0.68760573, nan, nan, nan], ## [ 0.69310571, 0.3303204 , 0.0037037 , nan, nan], ## [-0.2111842 , 0.23530427, 0.48989795, -0.04082483, nan]]) ``` --- Convert this to a long edgelist, filter out the `Nan`: ```python scifi_edgelist = scifi_corr.reset_index().melt(id_vars='book', value_vars=scifi_fil.index, var_name='book2', value_name='corr') scifi_edgelist = scifi_edgelist[~pd.isna(scifi_edgelist['corr'])] print(scifi_edgelist.shape) ``` ``` ## (300, 3) ``` ```python scifi_edgelist.head(5) ``` ``` ## book book2 corr ## 1 Twenty Thousand Leagues Under the Sea Frankenstein 0.627695 ## 2 Journey to the Center of the Earth Frankenstein 0.297528 ## 3 Brave New World Frankenstein 0.693106 ## 4 The Lion, the Witch and the Wardrobe Frankenstein -0.211184 ## 5 I, Robot Frankenstein 0.700617 ``` --- Filter all correlations above 0.5 and convert to a `Graph()`: ```python G = nx.from_pandas_edgelist(scifi_edgelist[scifi_edgelist['corr'] > 0.5], 'book', 'book2', ['corr']) nx.draw_networkx(G) limits = plt.axis('off') plt.show() ``` <img src="images/Sci-Fi-1.png" width="45%" /> --- class: section-slide # Let NetworkX work --- ### Adjacency Matrix You can get a Numpy matrix with `to_numpy_matrix()`: ```python A = nx.to_numpy_matrix(G) print(A.shape) ``` ``` ## (23, 23) ``` ```python A[:4, :4] ``` ``` ## matrix([[0., 1., 0., 0.], ## [1., 0., 1., 1.], ## [0., 1., 0., 1.], ## [0., 1., 1., 0.]]) ``` .insight[ 💡 Say, didn't we have 25 books? ] --- But a realistic network would be large and sparse so the `adjacency_matrix()` method returns a SciPy sparse matrix: ```python A = nx.adjacency_matrix(G) print(A.shape) ``` ``` ## (23, 23) ``` ```python print(A[:4, :4]) ``` ``` ## (0, 1) 1 ## (1, 0) 1 ## (1, 2) 1 ## (1, 3) 1 ## (2, 1) 1 ## (2, 3) 1 ## (3, 1) 1 ## (3, 2) 1 ``` .insight[ 💡 Do you know what a sparse matrix is? ] --- ### Easy to check if... ```python print(nx.is_directed(G)) ``` ``` ## False ``` ```python print(nx.is_directed_acyclic_graph(G)) ``` ``` ## False ``` ```python print(nx.is_bipartite(G)) ``` ``` ## False ``` ```python print(nx.is_connected(G)) ``` ``` ## True ``` --- ### Degree and Density Degree is a big deal, there's an attribute *and* a method for that: ```python G.degree ``` ``` ## DegreeView({'Twenty Thousand Leagues Under the Sea': 4, 'Frankenstein': 5, 'Brave New World': 5, 'I, Robot': 6, 'Solaris': 5, 'The Hunger Games': 5, 'Journey to the Center of the Earth': 5, 'Flowers for Algernon': 2, 'The Dune Chronicles': 3, 'The Princess Bride': 6, 'A Song of Ice and Fire': 6, 'Harry Potter Series': 5, '2001: A Space Odyssey': 6, "Ender's Game": 6, 'American Gods': 1, 'The Lord of the Rings': 5, 'The Lion, the Witch and the Wardrobe': 3, "The Hitchhiker's Guide to the Galaxy": 6, 'Contact': 6, 'Watership Down': 3, 'Babel 17': 4, 'World War Z': 2, "The Handmaid's Tale": 3}) ``` ```python list(G.degree) ``` ``` ## [('Twenty Thousand Leagues Under the Sea', 4), ('Frankenstein', 5), ('Brave New World', 5), ('I, Robot', 6), ('Solaris', 5), ('The Hunger Games', 5), ('Journey to the Center of the Earth', 5), ('Flowers for Algernon', 2), ('The Dune Chronicles', 3), ('The Princess Bride', 6), ('A Song of Ice and Fire', 6), ('Harry Potter Series', 5), ('2001: A Space Odyssey', 6), ("Ender's Game", 6), ('American Gods', 1), ('The Lord of the Rings', 5), ('The Lion, the Witch and the Wardrobe', 3), ("The Hitchhiker's Guide to the Galaxy", 6), ('Contact', 6), ('Watership Down', 3), ('Babel 17', 4), ('World War Z', 2), ("The Handmaid's Tale", 3)] ``` Average degree: ```python np.mean(list(dict(G.degree).values())) ``` ``` ## 4.434782608695652 ``` --- Get the degree of a specific node: ```python G.degree['Contact'] ``` ``` ## 6 ``` Get the density of a network: ```python nx.density(G) ``` ``` ## 0.2015810276679842 ``` For directed graphs, use `in_degree` and `out_degree`: ```python D = nx.DiGraph(G) D.in_degree() D.out_degree() ``` --- ### Paths and Diameter Is there a path between two nodes? ```python nx.has_path(G, 'Frankenstein', 'Contact') ``` ``` ## True ``` What is the shortest path distance between two nodes: ```python nx.shortest_path_length(G, 'Frankenstein', 'Contact') ``` ``` ## 2 ``` Give me (one, there could be more) shortest path: ```python nx.shortest_path(G, 'Frankenstein', 'Contact') ``` ``` ## ['Frankenstein', 'I, Robot', 'Contact'] ``` --- Give me all shortest paths between two nodes: ```python all_sps = nx.all_shortest_paths(G, 'Frankenstein', 'Contact') # generator! list(all_sps) ``` ``` ## [['Frankenstein', 'I, Robot', 'Contact'], ['Frankenstein', 'Solaris', 'Contact']] ``` Get the diameter of a network: ```python nx.diameter(G) ``` ``` ## 5 ``` --- ### Components How many connected components? ```python nx.number_connected_components(G) ``` ``` ## 1 ``` All connected components, each is a set of nodes: ```python cc = nx.connected_components(G) # generator! list(cc) ``` ``` ## [{'Solaris', '2001: A Space Odyssey', 'The Hunger Games', 'The Princess Bride', 'Journey to the Center of the Earth', 'The Lion, the Witch and the Wardrobe', "The Hitchhiker's Guide to the Galaxy", 'The Dune Chronicles', 'Frankenstein', 'Brave New World', "Ender's Game", 'Babel 17', 'Watership Down', 'I, Robot', "The Handmaid's Tale", 'American Gods', 'World War Z', 'Twenty Thousand Leagues Under the Sea', 'Harry Potter Series', 'Contact', 'The Lord of the Rings', 'Flowers for Algernon', 'A Song of Ice and Fire'}] ``` Get component of a specific node: ```python nx.node_connected_component(G, 'Contact') ``` ``` ## {'Solaris', '2001: A Space Odyssey', 'The Hunger Games', 'The Princess Bride', 'Journey to the Center of the Earth', 'The Lion, the Witch and the Wardrobe', "The Hitchhiker's Guide to the Galaxy", 'The Dune Chronicles', 'Frankenstein', 'Brave New World', "Ender's Game", 'Babel 17', 'Watership Down', 'I, Robot', "The Handmaid's Tale", 'American Gods', 'World War Z', 'Twenty Thousand Leagues Under the Sea', 'Harry Potter Series', 'Contact', 'The Lord of the Rings', 'Flowers for Algernon', 'A Song of Ice and Fire'} ``` --- Get largest component (a.k.a Giant Connected Component) ```python gcc = max(nx.connected_components(G), key=len) len(gcc) ``` ``` ## 23 ``` For directed graphs use `strongly_`/`weakly_` prefixes with previous functions: ```python D = nx.DiGraph(G) nx.number_strongly_connected_components(D) nx.strongly_connected_components(D) ``` --- ### Laplacian ```python L = nx.laplacian_matrix(G) L.todense()[:5, :5] ``` ``` ## matrix([[ 4, -1, 0, 0, 0], ## [-1, 5, -1, -1, -1], ## [ 0, -1, 5, -1, 0], ## [ 0, -1, -1, 6, -1], ## [ 0, -1, 0, -1, 5]], dtype=int32) ```