Mapping Wikileaks' Cablegate topics using Python, MongoDB, Neo4j and Gephi

Paris.pm, march 16th 2011

Goals

  • to analyse cables' full-text, not using meta-data as a structure

  • to produce occurrence networks of bi-grams and cables

  • to visualize how the discussions within the cables are composed and relate to each other

A preview of the result

How ?

  • analyse the ~5500 released cables with Python, and a set of productive libraries (NLTK)

  • use MongoDB and Neo4j for document and network storage

  • explore the network with Gephi

Import cables to MongoDB

  • Hopefully, Wikileak's archive follows a simple structure, making data hackers' job easy !

  • first used BeautifulSoup then switched to Cablemap

    • based on regular expressions and very fast
    • automatically decodes to unicode, a type in Python, same api as string

  • NLTK : clean_html() and you're done

    • but still an effort: to handle re-encoding manually
    • Python provides safe and built-in functions to manage that : unicode(), encode

  • MongoDB : document storage

    • transparently inserting and reading records as Python dict, a hash-map type without type constraint
    • automatic serializing/deserializing built-in types : unicode, nested lists and dict, datetime, reg-exp objects...

Extract topics (1) : tokenize

  • decompose text into tokens : nltk.sent_tokenize and nltk.TreebankWordTokenizer

    >>> sentences = nltk.sent_tokenize("WikiLeaks is a non-profit media organization dedicated to bringing important news and information to the public. We provide an innovative, secure and anonymous way for independent sources around the world to leak information to our journalists.")
    ['WikiLeaks is a [...] to the public.', 'We provide [...] journalists.']
    
    
    >>> nltk.TreebankWordTokenizer().tokenize(sentences[0])
    ['WikiLeaks', 'is', 'a', 'non-profit', 'media', 'organization', 'dedicated', 'to', 'bringing', 'important', 'news', 'and', 'information', 'to', 'the', 'public', '.']
                            

Extract topics (2) : stemming

  • an easy way to de-duplicate words

  • group words by their radical

  • use nltk.PorterStemmer

  • >>> print PorterStemmer().stem("language")
    'languag'
                    
  • compute the built-in sha256 hash method to use it as a database index

Extract topics (3) : part of speech tagging with nltk.tag

  • >>> nltk.tag.pos_tag(['Help','Wikileaks','keep','governments','open'])
    [('Help', 'NNP'), ('Wikileaks', 'NNP'), ('keep', 'VB'),
    ('governments', 'NNS'), ('open', 'JJ')]
    

  • Compose and save a quality tagger

Choose more relevant topics

  • a DIY regular expression for filtering tags, or "useless" n-grams
    ^(
        (VB,|VBD,|VBG,|VBN,|CD.?,|JJ.?,|\?,){0,2}?
        (N.?,|\?,)+?
        (CD.,)?
    )+?$
                      

Create the network : from MongoDB to Neo4j (1)

  • Writing to MongoDB :

    • (mis-)used as key/value storage : update and modifiers are the key to success

      mongodb.cooc.update({'_id': some_id}, {"$inc":{"value":1}})
                          

    • compose id patterns to organize records : the edge example

      mongodb.cooc.save({'_id': node-source_id +"_"+ node_target_id, "value":1})
                          

Create the network : from MongoDB to Neo4j (2)

  • Querying MongoDB :

    • example : extract the heaviest co-occurrences edges from a node
      mystartswith_regexp = re.compile("^"+mysha256+"_[a-z0-9]+$")
      cooc_curs = mongodb.cooc.find(
          {"_id":{
              "$regex": mystartswith_regexp
          }},
          timeout=False,
          sort=[("value",pymongo.DESCENDING)],
          limit=MAXEDGES)
                            

Creating the network : from MongoDB to Neo4j (3)

  • About :

    • use the official neo4j.py component (using python-jpype)

    • use wide transactions to reach the maximum performance

      with graphdb.transaction as trans:
          node = self.graphdb.node()
          node[key.encode("ascii","ignore")] = value.encode("ascii","ignore")
                        

    • control types written to nodes properties : use ascii

    • Neo4j for Gephi as a plugin : the direct connection

Introduction to

  • An AGPL3 desktop app for visualization of complex networks
  • Dedicated toolset for social network analysis and network map creation
  • Based on Java, NetBeans Platform and OpenGL (JOGL)
  • Also available as headless library: Gephi Toolkit
  • Connect nicely with Jython, JPype..
  • Plugin center, with connectors for Neo4J, SQL, some social networks API..
  • To learn more about: http://gephi.org

Summary of our Gephi workflow

  1. Import network from database

  2. Basic data laboratory features (sort, delete)

  3. Rank category and occurrences to colors

  4. Filter data on graph topology (degree and weight)

  5. Spatialize the network using an OpenOrd layout

  6. Remove artifacts direclty from the visualization

  7. Preview the final map, to tweak appearance

  8. Finally export the map to PDF and GEXF

To sum up:

  • ~600 lines of GNU/GPL one-shot code, 5 external libraries, 2 databases, and 1 Gephi

  • ~1 full week coding, ~5 hours executing the whole process

  • 2 networks obtained without much science :

    • bi-grams topics and cables, linked by occurrences : 43 179 nodes, 237 058 edges
    • bi-grams topics only, linked by co-occurrences : 39 808 nodes, 177 023 edges

  • 2 maps online to be explored by all topic maps lovers

  • 1 talk, 1 FOSDEM, 1 big leak ;-)

A zoom on Egypt's cables

A zoom on "Central Bank" neighbourhood

Thanks

A special thank to Wikileaks