Paris.pm, march 16th 2011
to analyse cables' full-text, not using meta-data as a structure
to produce occurrence networks of bi-grams and cables
to visualize how the discussions within the cables are composed and relate to each other
analyse the ~5500 released cables with Python, and a set of productive libraries (NLTK)
use MongoDB and Neo4j for document and network storage
explore the network with Gephi
first used BeautifulSoup then switched to Cablemap
NLTK : clean_html() and you're done
MongoDB : document storage
decompose text into tokens : nltk.sent_tokenize and nltk.TreebankWordTokenizer
>>> sentences = nltk.sent_tokenize("WikiLeaks is a non-profit media organization dedicated to bringing important news and information to the public. We provide an innovative, secure and anonymous way for independent sources around the world to leak information to our journalists.")
['WikiLeaks is a [...] to the public.', 'We provide [...] journalists.']
>>> nltk.TreebankWordTokenizer().tokenize(sentences[0])
['WikiLeaks', 'is', 'a', 'non-profit', 'media', 'organization', 'dedicated', 'to', 'bringing', 'important', 'news', 'and', 'information', 'to', 'the', 'public', '.']
an easy way to de-duplicate words
group words by their radical
use nltk.PorterStemmer
>>> print PorterStemmer().stem("language")
'languag'
compute the built-in sha256 hash method to use it as a database index
>>> nltk.tag.pos_tag(['Help','Wikileaks','keep','governments','open'])
[('Help', 'NNP'), ('Wikileaks', 'NNP'), ('keep', 'VB'),
('governments', 'NNS'), ('open', 'JJ')]
nltk comes with many pre-tagged corpora to learn from. we used 2 corpora = ~68000 sentences
The nltk.tag.SequentialBackoffTagger chains many taggers together
^(
(VB,|VBD,|VBG,|VBN,|CD.?,|JJ.?,|\?,){0,2}?
(N.?,|\?,)+?
(CD.,)?
)+?$
Writing to MongoDB :
(mis-)used as key/value storage : update and modifiers are the key to success
mongodb.cooc.update({'_id': some_id}, {"$inc":{"value":1}})
compose id patterns to organize records : the edge example
mongodb.cooc.save({'_id': node-source_id +"_"+ node_target_id, "value":1})
Querying MongoDB :
mystartswith_regexp = re.compile("^"+mysha256+"_[a-z0-9]+$")
cooc_curs = mongodb.cooc.find(
{"_id":{
"$regex": mystartswith_regexp
}},
timeout=False,
sort=[("value",pymongo.DESCENDING)],
limit=MAXEDGES)
About
:
use the official neo4j.py component (using python-jpype)
use wide transactions to reach the maximum performance
with graphdb.transaction as trans:
node = self.graphdb.node()
node[key.encode("ascii","ignore")] = value.encode("ascii","ignore")
control types written to nodes properties : use ascii
Neo4j for Gephi as a plugin : the direct connection
Import network from database
Basic data laboratory features (sort, delete)
Rank category and occurrences to colors
Filter data on graph topology (degree and weight)
Spatialize the network using an OpenOrd layout
Remove artifacts direclty from the visualization
Preview the final map, to tweak appearance
Finally export the map to PDF and GEXF
~600 lines of GNU/GPL one-shot code, 5 external libraries, 2 databases, and 1 Gephi
~1 full week coding, ~5 hours executing the whole process
2 networks obtained without much science :
2 maps online to be explored by all topic maps lovers
1 talk, 1 FOSDEM, 1 big leak ;-)
A special thank to Wikileaks
Learn more at: