DBLP dataset details.
================================================================================

1) DBLP dataset downloaded on Aug13,2012 from http://dblp.uni-trier.de/xml/

2) dblp-orig.xml is the original file. dblp.xml has been obtained with:
$ sed 's/></>#</g' dblp.xml | tr '#' '\n' > dblp1.xml
to split 1+ lines with multiple tags, not supported by our ultra-fast ultra-naive xml parser.
an example is "</proceedings><article md...". We expect "<article..." to always start on a new line.

3) there may be articles with zero authors, e.g., 

<article mdate="2011-12-29" key="tr/trier/MI92-17" publtype="informal publication">
<title>18. Workshop &uuml;ber Komplexit&auml;tstheorie, effiziente Algorithmen und Datenstrukturen, Universit&auml;t Trier, 20. Oktober 1992 (Abstracts)</title>
<journal>Universit&auml;t Trier, Mathematik/Informatik, Forschungsbericht</journal>
<volume>92-17</volume>
<year>1992</year>
</article>

4) Some stats.

Total number of processed articles: 850128
Graph stats: 677098 nodes, 4122556 edges, 14 labels.
Total number of labeled nodes: 153022 (~22%)
Total number of components: 73588

5) Nodes are authors. Nodes are labeled using conference keywords in different research fields of computer science.

6) To generate dblp.dat, edit xml2dat.py, set dblpPathnamePrefix = "dblp", and exec $ ./xml2dat.py with no parameters.

7) Total number of nodes that can be reached from labeled nodes: 547857 (~81%). This means that ~19% of the nodes can be discarded/ignored.



EOF.
