Mining Bio4j data: finding topological patterns in PPI networks

Hi everyone!

After writing this post on December, I’ve been thinking of doing something similar, yet different, using Neo4j Cypher query language.

That’s where I came up with the idea of looking for topological patterns through a large sub-set of the Protein-Protein interactions network included in Bio4j; -rather than focusing in a few proteins selected a priori.

I decided to mine the data in order to find circuits/simple cycles of length 3 where at least one protein is from Swiss-Prot dataset:

I would like to point out that the direction here is important and these two cycles:

A --> B --> C --> A
A --> C --> B --> A

are not the same. Ok, so once this has been said, let’s see how the Cypher query looks like:

START d=node:dataset_name_index(dataset_name_index = "Swiss-Prot")
MATCH d <-[r:PROTEIN_DATASET]- p, 
circuit = (p) -[:PROTEIN_PROTEIN_INTERACTION]-> (p2) -[:PROTEIN_PROTEIN_INTERACTION]-> (p3) -[:PROTEIN_PROTEIN_INTERACTION]-> (p)
 return p.accession, p2.accession, p3.accession

As you can see it’s really simple and straightforward. In the first two lines we match the proteins from Swiss-Prot dataset for later retrieving the ones which form a 3-length cycle as described before. Once the query has finished, you should be getting something like this:

cypher> 
==> +---------------------------------------------------------+
p.accession | p2.accession | p3.accession | 
==> +---------------------------------------------------------+
Q08465 P35189 P3421
Q08465 P34218 P35189
Q8GXA4 Q8L7E5 Q9LE82
Q8GXA4 Q9FH18 Q8L7E5
....
==> +---------------------------------------------------------+
==> 6632 rows, 1019211 ms

As you can see the query took about 17 minutes to be completed in a 100% fresh DB -there was no information cached at all yet; with a m1.large AWS machine -this machine has 7.5GB of RAM.

Not bad, right!?

We have to beware of something though, this query returns cycles such as:

A --> B --> C --> A
B --> C --> A --> B

as different cycles when they are actually not. That’s why I developed a simple program to remove these repetitions as well as for fetching some statistics information. After running the program you get two files:

PPICircuitsLength3NoRepeats file: download it here
PPICircuitsProteinsFreq file: download it here.

The final circuits found were reduced after performing the filtering to 2226 records.

Finally, I also created a really simple chart including the absolute frequency of the first 20 proteins with more occurrences in the cycles that were found.

Well, that’s all for now. Have a good day!

@pablopareja