Brummell: For a unique field 1, collapse non-unique entries in another field

For a unique field 1, collapse non-unique entries in another field

!Hi Everyone (Hi Dr. Nick)
I have a data set which is a left outer join intersection of two data
sets. I now have multiple entries from the first data set for each overlap
with the second. Just note that Assembly.1000 repeats three times and I
want to collapse that into 1
Assembly.1000 chrX 560000 575000 ABC1 20
Assembly.1000 chrX 560000 575000 IL15RA 3.2
Assembly.1000 chrX 560000 575000 BRCA1 20
Assembly.1038 chrX 780000 829000 . .
Assembly.1338 chrX 960000 999000 ACTIN 3800
Assembly.1338 chrX 960000 999000 ACTIN 4000
As you can see the File 1 entry for Assembly.1000 is repeated three times,
for each File 2 entry (ABC1, IL15RA, BRCA1)
What I would like to parse the output to is
Assembly.1000 chrX 560000 575000 ABC1;IL15RA;BRCA1 20;3.2;20
Assembly.1038 chrX 780000 829000 . .
Assembly.1338 chrX 960000 999000 ACTIN,ACTIN 3800;4000
I can accomplish this with $ while read command and looking at previous
entries in the loop but for large files (~1e6 entries) this simply is not
efficient enough. Does anyone have any suggestions in ways to program this
efficiently?
Cheers!

Brummell

Tuesday, 1 October 2013

For a unique field 1, collapse non-unique entries in another field

No comments:

Post a Comment