Screencast --- pandas groupby and apply functions

Jun 26, 2013 • Martin Schilling

Initially, I was thinking about using a problem that I had been working on recently, where I have a GBS(genotyping-by-sequencing)- pd.DataFrame with 160,000+ SNPs (rows) and 150 individuals from different populations (cols) and their respective genotype (like, for example A/A, A/T or N) in cells. I created a pd.groupby-object, which groups the data according to populations (using re.match on the pop_names).
Then, I can calculate all kinds of functions for each population and locus. In this case, I did calculate Hardy-Weinberg-Equilibrium to get a picture about population differentiation (I wasn’t really interested in HWE-estimates themselves).

I’m starting to think that this might be too much for 3 minutes, as it might take that long to just explain the structure of the data. So I might just take a simple dataFrame, like this:

df = pd.DataFrame({‘location’: [‘A’, ‘A’, ‘B’, ‘C’, ‘C’, ‘D’, ‘E’, ‘F’, ‘F’, ‘F’], ‘alive’: [‘True’, ‘True’, ‘False’, ‘False’, ‘True’, ‘True’, ‘False’, ‘False’, ‘True’, ‘False’], ‘mass’: np.random.randn(10)/0.8 +12})

and do some groupby-shenanigans with that, like the following:

grpd = df.groupby(df[‘alive’])

# apply functions on the object:

grpd.get_group(‘True’).sort(‘mass’)

grpd.sum()
grpd.mean()
grpd.max()

What do you think?

Thanks,
m