Tuesday 26 November 2019

aggregate - How to calculate population prevalence in R using data.table

I hoping someone can point me in the right direction of how I can calculate a population prevalence proportion without having to do it in excel.



I´m currently working on a project that requires me to find the annual prevalence for eating fruit stratified by sex, agegr. To complicate matters unique id's can eat fruit from different kinds of fruitgroups but two of the four fruits belong to the same fruitgroup and therefor can only be counted once per id.



To further complicate the issue I need to report the prevalence by





  • total fruits per year per 1000 in the population,

  • per individual fruits per year per 1000,

  • per the fruitgroup that has the two fruit in,

  • by age

  • by sex



Here is my dataset - you might have figured out that due to restrictions on my database I had to change the actual values - special thanks to this reproduce code here and to this answer on how to make your data anonymous




    #### Data ####
require(data.table)
anonDT <- data.table(structure(list(id = c("E1998", "E2308", "E1421", "E676", "E5061","E4225", "E2600", "E658", "E2331", "E982", "E4790", "E408", "E1048","E3937", "E4554", "E3357", "E2637", "E178", "E3734", "E1217","E3771", "E1954", "E4928", "E3566", "E1106", "E3835", "E1505","E668", "E4083", "E3066", "E3356", "E4910", "E2801", "E1074","E5097", "E610", "E995", "E1001", "E3824", "E3427", "E3885","E648", "E1986", "E4777", "E2546", "E909", "E1954", "E634", "E2602","E531", "E67", "E2418", "E3863", "E4266", "E196", "E657", "E1516","E4722", "E3077", "E3732", "E1556", "E112", "E924", "E2801","E2742", "E3362", "E1880", "E3645", "E3357", "E2519", "E2450","E5162", "E1483", "E3846", "E4539", "E2452", "E282", "E4604","E226", "E5043", "E3909", "E88", "E51", "E1925", "E2776", "E3835","E4746", "E1631", "E4052", "E1128", "E220", "E1390", "E4908","E1385", "E1003", "E5181", "E3835", "E4910", "E3240", "E4380","E3357", "E963", "E706", "E5142", "E2869", "E3839", "E5271","E2584", "E194", "E4366", "E2621", "E932", "E1104", "E1964","E928", "E4377", "E1418", "E2940", "E3420", "E3958", "E4130","E790", "E3667", "E934", "E3356", "E5203", "E3835", "E3356","E3297", "E5203", "E4380", "E668", "E2856", "E4502", "E1054","E3644", "E4641", "E5204", "E2597", "E4432", "E2716", "E2422","E1964", "E1327", "E2028", "E2727", "E1868", "E638", "E88", "E4892","E706", "E5147", "E3130", "E4099", "E4239", "E341", "E593", "E4746","E2291", "E2240", "E2481", "E3846", "E2602", "E1673", "E4772","E2140", "E5024", "E1137", "E2182", "E4366", "E2386", "E648","E3118", "E8", "E2813", "E3422", "E3982", "E2", "E2940", "E2035","E4746", "E5134", "E4380", "E4615", "E1372", "E2249", "E1954","E2418", "E1122", "E3485", "E934", "E3611", "E2665", "E2961","E2108", "E4432", "E2447", "E3413", "E531", "E1751"),
sex = structure(c(1L,2L, 1L, 2L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 2L, 1L,2L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 2L, 1L, 1L, 2L, 1L, 1L,1L, 2L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 1L,1L, 2L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 2L,1L, 2L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 1L,1L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 1L, 2L,1L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 2L, 1L, 2L, 2L,1L, 1L, 2L, 2L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L,2L, 2L, 2L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 2L,2L, 2L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 2L, 1L, 2L, 2L, 1L, 1L, 2L,1L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L,1L, 2L, 2L, 2L, 1L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L,1L, 2L, 1L, 2L, 1L, 1L, 1L), .Label = c("male", "female"), class = "factor"),group = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 2L,2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 1L,1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 3L, 1L,2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,2L, 1L, 2L, 1L, 2L, 3L, 1L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 1L,2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 1L, 2L, 2L, 3L,2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L,2L, 2L, 2L, 3L, 3L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,2L, 2L, 3L, 2L, 3L, 2L, 3L, 2L, 2L, 1L, 3L, 2L, 1L, 2L, 2L,2L, 2L, 2L, 2L, 2L, 2L, 3L, 1L, 3L, 2L, 1L, 1L, 2L, 2L, 2L,2L, 1L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 3L,2L, 2L, 2L, 2L, 2L, 1L, 2L, 1L, 2L, 2L), .Label = c("fruitgr1","fruitgr2", "fruitgr3"), class = "factor"),
subgroup = structure(c(4L,3L, 3L, 3L, 3L, 4L, 4L, 4L, 2L, 4L, 3L, 3L, 4L, 3L, 3L, 3L,4L, 1L, 3L, 4L, 3L, 1L, 3L, 3L, 1L, 1L, 1L, 4L, 3L, 4L, 4L,3L, 4L, 1L, 4L, 3L, 3L, 4L, 2L, 1L, 4L, 3L, 3L, 4L, 3L, 3L,1L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 1L, 3L, 3L, 3L,4L, 3L, 4L, 4L, 3L, 4L, 3L, 3L, 4L, 3L, 1L, 3L, 1L, 4L, 2L,1L, 3L, 1L, 4L, 4L, 4L, 4L, 4L, 1L, 4L, 4L, 3L, 3L, 4L, 3L,4L, 3L, 4L, 3L, 2L, 1L, 3L, 4L, 2L, 3L, 4L, 3L, 4L, 3L, 4L,1L, 3L, 4L, 4L, 4L, 4L, 3L, 1L, 4L, 3L, 3L, 3L, 2L, 2L, 1L,4L, 4L, 3L, 4L, 4L, 4L, 4L, 3L, 4L, 3L, 3L, 2L, 3L, 2L, 4L,2L, 3L, 4L, 1L, 2L, 4L, 1L, 4L, 3L, 4L, 3L, 4L, 4L, 3L, 4L,2L, 1L, 2L, 3L, 1L, 1L, 4L, 3L, 3L, 4L, 1L, 3L, 4L, 4L, 4L,1L, 3L, 3L, 4L, 4L, 3L, 3L, 3L, 4L, 4L, 3L, 1L, 3L, 4L, 4L,4L, 3L, 4L, 4L, 1L, 1L, 4L, 3L, 2L, 3L, 3L, 3L, 3L, 3L, 1L,4L, 1L, 4L, 4L), .Label = c("apple", "orange", "banana","kiwi"), class = "factor"),
agegr = structure(c(2L, 3L, 2L,2L, 3L, 2L, 2L, 2L, 2L, 3L, 1L, 2L, 1L, 2L, 1L, 2L, 2L, 3L,1L, 2L, 2L, 3L, 2L, 1L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 3L,2L, 2L, 2L, 2L, 2L, 2L, 3L, 2L, 2L, 2L, 1L, 3L, 3L, 3L, 2L,2L, 1L, 2L, 3L, 2L, 2L, 2L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,3L, 2L, 2L, 3L, 2L, 2L, 2L, 3L, 3L, 1L, 3L, 3L, 1L, 3L, 2L,3L, 3L, 2L, 1L, 2L, 2L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 1L, 1L,2L, 2L, 2L, 3L, 2L, 2L, 3L, 2L, 1L, 2L, 3L, 2L, 1L, 3L, 1L,1L, 2L, 1L, 2L, 2L, 3L, 2L, 1L, 2L, 1L, 2L, 2L, 3L, 2L, 2L,1L, 2L, 2L, 3L, 2L, 3L, 2L, 3L, 2L, 2L, 3L, 2L, 2L, 2L, 2L,2L, 2L, 1L, 2L, 3L, 3L, 2L, 2L, 1L, 2L, 1L, 2L, 2L, 2L, 2L,2L, 2L, 3L, 3L, 3L, 2L, 1L, 3L, 3L, 2L, 3L, 2L, 2L, 3L, 2L,2L, 2L, 3L, 2L, 2L, 1L, 2L, 1L, 2L, 3L, 1L, 2L, 3L, 2L, 3L,3L, 2L, 3L, 3L, 3L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L,1L, 2L), .Label = c("19-24", "25-49", "50+"), class = "factor"),year = c(2004, 2007, 2008, 2006, 2008, 2007, 2008, 2007,2007, 2007, 2005, 2005, 2006, 2006, 2006, 2006, 2006, 2004,2008, 2006, 2006, 2007, 2006, 2006, 2005, 2006, 2005, 2005,2006, 2007, 2006, 2008, 2008, 2006, 2004, 2004, 2007, 2005,2008, 2008, 2005, 2007, 2008, 2008, 2008, 2008, 2005, 2008,2008, 2005, 2005, 2006, 2007, 2007, 2006, 2006, 2007, 2007,2008, 2008, 2005, 2007, 2007, 2005, 2008, 2007, 2004, 2008,2007, 2008, 2005, 2005, 2008, 2005, 2007, 2008, 2008, 2005,2004, 2008, 2004, 2007, 2005, 2008, 2004, 2008, 2008, 2006,2008, 2006, 2007, 2008, 2005, 2007, 2007, 2007, 2006, 2007,2007, 2008, 2006, 2005, 2008, 2004, 2008, 2008, 2006, 2005,2004, 2007, 2006, 2004, 2005, 2004, 2006, 2005, 2008, 2004,2007, 2004, 2006, 2008, 2006, 2007, 2008, 2005, 2007, 2007,2006, 2005, 2008, 2004, 2008, 2007, 2008, 2004, 2008, 2007,2007, 2004, 2004, 2008, 2004, 2007, 2008, 2005, 2007, 2005,2008, 2006, 2006, 2007, 2005, 2005, 2006, 2006, 2005, 2007,2006, 2008, 2005, 2006, 2008, 2007, 2005, 2006, 2006, 2007,2007, 2006, 2008, 2007, 2007, 2008, 2005, 2008, 2007, 2004,2005, 2006, 2007, 2006, 2008, 2006, 2008, 2004, 2006, 2008,2007, 2004, 2008, 2008, 2004, 2008, 2007, 2006, 2005, 2006,2004, 2004)), .Names = c("id", "sex", "group", "subgroup","agegr", "year"), class = c("data.table", "data.frame"), row.names = c(NA,-200L)))


And the population data is here:




#### Population data ####
populDT <- data.table(structure(list(year = 2004:2008, total = c(210696L, 216192L,223472L, 230629L, 233625L), men = c(104770L, 108390L, 113597L,117629L, 118275L), women = c(105926L, 107802L, 109875L, 113000L,115350L), agegrp1 = c(25721L, 25558L, 25933L, 27457L, 28083L), agegrp2 = c(104933L, 107935L, 111796L, 114852L, 115102L),agegrp3 = c(80042L, 82699L, 85743L, 88320L, 90440L)), .Names = c("year","total", "men", "women", "agegrp1", "agegrp2", "agegrp3"), sorted = "year", class = c("data.table","data.frame"), row.names = c(NA, -5L), key='year'))


I have managed to get all the relevant prevalence counts into a single data.table



allcount <- anonDT[,.N, keyby=list(year,agegr,sex,group,subgroup,id)][,.N,by=list(year,agegr,sex,group,subgroup)]
allcount



as well as making several subsets of data.tables including the counts I need.



As for my questions,




  1. is there a simple way of calculating the prevalence proportion (formula is 1000*those who eat fruit/total population) inside the data.table for all the relevant stratifications and groups.

  2. Would I need to merge the populDT with the allcount?

  3. If so what would be the best way forward

  4. And as a bonus conundrum what would be the best way to create publication table/graph from the data.table?




Thank you all in advance... as this is my first time posting I hope I've done everything correctly :)



Here is a data.table of the almost desired outcome. I only managed to get it in this state by using data.frame not table. To be more precise I created count variables for all and then used aggregate to get it into shape



### first using max ### 
prevAN <- aggregate(anon[,7:17],
by = list(anon$year,anon$id,anon$group, anon$subgroup),
FUN = max)

### and then by sum ###
prevAN1 <- aggregate(prevAN[,5:15],
by = list(prevAN$year),
FUN = sum)


### the count dataset ###
DT2 <- data.table(structure(list(year = c(2004, 2005, 2006, 2007, 2008), count = c(865,1095, 1355, 1602, 1749), men = c(470, 616, 748, 863, 946), women = c(395,479, 607, 739, 803), agegr1 = c(141, 220, 272, 316, 385), agegr2 = c(552,657, 826, 1001, 1040), agegr3 = c(172, 218, 257, 285, 324), c_fruitgr2 = c(703,910, 1130, 1304, 1451), c_banana = c(153, 397, 618, 798, 950),c_kiwi = c(550, 513, 512, 506, 501), c_apple = c(121, 114,97, 110, 112), c_orange = c(41, 71, 128, 188, 186), total = c(210696L,216192L, 223472L, 230629L, 233625L), men.1 = c(104770L, 108390L,113597L, 117629L, 118275L),
women.1 = c(105926L, 107802L,109875L, 113000L, 115350L), agegrp1 = c(25721L, 25558L, 25933L,27457L, 28083L), agegrp2 = c(104933L, 107935L, 111796L, 114852L,115102L), agegrp3 = c(80042L, 82699L, 85743L, 88320L, 90440L)), .Names = c("year", "count", "men", "women", "agegr1","agegr2", "agegr3", "c_fruitgr2", "c_banana", "c_kiwi", "c_apple","c_orange", "total", "men.1", "women.1", "agegrp1", "agegrp2","agegrp3"), sorted = "year", class = c("data.table", "data.frame"), row.names = c(NA, -5L), key='year'))



When I get to this stage I need to calculate the prevalence from the figures in each year. I would do this using a data.frame by simply doing the following:



total.prev <- DF2$count*1000/DF2$total


Back to the questions at hand, firstly I love how fast data.table works and even though I am able to roughly get where I need to go with my data using aggregate I would like to learn how I can do this in data.table as I have a feeling that it is quicker and more productive.



Secondly there is the issue of the calculation, calculating within a data.table or between data.tables.




And thirdly there is the issue of graphs and tables for publication but that might be a second post.



UPDATE
I have figured out a way to calculate the prevelance within the data.table by doing the following



DT2[,':='(p1= count*1000/total,
pM= men*1000/men.1,
pW= women*1000/women.1,
pA1= agegr1*1000/agegrp1,
pA2= agegr2*1000/agegrp2,

pA3= agegr3*1000/agegrp3,
pFgr2= c_fruitgr2*1000/total,
pB= c_banana*1000/total,
pK= c_kiwi*1000/total,
pA= c_apple*1000/total,
pO= c_orange*1000/total),]

DT2<-round(DT2, 2) # to round the prevalence numbers



Again if someone has anything simpler to offer :)

No comments:

Post a Comment

php - file_get_contents shows unexpected output while reading a file

I want to output an inline jpg image as a base64 encoded string, however when I do this : $contents = file_get_contents($filename); print &q...