Can anyone in the SOCNET community recommend recent work that examines the impact of missing data on community detection algorithms and assesses the biases that arise?  I am particularly interested in work that focuses on relatively small interpersonal or interorganizational networks studied through some form of survey.  Thus, the work on sampling of very large social media networks is less relevant.

Kossinets (2006) demonstrated through simulation survey non-response leads to a downward bias in the estimated clustering coefficient, which should then make it less likely to identify communities.  On the other hand, Yan and Gregory found that community detection was fairly robust given moderate levels of survey non-response.  A far more sobering conclusion is found in van Gennip et. al. who tried to identify communities of gang members based on Police reports and geographical data but met with limited success (although they found that combining social network and geographical data greatly improved their estimates).

So, I am left uncertain with the consensus in the field concerning the value of attempting to identify communities based on incomplete survey data.  One could argue that it is a futile exercise prone to large and unidentified biases.  On the other hand, there may be an argument that valid communities can  still be found because non-response is more likely to omit inter-group edges than intra-group edges, which preserves community structure (the Yan and Gregory finding).

Any thoughts?



Kossinets, G. (2006). "Effects of missing data in social networks." Social Networks 28(3): 247-268.

Yan, B. and S. Gregory (2011). "Finding missing edges and communities in incomplete networks." Journal of Physics A: Mathematical and Theoretical 44(49): 495102.

van Gennip, Y., B. Hunter, et al. (2013). "Community detection using spectral clustering on sparse geosocial data." SIAM Journal on Applied Mathematics 73(1): 67-83.

