Location-Based Social Network Data Generation

The detailed investigation of collective social phenomena requires an inordinate amount of data given that the patterns underlying human behavior can be quite complex and are as such hard to predict. This is summed up quite aptly by the late Nobel laureate Murray Gell-Mann: “Think how hard physics would be if particles could think.” To some extent, research on location-based social networks (LBSNs) attempts to grapple this challenge by leveraging social network data for predictive tasks such as Point-of-Interest recommendation [10][11][12], social link prediction [9], and location prediction [1]. The major challenge, however, is that comprehensive real-world LBSN data sets are hardly available due to privacy implications. Also, because such data are considered operational data, businesses are unwilling to make them publicly available or share them. The largest publicly available LBSN data set is the Gowalla data set [1] having 36M check-ins. But, after removing users with less than 15 check-ins, and removing locations with less than 10 visitors, from Gowalla, only 18.7k users and 1.29M check-ins remain [8]. Distributed over 20 months, that is an average of only 2.1k checkins per day. Distributed globally this leaves only a hand-full of check-ins per day per city, hardly enough to model, explain and predict mobility. A recent study by Li et al. [7] concludes that "Researchers working with LBSN data sets are often confronted by themselves or others with doubts regarding the quality or the potential of their data sets" and that "it is reasonable to be skeptical" [7].

Towards addressing this challenge, we have developed an agent-based simulation framework that exhibits realistic social behavior based on the data about real-world phenomena and social science theories. We simulate plausible numbers of agents over years of simulation time and potentially generate LBSN data for entire generations. The simulation generates high-fidelity LBSN data sets containing complete location and temporal social network data without uncertainty collected over long periods of time. We have published this agent-based simulation framework, ran many simulations, and made available many such data sets [5]. LSBN is certainly important but one of the many potential uses of our data sets. Given the current pandemic crisis, there is a potential to use our framework for disease spread simulations.

Figure 1. Environments Populated with Agents. Clockwise fromTop Left: GMU, NOLA, Large and Small Synthetic Villages

The four maps in Figure 1 are (i) New Orleans, LA (NOLA), (ii) the George Mason University (GMU) campus, (iii) a small synthetic town (TownS) and (iv) a large synthetic town (TownL). The synthetic maps were created using a spatial network and place generator based in a generative grammar similar to L-systems described in [6]. A more detailed description of these data sets can be found in [5].

There were multiple settings that were used with each of the respective study areas. Table 1 specifies the run-time settings in detail: the area simulated, the type and number of sites simulated, the number of neighborhoods, and the count of agents. In terms of the number of sites, we simulated five types of sites: Schools, Pubs, Workplaces, Restaurants, and Apartments. The actual number of each respective site type and the number of neighborhoods simulated is the result of an internal computational process indirectly derived from the user choice of parameters but not directly accessible to the user at setup time.

Table 1. Location-Based Social Network Simulation Settings.
Settings Maps Area
# of Sites # of
# of
Total School Recreation Workplace Restaurant Apt.
GMU-1K GMU 3.36 1,781 1 10 250 20 1,500 1 874
GMU-3K GMU 3.36 5,341 1 30 750 60 4,500 1 2,589
GMU-5K GMU 3.36 8,901 1 50 1250 100 7,500 1 4,648
NOLA-1K NOLA 6.49 1,781 2 10 250 20 1,500 2 863
NOLA-3K NOLA 6.49 5,342 2 30 750 60 4,500 2 2,720
NOLA-5K NOLA 6.49 8,904 4 50 1,250 100 7,500 2 4,728
TownS-1K TownSm 58.41 1,788 4 12 252 20 1,500 4 876
TownS-3K TownSm 58.41 5,348 4 32 752 60 4,500 4 2,645
TownS-5K TownSm 58.41 8,908 4 52 1,252 100 7,500 4 4,349
TownL-1K TownLg 126.2 1,789 6 12 253 18 1,500 6 853
TownL-3K TownLg 126.2 5,346 6 30 750 60 4,500 6 2,550
TownL-5K TownLg 126.2 8,904 6 48 1,248 102 7,500 6 4,216

To demonstrate the feasibility of generating LBSN data over time, Table 2 gives an overview of the generated output data from the location-based social network simulation [5]. All data sets are available at the Open Science Framework (OSF) repository. Each data set can be downloaded directly. For low bandwidth connections, a pre-compiled executable of each simulation can be downloaded to re-generated the data locally.

The table shows the number of agent check-ins and the number of social links attributed to each of scenarios that consist of different numbers (1K, 3K, and 5K) of agents and four different maps. We note that the number of social links may be larger than the square of the number of users. That's due to the temporal nature of the data set: Social links may emerge and break over time. Agents meet new friends, but slowly forget about them if their friendship is not reinforced with further meetings. Thus, each link comes with a start time-stamp and end time-stamp.

We observe that the number of check-ins increases, for all study areas, linear with the number of agents. This is plausible, as the number of hours per day that agents can spend to satisfy their needs and visit sites is independent of other agents. However, we do see that the number of social links increase super-linear in the number of agents. This can be explained by more agents leading to larger co-locations of agents, creating chances for each pair of agents in the same co-location to become friends. We note that the generated temporal social network may have more edges than we have agent pairs. This is due to the temporal nature of the network. It reports changes over time and as such a single pair of agents can have multiple friends and unfriend events. The number reported corresponds to the number of new edges added to the temporal social network, regardless of the duration of these events. The super-linear growth of the social network also explains the super-linear run-time to create each data set, ranging from less to one hour for the 1000 agent instances to 10.5 hours for 5000 agents. Besides (i) number of check-ins and (ii) social links, we also report (iii) the run-time of each simulation and (iv) the resulting data size in Table 2.

Table 2. Data Sets Resulting from Location-Based Social Network Simulation
Settings# of Users# of Check-Ins# of LinksPeriod (month)

Figure 2 shows four visualizations of the social networks of 1K agents exemplary for GMU, NOLA, TownS, and TownL at the end of the 15 months simulation. These visuals show different types of network structures, such as two to three large social communities for the synthetic TownL, and one large community for GMU and NOLA.

(a) GMU
(b) NOLA
(c) TownS
(d) TownL
Figure 2. Social network (Note: the location of a node does not represent the location in the spatial network)

Since it is hard to describe the evolution of a social network over time, we have created a video for each of the four spatial areas showing the social network evolution over the 129,600 steps within the 15 months simulation time. These videos show how the social networks evolve from small isolated cliques into a large and complex network showing different sub-structures.


[1] E. Cho, S. A. Myers, and J. Leskovec. Friendship and Mobility: User Movement in Location-based Social Networks. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1082–1090, 2011
[2] H. Kavak, J.-S. Kim, A. Crooks, D. Pfoser, C. Wenk, and A. Züfle. Location-Based Social Simulation. In SSTD, pages 218–221, 2019.
[3] J.-S. Kim, H. Kavak, U. Manzoor, A. Crooks, D. Pfoser, C. Wenk, and A. Züfle. Simulating urban patterns of life: A geo-social data generation framework. In SIGSPATIAL, pages 576–579, 2019.
[4] J.-S. Kim, H. Jin, H. Kavak, O. C. Rouly, A. Crooks, D. Pfoser, C. Wenk, and A. Zufle. LBSN-data. (accessed 2020-05-21)
[5] J.-S. Kim, H. Jin, H. Kavak, O. C. Rouly, A. Crooks, D. Pfoser, C. Wenk, and A. Züfle. Location-based Social Network Data Generation Based on Patterns of Life. In IEEE International Conference on Mobile Data Management (MDM’20) (to appear). IEEE, 2020.
[6] J.-S. Kim, H. Kavak, and A. Crooks. Procedural City Generation Beyond Game Development. SIGSPATIAL Special,10(2):34–41, 2018.
[7] M. Li, R. Westerholt, H. Fan, and A. Zipf. Assessing Spatiotemporal Predictability of LBSN: A Case Study of Three Foursquare Datasets. GeoInformatica, 22(3):541–561, 2018.
[8] Y. Liu, T.-A. N. Pham, G. Cong, and Q. Yuan. An Experimental Evaluation of Point-of-interest Recommendation in Location-based Social Networks. Proceedings of the VLDB Endowment, 10(10):1010–1021, 2017
[9] S. Scellato, A. Noulas, and C. Mascolo. Exploiting Place Features in Link Prediction on Location-based Social Networks. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1046–1054, 2011.
[10] H. Wang, M. Terrovitis, and N. Mamoulis. Location recommendation in location-based social networks using user check-in data. In Proceedings of the 21st ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, pages 374–383, 2013.
[11] M. Ye, P. Yin, and W.-C. Lee. Location Recommendation for Location-based Social Networks. In Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems, pages 458–461, 2010.
[12] J.-D. Zhang and C.-Y. Chow. Point-of-interest Recommendations in Location-based Social Networks. SIGSPATIAL Special, 7(3):26–33, 2016.