Experimental statistics Comparison of the mobile network data structures of two mobile network operators

EXSTAT

The Federal Statistical Office is investigating the usability of mobile network data for official statistics purposes in various feasibility studies. Differences in mobile network coverage and customer acquisition activities between mobile network operators have a decisive impact on data structures and thus the usability of mobile network data. Since spring 2019, the Federal Statistical Office has had access, for the first time, to data from the Telefónica Deutschland network for the Land of Nordrhein-Westfalen (NRW) in addition to Telekom Deutschland data for the purpose of assessing the representativeness and structure of mobile network data. As there are three mobile network operators in the German mobile communications market with a market share of roughly one third each (see Federal Network Agency), it may be assumed that two thirds of the customers in NRW are covered by the data sets of these two operators. Using the two data sets, the Federal Statistical Office pursues the primary objective of increasing the representativeness of mobile network data while comparing the structures of the two sets of mobile network data.

Data basis: mobile network data of two network operators

The data sets available contain anonymised and aggregated mobile network activities of Telekom and Telefónica customers in NRW. A mobile network activity is an event or signal at a cell tower which is initiated by a minimum length of stay of a mobile device in an area studied. The data sets available include the average mobile network activities of a statistical week consisting of 22-hour periods of selected days and months of 2018/ 2019. Such a week is subdivided into five types of day: Monday, Tuesday to Thursday, Friday, Saturday and Sunday. The geographical resolution of the area studied is the same in both data sets; it is based on grid cells conforming with INSPIRE which correspond to grid cells of the 2011 Census Atlas. The Infrastructure for Spatial Information in the European Community (INSPIRE) is an initiative of the European Commission aiming to create a European geodata infrastructure.

The two sets of data are merged to find out whether the representativeness of the mobile network data can be increased. It is absolutely essential here that the mobile network data used for that purpose have not been extrapolated. In contrast to the mobile network data used and discussed so far (see the EXSTAT-article on "Mobile network data representing the population"), the mobile network activities of the two data sets now available have not been extrapolated, neither on the basis of regional market shares nor to the population. This means that the aggregated mobile network activities correspond to signals actually counted which have been produced by mobile devices within the network of the mobile operator, provided that there are at least five signals per area studied. In compliance with the data protection rules, only anonymised values based on a minimum of five mobile network activities per area studied are transmitted to the Federal Statistical Office so that it is not possible to derive information on individual devices or individuals. This is the first time that mobile network data provided by two operators for identical periods and regions can be merged and then checked for skewnesses and distortions. In addition, information on the socio-demographic variables of age and sex is available for the contract customers of these two mobile network operators.

Structural comparison of mobile network data by mobile network activities and socio-demographic variables

To check the representativeness of the data, the two non-extrapolated sets of mobile network data are combined with mobile network activities from the networks of Telekom Deutschland and Telefónica Deutschland. To this end, the mobile network activities are filtered by weekdays and hourly values and linked with each other by means of the underlying grid cells. Then a combined data record can be obtained by adding up the values that have been linked geographically. This is done for both mobile network activities and socio-demographic variables. It is assumed that merging mobile network data of different providers will increase the representativeness of the data and reduce possible distortions.

Mobile network activities

To find out whether and to what extent merging the two sets of mobile network data will increase their representativeness, the relationship between the combined mobile network activities (Telefónica + Telekom) of 2018/19 and the population figures from the 2011 Census will be determined in the following. The population figures from the 2011 Census are used as a benchmark to check the representativeness of the merged mobile network data. The resulting Pearson correlation coefficient in Figure 1 describes the linear relationship between the two data sources for all weekdays by time of day. The closer the coefficient is to 1, the more perfect the correlation and thus the linear relationship between the two data sources.

In all, the values of the correlation coefficient in Figure 1 reveal a very high positive correlation of up to 0.95 between the mobile network activities merged and population numbers during the evening hours and throughout Saturday and Sunday. Compared with the correlation analysis based on data of only one mobile network operator (see EXSTAT-article on "Mobile network data representing the population" Figure 1), the merging of mobile network data from various operators in Germany leads to a markedly better correlation with the distribution of official population figures. As shown in Figure 2, this approach also renders an almost perfect linear relationship between mobile network activities and the population figures of the 2011 Census especially on a Sunday evening. On account of their high correlation with the population figures of the 2011 Census, the mobile network activities on a Sunday evening are well-suited for deriving the resident population on the basis of mobile network data.

Figure 2 relates the distribution of the relative frequency of mobile network activities on a Sunday evening to the relative frequency of the number of inhabitants from the 2011 Census. Perfect correspondence of the two distributions is represented by the straight black line. The relative frequencies of mobile network activities are shown for each individual operator and for all operators (Telefónica + Telekom). When the relative frequency of the mobile network activities from the Telekom Deutschland network (red dots) is compared with the relative frequency of the number of inhabitants, the dots clearly scatter very widely around the straight black line. This can be seen especially at be beginning of the straight line, where the relative frequency of mobile network activities is roughly 0.03. In that section, the potential resident population according to mobile network data from the Telekom Deutschland network is clearly overestimated in areas with a rather low number of inhabitants. Then the mobile network activities underestimate the potential resident population in areas with a rather high number of inhabitants from a relative frequency of the number of inhabitants of approximately 0.08. The relative frequency of mobile network activities from the Telefónica Deutschland network (blue dots) underestimates the potential resident population in areas with a rather low number of inhabitants and overestimates it in areas with a rather higher number of inhabitants from a relative frequency of the number of inhabitants of roughly 0.04. When the two distributions are combined by merging the mobile network activities of the two providers, the resulting relative frequencies (green dots) clearly are closer to the straight black line and thus to the relative frequencies of the number of inhabitants from the 2011 Census. The combined distributions scatter less around the straight black line at both ends, thus largely offsetting the distortions in the individual distributions of the two operators. The comparison of the relative frequencies of mobile network activities and the population figures from the 2011 Census in Figure 2 shows, like the correlation analysis in Figure 1, that merging the mobile network activities of different mobile network operators notably increases the representativeness when the resident population on the basis of mobile network activities is taken as an example.

Socio-demographic variables

As the merging of data and the above correlations and analyses of relative frequencies show an increase in the representativeness of mobile network data, the representativeness of the socio-demographic variables will be studied more closely in a next step.

Since the customer structures of the individual network operators differ, some socio-demographic attributes are more common in the mobile network data than others. The resulting selective samples or selective mobile network data can cause distortions in the variable attributes. As a matter of fact, these selectivities clearly reflect the customer structure of the respective operator but also contribute to a distorted representation of the population when mobile network data of individual operators are used. In addition to that, only the socio-demographic attributes of contract customers of the two mobile network operators are available from the customer relationship management system. When the two operators’ socio-demographic attributes are merged, the result is as follows: Figures 3 and 4 show as an example the percentage distribution by sex and age group of the merged contract customers of the two network operators (Telefónica + Telekom) and the percentage shares according to the 2011 Census. As the socio-demographic attributes provided by the two operators are those of contract customers from the age of 20, only 2011 Census data for the population aged 20 years and over is considered here to ensure the comparability of the attributes.

It turns out that the percentage shares of sex and age group for the two operators continue to differ markedly from the Census 2011 distribution although they have been merged. After the two operators’ socio-demographic information has been merged, the share of women is underestimated by 13 percentage points and the share of men is overestimated by 13 percentage points compared with the 2011 Census (see Figure 3). When the percentage shares by age group for the two operators are compared with the corresponding percentages from the 2011 Census as in Figure 4, a notable distortion is revealed also in the distribution by age group of the merged mobile network data. Compared with the 2011 Census, especially the 50- 59 age group is overrepresented in the merged mobile network data with a difference of 8 percentage points. The age group 69 and over, in contrast, is clearly underrepresented in the mobile network data with a difference of 13 percentage points compared with the 2011 Census data. One reason may be the comparatively low penetration rate for the German population of higher age. The visible skewnesses of the variables in the merged mobile network data suggest that there are differences between the socio-demographic variables of the two clienteles which cannot be compensated for. The individual variable attributes may supplement each other to a small extent but the skewnesses and distortions cannot be corrected completely by the procedure described. Therefore, the variables cannot be referred to as representative.

Conclusion

The findings show that merging mobile network activities of several operators increases the representativeness considerably. As roughly 97.5% of households in Germany had a mobile phone or smartphone in 2020 (see continuous household budget surveys – only in German: Laufende Wirtschaftsrechnung (LWR)), it may also be assumed that merging the mobile network activities of all three operators in Germany will reflect the present population distribution in an almost perfect and representative manner. Mostly, the data do not include children and parts of the older population who do not have a mobile device. They will have to be integrated by means of an extrapolation frame to be designed especially for that purpose.

It could also be shown that distortions in the variable attributes can only to a limited extent be offset by this procedure. This may be due to the fact that the market shares of the two operators differ in size in NRW, meaning overrepresentation or underrepresentation of one operator in that Land and, consequently, clienteles differing in size in the area studied. As the socio-demographic variables are available for contract customers from the age of 20 only, it remains unclear how the variable attributes of contract customers are distributed in relation to prepaid customers. Family phone plans, two SIM cards per user and missing information for example on the prepaid customers make it even more difficult to reflect the socio-demographic variables on the basis of mobile network data in a representative and detailed manner. There also still are uncertainties on the part of the Federal Statistical Office regarding the data generation process as the mobile network data are anonymised and aggregated by the respective data providers according to their own and presumably different methods and concepts.

Finally, further steps have to be taken to obtain data from all mobile network operators within the country and thus increase the representativeness of the data for all of Germany. It is also necessary to create a legal basis in order to permanently secure the access to privately held data and enable their integration into official statistics production in the long term.