Exploratory Data Analysis On Telecom Italia Big Data Challenge
“The goal is to turn data into information and information into insight.”
— Carly Fiorina, former CEO and chair of Hewlett-Packard Company
Telecom Italia, a telecom company in Italy, organized a Big Data Challenge back in 2014. The goal of this challenge was to come up with technological ideas related to big data that in return may contribute to innovation in the sector. As a part of this challenge, Telecom Italia collaborated with other Italian industries and academic organizations to collect data between November 2013 and December 2013 to give to the participants to use.
This project is focused on studying trend of telecom data for 7 days (November 1 2013 through November 7 2013) for Milan province of Italy. The dataset used in this study contains spatially aggregated square cells of study area (Milan province) segmented in 235x235 meters grid structure, the boundary of the Milan Province, telecommunication activity per cell id giving details on to and from locations for telecom activity and what kind of activity-sms, calls or internet usage. The data is downloaded from Kaggle https://www.kaggle.com/marcodena/mobile-phone-activity.
Following are the data files used:
- mi-to-provinces-2013-11-01.csv
- mi-to-provinces-2013-11-02.csv
- mi-to-provinces-2013-11-03.csv
- mi-to-provinces-2013-11-04.csv
- mi-to-provinces-2013-11-05.csv
- mi-to-provinces-2013-11-06.csv
- mi-to-provinces-2013-11-07.csv
- milano-grid.geojson
- Italian_provinces.geojson.geojson
- sms-call-internet-mi-2013-11-01.csv
- sms-call-internet-mi-2013-11-02.csv
- sms-call-internet-mi-2013-11-03.csv
- sms-call-internet-mi-2013-11-04.csv
- sms-call-internet-mi-2013-11-05.csv
- sms-call-internet-mi-2013-11-06.csv
- sms-call-internet-mi-2013-11-07.csv
Final goal of this study is to analyze pattern of the data and observing trends on different kind of telecom activities like receiving and sending calls, sms and internet usages.
We start by importing necessary libraries.
Importing the Libraries
Here, opendatasets — a python library for downloading data from online sources like Kaggle and Google drive, is used to download the datasets from https://www.kaggle.com/marcodena/mobile-phone-activity.
Downloading the datasets from Kaggle:
Lets read the .geojson files for the Italian province boundary along with the 235x235 meters grid and store them in a geopandas dataframe. Since the boundary file contains provinces for the whole country, we will be filtering out all but Milan province.
Spatial Operations using Geopandas:
The Milan province’s grid breaks the province into ten thousand cells. Some of these cells are slightly outside of the province boundary, so those cells are not considered in the study. To do so, the grid is overlaid on the province boundary using the sjoin operation offered by geopands and only cells within the boundary are considered.
Notice on the top right section of grid on the map. These are some cells that are outside of Milan province boundary. Folium offers interactive map so you are able to zoom in and out of this map.
Lets check out ‘Milan_cells’ dataframe that contain the cells inside of the Milan province boundary. Notice how there are now 9,746 cells out of 10,000:
Study data trends using charts
Interaction from cells to the provinces of Italy :
The function defined below, ‘C2P_data’ reads the csv to pandas dataframe, only selects the cells that are chosen in the previous step, saves them to a new dataframe ‘new_C2P’. Further processing is done on this new dataframe to drop all the values that are NaN values and only ‘Cell2Province’ and ‘Province2Cell’ are copied over. The ‘datetime’ column is then broken down to show the day and time components and appended back into the ‘new_C2P’ dataframe. In the final dataframe ‘df’ values from ‘new_C2P’ dataframe are sorted for time and ‘cell2Province’ columns and only the first 200 rows are stored. The dataframes for 7 days are then merged into one dataframe, ‘merged_c2p_df’, that is then considers for making a histogram graph.
Lets try to see some pattern in the data for 7 days using a histogram. Plotly supports creating an animated frame which can be used to show trends for each day of the week.
From the histogram above, we can see that most cell to province activity for days 1, 2 and 4 happens at 11:00 hrs, for day 3 its at 17:00 hrs, for day 5, 6 and 7 its at noon time. We can also see a drastic increase in cell to province telecom activity from day 4 through day 7.
Same activity is repeated for finding out province to cell trend. A function ‘P2C_data’ returns a dataframe that has top 200 values for each day of the week with sorted values for ‘time’ and ‘Province2cell’ columns. The dataframes for 7 days are then merged into one dataframe, ‘merged_p2c_df’, that is then considers for making a histogram graph.
Interaction from provinces to the cells :
As can be seen by the histogram above, highest province to cell activity for days 1 and 2 happens at 11:00 hrs, for day 3 is at 17:00 hrs, for day 4 is at 13:00hrs, for day 5, 6, 7 is at noon time. And here too, province to cell activity increase from day 4 through day 7. Also a good point to note from the two histograms above is that the rise in cell 2 province and province to cell activity is at the same hour for day except for day 4 for cell 2 province is highest at 11:00 hrs while province to cell is at 13:00 hrs.
Next, the cells that are highly active are identified as ones that have both high cell to province and high province to cell activities. To pick out these cells, a function ‘pick_cellid’ is defined showing a peak for activities in histograms above. This is done by first obtaining the cumulative sum of activity grouped by hour of the day and then selecting the max value which results in peak in the histograms. Doing this, we can then see what is the value for time that yields the maximum value for sum of activity. The same function is used twice, first time to pick out the cells with high cell to province activity and second time for high province to cell activity
Pick highly active cells :
Now that we have a dataframe each for max cell to province and max province to cell activities, lets try to find out the cells that are common between the two. These will then be the most active cells.
Find the cells common between the cell to province and province to cell:
Lets now create a heatmap to get some idea on the trend of telecom activity on cell level. To do so, we need to start with structuring the dataframe in a way we can create the heatmap. For this we need to pivot the dataframe:
The trend as can be observed in the heatmap shows that the telecom activity for the cell ids have increased from day 4 to 7 and is relatively low from day 1 to 3. The darker the tone is the more active the cell is.
Lets try to look at the high telecom activity cells on the map. This may give us interesting insights on why these cells are so high in activity. For doing this, cell id with max activity(from the dataframe ‘revels’) are stored into a list which is then used to filter out records from ‘Milan_cells’ dataframe containing the cell grid geometry.
Now that we have visibility on how the highly active cells look on the map and where they lie, lets try to look into reasons why these cells are so active. Just a little pan on the map around the city of Milan tells us how busy the city looks, with mixed building types like hospitals, offices, churches, university/ schools etc. In the code below, there are a few buildings with latitudes and longitudes from the province that help explaining the active telecom activity in the cells displayed above.
Pick out some of the most popular locations in the province:
Lets try to pick out 15 popular destinations of the province. The idea is to show a relation between highly active cells with popular locations in the province. For this we first prepare a dataframe with the name, latitude, longitude and category of location and then use this dataframe to show points on map. The points will be be shown as CircleMarker on Folium map with pop-up showing name of the location and its category when clicked.
Above map helps in understanding why the overlaid cells may show high telecom activity. Because the points shown on the map are popular locations of the province, these locations may have high traffic which causes the telecom activity rise, especially the internet usage as we are about to observe from the study done in the next section. By clicking on these points, we can see the name of the location and the category of the location. Some of the categories for these points may include Churches, Hospitals and Universities.
SMS, Calls and Internet Datasets
In this section, new dataset is introduced to the study that has data with sms in, sms out, call in, call out and internet usage of the cells. The activities are proportional to the amount of issued sms in/sms out/calls in/ calls out/ internet usage inside a given cell id during a given time interval. From the prior knowledge of what are the cells with highest activity, a function ‘sci_activity’ is defined below to read the .csv files containing this data for selected cells ids. And rows containing NaN values for any field are eliminated.
Telecom Activity trend:
From the dataframe obtained above, ‘’merged_sci_df’, lets find out what kind of activity was highest.
As seen in the chart above, its clear that the internet activity is highest kind of activity, followed by calls going out issued by user, calls received by the user and sms sent out and the least kind of activity is sms coming in.
Top ten countries with most activity :
Lets check out top ten most contacted countries. This can be shown by using line chart. First, lets have top ten countries with highest activity into a list. Then, ‘merged_sci_df’ dataframe can be filtered to only have countries that are in the list.
From the dataframe obtained above, a ‘group by’ function is used to group by country codes and number of times they occur is added. For example, country with country code 39 comes 56450 times.
As can be seen on the chart, the highest calls are made to country code 39 for Italy followed by 41 for Switzerland , 33 for France, 86 for China, 7 for Kazakhstan, 46 for Sweden , 44 for UK, 34 for Spain, 49 for Germany and 1 for United States.
Study data distribution for each of these telecom activity:
Lets look at the box plots for each of these activities. A box plot is a great way of showing the distribution of data using box and whiskers. The data is distributed starts from ‘Minimum’ , going towards the First Quartile (Q1), in this region lie the first 25 percentile of data points, next is Median (Q2) which is the middle value of the data. This is followed by the the Third Quartile(Q3) which is the value between the median and the maximum value for the data and finally we have the Maximum value for the data. The outlier data values are shown as points in the box plot.
Plotly allows us to slice the boxplots. What this does is gives us a zoomed in view for the sliced portion of box plot. Here is a peak at how the boxplots look when zoomed in. The sliced sections are showing the data distribution for each case giving us a view between minimum and maximum values.
As can be observed from the box plots above, the maximum value for smsin activity is highest on day 7,maximum value for smsout, callin and callout are highest for day 5 and the maximum value for internet is highest for day 2. The data for internet and smsout shows a more consistent pattern for all the days where median value is at equal distance of Q1 and Q2, while for all other activities, the median value seem to be closer to Q1.However, for all the activities, the median values are somewhat equidistant from Q1 and Q2 for days 1–3. When we see a shorter box plot, that means that the data is distributed consistently around the median values whereas a taller box plot shows much more variation from median values in data. Here, callin data distribution from day 4 through 7 show taller box plots which means that the data distribution is varying from the median values. There are outliers for each activity’s distribution. These outlier points lie further from the maximum end of the whisker, which means that in all the cases the outliers are the values that are greater than the maximum values of the distribution. Another interesting observation from the box plot above is that in some cases, the minimum value (the lower whisker) is close to the Q1 value. This is because of the way the data is distributed.
Conclusion
- This study was done with the data provided by Telecom Italia Big Data Challenge 2014.
- The study handles data that’s 2GB in size which contains csv and .geojson file formats.
- A pattern in telecom activity from cell to province and vice versa is studied using histogram chart. For most of the days the time with highest activity is same between cell and province to and fro.
- A simple seaborn heatmap shows pattern of telecom activity for 100 cells being higher on day 4 through day 7. This can be extended to more number of cells.
- A use of folium maps is made to show Milan province boundary and grid cells on map. Some of the busy locations are shown as circular markers on the map. An overlay of such points on the grid cells with high telecom activity indicate that the busier a place is the higher the telecom activity is going to be.
- In all the telecom activities, the max activity is usage of internet followed by calls made out of the province.
- Box plot charts are a nice way to show spread of data. Data for internet activity seems to show much consistent pattern.
Future Scope
This study can be taken further by involving more variables like tweets and news feeds during that duration that gives information about the social pulse, the demographics of the area at the time and/or the census data. Next, this project was only focused on study area Milan, but the same can be extended to other provinces of Italy too if the data is available.
Reference
- Kaggle — Mobile Phone Activity in a City (https://www.kaggle.com/marcodena/mobile-phone-activity)
- Barlacchi, G., De Nadai, M., Larcher, R. et al. A multi-source dataset of urban life in the city of Milan and the Province of Trentino. Sci Data 2, 150055 (2015).
- Kung, K., Greco, K., Sobolevsky, S. & Ratti, C. Exploring universal patterns in human home-work commuting from mobile phone data. PLoS ONE 9, 6 (2014).
- Jovian Tutorials:
- Data Visualization using Python, Matplotlib and Seaborn (https://jovian.ai/learn/zero-to-data-analyst-bootcamp/lesson/visualization-with-matplotlib-and-seaborn)
- Visualization with Plotly & Folium (https://jovian.ai/learn/zero-to-data-analyst-bootcamp/lesson/interactive-visualization-with-plotly-and-folium)