P3: OpenStreetMap Data Case Study


Map Area

1. Schaumburg, IL, United States

This is a map of Schaumburg and neighboring towns representing my data. I’m interested to see what database querying reveals.

( Image: ScreenShot_03.PNG )

In [1]:
from IPython.display import Image
Image(filename='Images/ScreenShot_03.PNG')
Out[1]:

Overview of the Data

The choice of the databases for this project: SQL.

The extract was created on 2016 September 02, at 04:53 PM. OSM XML compressed file size is 3.7MB, uncompressed size 53.5MB.

To get the feeling on how much of which data I can expect to have in the map, I did the iterative parsing to process the map file and find out not only what tags are there, but also how many.

( Python file: Iterative_Parsing.py, Image: ScreenShot_09.PNG )

In [2]:
Image(filename='Images/ScreenShot_09.PNG')
Out[2]:

To see if there are any potential problems I explored the data a bit more. Before processing the data and adding it into the database, I checked the "k" value for each "tag".

The count of each of four tag categories in a dictionary:

  1. "lower", for tags that contain only lowercase letters and are valid,
  2. "lower_colon", for otherwise valid tags with a colon in their names,
  3. "problemchars", for tags with problematic characters, and
  4. "other", for other tags that do not fall into the other three categories.

( Python file: Tag_Types.py, Image: ScreenShot_10.PNG )

In [3]:
Image(filename='Images/ScreenShot_10.PNG')
Out[3]:

The number of unique users who have contributed to the map in this particular area is 390.

( Python file: Exploring_Users.py, Image: ScreenShot_11.PNG )

In [4]:
Image(filename='Images/ScreenShot_11.PNG')
Out[4]:

Problems Encountered in the Map

1. Abbreviated street names

I decided to explore the chosen Chicago west suburban area programmatically. After running audit_street_names.py and printing the results, one of the problems I have encountered was the inconsistency in naming the streets. See Image 01. Various abbreviations (with or without dots, upper/ lower case) of the names will be spelled out in full word with the first capital letter. Also, there are values which are assigned to a wrong map feature, e.g a zipcode as a street name. I am not planning to remove these values.

( Python file: audit_street_names.py, Image: ScreenShot_01.PNG )

In [5]:
Image(filename='Images/ScreenShot_01.PNG')
Out[5]:

2. Abbreviated street directions.

The other inconsistency is the street directions. Various abbreviations. All street directions will be spelled out in full word with the first capital letter, e.g. W -> West, E -> East. See image 01.

(Image: ScreenShot_01.PNG)

3. Spelling mistake

The spelling mistake was found after running the code with various attribute k values.

( Python file: audit_street_names.py, Image: ScreenShot_02.PNG )

In [6]:
Image(filename='Images/ScreenShot_02.PNG')
Out[6]:

3. Inconsistent phone number formats

After running the code with attribute k = 'phone', I found a wide variety in how phone numbers are presented, with regard to spacing and punctuation. I decided to follow the NANP convention, and chose to use instantly recognizable and most common format of the number (###) ###-####.

( Python file: audit_street_names.py, Image: ScreenShot_07.PNG )

In [7]:
Image(filename='Images/ScreenShot_07.PNG')
Out[7]:

The Cleaning and Re-shaping Data

Based on my findings, I created a dictionary mapping the incorrect street names to correct values.

( Python file: data.py, Image: ScreenShot_12.PNG )

In [8]:
Image(filename='Images/ScreenShot_12.PNG')
Out[8]:

The street names were fixed, cleaned and updated using function update_name(name, mapping)

( Python file: data.py, ways_tags.csv, ScreenShot_15.PNG )

In [9]:
Image(filename='Images/ScreenShot_15.PNG')
Out[9]:

With regular expressions pattern I checked for various formats of the phone number and made them uniform format with function update_phone(child_dict)

( Python file: data.py, nodes_tags.csv, Image: ScreenShot_13.PNG, ScreenShot_14.PNG )

In [22]:
Image(filename='Images/ScreenShot_14.PNG')
Out[22]:
In [11]:
Image(filename='Images/ScreenShot_13.PNG')
Out[11]:

Overview of the Data in my Database

I created a database Schaumburg.db consisting of 5 tables.

( Image: ScreenShot_16.PNG )

In [12]:
Image(filename='Images/ScreenShot_16.PNG')
Out[12]:

To get started, I was curious how many restaurants there are in my town and nearbay. Plenty of choices.

( Image: ScreenShot_18.PNG )

In [13]:
Image(filename='Images/ScreenShot_18.PNG')
Out[13]:

And here is beginning of the restaurant list:

( Image: ScreenShot_17.PNG )

In [14]:
Image(filename='Images/ScreenShot_17.PNG')
Out[14]:

And only one restaurant which serves Vietnamese cuisine.

( Image: ScreenShot_19.PNG )

In [15]:
Image(filename='Images/ScreenShot_19.PNG')
Out[15]:

Some other observations:

Found secondary level tags of the node 'node' element, which k attribute holds 'created_by' value. Not sure if this information has any value for the map.

<tag k="created_by" v="Potlatch 0.10b"/>

Some key attributes with value = 'building' lists an address instead of the name:

sqlite> SELECT a.value FROM nodes_tags as a LEFT JOIN nodes_tags as b ON a.id = b.id WHERE a.key = 'name' and b.key = 'building' ORDER BY a.value;

999 East Touhy

Full address as a house number value:

sqlite> SELECT b.value FROM nodes_tags as a LEFT JOIN nodes_tags as b ON a.id = b.id WHERE b.key = 'housenumber' and a.key = 'name';

2310 South Elmhurst Road

Most of the schools have no addresses listed. School count:

sqlite> SELECT COUNT(a.value) FROM nodes_tags as a LEFT JOIN nodes_tags as b ON a.id = b.id WHERE b.value = 'school' and a.key = 'name' ORDER BY a.value;

95

Count of schools with the address:

sqlite> SELECT COUNT(a.value) FROM nodes_tags as a LEFT JOIN nodes_tags as b ON a.id = b.id WHERE b.value = 'school' and a.key = 'street' ORDER BY a.value;

1

I checked, how many Churches are in the area:

sqlite> SELECT COUNT(a.value) FROM nodes_tags as a LEFT JOIN nodes_tags as b ON a.id = b.id WHERE b.value = 'place_of_worship' and a.key = 'name' ORDER BY a.value;

81

And this query might explains why the number of churches is so high - because of double entries:

sqlite> SELECT a.value FROM nodes_tags as a LEFT JOIN nodes_tags as b ON a.id = b.id WHERE b.value = 'place_of_worship' and a.key = 'name' ORDER BY a.value;

First United Methodist Church

First United Methodist Church

Immanuel Lutheran Church

Immanuel Lutheran Church

Saint Johns Church

Saint Johns Church

..

To visualize the process of finding, verifying and fixing the information of one of my questionable findings about the double entries, I ran a query on a node with the feature name 'Saint Johns Church'. The result is a list of 2 entries with a different latitude and longitude values.

sqlite> SELECT nodes.lat, nodes.lon FROM nodes JOIN nodes_tags ON nodes.id = nodes_tags.id WHERE nodes_tags.value = 'Saint Johns Church';

41.9936972|-88.1099734

42.0441949|-87.9486796

I am using Bing search and Bing maps to get the addressees, providing the coordinates and looking for a church nearby. First search finds the church. The second result, doesn't find an address, but gives me an approximate location. After looking up the official church Web pages, I find the names, which have to be corrected, and addresses which needs to be added. The next step is to actually drive by and verify the correctness of the information.

In [16]:
Image(filename='Images/ScreenShot_20.PNG')
Out[16]:
In [17]:
Image(filename='Images/ScreenShot_22.PNG')
Out[17]:
In [18]:
Image(filename='Images/ScreenShot_21.PNG')
Out[18]:
In [19]:
Image(filename='Images/ScreenShot_23.PNG')
Out[19]:

Conclusion

After taken a look at my neibourhud map, information available, and my research example, I realized that there could be done a lot of stuff to improve it. To list a few - almost all schools and churches have no listed address features, at the same time verify the names. That alone would be a ton of tedious work, there are no shortcuts to this process.

To add any features to the existing nodes, the challenge is to check and figure out, if the existing information is correct, if not, what exactly has to be fixed. I would need to do more research to get to know the process of getting involved.

It is probably easier to create and add a new node.

Recources