• Center for Problem oriented policing

POP Center Tools Understanding and Responding to Crime & Disorder Hot Spots Appendix A

previous page next page

Appendix A: Data Preparation

Cleaning the Data

Every agency and system–including computer-aided dispatch (CAD) and records management systems (RMS) – differs in terms of whether addresses are geocoded as part of the initial entry. The following section covers data that do not already have x and y coordinates for mapping, as well as datasets that do. You should familiarize yourself with the accuracy of the geocoded data generated by the system (e.g., whether it is using a current street file, whether place names that do not have an
address are geocoded).

Before data can be geocoded, they must be adequately cleaned. Cleaning is the process of correcting inaccuracies in the dataset. This may be as simple as filling in gaps or supplementing the data. However, in some cases, it may require addressing duplicate records or records that are not of interest. The data used to diagnose hot spots often consist of thousands of records. Usually, the best way to clean data is to use some form of software built for manipulating and managing databases. There are a number of programs that have this capability. Both Microsoft Access and Microsoft Excel are commonly used because they are included in most versions of the Microsoft Office suite.

The software needs to have the capability to systematically clean certain characteristics of the data that make geocoding difficult. Instead of a step-by-step tutorial, a list of common issues is provided below. Some of the tasks listed are specific to police data. Others are common problems found across datasets and should always be considered prior to using any data file. To ensure data are properly cleaned, it is necessary to first check the structure of the data in use. Data can be written in a number of ways. Below is a list of some examples that often arise in working with police incident data.


  • Abbreviations instead of full text for street names and types
  • Missing data from the street file, which is used to create the address locator (e.g., address ranges for street segments)
  • Missing or omitted numbers from the building number of an address
  • Address is listed as a “block” or “block of” (e.g., “700 block Main St.”)
  • The inclusion of apartment or unit numbers for addresses that contain multiple housing units (e.g., “704 #2, Main St.”)
  • Event type classifications: Many police data systems, such as call for service data, include traffic as well as administrative call types, which may need to be excluded so that only relevant event types (e.g., citizen-initiated calls) are included
  • Date and time fields are combined
  • Data are missing a unique identifier
  • The place of occurrence and the place of reporting are not distinguished or are inaccurately recorded

Mapping Crime and Disorder


Geocoding is the process of matching a location (typically an address) to a real place on Earth. Geocoding, across all mapping software, requires a data file with the case information (e.g., event addresses) as well as an address locator file, which is a reference file built from a street network file. It is an index or encyclopedia of addresses that reads the range of addresses for a street and the street name and then matches a point based on the estimated location. While it is likely impossible to
perfectly geocode thousands of events in a dataset, 85 percent is generally considered the minimum acceptable rate.47 That said, because many systems are automated and require complete address information, higher match rates (95 percent and above) are common and should be a goal for most datasets. We recommend, in any instance, reviewing unmatched records to see if there is a common reason they did not match (e.g., out of jurisdiction, new address, alias that is not yet in the street file)
before moving forward with the analysis. Common errors can often be addressed in batches, which can substantially improve geocoding rates and hot spot identification.

Departments can also map crime and disorder call data using existing geographic coordinates (e.g., latitude and longitude or x and y coordinates). The process to map these is much simpler in that the GIS software will read the coordinates based on the coordinate system of other area files already in the map (e.g., police jurisdiction areas, city area). Also, most mapping software programs can identify the coordinate of centroid (the center point) for files of larger areas (polygons) or the end and center points of line files (polyline), like streets. This is helpful because a number of hot spot generation techniques (as well as spatial analysis techniques) use point data.

Joining/associating incident data

Once the data are cleaned and geocoded, it will likely be necessary to perform joins to synthesize all the relevant information in one place. A join is simply a link between two datasets—often between one that is already in a spatial format and one that is not. Joins merge all the data associated with the
geographic unit into one place, allowing for different features and characteristics associated with that unit to be analyzed as well.

Two types of joins are especially useful when identifying hot spots. The first type is a spatial join of the data. This type joins data based on some type of spatial association. So, for instance, one approach to identifying hot spots is to obtain a count of policing incident points at each street segment.48 The counts are obtained for each segment through a spatial join of event points to street lines. Spatial joins are also often used to associate data from larger geographic units (like census block groups or
tracts) to smaller features (like streets). However, this should be practiced with caution, as there is a missing level of precision when associating data from larger to smaller geographic units.

The second type of join is an attribute join. This joins two datasets with based on a common attribute or field. This join is often used to associate census statistics with the census area geographic boundaries. This allows for information, like mean household income or population, to be joined with the area in which the data were collected.

previous page next page