Case Study

The case study for this workshop was carried out in Python and the file is presented as a Jupyter Notebook. The case study can also be viewed in PDF format.

NIOSH - Data Science for Everyone Workshop - Accidents Data Case Study (Python)

Created by Leonid Shpaner for use in NIOSH - Data Science for Everyone Workshop. The dataset originates for Data Mining from Business Analytics (Shmueli et., 2018). The functions and syntax are presented in the most basic format to facilitate ease of use.

The Accidents dataset is presented as a flat .csv file which is comprised of 42,183 recorded automobile accidents from 2001 in the United States. The following three outcomes are observed: “NO INJURY, INJURY, or FATALITY.” Each accident is supplemented with additional information (i.e., day of the week, condition of weather, and road type). This may be of interest to an organization looking to develop “a system for quickly classifying the severity of an accident based on initial reports and associated data in the system (some of which rely on GPS-assisted reporting)” (Shmueli et al., 2018, p. 202).

Data Dictionary

Prior to delving deeper, let us first identify (describe) what each respective variable name really means. To this end, we have the following data dictionary:

  1. HOUR_I_R - rush hour classification: 1 = rush hour, 0 = not rush hour (rush hour = 6-9 am, or 4-7 pm)
  2. ALCHL_I - alcohol involvement: Alcohol involved = 1, alcohol not involved = 2
  3. ALIGN_I - road alignment: 1 = straight, 2 = curve
  4. STRATUM_R - National Automotive Sampling System stratum: 1 = NASS Crashes involving at least one passenger vehicle (i.e., a passenger car, sport utility Vehicle, pickup truck or van) towed due to damage from the crash scene and no medium or heavy trucks are involved. 0 = not
  5. WRK_ZONE - work zone: 1= yes, 0 = no
  6. WKDY_I_R - weekday or weekend: 1 = weekday, 0 = weekend
  7. INT_HWY - interstate highway: 1 =yes, 0 = no
  8. LGTCON_I_R - light conditions - 1=day, 2=dark (including dawn/dusk), 3 = dark, but lighted, 4 = dawn or dusk
  9. MANCOL_I - type of collision: 0 = no collision, 1 = head-on, 2 = other form of collision
  10. PED_ACC_R - collision involvement type: 1=pedestrian/cyclist involved, 0=not
  11. RELJCT_I_R - whether the collision occurred at intersection: 1=accident at intersection/interchange, 0=not at intersection
  12. REL_RWY_R - related to roadway or not: 1 = accident on roadway, 0 = not on roadway
  13. PROFIL_I_R - road profile: 1 = level, 0 = other
  14. SPD_LIM - speed limit, miles per hour: numeric
  15. SUR_CON - surface conditions (1 = dry, 2 = wet, 3 = snow/slush, 4 = ice, 5 = sand/dirt/oil, 8 = other, 9 = unknown)
  16. TRAF_CON_R - traffic control device: 0 = none, 1 = signal, 2 = other (sign, officer, . . . )
  17. TRAF_WAY - traffic type: 1 = two-way traffic, 2 = divided hwy, 3 = one-way road
  18. VEH_INVL - vehicle involvement: number of vehicles involved (numeric)
  19. WEATHER_R - weather conditions: 1=no adverse conditions, 2=rain, snow or other adverse condition
  20. INJURY_CRASH - injury crash: 1 = yes, 0 = no
  21. NO_INJ_I - number of injuries: numeric
  22. PRPTYDMG_CRASH - property damage: 1 = property damage, 2 = no property damage
  23. FATALITIES - fatalities: 1 = yes, 0 = no
  24. MAX_SEV_IR - maximum severity: 0 = no injury, 1 = non-fatal injury, 2 = fatal injury

Reference

Shmueli, G., Bruce, P. C., Gedeck, P., & Patel, N. R. (2020).
    Data mining for business analytics: Concepts, techniques and applications in Python. John Wiley & Sons, Inc. 
Previous