2  Our Datasets

2.1 Description

There are four main datasets that we plan to use for this project. All of the data is collected by the Metropolitan Transportation Authority, and were imported by directly downloading the CSV files from the official website.

2.1.1 MTA Subway Trains Delayed: Beginning 2020

This dataset (source) contains the number and percentage of subway trains delayed per weekday, broken down into specific delay categories (from 2020 onward). According to the source, the data on train arrivals is manually entered into a digital dispatching tool, and the number of trains delayed by more than five minutes or missing scheduled stops is summed up. Additionally, every delay is manually assigned a category and subcategory. Thus, the assignment of categories to delays depends on the subjective judgment of the dispatcher, and in some cases, the delays may be caused by more than one cause (which cannot be recorded in this format). The frequency of updates to this dataset varies widely, but it has been updated at least once every month. As of 11/19/2024, the dataset has 39051 rows and 7 columns.

The following columns are included in the data:

  1. month - The month in which the delays occur
  2. division - The subway line division, lettered or numbered
  3. line - The subway line
  4. day_type - Whether the day is a weekday (1) or a weekend (2)
  5. reporting_category - One of six categories causing the delays
  6. subcategory - The subcategory of the delays
  7. delays - The number of train delays for the category and time period

2.1.2 MTA Service Alerts: Beginning April 2020

This dataset (source) contains details of service alerts causing route changes in the MTA from April 2020 onward. Note that this contains service changes of MTA buses as well, but for now we are interested in the MTA subway alone. Real-time service alerts are first drafted within the MTA alert system, and then disseminated via various channels in real time, while planned alerts are entered into the system ahead of time. The frequency of updates to this dataset varies, but it has been updated at least once every month on average. As of 11/19/2024, the dataset has 351631 rows and 9 columns.

The following columns are included in the data:

  1. Alert ID
  2. Event ID
  3. Update Number
  4. Date
  5. Agency - Transit agency associated with the event (we are interested in the NYC subway)
  6. Status Label - Pipe-delimited list of service statuses
  7. Affected - Pipe-delimited list of affected routes
  8. Header - High-level overview of service impact
  9. Description - Additional information related to the alert

2.1.3 MTA Subway Wait Assessment: Beginning 2020

This dataset (source) measures how regularly trains are spaced on the MTA subway. Subway schedule data, including service changes, and actual train movement for each day are compiled into a pair of timetables, while the actual train movement and distance data come from a variety of sources. The dataset appears to have been updated at least once every month, and the data has 5306 rows and 8 columns as of 11/19/2024.

The following columns are included in the data:

  1. month
  2. division - The subway line division, lettered or numbered
  3. line - The subway line
  4. day_type - Whether the day is a weekday or a weekend
  5. period - Whether peak or off-peak service
  6. num_timepoints_passing_wait_assessment - Number of scheduled timepoints where trips pass the wait assessment standard
  7. num_sched_timepoints - Number of scheduled timepoints
  8. wait assessment - Percentage of trips that pass the wait assessment

2.1.4 MTA Subway Hourly Ridership: Beginning July 2020

This dataset (source) subway ridership estimates on an hourly basis by subway station complex and class of fare payment, beginning July 2020. It captures rider entries made at subway turnstiles by using OMNY and MetroCard taps/swipes. MetroCard data is aggregated from on-premises MTA data servers, while OMNY data is provided by a third-party vendor named Cubic. This dataset appears to have multiple updates in October and November 2024. As of 12/9/2024, the dataset has about 105 million rows and 12 columns. Due to its large size (about 17 GB), we used Google Colab to aggregate and preprocess the data and reduce its size (described later).

The following columns are included in the (original) data:

  1. transit_timestamp - Timestamp when the payment took place in local time, rounded down to the nearest hour
  2. transit_mode
  3. station_complex_id
  4. station_complex
  5. borough
  6. payment_method
  7. fare_class_category
  8. ridership
  9. transfers
  10. latitude
  11. longitude
  12. Georeference

2.1.4.1 Preprocessing and aggregation

As mentioned earlier, the original dataset was very large and difficult to work with. We discovered that the original dataset had many columns that were not necessary for our analysis. Moreover, a major factor contributing to the size of the dataset was the fact that the ridership data was collected and stored on an hourly basis, rather than monthly. Thus, to reduce the size of the data, we performed the following steps in Google Colab (code can be viewed in the file /preprocessing-code/MTA_Preprocessing.ipynb on GitHub):

  1. Filter the data by the condition transit_mode == 'subway'
  2. Extract the year and month of the transit_timestamp and store them in new columns year and month
  3. Check if the transit_timestamp represents a weekend or a weekday, and store the information in a new column day_type, in the same format as in the ‘MTA Subway Trains Delayed’ dataset
  4. Remove all columns except year, month, day_type, station_complex, and ridership
  5. Group by all the columns except ridership and take the sum of ridership for every group

This effectively computes the overall monthly stationwise ridership for weekdays and weekends respectively, and reduces the size of the dataset to about 2000 KB. The new data set has 44095 rows and 5 columns.

2.2 Missing value analysis

2.2.1 MTA Subway Trains Delayed: Beginning 2020

Code
subway_train_delays <- read_csv("data/MTA_Subway_Trains_Delayed__Beginning_2020_20241118.csv")
plot_missing(subway_train_delays, num_char = 8)

Here, the subcategory field is missing in less than 12.5% of the rows, while all the other values are present in all rows.

2.2.2 MTA Service Alerts: Beginning April 2020

Code
mta_service_alerts <- read_csv("data/MTA_Service_Alerts__Beginning_April_2020_20241118.csv")
plot_missing(mta_service_alerts, num_char = 7, max_cols = 4)

Here, in about 62.5% of the rows only the Description field is missing, while there are a few rows where the Affected and the Header fields are missing as well.

2.2.3 MTA Subway Wait Assessment: Beginning 2020

Code
subway_wait_assessment <- read_csv("data/MTA_Subway_Wait_Assessment__Beginning_2020_20241119.csv")
plot_missing(subway_wait_assessment, num_char = 7)

This dataset has no missing values.

2.2.4 MTA Subway Hourly Ridership: Beginning July 2020

Since the original dataset is very large, we did missing value analysis on our modified dataset. (Note: the first column represents a row index which is not relevant for our analysis)

Code
mta_ridership <- read_csv("data/MTA_Subway_Hourly_Ridership_Data_Modified_v2.csv")
plot_missing(mta_ridership)

This dataset has no missing values.