Mining Contributors to Train Wreck
Mining Contributors to Train Wreck
ABSTRACT:
Rail accidents represent an important safety concern for the transportation industry in many countries. In the 11 years from 2001 to 2012, the U.S. had more than 40 000 rail accidents that cost more than $45 million. While most of the accidents during this period had very little cost, about 5200 had damages in excess of $141 500. To better understand the contributors to these extreme accidents, the Federal Railroad Administration has required the railroads involved in accidents to submit reports that contain both fixed field entries and narratives that describe the characteristics of the accident. While a number of studies have looked at the fixed fields, none have done an extensive analysis of the narratives. This paper describes the use of text mining with a combination of techniques to automatically discover accident characteristics that can inform a better understanding of the contributors to the accidents. The study evaluates the efficacy of text mining of accident narratives by assessing predictive performance for the costs of extreme accidents. The results show that predictive accuracy for accident costs significantly improves through the use of features found by text mining and predictive accuracy further improves through the use of modern ensemble methods. Importantly, this study also shows through case examples how the findings from text mining of the narratives can improve understanding of the contributors to rail accidents in ways not possible through only fixed field analysis of the accident reports.
PROJECT OUTPUT VIDEO:
EXISTING SYSTEM:
-
Tey et al. describe the use of logistic regression and mixed regression to model the behavior of drivers at railway crossings.
-
The paper by Akin and Akbas describes the use of neural networks to model intersection crashes and intersection characteristics, such as, lighting, surface materials, etc. Taken together these papers show the use data mining to better understand the factors that can influence and improve safety at rail crossings.
-
Nayak et al. used text mining to analyze road crash data in Australia. For text mining they employed Leximancer concept mapping as implemented in a commercial product available through Leximancer.
DISADVANTAGES OF EXISTING SYSTEM:
-
To date, they have not reported large scale analysis of the narratives for information that could inform safety policies and design.
-
They focused on retrieval not prediction.
PROPOSED SYSTEM:
-
This paper describes an investigation to understand the possible predictors or contributors to accidents obtained from “mining” the narrative text in rail accident reports. To do this the approach integrates a combination of analytical methods to first identify the accidents of interest and then look for relationships in the structured and unstructured data that may suggest contributors to accidents.
-
This study evaluates the efficacy of the features found from text mining using models containing these features to predict the costs of extreme accidents. In performing this evaluation the study also considers the usefulness of modern ensemble approaches incorporating these text-mined features to predict accident costs.
-
Finally, the study teases apart the text-mined features, whose importance is confirmed by predictive accuracy, for their insights into the contributors to rail accidents. The purpose of this final analysis is to understand the insights for rail safety that text mining can provide to the exclusion of fixed field reports.
ADVANTAGES OF PROPOSED SYSTEM:
-
First this paper describes a broader comparison of techniques than previous studies.
-
Three by three design provides a broader range of evaluation than any previous study.
-
This paper focuses on rail accident reports over a longer time span than other studies; namely, 11 years.
-
None of the text mining analytics described here have previously been applied to rail accident damage assessment.
-
Finally, the methods in this paper are all available through open source software (R) and the code used in the analysis is also freely and openly available.
MODULES:
-
Accident Report Generation
-
Characteristics of Accident Report
-
Text Mining Techniques
-
Stored In DB
Module Descriptions:
Accident Report Generation:
This paper integrates methods for safety analysis with accident report data and text mining to uncover contributors to rail accidents. This section describes related work in rail and, more generally, transportation safety and also introduces the relevant data and text mining techniques.
Characteristics of Accident Report:
This report has a number of fields that include characteristics of the train or trains, the personnel on the trains operational conditions (e.g., speed at the time of accident, highest speed before the accident, number of cars, and weight), and the primary cause of the accident.
This field has become increasingly important because of the large amounts of data available in documents, news articles, research papers, and accident reports.
Text Mining Techniques:
Latent Dirichlet Allocation (LDA): LDA provides a method to identify topics in text. We applied LDA to the accident narratives to obtain 10 and 100 topics. To incorporate LDA topics into these ensemble models we again score each topic in each narrative by the proportion of topic words in the narrative. In order to compare the importance of topics, we also used the ensemble models with the top ten most important words in each topic.
Partial Least Squares (PLS): we measure importance as the percent change in root mean square error (RMSE) in the out-of-bag sample when that variable is removed. The results indicate that of the 20 most important variables 16 are LDA topics. For PLS we first obtained 1000 words from the LDA topics. We then found the estimated number of PLS components using cross-validation. We incorporate the PLS component into the accident damage models using two approaches. In the first approach we use a two step process. We first predict damage with only the PLS component. In the second approach we us the PLS component to estimate the coefficients for each word and directly use the results as another predictor variable, the PLS predictor, in the random forest model.
Stored In databases:
Text databases are semi structured because in addition to the free text they also contain structured fields that have the titles, authors, dates, and other Meta data. The accident reports used in this paper are semi structured.
SYSTEM REQUIREMENTS:
HARDWARE REQUIREMENTS:
- System : Pentium i3 Processor
- Hard Disk : 500 GB..
- Monitor : 15’’ LED
- Input Devices : Keyboard, Mouse
- RAM : 4 GB.
SOFTWARE REQUIREMENTS:
- Operating system : Windows 10/11.
- Coding Language : C#.net.
- Frontend : ASP.Net, HTML, CSS, JavaScript.
- IDE Tool : VISUAL STUDIO.
- Database : SQL SERVER 2005.