
Racism Detection on Twitter Using Stanford NLP
Racism Detection on Twitter Using Stanford NLP
ABSTRACT:
Social media platforms have become integral to modern communication, enabling millions of users worldwide to express their opinions, share experiences, and engage in discussions on diverse topics. Twitter, being one of the most popular microblogging platforms, witnesses an enormous volume of user-generated content daily. However, this freedom of expression has also given rise to a concerning proliferation of racist and discriminatory content that can cause psychological harm to individuals and communities, promote hatred, and disrupt social harmony. The anonymity and rapid dissemination capabilities of social media have made it challenging to monitor and control such offensive content manually, creating an urgent need for automated systems that can detect and flag racist posts in real-time.
The project titled “Racism Detection on Twitter Using Stanford NLP” addresses the growing challenge of identifying discriminatory and harmful content on social media platforms. This project aims to automate the detection of racism-related content using Natural Language Processing techniques integrated into a Java-based application. The developed platform, built using Java for backend logic, JSP, CSS, and JavaScript for the frontend, and MySQL for database management, provides a comprehensive analytical interface.
The proposed system processes user-uploaded datasets consisting of 1,500 tweet records, each containing a single attribute, text, representing the tweet content. Since the dataset is unlabeled, the system begins by performing extensive data preprocessing to clean and normalize the tweets by removing noise, unnecessary characters, and irrelevant patterns. Following this, the system utilizes Stanford NLP’s sentiment analysis model, adapted for tweet-level sentiment scoring, to classify each tweet as positive, negative, or neutral. These sentiment insights are then combined with rule-based and feature-based heuristics to determine whether a tweet exhibits racist tendencies, producing annotations labeled as Racism or None, represented numerically as 1 or 0. To enhance interpretability, the system also generates various visual summaries in charts highlighting racism vs. non-racism classification. These visual outputs allow users to quickly grasp trends and patterns within the analyzed dataset.
Overall, the project demonstrates an effective integration of Stanford NLP with a Java-based web application, enabling automated racism detection on Twitter through sentiment-aware text analysis and visual interpretability.
PROJECT OUTPUT VIDEO:
EXISTING SYSTEM:
- In the existing system, the analysis of tweets for detecting racist or harmful content heavily relied on traditional manual review and basic keyword-based filtering techniques. Social media researchers, moderators, and analysts typically collected tweets from various sources and examined them manually to identify patterns of abusive or discriminatory language. This approach depended on human interpretation, where experts read through the content to understand context, sentiment, and the presence of racially sensitive expressions.
- Additionally, in the existing system automated methods primarily used simple text-matching rules, where predefined keywords or phrases related to racism were searched within the tweets. These systems performed direct string comparisons to detect whether a tweet contained specific terms. Basic preprocessing steps, such as removing noise or irrelevant symbols, were sometimes applied, but deeper linguistic analysis was generally limited. The earlier frameworks did not incorporate advanced Natural Language Processing models or sentiment scoring mechanisms and focused mainly on surface-level text patterns.
- Data storage and processing in the existing system were commonly handled through straightforward file-based approaches or simple relational databases. The tools used were primarily designed for small-scale analysis and offered limited support for handling large datasets or complex text patterns. Visual interpretation of the analyzed content was minimal, as most results were displayed in basic tabular formats without dynamic charts or graphical summaries.
- Overall, the existing system provided fundamental tweet collection and keyword-based analysis, serving as a starting point for understanding online content related to racism. It laid the groundwork for more advanced automated techniques by highlighting the essential need for structured processing, text cleaning, and content evaluation.
DISADVANTAGES OF EXISTING SYSTEM:
- High Dependence on Manual Effort: The existing system required significant human involvement to read, interpret, and categorize tweets. This made the process slow, labor-intensive, and impractical when dealing with large volumes of social media data.
- Inability to Understand Context or Hidden Racism: In the existing system, Keyword-based detection methods could not interpret deeper linguistic meaning, sarcasm, indirect racism, or subtle discriminatory expressions. As a result, many tweets with indirect or context-dependent racist content often went unnoticed.
- Limited Accuracy Due to Simple Keyword Matching: The existing system approach depended heavily on fixed word lists. Tweets that used creative spellings, slang, abbreviations, or coded language were usually not detected, reducing the overall reliability of the system.
- Lack of Advanced NLP and Sentiment Analysis: Existing systems did not incorporate modern NLP techniques such as sentiment scoring, tokenization, lemmatization, or feature extraction. Without these, the system could not assess emotional tone or differentiate between positive, neutral, and negative expressions.
- No Automated Labelling or Classification: Since the existing system lacked automated classification models, it could not assign sentiment types, annotations, or labels to tweets. This forced users to perform manual classification, leading to inconsistent and time-consuming outcomes.
- Minimal Data Visualization and Insights: The existing system typically displayed results in basic text or table formats. It lacked charts, graphical summaries, and analytical dashboards, making it difficult for users to interpret the data or identify trends quickly.
- Poor Scalability for Large Datasets: The existing system manual nature and simple processing capabilities made it challenging for the existing system to handle large collections of tweets. As social media generates massive data streams, the older approach was not scalable for real-world applications.
PROPOSED SYSTEM:
- The proposed system introduces an automated and NLP-driven approach to identify racism in tweets by leveraging Stanford NLP’s sentiment analysis capabilities. The system begins by allowing users to upload a dataset containing 1,500 tweet records, each consisting of a single text attribute. Once the dataset is uploaded, the system performs thorough preprocessing to clean, normalize, and prepare the tweet data. This includes removing unwanted characters, handling special symbols, and converting the text into a form suitable for linguistic analysis.
- After preprocessing, the system applies Natural Language Processing methods to extract informative features from the tweet content. These features include keywords, hashtags, sentiment expressions, and other linguistic elements that help in understanding the nature of each tweet. The core of the system utilizes Stanford NLP’s Sentiment Analysis for Tweets, which computes sentiment scores and assigns each tweet a sentiment category such as very positive, positive, neutral, negative, or very negative. These sentiment outcomes serve as a foundational component for further interpretation.
- Based on the extracted features and sentiment results, the system proceeds to generate racism-related annotations for each tweet. Alongside the annotation, the system stores additional computed fields such as Tweet ID, sentiment score, sentiment type, sentiment percentage breakdowns, and a final label indicating racism or non-racism. All processed data is managed through a structured MySQL database for organized retrieval, analysis, and visualization.
- The application is implemented using Java for backend processing, JSP for dynamic web interfaces, and CSS and JavaScript for user interaction and styling. After processing all tweets, the system presents a detailed analytical dashboard displaying tables and multiple graphical charts, including pie charts, bar charts, and donut charts. These visual elements illustrate the distribution of racism vs. non-racism classifications, sentiment type frequencies, and comparison metrics. The proposed system thus provides a complete pipeline from dataset handling to NLP processing and result visualization within an integrated web-based environment.
ADVANTAGES OF PROPOSED SYSTEM:
- Automated Detection of Racism in Tweets: The proposed system eliminates the need for manual review by automatically analyzing tweets using NLP and sentiment analysis techniques, resulting in faster and more consistent detection of racism-related content.
- Enhanced Understanding Through Advanced NLP Techniques: With the integration of Stanford NLP, the proposed system can interpret linguistic patterns, emotional tone, and contextual sentiment, allowing for more accurate identification of subtle, indirect, or sentiment-driven racism.
- Efficient Data Preprocessing and Feature Extraction: The proposed system performs extensive preprocessing and extracts meaningful features such as hashtags, keywords, and sentiment scores, improving the quality of analysis and ensuring reliable classification.
- Automated Sentiment Classification: In the proposed system, each tweet is systematically classified into sentiment categories very positive, positive, neutral, negative, or very negative providing deeper insights into how emotions correlate with racist expressions.
- Clear Annotation and Labelling: The proposed system assigns annotations (Racism / None) and numeric labels (1 / 0), creating structured and machine-readable outputs that support further analysis, reporting, and integration with other systems.
- Rich and Interactive Data Visualization: In the proposed system, Charts such as pie charts, bar charts, and donut charts help users easily understand the distribution of racism vs. non-racism tweets and sentiment types, improving interpretability and decision-making.
- Scalable and Organized Data Management: With MySQL handling structured storage and retrieval, the proposed system can manage larger datasets efficiently while maintaining clean organization of processed results.
- Web-Based Access with User-Friendly Interface: Using Java, JSP, CSS, and JavaScript, the proposed system provides a smooth, interactive, and accessible platform where users can upload datasets, analyze tweets, and view results seamlessly from a browser.
SYSTEM REQUIREMENTS:
HARDWARE REQUIREMENTS:
- System : Pentium i3 Processor.
- Hard Disk : 20 GB.
- Monitor : 15’’ LED.
- Input Devices : Keyboard, Mouse.
- Ram : 8 GB.
SOFTWARE REQUIREMENTS:
- Operating system : Windows 10/11.
- Coding Language : JAVA.
- Frontend : JSP, CSS, JavaScript.
- JDK Version : JDK 23.0.1.
- IDE Tool : Apache Netbeans IDE 24.
- Tomcat Server Version : Apache Tomcat 9.0.84
- Database : MYSQL.
👉CLICK HERE TO BUY THIS PROJECT “Racism Detection on Twitter Using Stanford NLP” SOURCE CODE👈
Frequently Asked Questions (FAQ’s) and Answers
The main objective is to automatically detect racist content in Twitter posts using NLP techniques. The system processes tweets, performs sentiment analysis using Stanford NLP, and classifies each tweet as racist or non-racist based on computed features and rules.
The project is built using: • Backend: Java • Frontend: JSP, CSS, JavaScript • Database: MySQL • NLP Engine: Stanford NLP Sentiment Analysis These tools work together to perform data processing, analysis, and visualization.
The system requires a dataset of tweets in a text format where each record contains only one attribute called “text”. The dataset used for this project contains 1,500 unlabeled tweets, which the system processes and analyzes.
No. The dataset is unlabeled, and the system generates labels automatically by performing sentiment analysis and applying racism detection logic. Labels produced are: • 1 → Racism • 0 → None
The preprocessing module cleans and prepares the text by removing: • URLs • Special characters • Extra spaces • Unnecessary symbols It may also apply tokenization and normalization to ensure consistent input for NLP processing.
The system uses Stanford NLP’s tweet sentiment model to calculate sentiment scores and sentiment type. Based on specific patterns, linguistic features, and sentiment indicators, the system classifies tweets as Racism or None.
Sentiment types include: • Very Positive • Positive • Neutral • Negative • Very Negative These categories help understand the emotional tone of each tweet.
The application displays a detailed table with: • Tweet ID • Original Tweet Text • Sentiment Score • Sentiment Type • Very Positive (%) • Positive (%) • Neutral (%) • Negative (%) • Very Negative (%) • Annotation (Racism / None) • Label (1/0)
The system generates interactive visual charts including: • Pie Chart – Racism vs. Non-Racism distribution • Bar Chart – Comparison of racism and non-racism counts • Sentiment Count Chart – Visual representation of all sentiment types • Donut Chart – Racism vs. Non-Racism classification These charts help users quickly interpret trends in the dataset.
No, the current project analyzes data from uploaded datasets only. Real-time streaming can be added in future enhancements by integrating Twitter API.
This system can be beneficial for: • Researchers studying online hate speech • Social media monitoring teams • Policy makers analyzing public sentiment • Academic institutions • Cybersecurity and content moderation teams 1. What is the main objective of this project?
2. What technologies are used in this project?
3. What type of dataset is required for this system?
4. Does the dataset need to be labeled before analysis?
5. How does the system preprocess tweets?
6. How does the system detect racism in tweets?
7. What sentiment categories are generated by the system?
8. What results does the system display after analysis?
9. What types of charts does the system generate?
10. Can this system analyze real-time Twitter data?
11. Who can benefit from using this system?


