4.PROPOSED DETECTION PROCESS
1. Collect a dataset:
The data set being used is downloaded from Kaggle.
It contains the following headers:
- Id
- Title
- Text
- date
Gather a dataset of news stories that have been classified as authentic or fraudulent. This dataset has to be broad and inclusive of a range of literary genres, sources,and themes.
For this project I used 3 CSVs or two datasets. The first dataset has a two separate files for fake and real news. The second dataset is scraped data from a fact checking website. After loading the dataset, it was processed to balance, clean, and make it easier to use.
2. Pre-process the data:
- Removal of special characters
- Removal of punctuation marks
- Spell checking
- Stemming words
- Removal of stop words
3. Extract features:
- Count Vectorizer
- TF-IDF Vectorizer
- Hash Vectorizer
- N-grams: Sequence of N-words is also used in-order to specify the word sequence in Count vectorizer and TF-IDF Vectorizer.
- To calculate word frequencies, and by far the most popular method is called TFIDF(Term Frequency – Inverse Document).
- Term Frequency: This summarizes how often a given word appears within a document.
- Inverse Document Frequency: This downscales words that appear a lot across documents.
4. Select a model:
Choose a model for fake news detection. Common models include rule-based models, machine learning models, and deep learning models.
• Classifiers separates observations into groups based on their characteristics.
• For data classification, the following three classifiers are used.
- Multinomial Naïve-Bayes
- SVM
- Passive Aggressive
Here I select for this project Passive Aggressive. Passive Algorithms for online learning are aggressive algorithms. Such an approach remains passive in the case of a successful classification and becomes proactive, updating and tweaking. It does not converge, in contrast to most other algorithms. With relatively minimal modification to the weight vector's norm, its goal is to create updates that repair the loss.
5. Train & Test the model:
Use the retrieved features to develop a model on the specific dataset. Create training and testing sets from the dataset. In order to reduce the error between the anticipated and real labels, the model tunes its parameters as it learns the patterns in the data.
Use measures like accuracy, precision, recall, and F1 score to test the model on the testing set and assess its performance. These metrics give a numerical evaluation of the model's ability to discern between legitimate and false news.
6. Improve the model:
Analyze the results and improve the model by adjusting the parameters or selecting different features.
Comments
Post a Comment