4.PROPOSED DETECTION PROCESS

1. Collect a dataset:

The data set being used is downloaded from Kaggle.

It contains the following headers:

Id
Title
Text
date

Gather a dataset of news stories that have been classified as authentic or fraudulent. This dataset has to be broad and inclusive of a range of literary genres, sources,and themes.

For this project I used 3 CSVs or two datasets. The first dataset has a two separate files for fake and real news. The second dataset is scraped data from a fact checking website. After loading the dataset, it was processed to balance, clean, and make it easier to use.

2. Pre-process the data:

After the data is set, the text needs to be preprocessed to be readable by the program. Take out stop words, capitalization, and HTML elements from the dataset to eliminate noise.

To decrease the dimension of the information, you can also use tokenization, filtering, or lemmatization.

Pre-processing involves :

Removal of special characters
Removal of punctuation marks
Spell checking
Stemming words
Removal of stop words

3. Extract features:

Identify characteristics in the newspaper articles which may be utilized to identify between legitimate and false news. Content analysis, word frequency, topic modelling, and network analysis are typical aspects.

Vectorization is the process of turning a collection of data into numerical feature vectors. Features derived are :

Count Vectorizer
TF-IDF Vectorizer
Hash Vectorizer
N-grams: Sequence of N-words is also used in-order to specify the word sequence in Count vectorizer and TF-IDF Vectorizer.

TF-IDF VECTORIZER :

To calculate word frequencies, and by far the most popular method is called TFIDF(Term Frequency – Inverse Document).
Term Frequency: This summarizes how often a given word appears within a document.
Inverse Document Frequency: This downscales words that appear a lot across documents.

4. Select a model:

Choose a model for fake news detection. Common models include rule-based models, machine learning models, and deep learning models.

• Classifiers separates observations into groups based on their characteristics.

• For data classification, the following three classifiers are used.

Multinomial Naïve-Bayes
SVM
Passive Aggressive

Here I select for this project Passive Aggressive. Passive Algorithms for online learning are aggressive algorithms. Such an approach remains passive in the case of a successful classification and becomes proactive, updating and tweaking. It does not converge, in contrast to most other algorithms. With relatively minimal modification to the weight vector's norm, its goal is to create updates that repair the loss.

5. Train & Test the model:

Use the retrieved features to develop a model on the specific dataset. Create training and testing sets from the dataset. In order to reduce the error between the anticipated and real labels, the model tunes its parameters as it learns the patterns in the data.

Use measures like accuracy, precision, recall, and F1 score to test the model on the testing set and assess its performance. These metrics give a numerical evaluation of the model's ability to discern between legitimate and false news.

6. Improve the model:

Analyze the results and improve the model by adjusting the parameters or selecting different features.

Search This Blog

Fake News Detector Using Machine learning - A Comparative Analysis