4.PROPOSED DETECTION PROCESS

    1. Collect a dataset: 

        The data set being used is downloaded from Kaggle. 

         It contains the following headers: 

    • Id 
    • Title 
    • Text 
    • date 

    Gather a dataset of news stories that have been classified as authentic or fraudulent. This dataset has to be broad and inclusive of a range of literary genres, sources,and themes. 

For this project I used 3 CSVs or two datasets. The first dataset has a two separate files for fake and real news. The second dataset is scraped data from a fact checking website. After loading the dataset, it was processed to balance, clean, and make it easier to use.



    2. Pre-process the data: 

        After the data is set, the text needs to be preprocessed to be readable by the program. Take out         stop words, capitalization, and HTML elements from the dataset to eliminate noise. 
        To decrease the dimension of the information, you can also use tokenization, filtering, or                     lemmatization.

     Pre-processing involves : 
  • Removal of special characters 
  • Removal of punctuation marks
  • Spell         checking 
  • Stemming words 
  • Removal of stop words


    3. Extract features: 

         Identify characteristics in the newspaper articles which may be utilized to identify between                     legitimate and false news. Content analysis, word frequency, topic modelling, and network                     analysis are typical aspects. 

        Vectorization is the process of turning a collection of data into numerical feature vectors. Features         derived are : 
  • Count Vectorizer 
  • TF-IDF Vectorizer 
  • Hash Vectorizer 
  • N-grams: Sequence of N-words is also used in-order to specify the word sequence in Count vectorizer and TF-IDF Vectorizer. 
        TF-IDF VECTORIZER : 
  • To calculate word frequencies, and by far the most popular method is   called TFIDF(Term                    Frequency – Inverse Document). 
  • Term Frequency: This summarizes how  often a given word appears within a document. 
  • Inverse Document Frequency: This downscales  words that appear a lot across  documents. 

     4. Select a model:

Choose a model for fake news detection. Common models include rule-based models, machine learning models, and deep learning models. 
• Classifiers separates observations into groups based on their characteristics. 
• For data classification, the following three classifiers are used. 
  1. Multinomial Naïve-Bayes 
  2. SVM 
  3. Passive Aggressive
Here I select for this project Passive Aggressive. Passive Algorithms for online learning are aggressive algorithms. Such an approach remains passive in the case of a successful classification and becomes proactive, updating and tweaking. It does not converge, in contrast to most other algorithms. With relatively minimal modification to the weight vector's norm, its goal is to create updates that repair the loss. 

 



    5. Train & Test the model: 

 Use the retrieved features to develop a model on the specific dataset. Create training and testing sets from the dataset. In order to reduce the error between the anticipated and real labels, the model tunes its parameters as it learns the patterns in the data. 

Use measures like accuracy, precision, recall, and F1 score to test the model on the testing set and assess its performance. These metrics give a numerical evaluation of the model's ability to discern between legitimate and false news.

 




6. Improve the model:

Analyze the results and improve the model by adjusting the parameters or selecting different features.

 



Comments