News Classification with BERT

In this post, I used Bidirectional Encoder Representations from Transformers (BERT) to classify whether a news is fake or real. BERT is a state-of-the-art technique for Natural Language Processing (NLP) created and published by Google in 2018. Bidirectional means that it looks both left and right context to understand the text. It can be used for next sentence prediction, question answering, language inference and more. Here, we used BERT for news classification using transformers library from Hugging Face. Transformers allows PyTorch interface.

The analysis was created and executed through Google colaboratory notebook and can be accessible here. Google colaboratory, or in short Colab, is a free research tool provided by Google to execute python and perform machine learning tasks. It can allow us to set Graphics Processing Unit (GPU) or Tensor Processing Unit (TPU).

Import libraries

In [9]:
# Important libraries
import pandas as pd
import numpy as np
import re # to use regular expression pattern 
import datetime as dt  # to parse to datetime
import string
from scipy import stats
from collections import defaultdict

#for data preprocessing
from sklearn.model_selection import train_test_split  

# to evaluate model performance
from sklearn.metrics import confusion_matrix, classification_report

# for visualization
import matplotlib.pyplot as plt
from matplotlib import rc
from pylab import rcParams
import seaborn as sns 
%matplotlib inline

The analysis is discussed step by step as follows:

1)Read the data

The data is publicly available and can be downloaded as fake and true news separately. I downloaded the data on the local machine and read from there. Though there are different ways to get data into Google Colab, I get the data by mounting my Google Drive into Colab environment.

In [10]:
true= pd.read_csv("drive/My Drive/NLP_files/True.csv", parse_dates=["date"])     # the true news data
fake= pd.read_csv("drive/My Drive/NLP_files/Fake.csv")     # the fake news data
In [11]:
true.head(3) # the first few rows of real dataframe
Out[11]:
title text subject date
0 As U.S. budget fight looms, Republicans flip t... WASHINGTON (Reuters) - The head of a conservat... politicsNews 2017-12-31
1 U.S. military to accept transgender recruits o... WASHINGTON (Reuters) - Transgender people will... politicsNews 2017-12-29
2 Senior U.S. Republican senator: 'Let Mr. Muell... WASHINGTON (Reuters) - The special counsel inv... politicsNews 2017-12-31
In [12]:
fake.head(3)
Out[12]:
title text subject date
0 Donald Trump Sends Out Embarrassing New Year’... Donald Trump just couldn t wish all Americans ... News December 31, 2017
1 Drunk Bragging Trump Staffer Started Russian ... House Intelligence Committee Chairman Devin Nu... News December 31, 2017
2 Sheriff David Clarke Becomes An Internet Joke... On Friday, it was revealed that former Milwauk... News December 30, 2017
In [13]:
# the size of fake and true news data sets
true.shape, fake.shape
Out[13]:
((21417, 4), (23481, 4))

1.1) Add labels(fake/true) to the dataframs

In [14]:
# create a new column called is_fake and label as 0 for true news 
true["label"]="true"
# create a new column called is_fake and label as 1 for fake news 
fake["label"]= "fake"
In [15]:
# parse the date column into datetime. 
#Since it has three datetime format, we need three formats to parse.  
def parsing_datetime(string):
    for f in ("%B %d, %Y", '%d-%b-%y', "%b %d, %Y"): # format  19-Feb-18
        try:
            return dt.datetime.strptime(string, f)
        except ValueError:
            pass
        
# parse the date column of fake dataframe into datetime
fake.date= fake.date.apply(lambda x: parsing_datetime(x))

1.2) Merge the real and fake news dataframes

In [16]:
# Merge the fake and read dataframe
news= pd.concat([true,fake], axis=0, ignore_index=True) # index to have unique index
In [17]:
news.head(3)
Out[17]:
title text subject date label
0 As U.S. budget fight looms, Republicans flip t... WASHINGTON (Reuters) - The head of a conservat... politicsNews 2017-12-31 true
1 U.S. military to accept transgender recruits o... WASHINGTON (Reuters) - Transgender people will... politicsNews 2017-12-29 true
2 Senior U.S. Republican senator: 'Let Mr. Muell... WASHINGTON (Reuters) - The special counsel inv... politicsNews 2017-12-31 true
In [18]:
sns.countplot(x="label", data=news)
plt.title("Count plot of fake and true news")
plt.show()

2) Data Cleaning and Preprocessing

Regular expression patterns have been used to detect and remove emoji symbols, url links, html tags, special characters and punctuation marks.

In [19]:
def remove_pattern(text, patterns):
    """The function remove_pattern returns the new string with a set of patterns removed.
       Parameters:
       ------------
       text: the text from which the pattern will be removed
       patterns: is the set of patterns(iterable) we are interested to remove from the text
       """
           
    for pattern in patterns:
        new= re.sub(pattern, "", text)
        text= new 
    return new
In [20]:
# patterns to be extracted and to be removed from the data 
emoji = "[\U0001F300-\U0001F5FF\U0001F600-\U0001F64F\U0001F680-\U0001F6FF\u2600-\u26FF\u2700-\u27BF\U000024C2-\U0001F251]+"

url= re.compile("https?://\S+|www\.\S+")                     # pattern for url
html= r'<.*?>'                                               # pattern for html tag
num_with_text= r"\S*\d+\S*"                                  # pattern for digit 
reuters= r"(\s\(Reuters\))"                                  # pattern to detect Reuters, it is common word in true news
punctuation= r"[#@&%$~=\.;:\?,(){}\"\“\”\‘\'\*!\+`^<>\[\]\-]+"      #pattern for punctuations and special characters   
apostroph=r"\’s?"
# collect the patterns 
patterns=[emoji, url, html, num_with_text, apostroph,reuters,punctuation] # punctuation removed

2.1) Clean merged data

At this statage of data cleaning ,the emoji, url links, html tag, digits and special characters are removed.

In [21]:
# Clean the title and text of merged data using regular expression patterns
news_clean_title = news.title.apply(remove_pattern, patterns= patterns)
news_clean_text= news.text.apply(remove_pattern, patterns= patterns)

2.2) Split the data into training, validating and test data

In [22]:
news["cleaned_title"]= news_clean_title # add the cleaned title as new column 
news["cleaned_text"]= news_clean_text  # add the cleaned text as new column 
dicmap= {"true": 0, "fake": 1} # lable true news as 0 and fake news as 1
news["is_fake"]= news.label.apply(lambda x: dicmap[x]) 
In [23]:
news.head(2)
Out[23]:
title text subject date label cleaned_title cleaned_text is_fake
0 As U.S. budget fight looms, Republicans flip t... WASHINGTON (Reuters) - The head of a conservat... politicsNews 2017-12-31 true As US budget fight looms Republicans flip thei... WASHINGTON The head of a conservative Republi... 0
1 U.S. military to accept transgender recruits o... WASHINGTON (Reuters) - Transgender people will... politicsNews 2017-12-29 true US military to accept transgender recruits on ... WASHINGTON Transgender people will be allowed... 0
In [24]:
# copy the news to data_brt split into training, validating and test set
data_brt= news[["cleaned_title","cleaned_text","is_fake","label"]] 
In [25]:
data_brt.head(3)
Out[25]:
cleaned_title cleaned_text is_fake label
0 As US budget fight looms Republicans flip thei... WASHINGTON The head of a conservative Republi... 0 true
1 US military to accept transgender recruits on ... WASHINGTON Transgender people will be allowed... 0 true
2 Senior US Republican senator Let Mr Mueller do... WASHINGTON The special counsel investigation ... 0 true
In [26]:
#train_data, validate_data, test_data (70%, 15%, 15% respectively)
train_data, validate_data, test_data= np.split(data_brt.sample(frac=1, random_state=42), [ int(.7*len(news)), int(.85*len(news))])
In [27]:
# size of trainng, validating and test data
train_data.shape, validate_data.shape, test_data.shape 
Out[27]:
((31428, 4), (6735, 4), (6735, 4))

3) Classification using BERT: Using Transformers library from Hugging Face

Note that the steps in this section are similar to the steps in Valkov article posted in Curiously blog with some adaptation to the data at hand.

In [75]:
!pip install -qq transformers 
In [29]:
# For  BERT tokenization and modeling 
import transformers
from transformers import BertModel, BertTokenizer, AdamW, get_linear_schedule_with_warmup
import torch
from textwrap import wrap
from torch import nn, optim
from torch.utils.data import Dataset, DataLoader
In [30]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
In [31]:
device
Out[31]:
device(type='cuda', index=0)

3.1) Tokenization

Tokenization is the process of breaking a text into smaller units called tokens. For example, breaking a sentence into a list of words. BERT provides the option to be case sensitive (uppercase or lowercase characters) or to be case-insensitive. In our case, the case-insensitive(uncased) version of BERT is used.

In [32]:
# uncased version of BERT
PRE_TRAINED_MODEL_NAME = 'bert-base-uncased'
In [33]:
tokenizer = BertTokenizer.from_pretrained(PRE_TRAINED_MODEL_NAME, do_lower_case=True)

In [34]:
train_data.iloc[0,0] # get a sample news title from the training data 
Out[34]:
' BREAKING GOP Chairman Grassley Has Had Enough DEMANDS Trump Jr Testimony'
In [35]:
sample_txt=train_data.iloc[0,0]
In [36]:
# tokenizing sample text from training data 
encoding = tokenizer.encode_plus(
  sample_txt,
  max_length=20,
  add_special_tokens=True, # Add '[CLS]' and '[SEP]'
  return_token_type_ids=False,
  padding='max_length',
  return_attention_mask=True,
  return_tensors='pt',  # Return PyTorch tensors
  truncation=True # to truncate excess tokens to meet the maximum length
)
encoding.keys()
Out[36]:
dict_keys(['input_ids', 'attention_mask'])

Parameter explanation :

  • max_length: has integer value to pass sequency of constance length, controls the length for padding/truncation
  • add_special_tokens: has boolean value to add special tokens such as [CLS] to add at the start of each sentence and [SEP] to mark ending of a sentence.
  • return_token_type_ids: has boolean value to whether to return token type IDs(list of token type ids to be fed to a model).
  • padding: activates and control padding. If set to 'max_length' pad to a maximum length specified with the argument max_length.Creats array of 0s (pad token) and 1s (real token) called attention mask
  • return_attention_mask: Whether to return the attention mask, indicates to the model which tokens should be attended to, and which should not.
  • return_tensors: Can be set to ‘tf’, ‘pt’ or ‘np’ to return TensorFlow tf.constant, PyTorch torch.Tensor or Numpy, respectively.
  • truncation: has boolean value to control truncation

For more parameters and detailed explanations please refere here.

In [37]:
encoding["input_ids"]
Out[37]:
tensor([[  101,  4911,  2175,  2361,  3472,  5568,  3051,  2038,  2018,  2438,
          7670,  8398,  3781, 10896,   102,     0,     0,     0,     0,     0]])
In [38]:
encoding["attention_mask"]# shows 1 for real token and 0 for pad tokens 
Out[38]:
tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0]])

Set the maximum length for our training data

To set the max_length, let us look at token length in the news title. In this post only the title of the news was considered for the analysis and for classification. Even though the text of the news has detailed information and has better prediction potential, due to the limited memory capacity of the free version of Google colab, we couldn't use the text of the news for classification. But a similar approach can be applied to use the text of the news for prediction.

In [39]:
token_length=[] # place holder to count the number of tokens in news 
for ttl in data_brt.cleaned_title:
  tokens = tokenizer.encode(ttl, max_length=512, truncation=True)
  token_length.append(len(tokens)) # list of token count of each news title
In [40]:
plt.figure(figsize=(12,4))
#sns.distplot(token_length)
sns.countplot(token_length)
plt.title("Frequency of token count")
plt.xlabel("Number of tokens in news title")
plt.ylabel("Frequency")
plt.show()
In [41]:
# see the median, 95% quantile and maximum token count
print("median number of tokens = {},95% quantile = {}, maximum number of tokens = {}"\
      .format(np.median(token_length),np.quantile(token_length,0.95),np.max(token_length)))
median number of tokens = 15.0,95% quantile = 26.0, maximum number of tokens = 60

To be safe, we can set the maximum length as 60. But we can also set a lesser number as the majority of token counts are below 30.

In [42]:
max_len= 60  # maximum length
In [43]:
# define a class that can tokenize the data
class NewsDataset(Dataset):
  def __init__(self, text, targets, tokenizer, max_len):
    self.text = text
    self.targets = targets
    self.tokenizer = tokenizer
    self.max_len = max_len
  def __len__(self):
    return len(self.text)
  def __getitem__(self, item):
    row = str(self.text[item])
    target = self.targets[item]
    encoding = self.tokenizer.encode_plus(
      row,
      add_special_tokens=True,
      max_length=self.max_len,
      return_token_type_ids=False,
      padding="max_length",
      return_attention_mask=True,
      truncation=True, # to truncate excess tokens to meet the maximum length
      return_tensors='pt',
    )
    return {
      'row_text': row,
      'input_ids': encoding['input_ids'].flatten(),
      'attention_mask': encoding['attention_mask'].flatten(),
      'targets': torch.tensor(target, dtype=torch.long)
    }

Data loader

The following function is specifically defined for the title of the news. For the text of the news, use similar approach but the input data has to be the column "cleaned_text".

In [44]:
  # cleate dataloader for news title
def create_data_loader(data, tokenizer, max_len, batch_size):
    ds = NewsDataset(
    text=data.cleaned_title.to_numpy(),
    targets=data.is_fake.to_numpy(),
    tokenizer=tokenizer,
    max_len=max_len
    )
    return DataLoader(ds,batch_size=batch_size, num_workers=4)

BATCH_SIZE = 16 
train_data_loader = create_data_loader(train_data, tokenizer, max_len, BATCH_SIZE)
val_data_loader = create_data_loader(validate_data, tokenizer, max_len, BATCH_SIZE)
test_data_loader = create_data_loader(test_data, tokenizer, max_len, BATCH_SIZE)
In [45]:
data = next(iter(train_data_loader)) # get 
data.keys()
Out[45]:
dict_keys(['row_text', 'input_ids', 'attention_mask', 'targets'])
In [46]:
# the shape of input_ids, attention_mask and targets for a batch of 16
print(data['input_ids'].shape) # batch size * max_length
print(data['attention_mask'].shape)  # batch size * max_length
print(data['targets'].shape)
torch.Size([16, 60])
torch.Size([16, 60])
torch.Size([16])

Let us see how the result looks like:

In [47]:
data["row_text"][0:2] # the first two news title in training data
Out[47]:
[' BREAKING GOP Chairman Grassley Has Had Enough DEMANDS Trump Jr Testimony',
 ' Failed GOP Candidates Remembered In Hilarious Mocking Eulogies VIDEO']
In [48]:
print(data['input_ids'][:2]) # the first two news input_ids in training data
tensor([[  101,  4911,  2175,  2361,  3472,  5568,  3051,  2038,  2018,  2438,
          7670,  8398,  3781, 10896,   102,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0],
        [  101,  3478,  2175,  2361,  5347,  4622,  1999, 26316, 19545,  7327,
         21615,  2678,   102,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0]])
In [49]:
print(data['targets'][:2]) # target value of the first two rows from a batch of 16
tensor([1, 1])
In [50]:
# We can confirm the above result from the training dataframe
train_data.head(2)
Out[50]:
cleaned_title cleaned_text is_fake label
22216 BREAKING GOP Chairman Grassley Has Had Enough... Donald Trump s White House is in chaos and the... 1 fake
27917 Failed GOP Candidates Remembered In Hilarious... Now that Donald Trump is the presumptive GOP n... 1 fake

3.2) Modeling using BERT

BERT Base-uncase pretrained model was used to train and validate the data.

In [51]:
bert_model = BertModel.from_pretrained(PRE_TRAINED_MODEL_NAME)


In [52]:
# for the sample data
last_hidden_state, pooled_output = bert_model(
  input_ids=encoding['input_ids'],
  attention_mask=encoding['attention_mask']
)

Last_hidden state in this case contians each token in a news and their corresponding hidden state of the last layer in the feedforward-networks whereas pooled_output is the summary of a news title on last_hidden_state.

In [53]:
last_hidden_state.shape #  the number of hidden units in the feedforward-networks=768
Out[53]:
torch.Size([1, 20, 768])
In [54]:
pooled_output.shape
Out[54]:
torch.Size([1, 768])
In [55]:
# to make classification
class NewsClassifier(nn.Module):
  def __init__(self, n_classes): 
    super(NewsClassifier, self).__init__()
    self.bert = BertModel.from_pretrained(PRE_TRAINED_MODEL_NAME)
    self.drop = nn.Dropout(p=0.1) # set dropout probablity for regularization
    self.out = nn.Linear(self.bert.config.hidden_size, n_classes)
  def forward(self, input_ids, attention_mask):
    _, pooled_output = self.bert(
      input_ids=input_ids,
      attention_mask=attention_mask
    )
    output = self.drop(pooled_output)
    return self.out(output)
In [56]:
#create classifier instance and move it to the GPU
model = NewsClassifier(2) # we have an argument 2, because we have two classes, fake and real news labeles
model = model.to(device)
In [57]:
# move the batch of training data to GPU
input_ids = data['input_ids'].to(device)
attention_mask = data['attention_mask'].to(device)
print(input_ids.shape) # batch size x seq length
print(attention_mask.shape) # batch size x seq length
torch.Size([16, 60])
torch.Size([16, 60])
In [58]:
# predicted probabilities from the trained model using softmax function
nn.functional.softmax(model(input_ids, attention_mask), dim=1)
Out[58]:
tensor([[0.3579, 0.6421],
        [0.3359, 0.6641],
        [0.3831, 0.6169],
        [0.2716, 0.7284],
        [0.3741, 0.6259],
        [0.3672, 0.6328],
        [0.3551, 0.6449],
        [0.3852, 0.6148],
        [0.4510, 0.5490],
        [0.4240, 0.5760],
        [0.3056, 0.6944],
        [0.3381, 0.6619],
        [0.4151, 0.5849],
        [0.3521, 0.6479],
        [0.3809, 0.6191],
        [0.3937, 0.6063]], device='cuda:0', grad_fn=<SoftmaxBackward>)

3.3) Fine-tuning BERT model

Recommendations for fine-tuning by BERT's author:

  • Batch size: 16, 32: increasing the batch size reduces the training time but gives lower accuracy
  • Learning rate (Adam): 5e-5, 3e-5, 2e-5
  • Number of epochs: 2, 3, 4
In [59]:
#Training 
EPOCHS = 4
optimizer = AdamW(model.parameters(), lr=2e-5, correct_bias=False)
total_steps = len(train_data_loader) * EPOCHS
scheduler = get_linear_schedule_with_warmup(
  optimizer,
  num_warmup_steps=0,
  num_training_steps=total_steps
  )
loss_fn = nn.CrossEntropyLoss().to(device)

Cross entropy is the measure of the performance of classification obtained by comparing the actual probability with the predicted probability(for more detail explanation please refer here)

In [60]:
# helper function to train the model
def train_epoch( model, data_loader, loss_fn, optimizer, device, scheduler, n_examples):
  model = model.train()
  losses = []
  correct_predictions = 0 
  for d in data_loader:
    input_ids = d["input_ids"].to(device)
    attention_mask = d["attention_mask"].to(device)
    targets = d["targets"].to(device)
    outputs = model(
      input_ids=input_ids,
      attention_mask=attention_mask
    )
    _, preds = torch.max(outputs, dim=1) #get the maximum probablity for prediction
    loss = loss_fn(outputs, targets)
    correct_predictions += torch.sum(preds == targets) # number of correct predictions
    losses.append(loss.item())
    loss.backward()
    nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
    optimizer.step()
    scheduler.step()
    optimizer.zero_grad()
  return correct_predictions.double() / n_examples, np.mean(losses)  # accuracy, loss
In [61]:
#helper function to evaluate the model
def eval_model(model, data_loader, loss_fn, device, n_examples):
  model = model.eval()
  losses = []
  correct_predictions = 0
  with torch.no_grad():
    for d in data_loader:
      input_ids = d["input_ids"].to(device)
      attention_mask = d["attention_mask"].to(device)
      targets = d["targets"].to(device)
      outputs = model(
        input_ids=input_ids,
        attention_mask=attention_mask
      )
      _, preds = torch.max(outputs, dim=1) # take the maximum 
      loss = loss_fn(outputs, targets)
      correct_predictions += torch.sum(preds == targets)
      losses.append(loss.item())
  return correct_predictions.double() / n_examples, np.mean(losses)  # accuracy, loss 
In [62]:
%%time
history = defaultdict(list) # placeholder to store the history of training and validation performance
best_accuracy = 0
for epoch in range(EPOCHS):
  print(f'Epoch {epoch + 1}/{EPOCHS}')
  print('-' * 10)
  # training performance
  train_acc, train_loss = train_epoch(model,train_data_loader,loss_fn, optimizer, device, scheduler,len(train_data))
  print(f'Train loss {train_loss} accuracy {train_acc}')
  
  # validation performance 
  val_acc, val_loss = eval_model(model,val_data_loader, loss_fn, device,len(validate_data))
  print(f'Val   loss {val_loss} accuracy {val_acc}')
  print()

  history['train_acc'].append(train_acc)
  history['train_loss'].append(train_loss)
  history['val_acc'].append(val_acc)
  history['val_loss'].append(val_loss)
  if val_acc > best_accuracy:
    torch.save(model.state_dict(), 'best_model_state.bin') # store the best model
    best_accuracy = val_acc
Epoch 1/4
----------
Train loss 0.1296285122639349 accuracy 0.9602583683339697
Val   loss 0.09647069952418616 accuracy 0.9743132887899035

Epoch 2/4
----------
Train loss 0.0426868009054726 accuracy 0.9894043528064146
Val   loss 0.10733570853713174 accuracy 0.9775798069784707

Epoch 3/4
----------
Train loss 0.011607260809529509 accuracy 0.9973908616520301
Val   loss 0.12249030433684359 accuracy 0.9805493689680772

Epoch 4/4
----------
Train loss 0.003767455873490347 accuracy 0.9992045309914726
Val   loss 0.1310923327622118 accuracy 0.9809948032665182

CPU times: user 18min 32s, sys: 8min 20s, total: 26min 52s
Wall time: 27min 15s
In [63]:
# plot the accuracy of training and validation data for different epoch
plt.plot(history['train_acc'], label='train accuracy')
plt.plot(history['val_acc'], label='validation accuracy')
plt.title('Training history')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend()
plt.ylim([0, 1]);

4) Model Evaluation

4.1) Validate with test data

In [74]:
# accuracy of the model for test data
test_acc, _ = eval_model(model, test_data_loader,loss_fn, device, len(test_data))
test_acc.item()
Out[74]:
0.9795100222717149

The accuracy of the model for test data is very good and the model can be used for new data sets.

In [66]:
# A fuction to make prediction for new data. It returns the actual text,
# prediction , prediction probablity and real labeling
def get_predictions(model, data_loader):
  model = model.eval()
  news_texts = []
  predictions = []
  prediction_probs = []
  real_values = []
  with torch.no_grad():
    for d in data_loader:
      texts = d["row_text"]
      input_ids = d["input_ids"].to(device)
      attention_mask = d["attention_mask"].to(device)
      targets = d["targets"].to(device)
      outputs = model(
        input_ids=input_ids,
        attention_mask=attention_mask
      )
      _, preds = torch.max(outputs, dim=1)
      
      news_texts.extend(texts)
      predictions.extend(preds)
      prediction_probs.extend(outputs)
      real_values.extend(targets)
  predictions = torch.stack(predictions).cpu()
  prediction_probs = torch.stack(prediction_probs).cpu()
  real_values = torch.stack(real_values).cpu()
  return news_texts, predictions, prediction_probs, real_values
In [67]:
y_news_texts, y_pred, y_pred_probs, y_test = get_predictions(
  model, test_data_loader)
In [68]:
# classification report for test data
print(classification_report(y_test, y_pred, target_names=["true","fake"]))
              precision    recall  f1-score   support

        true       0.98      0.98      0.98      3238
        fake       0.98      0.98      0.98      3497

    accuracy                           0.98      6735
   macro avg       0.98      0.98      0.98      6735
weighted avg       0.98      0.98      0.98      6735

In [69]:
# to display the confusion matrix
def show_confusion_matrix(confusion_matrix):
  hmap = sns.heatmap(confusion_matrix, annot=True, fmt="d", cmap="Blues")
  hmap.yaxis.set_ticklabels(hmap.yaxis.get_ticklabels(), rotation=0, ha='right')
  hmap.xaxis.set_ticklabels(hmap.xaxis.get_ticklabels(), rotation=30, ha='right')
  plt.ylabel('True lable')
  plt.xlabel('Predicted label');

cm = confusion_matrix(y_test, y_pred)
df_cm = pd.DataFrame(cm, index=[0,1], columns=[0,1]) # [0,1] is class name
show_confusion_matrix(df_cm)

We can also see the examples from test data and their corresponding predictions.

In [70]:
idx = 2 # index
class_names= ["true","fake"]
news_text_ttl = y_news_texts[idx] # extract the news title with the given index
true_label = y_test[idx]
pred_df = pd.DataFrame({'class_names': class_names, 'values': y_pred_probs[idx]})
print("\n".join(wrap(news_text_ttl)))
print()
print(f'True prediction: {class_names[true_label]}')
Lebanons grand mufti calls for national unity

True prediction: true
In [71]:
y_pred[2] #prediction
Out[71]:
tensor(0)
In [83]:
test_data.iloc[2,:]# we can confirm the third row(index of two) from the test data 
Out[83]:
cleaned_title        Lebanons grand mufti calls for national unity
cleaned_text     BEIRUT  Lebanon s grand mufti the top cleric f...
is_fake                                                          0
label                                                         true
Name: 15155, dtype: object

4.2 Use the model to classify news titles from Snopes website

Some news from Snopes, a fact checking website, were collected to test our model. It has news rated as false, true, mixed and so on. To test the model, archived news from the year 2016/2017 were collected from Snopes. This is because the data at hand was from 2015 to 2018. The current news might not be a good test as news in 2020 might be dominated by current issues such as covid-19. Thus, the older news was considered. A total of twenty news titles, half of them labeled as fake and rest were true news, were used to test our model.

In [72]:
# news title and their corresponding labeling by Snopes
snopes_data= ["Is This James Earl Jones Dressed as Darth Vader", 
"David Rockefeller's Sixth Heart Transplant Successful at Age 99", 
"Did Bloomington Police Discover Over 200 Penises During Raid at a Mortician's Home?", 
"Is the Trump Administration Price Gouging Puerto Rico Evacuees and Seizing Passports?",
"2017 Tainted Halloween Candy Reports 11/5/2014", 
"Did President Trump Say Pedophiles Will Get the Death Penalty?", 
"Michelle Obama Never Placed Her Hand Over Her Heart During the National Anthem?",
"Katy Perry Reveals Penchant for Cannibalism?" ,
"Is a Virginia Church Ripping Out an 'Offensive' George Washington Plaque?", 
"Were Scientists Caught Tampering with Raw Data to Exaggerate Sea Level Rise?",
"Did Trump Retweet a Cartoon of a Train Hitting a CNN Reporter?",
"Did Pipe-Bombing Suspect Cesar Sayoc Attend Donald Trump Rallies?",
"Did President Trump’s Grandfather Beg the Government of Bavaria Not to Deport Him?",
"Did Gun Violence Kill More People in U.S. in 9 Weeks than U.S. Combatants Died on D-Day?",
 "Did the Florida Shooter’s Instagram Profile Picture Feature a ‘MAGA’ Hat?",
"Wisconsin Department of Natural Resources Removes References to ‘Climate’ from Web Site",
 "Hillary Clinton Referenced RFK Assassination as Reason to Continue 2008 Campaign",
  "Did Richard Nixon Write a Letter Predicting Donald Trump’s Success in Politics?", 
"Did a Twitter User Jeopardize Her NASA Internship by Insulting a Member of the National Space Council?",
"Did WaPo Headline Call IS Leader al-Baghdadi an ‘Austere Religious Scholar’?"]  

label_actual = ["fake", "fake","fake","fake","fake","mixed", "fake","fake","mostly_false","fake","true",
               "true", "true","true","true","true","true","true","true","true"] # rated by Snopes
label_adjusted = ["fake", "fake","fake","fake","fake","fake", "fake","fake","fake","fake","true","true",
                  "true","true","true","true","true","true","true","true"]  # adjusted to fake or true

Clean the news in the same way as we cleaned the training data and then make prediction.

In [84]:
snopes_pred=[]
count_true_pred=0
for pos,ttl in enumerate(snopes_data):
  ttl= remove_pattern(ttl, patterns) # clean the title
  encoded_snopes = tokenizer.encode_plus(
    ttl,
    max_length=max_len,
    add_special_tokens=True,
    return_token_type_ids=False,
    padding="max_length",
    return_attention_mask=True,
    truncation=True,
    return_tensors='pt',
  )

  input_ids = encoded_snopes['input_ids'].to(device)
  attention_mask = encoded_snopes['attention_mask'].to(device)
  output = model(input_ids, attention_mask)
  _, prediction = torch.max(output, dim=1)
  
  pred_label=class_names[prediction]  #  prediction class
  snopes_pred.append(pred_label)

  # compaire the predicted and actual class label
  count_true_pred += (pred_label==label_adjusted[pos])
  
  
print(f'Snopes news title: {snopes_data}')
print(f'Prediction  : {snopes_pred}')
print(f'accuracy  : {count_true_pred/len(snopes_data)}')
Snopes news title: ['Is This James Earl Jones Dressed as Darth Vader', "David Rockefeller's Sixth Heart Transplant Successful at Age 99", "Did Bloomington Police Discover Over 200 Penises During Raid at a Mortician's Home?", 'Is the Trump Administration Price Gouging Puerto Rico Evacuees and Seizing Passports?', '2017 Tainted Halloween Candy Reports 11/5/2014', 'Did President Trump Say Pedophiles Will Get the Death Penalty?', 'Michelle Obama Never Placed Her Hand Over Her Heart During the National Anthem?', 'Katy Perry Reveals Penchant for Cannibalism?', "Is a Virginia Church Ripping Out an 'Offensive' George Washington Plaque?", 'Were Scientists Caught Tampering with Raw Data to Exaggerate Sea Level Rise?', 'Did Trump Retweet a Cartoon of a Train Hitting a CNN Reporter?', 'Did Pipe-Bombing Suspect Cesar Sayoc Attend Donald Trump Rallies?', 'Did President Trump’s Grandfather Beg the Government of Bavaria Not to Deport Him?', 'Did Gun Violence Kill More People in U.S. in 9 Weeks than U.S. Combatants Died on D-Day?', 'Did the Florida Shooter’s Instagram Profile Picture Feature a ‘MAGA’ Hat?', 'Wisconsin Department of Natural Resources Removes References to ‘Climate’ from Web Site', 'Hillary Clinton Referenced RFK Assassination as Reason to Continue 2008 Campaign', 'Did Richard Nixon Write a Letter Predicting Donald Trump’s Success in Politics?', 'Did a Twitter User Jeopardize Her NASA Internship by Insulting a Member of the National Space Council?', 'Did WaPo Headline Call IS Leader al-Baghdadi an ‘Austere Religious Scholar’?']
Prediction  : ['fake', 'true', 'fake', 'fake', 'fake', 'fake', 'fake', 'fake', 'fake', 'fake', 'fake', 'fake', 'fake', 'fake', 'fake', 'true', 'true', 'fake', 'fake', 'fake']
accuracy  : 0.55

Summary

The BERT model has better accuracy than other models done by word count, tfidf and word2vec (available in my github repository). The accuracy of test data using the BERT model was approximately 0.98 whereas the accuracies obtained by other models were between 0.93 and 0.94. Though due to the capacity limit, we could not be able to predict using news text, we expect the same better performance than the corresponding models. This is left for further experiment.