Learning what it takes to edit sentences.

Akarshan Kumar
10 min readJan 9, 2022
Photo by Aaron Burden on Unsplash

The blog discusses

  1. The Problem Statement and Data
  2. EDA on data
  3. Building a base model with CNN and Lstm
  4. Using BERT for word embeddings and Model analysis
  5. Deployment of solution
  6. Conclusion and Future Work

The Problem Statement and Data

In the recent past, there have been attempts by conventional ML models to perform operations on sequence classification tasks( including but not limited to NLP). The conventional models achieve a decent performance score on the task as well. But after the emergence of BERT and TRANSFORMERS as a new epitome of models to handle sequence classification tasks, we must re-render these sequence classification tasks.

This blog trains three Bert variants ( BERT, RoBERTa, SciBERT) with fine-tuning approach, on existing sentence classification task, and compare the results with baseline models. The problem of finding grammatical errors in a sentence is the sequence classification task that has been chosen to enhance upon. Elaborating on the task, we are given a sentence to work with where we need to identify grammatical errors present in it i.e. to predict, given a sentence, does the sentence needs editing or not. We care for both False positives and False Negatives in performance so the F1 score was chosen as the metric of use.

We have taken data from the paper ‘A Report on the Automatic Evaluation of Scientific Writing Shared Task’. AESW is the task of identifying sentences in need of correction to ensure their appropriateness in scientific prose. The data set comes from a professional editing company, VTeX, with two aligned versions of the same text — before and after editing — and covers a variety of textual infelicities that proofreaders have edited.

A fragment of Data.

EDA on data

The data at hand has to be extracted from the XML format to CSV.

Then the preliminary exploration shows that there are 1,189,412 data points. The number of extracted columns is 7. SID is just sentence ID. The domain tells from which scientific domain does the sentence belongs, it contains string values. SBE is the state of the sentence before editing, and SAE is the state of the sentence after editing is done. If the sentence requires editing then these two columns will contain different sentences else they will have the same sentence which can be read as a string. del_word column contains values(including spaces) as the texts that had to be deleted for correcting sentences. ins_word column contains values(including spaces) as the texts that had to be inserted for correcting sentences. The label contains 1 or 0 corresponding to a grammatically wrong sentence or otherwise.

Null values are rare in SBE and SAE, we can directly drop them. For null values in del_word and ins_word, we can impute it with blank. Note: For correcting a wrong sentence we don't have to always delete a word from it, we can just insert the missing text and vise versa. Whereas for sentences that do not need editing del_word and ins_word are always blank.

Most Del_words from Sentences that need editing
Most Ins_words from Sentences that need editing
Top 10 Pairs of Del_words, Ins_word from Sentences that need editing

From the above three images, it is evident that minuscule grammatical errors plague most of the sentences that need editing. This makes the job of finding errors in a sentence hard as all kinds of sentences are overwhelmed with these features.

From the distribution of Label, we can see this is an imbalanced dataset with most of the sentences not needing an edit.

Percentwise Label distribution

From the Distribution of Domain column, we can see that most of the sentences come from a paper written in the domain of Mathematics, Physics, Engineering, and Computer science.

Distribution of sentence length in SBE
Distribution of sentence length in SAE

From the above 2 graphs, we can see that the distribution of sentence length follows a typical Log distribution, as expected from sentence data, for both sentences before editing and sentence after editing.

Word Share distribution between SAE and SBE

Word share is the ratio of the number of words 2 sentences have in common to the number of words 2 sentences have in total(i.e intersection over Union all for distinct words used). From the above EDA we can see for the sentences not needing an edit it is perfectly 0.5, while for the sentence that needs editing it is though left-skewed but many sentences still have near 0.5-word similarity. This indicates that most of the sentences need minor editing only, as we saw in the earlier analysis also.

Building a base model with CNN and LSTM

Let's build a baseline model first to compare our results against the latter models which will include BERT. For the job of sequence classification, we can use CNN as well as LSTM. You can have a look at this and this for a better understanding of using CNN for sentence classification. We will use the Bert-based tokenizers only so that the features stay as constant as possible across all models. Though for base models we will train the tokenizers on the AESW data.

bert_tokenizer_params=dict(lower_case=True)
reserved_tokens=["[PAD]", "[UNK]", "[START]", "[END]"]
bert_vocab_args = dict(
# The target vocabulary size
vocab_size = 12000,
# Reserved tokens that must be included in the vocabulary
reserved_tokens=reserved_tokens,
# Arguments for `text.BertTokenizer`
bert_tokenizer_params=bert_tokenizer_params,
# `wordpiece_vocab.wordpiece_tokenizer_learner_lib.learn`
learn_params={},
)

The above cell shows how to set arguments for initializing a bert tokenizer as well as a vocab builder for the bert tokenizer.

vocabulary = bert_vocab.bert_vocab_from_dataset(
train.batch(10).prefetch(2),**bert_vocab_args
)

This cell is used to make the vocabulary from the Train data which is fed to it in batches of 10 at a time.

tokenizer = tf_text.BertTokenizer(‘Data//vocab_tr_w.txt’, **bert_tokenizer_params)

This is how the bert tokenizer is initialized with the vocab file path and the bert tokenizer argument as a dictionary. After this process, we just pad the encoded vectors to maintain the same length across all of the data and then train our CNN and LSTM models.

CNN architecture

The architecture of the CNN model used is as shown beside. It was observed that deeper and complex models were not boosting the performance much. So this model with low complexity and on par results was the final CNN model used. The results were as follows.

Accuracy on test set: 50.90%, Precision on test set: 39.82%, recall score on test set: 49.99%, f1 score on test set: 44.33%

The architecture of the LSTM model used is as shown beside. It was observed that deeper and complex models were not boosting the performance much. So this model with low complexity and on par results was the final CNN model used. The results were as follows.

Accuracy on test set: 50.70% Precision on test set: 39.79% recall score on test set: 50.74% f1 score on test set: 44.60%

Note: The models were used for prediction values only. The task of classification was done after the tunning of thresholds (as imbalanced data). After that, the results were calculated.

Using BERT for word embeddings and Model analysis

Transformers are the new epitome in NLP models. It brings the power of attention-based encoder-decoder mechanism with itself as explained in the paper “Attention Is All You Need”. For further clarification please go through this and this blog.

Bidirectional Encoder Representations from Transformers or BERT, as suggested by the name, it uses only the encoder section of Transformers. With BERT, we can encode words and then use those encodings in a feed-forward network to make classifications. This is precisely what we are going to do in the following sections. But instead of using BERT, we are trying to use different variants of BERT and see which one performs the best. In this blog, we will see the usage of Distillbert(a smaller and faster bert based model achieved by distillation of actual Bert), RoBERTa( it has the same architecture as BERT but is pre-trained with more data (13GB vs 160GB)), and SciBERT(it also has the same architecture as BERT but is pre-trained on 1.14M papers from semanticscholar.org).

The method of training i.e. extracting word embeddings from the model and then using an FFNN on top to classify sentences, will be the same for all the three Bert-based models.

To get the embeddings from the bert models Huggingface provides an excellent approach in form of Pipelines. The pipeline takes three arguments, the model, the tokenizer, and what is the end task you need to perform using them.

model_name = ‘distilbert-base-uncased’
config = AutoConfig.from_pretrained(model_name,trianing =False, num_labels=2 )
config.output_hidden_states = False
BERT = TFAutoModel.from_pretrained(model_name,config = config)
tokenizer = AutoTokenizer.from_pretrained(model_name,
do_lower_case=True,
use_fast=True,
max_length=MAX_LEN,
truncation=True,
pad_to_max_length=True)
pipe = pipeline(‘feature-extraction’, model=BERT,
tokenizer=tokenizer,device=1)

In the above snippet of code if we can just use the appropriate model names and get encoded-words based on that model.

features = np.array(pipe(df[‘SBE’].iloc[0:2000].to_list()))
for idx in range(np.shape(features)[0]):
sent_mean = np.mean(features[idx][0],axis =0)
lst.append(sent_mean)
feature_matrix= np.array(lst)

from the above code snippet, we can see we first convert the embeddings into an array. The array will contain word-wise embeddings. So to get the embedding of a sentence we can just take the mean of embeddings of all the words present in that sentence. This sentence embedding can now be stored in a feature_matrix. We do this for all the 3 Bert-based models and save the feature_matrix accordingly.

After passing this feature_matrix to FFNN, an enhancement in the performance is observed for all three variants of BERT. The best of the three was Distillbert with the following performances Accuracy on test set: 53.63% Precision on test set: 60.97% recall score on test set: 65.94% f1 score on test set: 63.36%.

Now that we know Distillbert performs the best. We can see it's working on the actual data and can analyze the model in depth.

Domain Distribution

Analyzing the False Positive Data points we can see the Domain distribution of False Positive is the same as that of whole Train data but Physics. So FPs are a little dependent on the domain.

Del-Ins word pair for FP points.

The deleted-then-inserted word pair for FP mostly had Punctuations pairs, followed by is-are,section-Section,n’t-not pairs, and then articles(a, an, the) pairs that got falsely predicted as not needing editing.

Binary cross-entropy loss

Binary cross entropy loss dist is highly right-skewed.

Del-Ins word pair for Most Errornous points.

Most erroneous points belong to FP, and so deleted-then-inserted word pair for it closely follows FP.

Deployment of solution

Deployment on Flask

I pickled Pipeline, DistillBert model, and Tokenizer as well as the FFNN model, and used them to provide predictions on my data. Then I made a simple form that opens in a web browser to take user inputs minding all the boundary conditions and used the Flask API to deploy my app locally, where one can give input as a sentence and the model page will predict if the sentence needs editing or not.

Conclusion

The Bert being the state of the art model for NLP tasks does improve the performance in the task of predicting if a sentence needs editing or not.

Future work

Using the Model retraining Techniques this fine tunning task can be modeled for better performance.

Profile

For .ipynb notebooks see here.

contact me via Linkedin.

References

--

--