Cleaned Lang8 Dataset for Grammatical Error Detection [GED]
Grammatical Error Detection is an important problem in the domain of Natural Language Processing, and proper development of these algorithms requires availability of proper datasets which can be used for training and testing of various Machine Learning models. One of the most popular datasets used for training and testing of grammar error detection algorithms is the Corpus of Linguistic Acceptability (CoLA), which has around ten thousand sentences. However, the CoLA dataset is relatively small and is also highly unbalanced. It has a much higher number of correct sentences (6,023) compared to incorrect ones (2,528) in the training set. This imbalance prompted us to explore a larger and more diverse dataset like Lang8 for a more robust training and evaluation of language models.
The original Lang8 dataset can be obtained from Google Research Datasets and comprises of 23,72,119 rows. The dataset was acquired by completing a form provided by the repository. The repository included a run.sh script, which, when executed on the raw data, produced the final tokenized dataset. Due to tokenization issues encountered during the execution of the provided script, the final dataset obtained had 23,50,982 rows.
This Lang8 dataset consists of two columns, labeled 0 and 1, where the column labeled 0 contains the original sentences, and the column labeled 1 contains their corrected counterparts. Out of the 23,50,982 rows in the dataset, 9,91,358 rows contain grammatically correct sentences in column 0 and 13,59,624 rows contain grammatically wrong sentences. This is inferred by simply matching the sentence in column 0 with the sentence given in column 1. If both these sentences are the same, then the sentence in column 0 is grammatically correct.
During a detailed manual analysis of this dataset, we noticed that a lot of grammatically wrong sentences in column 0 can actually be fixed by using simple rules. It is important to remove these sentences from our dataset since Machine Learning models for Grammar Error Detection are primarily required to detect non-trivial grammatical errors, and so the dataset for training these models should not contain any trivially fixable sentences. The subsequent data cleaning procedures detailed in this document aimed to refine this dataset, ensuring its suitability for linguistic analysis, language model training, and other research endeavors.
Text Normalisation
Next, we normalised the text. This involved converting all characters in the Unicode standard to their equivalent in the ASCII standard. Additionally, punctuation marks were replaced with spaces to standardize the text format. After normalisation, sentences that became identical were removed.
Space Normalisation
In the Space Normalisation step, we employed regular expressions to eliminate extra spaces in both columns 0 and 1. This encompassed converting double spaces to single spaces and removing spaces preceding punctuations, contributing to the standardization of the text format.
Moreover, we extended the normalization process to address instances where spaces were incorrectly placed, as observed in phrases like “ca n’t.” These were transformed to their correct forms, such as “can’t,” ensuring a more accurate representation of the grammatical structure.
This comprehensive space normalisation not only standardised spacing conventions but also rectified specific cases where spaces were improperly positioned, further refining the dataset for subsequent linguistic analysis and language model training.
Original: “I do n’t want to go to school. “
Corrected: “I don’t want to go to school.”
Updated List of Grammatically Wrong Sentences : 13,23,190
Case Lowering
All sentences were converted to lowercase to ensure that different cases of the same word were treated uniformly. For example:
Original: “I am Going to School.”
Corrected: “I am going to school.”
Newly created duplicates due to this process were then removed.
Updated List of Grammatically Wrong Sentences : 12,51,300
Handling Contractions
Contractions in the sentences were expanded to their full forms. This step ensured that contractions did not lead to discrepancies in detecting grammatical errors. For example:
Original: “I can’t go to school.”
Corrected: “I cannot go to school.”
Newly created duplicates due to this process were then removed.
Updated List of Grammatically Wrong Sentences : 12,51,257
Punctuation Removal
Some of the sentences were labeled grammatically wrong simply due to incorrect punctuation. Punctuation marks were removed from sentences in both columns using regular expressions and the sentences were compared. Duplicate sentences, now devoid of punctuation, were again filtered out.
Original: “I cannot go to school!”
Corrected: “I cannot go to school.”
Updated List of Grammatically Wrong Sentences : 11,82,692
Levenshtein Distance Filtering
Finally, we wanted to retain only those sentences which have a reasonable length and also have significant enough grammatical errors. So we first removed sentences which have a length of less than 3 and more than 100 (number of characters). Next, we computed the Levenshtein Distance (LD) between sentences in columns 0 and 1 and filtered sentences with LD between 7 and 42, and lengths less than 101 characters, resulting in 2,27,527 sentences.
To gain a normalized perspective relative to sentence lengths, we calculated the ratio of LevenshteinDistance and the length of the corrected sentence in column 1, which provides a balanced measure accounting for the inherent differences in sentence lengths.
Levenshtein Distance Normalisation
To provide a balanced measure of sentence similarity, the Levenshtein distance was normalised by dividing it by the length of the sentences. Sentences were filtered based on these normalized Levenshtein distances. Only sentences with a normalized Levenshtein distance between 0.08 and 0.5 were retained, resulting in 217,018 sentences. For instance:
Original: “The cat sat on the mat.”
Corrected: “The cat is on the mat.”
Levenshtein Distance: 2 (sat -> is)
Length of corrected sentence: 18
Normalized Levenshtein Distance: 2 / 18 = 0.11
This sentence pair was retained as the normalized distance fell within the specified range.
Number of Sentences After Levenshtein Distance Filtering: 2,17,018
Conclusion
The meticulous application of these six data cleaning functions resulted in a refined Lang8 dataset containing 2,17,018 grammatically incorrect sentences in Column 0, while Column 1 contains their corrected versions. To make the dataset balanced, we have also added an equal number of grammatically correct sentences to the csv file. However, that made the file size more than the GitHub limit of 25 MB. So, the final uploaded dataset has a total number of 2,00,000 sentences in column 0. This cleaned dataset will hopefully serve as a valuable resource for linguistic analysis, language model training, and related research endeavors.
We would have also liked to do a spell check of all the words in the sentences and remove sentences which are grammatically wrong merely due to spelling errors since we would like to focus on developing Machine Learning models to detect mistakes in sentence structure formation. However, the existing spell checkers are very slow and so this step was taking too much time.
It would also be interesting to analyze the sentences in columns 0 and 1 from the perspective of sentence similarity with respect to one of the Large Language Models like BERT or SBERT. We hope to do this detailed analysis in the future, which will hopefully help in building robust models for Grammatical Error Detection.
Work done by my intern : Rahul Nihalani