Copyright (c) Meta Platforms, Inc. and affiliates.
This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
Use this notebook to pull in datasets and apply pre-processing. Most grammar datasets unfortunately require preprocessing before being usable in training. (example - jfleg has 4 targets per input, so we have to rematch as 1:1 pairings)
%% Cell type:code id: tags:
``` python
importcsv
fromdatasetsimportload_metric,load_dataset
frompathlibimportPath
```
%% Cell type:code id: tags:
``` python
list_replacements=[
(" .","."),
(" ,",","),
("'","'"),
(" ?","?"),
(" !","!"),
(" :","!"),
(" ;","!"),
(" :",":"),
(" ;",";"),
(" n't","n't"),
(" v","n't"),
("2 0 0 6","2006"),
("5 5","55"),
("4 0 0","400"),
("1 7-5 0","1750"),
("2 0 %","20%"),
("5 0","50"),
("1 2","12"),
("1 0","10"),
('" ballast water','"ballast water')
]
```
%% Cell type:code id: tags:
``` python
defcorrect_spacing(item):
""" we iterate through the list of all replacements per each item in dataset"""
forfixinlist_replacements:
item=item.replace(fix[0],fix[1])
returnitem
```
%% Cell type:code id: tags:
``` python
defgenerate_csv(csv_path,dataset):
""" apply spacing corrections and save out matched pairs to csv file as dataset"""
withopen(csv_path,'w',newline='')ascsvfile:
writer=csv.writer(csvfile)
writer.writerow(["input","target"])
forcaseindataset:
# Adding the t5 task indication prefix to input
input_text=case["sentence"]
input_text=correct_spacing(input_text)
forcorrectionincase["corrections"]:
correction=correct_spacing(correction)
# a few of the cases contain blank strings.
ifinput_textandcorrection:
writer.writerow([input_text,correction])
```
%% Cell type:markdown id: tags:
In Jfleg - validation will be used as 'train', test will be 'validation'
Found cached dataset jfleg (/data/home/mreso/.cache/huggingface/datasets/jfleg/default/1.0.0/ed4ab2367351fe31949f48849ae6732b164f0d5ea6bb5d4357ff4293ac89511b)
Found cached dataset jfleg (/data/home/mreso/.cache/huggingface/datasets/jfleg/default/1.0.0/ed4ab2367351fe31949f48849ae6732b164f0d5ea6bb5d4357ff4293ac89511b)
%% Cell type:code id: tags:
``` python
print(train_dataset)
print(eval_dataset)
```
%% Output
Dataset({
features: ['sentence', 'corrections'],
num_rows: 755
})
Dataset({
features: ['sentence', 'corrections'],
num_rows: 748
})
%% Cell type:code id: tags:
``` python
print(train_dataset['sentence'][22])
print(train_dataset['corrections'][22])
```
%% Output
Students can focus on only a few subjects they are intwerested in and they will become an experts in those areas .
['Students can focus on only a few subjects they are interested in and they will become experts in those areas . ', 'Students can focus on only a few subjects they are interested in and they will become experts in those areas . ', 'Students can focus on only a few subjects they are interested in and they will become an expert in those areas . ', 'Students can focus on only a few subjects they are interested in and they will become an expert in those areas . ']