Copyright (c) Meta Platforms, Inc. and affiliates.
Copyright (c) Meta Platforms, Inc. and affiliates.
This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.
Use this notebook to pull in datasets and apply pre-processing. Most grammar datasets unfortunately require preprocessing before being usable in training. (example - jfleg has 4 targets per input, so we have to rematch as 1:1 pairings)
Use this notebook to pull in datasets and apply pre-processing. Most grammar datasets unfortunately require preprocessing before being usable in training. (example - jfleg has 4 targets per input, so we have to rematch as 1:1 pairings)
%% Cell type:code id: tags:
%% Cell type:code id: tags:
``` python
``` python
importcsv
importcsv
fromdatasetsimportload_metric,load_dataset
fromdatasetsimportload_metric,load_dataset
frompathlibimportPath
frompathlibimportPath
```
```
%% Cell type:code id: tags:
%% Cell type:code id: tags:
``` python
``` python
list_replacements=[
list_replacements=[
(" .","."),
(" .","."),
(" ,",","),
(" ,",","),
("'","'"),
("'","'"),
(" ?","?"),
(" ?","?"),
(" !","!"),
(" !","!"),
(" :",":"),
(" :",":"),
(" ;",";"),
(" ;",";"),
(" n't","n't"),
(" n't","n't"),
(" v","n't"),
(" v","v"),
("2 0 0 6","2006"),
("2 0 0 6","2006"),
("5 5","55"),
("5 5","55"),
("4 0 0","400"),
("4 0 0","400"),
("1 7-5 0","1750"),
("1 7-5 0","1750"),
("2 0 %","20%"),
("2 0 %","20%"),
("5 0","50"),
("5 0","50"),
("1 2","12"),
("1 2","12"),
("1 0","10"),
("1 0","10"),
('" ballast water','"ballast water')
('" ballast water','"ballast water')
]
]
```
```
%% Cell type:code id: tags:
%% Cell type:code id: tags:
``` python
``` python
defcorrect_spacing(item):
defcorrect_spacing(item):
""" we iterate through the list of all replacements per each item in dataset"""
""" we iterate through the list of all replacements per each item in dataset"""
forfixinlist_replacements:
forfixinlist_replacements:
item=item.replace(fix[0],fix[1])
item=item.replace(fix[0],fix[1])
returnitem
returnitem
```
```
%% Cell type:code id: tags:
%% Cell type:code id: tags:
``` python
``` python
defgenerate_csv(csv_path,dataset):
defgenerate_csv(csv_path,dataset):
""" apply spacing corrections and save out matched pairs to csv file as dataset"""
""" apply spacing corrections and save out matched pairs to csv file as dataset"""
withopen(csv_path,'w',newline='')ascsvfile:
withopen(csv_path,'w',newline='')ascsvfile:
writer=csv.writer(csvfile)
writer=csv.writer(csvfile)
writer.writerow(["input","target"])
writer.writerow(["input","target"])
forcaseindataset:
forcaseindataset:
# Adding the t5 task indication prefix to input
# Adding the t5 task indication prefix to input
input_text=case["sentence"]
input_text=case["sentence"]
input_text=correct_spacing(input_text)
input_text=correct_spacing(input_text)
forcorrectionincase["corrections"]:
forcorrectionincase["corrections"]:
correction=correct_spacing(correction)
correction=correct_spacing(correction)
# a few of the cases contain blank strings.
# a few of the cases contain blank strings.
ifinput_textandcorrection:
ifinput_textandcorrection:
writer.writerow([input_text,correction])
writer.writerow([input_text,correction])
```
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
In Jfleg - validation will be used as 'train', test will be 'validation'
In Jfleg - validation will be used as 'train', test will be 'validation'
Found cached dataset jfleg (/data/home/mreso/.cache/huggingface/datasets/jfleg/default/1.0.0/ed4ab2367351fe31949f48849ae6732b164f0d5ea6bb5d4357ff4293ac89511b)
Found cached dataset jfleg (/data/home/mreso/.cache/huggingface/datasets/jfleg/default/1.0.0/ed4ab2367351fe31949f48849ae6732b164f0d5ea6bb5d4357ff4293ac89511b)
Found cached dataset jfleg (/data/home/mreso/.cache/huggingface/datasets/jfleg/default/1.0.0/ed4ab2367351fe31949f48849ae6732b164f0d5ea6bb5d4357ff4293ac89511b)
Found cached dataset jfleg (/data/home/mreso/.cache/huggingface/datasets/jfleg/default/1.0.0/ed4ab2367351fe31949f48849ae6732b164f0d5ea6bb5d4357ff4293ac89511b)
%% Cell type:code id: tags:
%% Cell type:code id: tags:
``` python
``` python
print(train_dataset)
print(train_dataset)
print(eval_dataset)
print(eval_dataset)
```
```
%% Output
%% Output
Dataset({
Dataset({
features: ['sentence', 'corrections'],
features: ['sentence', 'corrections'],
num_rows: 755
num_rows: 755
})
})
Dataset({
Dataset({
features: ['sentence', 'corrections'],
features: ['sentence', 'corrections'],
num_rows: 748
num_rows: 748
})
})
%% Cell type:code id: tags:
%% Cell type:code id: tags:
``` python
``` python
print(train_dataset['sentence'][22])
print(train_dataset['sentence'][22])
print(train_dataset['corrections'][22])
print(train_dataset['corrections'][22])
```
```
%% Output
%% Output
Students can focus on only a few subjects they are intwerested in and they will become an experts in those areas .
Students can focus on only a few subjects they are intwerested in and they will become an experts in those areas .
['Students can focus on only a few subjects they are interested in and they will become experts in those areas . ', 'Students can focus on only a few subjects they are interested in and they will become experts in those areas . ', 'Students can focus on only a few subjects they are interested in and they will become an expert in those areas . ', 'Students can focus on only a few subjects they are interested in and they will become an expert in those areas . ']
['Students can focus on only a few subjects they are interested in and they will become experts in those areas . ', 'Students can focus on only a few subjects they are interested in and they will become experts in those areas . ', 'Students can focus on only a few subjects they are interested in and they will become an expert in those areas . ', 'Students can focus on only a few subjects they are interested in and they will become an expert in those areas . ']