[Experiment] Various tuning of the GPT2 RNM model


 I’ve previously made a Rick & Morty script generator using e-tony’s gpt2-rnm model on the Hugging face. This time, let’s experiment with various tunings of the gpt2-rnm model. This is an experiment to tune the model using only a specific character from Rick & Morty or dialogue from a specific season, and tuning by data from another character from other animations. The data are Rick and Morty data from Andrada Olteanu and SpongeBob data from Mikhail Gaerlan. Let’s refer to this colab notebook to tune the gpt2-rnm model!


Specific season model: ainize/gpt2-rnm-with-season-1
Specific character model: ainize/gpt2-rnm-with-only-rick
Other animations model: ainize/gpt2-rnm-with-spongebob

Training with a specific season.

Rick and Morty data has conversation data from seasons 1 to 3. Here I will extract only one specific season and tune the gpt2-rnm model.

Data loading

First, let’s get the meta information of CSV and sample of the first line.

import csv
import os

input_path = '/content/RickAndMortyScripts.csv'

with open(input_path, 'r', encoding='utf-8-sig') as f:
    rdr = csv.reader(f)    # read csv

    csv_headings = next(rdr)    # column info
    first_line = next(rdr)      # First line sample

    for i in range(len(first_line)):
		# Shorten the content to make it easier to see.
        first_line[i] = first_line[i][:50] + '...' if len(first_line[i]) > 30 else first_line[i]

    result = {0: csv_headings, 1: first_line}

for i in range(len(result[0])):
	# Print column names and information.
    print(f'Index {i}: {result[0][i]} - {result[1][i]}')

The parts you need here are season no 1, name 4, and line 5. Since I will be training the model with the data from season 1, I will only extract the data from season 1.

character_line = 4   # the line with the character name.
dialog_line = 5      # the line with the dialog.
season_line = 1      # the line with the season.
want_season = 1      # I want season 1
want_season = str(want_season)

result_path = f'/content/RickAndMorty-Season-{want_season}-Script.text'

with open(input_path, 'r', encoding='utf-8-sig') as f:
	# read csv
    rdr = csv.reader(f)

    with open(result_path, 'w', encoding='utf-8') as r:

        for idx, line in enumerate(rdr):
            if idx == 0:
				# The first line is information column, skip.
                pass
            elif line[season_line] == want_season:
				# Get the character name and dialogue.
				# Normalize whitespace with split & join.
				# If there is no character name, unify it as Narrator.
                who = " ".join(str(line[character_line]).split()) if line[character_line] else 'Narrator'
                dialog = " ".join(str(line[dialog_line]).split())

				# Lines without dialogue are skipped.
                if dialog == '':
                    continue

                r.write(who + ': ' + dialog + '\n')
            else:
                pass


This completes the Season 1 data.

Tune the e-tony/rnm model with Rick and Morty season 1 data

Now it’s time to tune the rnm model. First, download and save the model and tokenizer.

from transformers import GPT2Tokenizer, GPT2LMHeadModel
from transformers import Trainer, TrainingArguments
from transformers import TextDataset, DataCollatorForLanguageModeling

# training parameter
output_dir = f'/content/rnm-season-{want_season}-model'
num_train_epochs = 3    # Train with 3 epoch.
batch_size = 16
block_size = 128

# Load the model and tokenizer.
tokenizer = GPT2Tokenizer.from_pretrained('e-tony/gpt2-rnm')
model = GPT2LMHeadModel.from_pretrained('e-tony/gpt2-rnm')

# Save the model and tokenizer.
tokenizer.save_pretrained(output_dir)
model.save_pretrained(output_dir)

Load data set and data collator using text file and define training arguments for tuning.

※ data collator: It is an object that forms a batch using a data set.

# Dataset loading
train_dataset = TextDataset(
    tokenizer=tokenizer,
    file_path=result_path,
    block_size=block_size,)

# make data collator
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, 
    mlm=False,)

# make training arguments
training_args = TrainingArguments(
    output_dir=output_dir,     #The output directory
    overwrite_output_dir=output_dir, #overwrite the content of the output directory
    num_train_epochs=num_train_epochs, # number of training epochs
    per_device_train_batch_size=batch_size, # batch size for training
    save_steps=600,     # after 600 steps model is saved 
    )

Now define the trainer using the loaded data and training arguments. It’s time to start training.

# Make trainer for gpt2
trainer = Trainer(
    model = model,
    args=training_args,
    train_dataset=train_dataset,
    data_collator=data_collator
)

# training start
trainer.train()

# save model
trainer.save_model()


Training is completed. I have very small size of data, so the trainging was done quickly.

Test season 1 Model

Now it’s time to try out the Season 1 model. First, load the model and tokenizer.

# Load the model and tokenizer.
tokenizer = GPT2Tokenizer.from_pretrained(output_dir)
model = GPT2LMHeadModel.from_pretrained(output_dir)

Just put a sentence in the text and test it.

text = 'Rick: Haha!!!! Morty!!!'

# Encode text with tokenizer.
ids = tokenizer.encode(text, return_tensors='pt')

# Generate a sentence with a model.
final_outputs = model.generate(
        ids,
        do_sample=True,
        max_length=100,
        top_k=40,
        top_p=0.95,
    )

# Decode the generated sentences with the tokenizer.
print(tokenizer.decode(final_outputs[0], skip_special_tokens=True))


Generated text.


Generated text of the existing rnm model.

Hmmm… The generated result is good, but I don’t know if there is any difference from the existing model…!! But, it was an experiment that was fun!



Original model: e-tony/gpt2-rnm
Model made by experiment: ainize/gpt2-rnm-with-season-1
Full code of this experiment: Tuning to Season 1

1 Like

Training with a specific character.

In previous experiments, I tuned the gpt2-rnm model to a specific season. So, what will happen if I tune only with the dialogue of a specific character? Let’s find out.

Data loading

First, I loaded the data the same way as before and extract only Rick’s lines.

import csv
import os

input_path = '/content/RickAndMortyScripts.csv'
character_line = 4   # the line with the character name.
dialog_line = 5      # the line with the dialog.
want_name = 'Rick'
result_path = f'/content/RickAndMorty-{want_name}-Script.text'

with open(input_path, 'r', encoding='utf-8-sig') as f:
    rdr = csv.reader(f)

    with open(result_path, 'w', encoding='utf-8') as r:

        for idx, line in enumerate(rdr):
            if idx == 0:
				# The first line is information column, skip.
                pass
            elif line[character_line] == want_name:
				# Get the character name and dialogue.
				# Normalize whitespace with split & join.
                who = " ".join(str(line[character_line]).split())
                dialog = " ".join(str(line[dialog_line]).split())

				# Lines without dialogue are skipped.
                if dialog == '':
                    continue

                r.write(who + ': ' + dialog + '\n')
            else:
                pass


This completes the data that only collected Rick conversations.

Tune the e-tony/rnm model with only Rick’s lines

Now let’s tune in with Rick’s lines. First, download and save the model and tokenizer.

from transformers import GPT2Tokenizer, GPT2LMHeadModel
from transformers import Trainer, TrainingArguments
from transformers import TextDataset,DataCollatorForLanguageModeling

output_dir = f'/content/rnm-{want_name}-model'
num_train_epochs = 1    # Train with 1 epoch.
batch_size = 16
block_size = 128

# Load the model and tokenizer.
tokenizer = GPT2Tokenizer.from_pretrained('e-tony/gpt2-rnm')
model = GPT2LMHeadModel.from_pretrained('e-tony/gpt2-rnm')

# Save the model and tokenizer.
tokenizer.save_pretrained(output_dir)
model.save_pretrained(output_dir)

Then, define data and training arguments.

# Dataset loading
train_dataset = TextDataset(
    tokenizer=tokenizer,
    file_path=result_path,
    block_size=block_size,)

# make data collator
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, 
    mlm=False,)

# make training arguments
training_args = TrainingArguments(
    output_dir=output_dir, #The output directory
    overwrite_output_dir=output_dir, #overwrite the content of the output directory
    num_train_epochs=num_train_epochs, # number of training epochs
    per_device_train_batch_size=batch_size, # batch size for training
    save_steps=600,     # after 600 steps model is saved 
    )

With the loaded data and training arguments, it’s time to define a trainer and start training.

# Make trainer for gpt2
trainer = Trainer(
    model = model,
    args=training_args,
    train_dataset=train_dataset,
    data_collator=data_collator
)

# training start
trainer.train()

# save model
trainer.save_model()


The data was so small that it was over quickly.

Test Rick’s Model

Let’s test the finished model.

# Load the model and tokenizer.
tokenizer = GPT2Tokenizer.from_pretrained(output_dir)
model = GPT2LMHeadModel.from_pretrained(output_dir)

text = 'Morty: Haha!!!! Rick!!!'

# Encode text with tokenizer.
ids = tokenizer.encode(text, return_tensors='pt')

# Generate a sentence with a model.
final_outputs = model.generate(
        ids,
        do_sample=True,
        max_length=100,
        top_k=40,
        top_p=0.95,
    )

# Decode the generated sentences with the tokenizer.
print(tokenizer.decode(final_outputs[0], skip_special_tokens=True))

It was a strange result of Rick’s monologue. This experiment was unsuccessful, but it’s quite an entertaining result.



Original model: e-tony/gpt2-rnm
Model made by experiment: ainize/gpt2-rnm-with-only-rick
Full code of this experiment: Tuning with Rick’s own lines

Training with other animations.

In this experiment, I will tune the model with SpongeBob data and Rick and Morty data.

Data loading

First, since there are hundreds of SpongeBob data, it needs to be merged into one. And I will pre-process.

import os
import unidecode

files_path = "/content/SpongeBob_SquarePants_Transcripts/"
spongebob_path = '/content/Spongebob-Script.txt'

file_list = os.listdir(files_path)

with open(spongebob_path, 'w', encoding='utf-8') as r:

    for file in file_list:
        name = file.split('.')

        if 'txt' not in name:
            continue

		# It merges by reading the files one by one.
        with open(files_path + '/' + file, 'r', encoding='utf-8') as f:
            lines = f.readlines()

            count = 0

            for line in lines:
                if count == 0:
                    count = 1
                    continue

                r.write(line)


The files have been merged into one. Now I will pre-process.

with open(spongebob_path, 'r', encoding='utf-8-sig') as f:
    lines = f.readlines()

    with open(spongebob_path, 'w', encoding='utf-8-sig') as r:

        for line in lines:
			# Split the line based on ':'.
            line = line.split(":")
            
            if len(line) < 2:
				# If the length is less than 2, 
				# there is no conversation or name, so skip.
                continue
            elif len(line) > 2:
				# If the length is greater than 2,
				# it is combined from 1 to the end.
                for x in line[2:]:
                    line[1] += " : " + x

			# Recover broken characters with unidecode.
			# Normalize whitespace with split & join.
            line[0] = " ".join(unidecode.unidecode(line[0]).split())
            line[1] = " ".join(unidecode.unidecode(line[1]).split())

            if line[0] == "[":
				# If you start with [, then Narrator says.
                line = 'Narrator : ' + line[1]
            else:
                line = line[0] + ': ' + line[1]

            r.write(line.strip() + '\n')


There are no speakers in the data or there are some broken letters, so this is the logic to fix them. By split and join, lines without speakers were deleted or Narrator was added, and broken characters were removed through the unicode module.

And Rick and Morty data is also parsed from csv files. The parsing method is the same as the previous two experiments, but all lines were parsed.

input_path = '/content/RickAndMortyScripts.csv'
character_line = 4   # the line with the character name.
dialog_line = 5      # the line with the dialog.
result_path = f'/content/RickAndMorty-Script.txt'

with open(input_path, 'r', encoding='utf-8-sig') as f:
		# read csv
    rdr = csv.reader(f)

    with open(result_path, 'w', encoding='utf-8') as r:

        for idx, line in enumerate(rdr):
			# The first line is information column, skip.
            if idx == 0:
                pass
            else:
				# Get the character name and dialogue.
				# Normalize whitespace with split & join.
				# If there is no character name, unify it as Narrator.
                who = " ".join(str(line[character_line]).split()) if line[character_line] else 'Narrator'
                dialog = " ".join(str(line[dialog_line]).split())

				# Lines without dialogue are skipped.
                if dialog == '':
                    continue

                r.write(who + ': ' + dialog + '\n')


The Rick & Morty data is ready.

result_path = '/content/merged_script.txt'

with open(result_path, 'w', encoding='utf-8') as r:

    for file in ['/content/RickAndMorty-Script.txt', '/content/Spongebob-Script.txt']:
        name = file.split('.')

        if 'txt' not in name:
            continue

				# Read the files and combine them into one.
        with open('/' + file, 'r', encoding='utf-8') as f:
            lines = f.readlines()

            count = 0

            for line in lines:
                if count == 0:
                    count = 1
                    continue

                r.write(line)

Finally, the data is completed by combining the two data.

Tune the e-tony/rnm model with SpongeBob data

Load the necessary modules, models, and tokenizers, and declare variables.

from transformers import Trainer, TrainingArguments
from transformers import GPT2Tokenizer, GPT2LMHeadModel
from transformers import TextDataset, DataCollatorForLanguageModeling

output_dir = f'/content/rnm-with-spongebob-model'
num_train_epochs = 2    # Train with 2 epoch.
batch_size = 16
block_size = 128

# Load the model and tokenizer.
tokenizer = GPT2Tokenizer.from_pretrained('e-tony/gpt2-rnm')
model = GPT2LMHeadModel.from_pretrained('e-tony/gpt2-rnm')

# Save the model and tokenizer.
tokenizer.save_pretrained(output_dir)
model.save_pretrained(output_dir)

Load the data set and data collator using a text file and define training arguments for the trainer.

# Dataset loading
train_dataset = TextDataset(
    tokenizer=tokenizer,
    file_path=result_path,
    block_size=block_size,)

# make data collator
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, 
    mlm=False,)

# make training arg
training_args = TrainingArguments(
    output_dir=output_dir, #The output directory
    overwrite_output_dir=output_dir, #overwrite the content of the output directory
    num_train_epochs=num_train_epochs, # number of training epochs
    per_device_train_batch_size=batch_size, # batch size for training
    save_steps=10000,     # after 10000 steps model is saved 
    )

Declare the Trainer and start tuning.

# Make trainer for gpt2
trainer = Trainer(
    model = model,
    args=training_args,
    train_dataset=train_dataset,
    data_collator=data_collator
)

# training start
trainer.train()

# save model
trainer.save_model()

Unlike previous experiments, it took quite a while because there was a lot of data. Will Rick and Morty characters and SpongeBob characters have a conversation?

Test Rick and Morty with SpongeBob Model

Now it’s time to test the model. Load the model and tokenizer. Put a text in the text and run it.

# Load the model and tokenizer.
tokenizer = GPT2Tokenizer.from_pretrained(output_dir)
model = GPT2LMHeadModel.from_pretrained(output_dir)

text = 'Rick: Hi Morty and SpongeBob!'

# Encode text with tokenizer.
ids = tokenizer.encode(text, return_tensors='pt')

# Generate a sentence with a model.
final_outputs = model.generate(
        ids,
        do_sample=True,
        max_length=100,
        top_k=40,
        top_p=0.95,
    )

# Decode the generated sentences with the tokenizer.
result = tokenizer.decode(final_outputs[0], skip_special_tokens=True).split('\n')

for r in result:
    print(r)

Rick, Morty, Sandy and SpongeBob drove into a bustle. Interesting results came out. This experiment seems to be a success!



Original model: e-tony/gpt2-rnm
Model made by experiment: ainize/gpt2-rnm-with-spongebob
Full code of this experiment: Tuning with SpongeBob data
This model’s demo page: End point

3 Likes