[Teachable NLP] GPT-2 Fairy Tales

Teachable NLP: link
Tabtab: link
Ainize: view API


Previously, I trained the Teachable NLP’s GPT-2 model using Lord of the Rings text data. link

I found interesting data at Kaggle. It’s a collection of fairy tales. This data is CC0: Public Domain, so anyone can use it freely without any copyright restrictions. Check out the details via the link. Now I’m going to train the GPT-2 model with this.

1. Data

Lovely Ilonka

There was once a king’s son who told his father that he wished to marry.

‘No, no!’ said the king; ‘you must not be in such a hurry. Wait till you
have done some great deed. My father did not let me marry till I had won
the golden sword you see me wear.’

The prince was much disappointed, but he never dreamed of disobeying his
father, and he began to think with all his might what he could do. It
was no use staying at home, so one day he wandered out into the world to
try his luck, and as he walked along he came to a little hut in which he
found an old woman crouching over the fire.

‘Good evening, mother. I see you have lived long in this world; do you
know anything about the three bulrushes?’

‘Yes, indeed, I’ve lived long and been much about in the world, but I
have never seen or heard anything of what you ask. Still, if you will
wait till to-morrow I may be able to tell you something.’

This is the data to be used this time. This is a whopping 12.5mb text file in line 246991 including title and white space!! It’s good data, but I think that need to preprocess some of it.

Unnecessary white spaces and line breaks are not helpful for training. These two have the potential to make the model’s results awkward, so I removed them.

I’m going to do a special processing on the title so that if you enter the title in the model, you can create a fairy tale. So what is the special processing? Right away I will add a token so the model can identify the title. Currently Teachable NLP doesn’t have the ability to add custom tokens, but can make it similar in a simple way!

with open(source_dir + '/' + text_file, 'r', encoding='utf-8') as f:

    name = text_file.split('.')[0]

    with open(result_dir + '/' + name + '.txt', 'w', encoding='utf-8') as r:
        lines = f.readlines()

		# This is a variable that counts unwritten lines.
		# The first line of data is also a title,
		# so I set it to 5 so that it can be included in the conditional statement.
        count = 5

        for line in lines:
			# This is the part to remove the unwritten lines.
            if line == "\n" or line == "" or line == " ":
                count += 1
                continue
			# As a result of judging by looking at the data,
			# a new fairy tale begins when three or more counts are accumulated.
			# The first line of the fairy tale is the title line.
            elif count > 3:
                count = 0
				# This is the part that normalizes the spaces in the line.
				# Using this method, consecutive spaces as well as newlines are removed.
                text = " ".join(line.split())
				# Put <title> and </title> before and after the title.
				# This will serve as a token that separates the title.
                r.write("\n<title>" + text + "</title>")
                continue

			# The number is not 3, but if it is more than 1,
			# a newline is added as the end of the paragraph.
            if count > 0:
                r.write("\n ")
            count = 0

			# This is the part that normalizes the spaces in the line.
            text = " ".join(line.split())

			# Finally, write down the finished line.
			# Put a space to make it a natural sentence.
            r.write(text + " ")

This is a simple Python code for preprocessing fairy tale data. Removed unnecessary lines as conditional statements, and added <title></title> tokens to the title. In addition, the paragraphs were organized by reducing unnecessary newlines.

<title>Lovely Ilonka</title>
There was once a king’s son who told his father that he wished to marry.
‘No, no!’ said the king; ‘you must not be in such a hurry. Wait till you have done some great deed. My father did not let me marry till I had won the golden sword you see me wear.’
The prince was much disappointed, but he never dreamed of disobeying his father, and he began to think with all his might what he could do. It was no use staying at home, so one day he wandered out into the world to try his luck, and as he walked along he came to a little hut in which he found an old woman crouching over the fire.
‘Good evening, mother. I see you have lived long in this world; do you know anything about the three bulrushes?’
‘Yes, indeed, I’ve lived long and been much about in the world, but I have never seen or heard anything of what you ask. Still, if you will wait till to-morrow I may be able to tell you something.’
Well, he waited till the morning, and quite early the old woman appeared and took out a little pipe and blew in it, and in a moment all the crows in the world were flying about her. Not one was missing. Then she asked if they knew anything about the three bulrushes, but not one of them did.
The prince went on his way, and a little further on he found another hut in which lived an old man. On being questioned the old man said he knew nothing, but begged the prince to stay overnight, and the next morning the old man called all the ravens together, but they too had nothing to tell.
The prince bade him farewell and set out. He wandered so far that he crossed seven kingdoms, and at last, one evening, he came to a little house in which was an old woman.

Preprocessing is complete! That’s 12.5mb of data on the 41882 line. Fine-tuning the GPT-2 model with this will make it a great fairy tale generator!

2. Training

This time, ModelType used small (GPT-2 small) and 5 epochs. I used the small model because it is a great model that can produce various results even when tuning with little data.
It took about 1 hour to press the Train Model button. Fortunately, Teachable NLP continues to work even when the window is turned off, so in the meantime, I was able to do a variety of things, such as browsing other people’s Showcases and using Ainize’s projects.

3. Generate

<title>In my dream</title>
From the Danish.
There was once a man who had three sons, and they were all married, and very proud.
The youngest of the three princes was named Martin, for he was clever and agreeable, and besides, he was also a courageous youth, and knew how to make a voyage. So he said, ‘I am going to take a voyage to some great country, and you shall go with me.’
‘Do not trouble yourself about that,’ said the elder; ‘the journey is worth doing, and it will bring you luck.’
So he walked on with his brothers across the sea to the town where their father had lived before they were married. Then they entered a beautiful house, and the young man said, ‘Now listen; you went out into the world, and what do you find in it?’
‘Alas!’ said he, ‘I left home to seek my fortune, and now I have found it! Here I must stay till two beautiful days, and then the three days are just as full of joy as ever.’
‘But what happens after that?’ asked the younger brother, wondering at his brother’s strange history. ‘Oh, well! I am going to set out on the journey, and I will go on with you.’
And so he walked on with his brothers across the sea to the land of the little town where they had lived before they were married. The elder brother thought that he would like to stay with him, for he could accompany him on the journey, but the younger one wished to go on with him, for nothing.
When they reached home and saw his sister, who was much farther off than the other two days, and she was much more beautiful, and also younger, he said to himself, 'Why, but he would like to travel too. So he said to travel too, and he said, but he did not like it. So he travelled and he stayed at once.

This is the result of the GPT-2 model giving “In my dream” as input in the Tabtab. A plausible story about three brothers was created even though I only typed the title. Everyone, try using the model I trained :slightly_smiling_face:




4 Likes

:apple:인공지능이 작가가 된다면? - Teachable NLP로 동화 작성 모델 만들기 :books: