Teachable NLP : Training the Model
TabTab : Write your own Résumé
Ainize : Coming soon…
1. Background
If you are preparing to get a new job
So I made a résumé generator service that for developers who find it more difficult to introduce themselves in writing than in codes. It was possible by using Teachable NLP that trains GPT-2 with a text file of résumé. It is super easy if you follow below.
My résumé was written in only a few minutes.
- Web Frontend with HTML, CSS
- Creating Web RESTful API
- Taking part in preprocessing steps of machine learning mainly missing value treatment, outlier detection, encoding, scaling, feature selection.
- Testing machine learn algorithms in python. optimizing of existing algorithms.
Isn’t it interesting? Let me show you how to make the résumé generator!
2. Acquiring Dataset
In Kaggle, I acquired up-to-date resume dataset which is used for training GPT-2 in Teachable NLP. The file format is .csv and there is a table containing 2 columns, ‘Category’ and ‘Resume’.
3. Preprocessing Text
I used python package Pandas
, and preprocessed in Jupyter notebook
.
First of all, I made the table to
DataFrame
, and checked the basics. There are sufficient resume data for developer. And there’s no null value in the table. If you find out null value, please remove it. Fortunately, in my data, two columns are non null(Not Null)
import pandas as pd
import numpy as np
# Read File
data = pd.read_csv('/opt/notebooks/UpdatedResumeDataSet.csv')
# Check categories
print(data['Category'].unique())
"""
['Data Science' 'HR' 'Advocate' 'Arts' 'Web Designing'
'Mechanical Engineer' 'Sales' 'Health and fitness' 'Civil Engineer'
'Java Developer' 'Business Analyst' 'SAP Developer' 'Automation Testing'
'Electrical Engineering' 'Operations Manager' 'Python Developer'
'DevOps Engineer' 'Network Security Engineer' 'PMO' 'Database' 'Hadoop'
'ETL Developer' 'DotNet Developer' 'Blockchain' 'Testing']
"""
# Check the numbers of data
print(data['Category'].value_counts())
"""
Java Developer 84
Testing 70
DevOps Engineer 55
Python Developer 48
Web Designing 45
HR 44
...
"""
# Check null value
data['Resume'].isna().sum()
"""
0
"""
And then, following below, you can get appropriate data specialized for developers.
A) Remove Unnecessary Words
B) Extract Resume Specialized For Developer
A) Remove Unnecessary Words
In cleaning stage, numbers, stopwords(meaningless word tokens) or extremely short words are usually removed. However I omit the steps. When I omitted the data and trained GPT-2 with the file, all formats of resume are gone and readabliity became poor. For example, the sentence, HTML Experience - Less than 3 months
, becomes html experience less than months
after cleaning. It sounds a little bit weird. Also given the lots of abbreviation for developers(e.g. nltk, api), it was unfit to simply cleaning the data because of the length of words.
For example, I’ll show you first resume in DataFrame. I have to remove *
noticing the unordered list, and words generating encoding error.
I considered to remove parenthesis and comma, but I didn’t. Because I thought the meaning of library, package, framework is gone by removing them. So I kept them.
Rather, I thought number
, -
, (
, )
, ,
will let users know the format of resume.
"Skills * Programming Languages: Python (pandas, numpy, scipy, scikit-learn, matplotlib), Sql, Java, JavaScript/JQuery. * Machine learning: Regression, SVM, Na횄짱ve Bayes, KNN, Random Forest, Decision Trees, Boosting techniques, Cluster Analysis, Word Embedding, Sentiment Analysis, Natural Language processing, Dimensionality reduction, Topic Modelling (LDA, NMF), PCA & Neural Nets. * Database Visualizations: Mysql, SqlServer, Cassandra, Hbase, ElasticSearch D3.js, DC.js, Plotly, kibana, matplotlib, ggplot, Tableau. * Others: Regular Expression, HTML, CSS, Angular 6, Logstash, Kafka, Python Flask, Git, Docker, computer vision - Open CV and understanding of Deep learning.Education Details
Data Science Assurance Associate
Data Science Assurance Associate - Ernst & Young LLP
Skill Details
JAVASCRIPT- Exprience - 24 months
jQuery- Exprience - 24 months
Python- Exprience - 24 monthsCompany Details
.
.
.
MULTIPLE DATA SCIENCE AND ANALYTIC PROJECTS (USA CLIENTS)
TEXT ANALYTICS - MOTOR VEHICLE CUSTOMER REVIEW DATA * Received customer feedback survey data for past one year. Performed sentiment (Positive, Negative & Neutral) and time series analysis on customer comments across all 4 categories.
* Created heat map of terms by survey category based on frequency of words * Extracted Positive and Negative words across all the Survey categories and plotted Word cloud.
* Created customized tableau dashboards for effective reporting and visualizations.
CHATBOT * Developed a user friendly chatbot for one of our Products which handle simple questions about hours of operation, reservation options and so on.
* This chat bot serves entire product related questions. Giving overview of tool via QA platform and also give recommendation responses so that user question to build chain of relevant answer.
* This too has intelligence to build the pipeline of questions as per user requirement and asks the relevant /recommended questions.
.
.
.
창짖 FAP is a Fraud Analytics and investigative platform with inbuilt case manager and suite of Analytics for various ERP systems.
* It can be used by clients to interrogate their Accounting systems for identifying the anomalies which can be indicators of fraud by running advanced analytics
Tools & Technologies: HTML, JavaScript, SqlServer, JQuery, CSS, Bootstrap, Node.js, D3.js, DC.js"
The preprocessing is implemented in Python
.
import re
import string
def clean_text(text):
text = text.lower()
#remove any numeric characters
#text = ''.join([word for word in text if not word.isdigit()])
#remove *(asterisk)
text = re.sub('\*','',text)
#replace consecutive non-ASCII characters with a space
text = re.sub(r'[^\x00-\x7f]',r' ',text)
#extra whitespace removal
text = re.sub('\s+', ' ',text)
return text
data['cleaned_text'] = data['Resume'].apply(lambda x : clean_text(x))
You can clean the text with regex, regular exprerssion. It looks complicated, but let me explain it easily.
I added the preprocessed data to DataFrame as a new column, cleaned_text
using function apply
.
B) Extract Resume Specialized For Developer
There are several jobs including HR, Arts, Mechanical Engineer in the Category
column. I filtered out Resume
of which Category
belongs to Developer. And then I saved them to text file.
java = data['Category'] == 'Java Developer'
testing = data['Category'] == 'Testing'
devops = data['Category'] == 'DevOps Engineer'
python = data['Category'] == 'Python Developer'
hadoop = data['Category'] == 'Hadoop'
etl = data['Category'] == 'ETL Developer'
block = data['Category'] == 'Blockchain'
dt = data['Category'] == 'Data Science'
database = data['Category'] == 'Database'
dn = data['Category'] == 'DotNet Developer'
network = data['Category'] == 'Network Security Engineer'
sap = data['Category'] == 'SAP Developer'
cleaned_data = data[java|testing|devops|python|hadoop|etl|block|dt|database|dn|network|sap]
# Make the resume as one text
result = ""
for idx, row in cleaned_data.iterrows():
result = result + row['cleaned_text'] + " "
# Save the text to a file
f = open("/opt/notebooks/developer.txt","w")
f.write(result)
f.close()
4. Teachable NLP
Teachable-NLP is a GPT-2 Finetuning program with a text(.txt) file without writing NLP codes. After training by uploading the preprocessed text file, you can fine-tune the GPT-2 model. I worried the size of data isn’t enough, so I chose medium size of model, and epoch to 3. In TabTab, you can test the model and generate resume.
Write your own perfect Résumé by choosing the most appropriate expressions out of 5 candidate sentences. And then show me your résumé in the Forum