from IPython.display import HTML
HTML('''<script>
code_show=true;
function code_toggle() {
if (code_show){
$('div.input').hide();
} else {
$('div.input').show();
}
code_show = !code_show
}
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>''')
We all love ted talks, don’t we? Recently I came across a dataset on Kaggle (https://www.kaggle.com/rounakbanik/ted-talks). This dataset gave me an opportunity to put some NLP and text mining techniques to test. If you look at the data, you will realise that there are two files, one talks about the metadata of different Ted Talks such as Speaker Name, Title, Length of the talk, List of similar talks etc. There is another file which has the complete transcripts of the talks. This got me thinking:
I have the data on all the transcripts across many Ted talks, can I try to come up with a way to recommend ted talks based on their similarity, just as is done by official Ted page?
The data came in as a tabular flat file, the transcript for each talk was stored in a row across a column named transcript. Here is how the file looked like
import pandas as pd
transcripts=pd.read_csv("E:\\Kaggle\\ted-data\\transcripts.csv")
transcripts.head()
After examining how the data looked like, I figured out that I could easily extract the title of the talk from the url. My eventual goal was to use the text in the transcript column to create a measure of similarity. And then recommend the 4 most similar titles to a given talk. It was quite straightforward to separate the title from url using a simple string split operation as shown below
transcripts['title']=transcripts['url'].map(lambda x:x.split("/")[-1])
transcripts.head()
At this point, I was ready to begin piecing together the components that will help me build a talk recommender. In order to achieve this I had to:
Since our final goal is to recommend talks based on the similarity of their content, the first thing we will have to do is to, create a representation of the transcripts that are amenable to comparison. One way of doing this is to create a tfidf vector for each transcript. But what is this tfidf business anyway? Let’s discuss that first.
To represent text, we will think of each transcript as one “Document” and the set of all documents as a “Corpus”. Then we will create a vector representing the count of words that occur in each document, something like this:
As you can see for each document, we have created a vector for count of times each word occurs. So the vector
In order to understand how Tf-Idf helps in identifying the importance of the words, let’s do a thought experiment and ask ourselves a couple of questions, what determines if a word is important?
A word is important in a document if, it occurs a lot in the document, but rarely in other documents in the corpus. Term Frequencymeasures how often the word appears in a given document, while Inverse term frquency measures how rare the word is in a corpus. The product of these two quantities measures the importance of the word and is known as Tf-Idf. Creating a tf-idf representation is fairly straightforward, if you are working with a machine learning framework, such as scikit-learn, it’s fairly straightforward to create a matrix representation of text data
from sklearn.feature_extraction import text
Text=transcripts['transcript'].tolist()
tfidf=text.TfidfVectorizer(input=Text,stop_words="english")
matrix=tfidf.fit_transform(Text)
#print(matrix.shape)
So once we sort the issue of representing word vectors by taking into account the importance of the words, we are all set to tackle the next issue, how to find out which documents (in our case Ted talk transcripts) are similar to a given document?
To find out similar documents among different documents, we will need to compute a measure of similarity. Usually, when dealing with Tf-Idf vectors, we use similarity
Again, using sklearn, doing this was very straighforward
### Get Similarity Scores using cosine similarity
from sklearn.metrics.pairwise import cosine_similarity
sim_unigram=cosine_similarity(matrix)
All I had to do now was for, each Transcript, find out the 4 most similar ones, based on cosine similarity. Algorithmically, this would amount to finding out, for each row in the cosine matrix constructed above, the index of five columns, that are most similar to the document (transcript in our case) corresponding to the respective row number. This was accomplished using a few lines of code
def get_similar_articles(x):
return ",".join(transcripts['title'].loc[x.argsort()[-5:-1]])
transcripts['similar_articles_unigram']=[get_similar_articles(x) for x in sim_unigram]
Let’s check how we faired, by examining the recommendations. Let’s pickup, any Ted Talk Title from, the list, let’s say we pick up:
transcripts['title'].str.replace("_"," ").str.upper().str.strip()[1]
Then, based on our analysis, the four most similar titles are
transcripts['similar_articles_unigram'].str.replace("_"," ").str.upper().str.strip().str.split("\n")[1]
You can clearly, see that by using Tf-Idf vectors to compare transcripts of the talks, we were able to pick up, talks that were on similar themes. You can find the complete code HERE.
Analytics is a vast field. At the one end, it overlaps with statistics and higher…
Do you love to explore and investigate information? Do you find spreadsheets to be a…
India has developed into the global hub for analytics. A large number of MNCs have…
International Business Machines Corp. Or IBM as it is popularly known recently announced its restructuring…
So you have got a job as an analyst in your dream company? Here are…
What's the sentiment on "sentiment analysis"? Is the field ready to take off?