Language-Independent Document Clustering
I explored a novel way of document clustering, which is language-independent. The crux is, that it applies to any text in any language. The process remains the same as the data changes.
Let me note the approach in bullet points:
- Build word vectors from the available data.
- Build word clusters of the document word vectors.
- Find which cluster best represents the document.
- Aggregate these clusters (similar clusters go together).
- Documents with similar clusters go together.
Now we will walk through the process and code. We will start by creating word vectors. Before that
In Natural Language Processing (NLP) 90% of the work is preprocessing — Some Researcher
So make sure you do all the preprocessing that you need to do on your data. Since this is a POC, I have excluded the part where preprocessing is done.
We will use the genesis python package to build word vectors. The below function creates the vectors and saves them in binary format.
# Create wordvectors
def create_vectors(filename):
sentences = []
with bz2.open(filename, 'rt') as df:
for l in df:
sentences.append(" ".join(l.split(",")).split())
model = Word2Vec(sentences, size=25, window=5, min_count=1, workers=4)
model.save(r"filename.vec") # Line changed refer code on github
return model
Once the word vectors are ready, we can move on to the next step. This is an exciting step, in particular. I stole it from a blog here. It’s very simple and it works!
Let’s say we have a document with n words. We will start with the first word and lookup for the vector for that word in the model (word vector). That vector goes to a cluster, then moving forward, we assign the word to either a pre-created cluster or a new cluster with a probability 1/(1+n). If we decide to assign to an existing cluster, we assign the word to the most similar cluster. This approach creates clusters that separate the words into groups that make sense and take a look.
More examples in this notebook are here.
The below function is the implementation of the above algorithm, which is also called as Chinese Restaurant Process.
I have put the code on GitHub. Also, there is an accompanying notebook that you can have a look at github.
Improvements
A few suggestions to improve the above algorithm.
- More Data
- Better Preprocessing
- More Dimensional Vectors (we used 25 dimensions)
- A better approach to matching clusters (other than the cosine approach)
Note: This is a very experimental approach. A lot is unclear and undefined as of now. I will keep improving and completing the approach. Your comments are appreciated.
For any questions and inquiries, visit us at thinkitive
Right here is the perfect web site for anyone who hopes to find out about this topic. You understand a whole lot its almost hard to argue with you (not that I actually would want toรHaHa). You definitely put a new spin on a topic that has been discussed for decades. Excellent stuff, just great!
I was excited to discover this website. I want to to thank you for your time for this wonderful read!! I definitely liked every part of it and i also have you book-marked to look at new things on your website.
Good post. I learn something totally new and challenging on blogs I stumbleupon every day. It will always be interesting to read through articles from other writers and use a little something from other sites.
Thanks for sharing. I read many of your blog posts, cool, your blog is very good.
Thanks for sharing. I read many of your blog posts, cool, your blog is very good.