Language-Independent Document Clustering

KaustubhSeptember 24, 2020

5 1,953 2 minutes read

I explored a novel way of document clustering, which is language-independent. The crux is, that it applies to any text in any language. The process remains the same as the data changes.

Let me note the approach in bullet points:

Build word vectors from the available data.
Build word clusters of the document word vectors.
Find which cluster best represents the document.
Aggregate these clusters (similar clusters go together).
Documents with similar clusters go together.

Now we will walk through the process and code. We will start by creating word vectors. Before that

In Natural Language Processing (NLP) 90% of the work is preprocessing — Some Researcher

So make sure you do all the preprocessing that you need to do on your data. Since this is a POC, I have excluded the part where preprocessing is done.

We will use the genesis python package to build word vectors. The below function creates the vectors and saves them in binary format.

# Create wordvectors
def create_vectors(filename):
sentences = []
with bz2.open(filename, 'rt') as df:
for l in df:
sentences.append(" ".join(l.split(",")).split())
model = Word2Vec(sentences, size=25, window=5, min_count=1, workers=4)
model.save(r"filename.vec") # Line changed refer code on github
return model

Once the word vectors are ready, we can move on to the next step. This is an exciting step, in particular. I stole it from a blog here. It’s very simple and it works!

Let’s say we have a document with n words. We will start with the first word and lookup for the vector for that word in the model (word vector). That vector goes to a cluster, then moving forward, we assign the word to either a pre-created cluster or a new cluster with a probability 1/(1+n). If we decide to assign to an existing cluster, we assign the word to the most similar cluster. This approach creates clusters that separate the words into groups that make sense and take a look.

lidc-1024x395 Language-Independent Document Clustering — Language-Independent Document Clustering

More examples in this notebook are here.

The below function is the implementation of the above algorithm, which is also called as Chinese Restaurant Process.

I have put the code on GitHub. Also, there is an accompanying notebook that you can have a look at github.

Improvements

A few suggestions to improve the above algorithm.

More Data
Better Preprocessing
More Dimensional Vectors (we used 25 dimensions)
A better approach to matching clusters (other than the cosine approach)

Note: This is a very experimental approach. A lot is unclear and undefined as of now. I will keep improving and completing the approach. Your comments are appreciated.

For any questions and inquiries, visit us at thinkitive

5 Comments

נערות ליווי בתל אביב says:

April 18, 2022 at 2:49 am

Right here is the perfect web site for anyone who hopes to find out about this topic. You understand a whole lot its almost hard to argue with you (not that I actually would want toÖHaHa). You definitely put a new spin on a topic that has been discussed for decades. Excellent stuff, just great!

Israel night club says:

April 25, 2022 at 5:20 pm

I was excited to discover this website. I want to to thank you for your time for this wonderful read!! I definitely liked every part of it and i also have you book-marked to look at new things on your website.

https://israel-lady.co.il/ says:

May 18, 2022 at 5:57 pm

Good post. I learn something totally new and challenging on blogs I stumbleupon every day. It will always be interesting to read through articles from other writers and use a little something from other sites.

binance futures trading says:

April 3, 2023 at 9:01 pm

Thanks for sharing. I read many of your blog posts, cool, your blog is very good.

Ücretsiz hesap oluşturun says:

April 18, 2023 at 10:39 pm

Thanks for sharing. I read many of your blog posts, cool, your blog is very good.

Language-Independent Document Clustering

Improvements

Kaustubh

5 Comments

Leave a Reply Cancel reply

SERVICES

INDUSTRIES

TECHNOLOGIES

COMPANY

USA

Free Guide to Healthcare Software Adoption & Implementation

Improvements

Kaustubh

Related Articles

AI in Telemedicine: Use Cases & Implementation

Best Practices in Healthcare Dashboard Design

How to Develop an Internet of Things (IoT) Application?

Importance of Data Visualization

5 Comments

Leave a Reply Cancel reply

SERVICES

INDUSTRIES

TECHNOLOGIES

COMPANY

USA