Determining the Right Cluster

07 Feb 2023

When confronted with a broad spectrum of data, how do we make meaningful analysis? That was a question Professor Anthony Tung, Deputy Head of Department of Computer Science at the National University of Singapore, sought to answer at the DSTA Academy Technology Talk.

Speaking to DSTA staff on 6 February 2023, Professor Tung zoomed in to clustering, which is an unsupervised machine learning method to group similar data points that is commonly used in data analytics applications.

Determining the Right Cluster_01

He explained that despite its broad applications in many fields, clustering results can become unreliable and inconsistent, especially in high dimensional space with complex data types and attributes. This is because most clustering algorithms could only cater to specific data types with either all numerical or categorical features. Additionally, clustering high dimensional data typically requires dimensionality reduction, resulting in information loss, and possibly leads to the formation of sub-optimal clusters. The difficulties of assigning the best settings for weight and scaling could also affect the quality of the resulting clusters, Professor Tung elaborated.

To overcome these challenges, Professor Tung proposed a novel approach of representing the cluster centroid using only a selected number of informative dimensions. He termed such a centre representation as FreqItem, and shared how it served as the basis for sparse data clustering.

He then introduced participants to k-FreqItems, a clustering algorithm built upon the sparse centre representation FreqItem, and assigns objects to clusters by minimising the Jaccard distance. To scale k-FreqItems, Professor Tung posited an innovative method named SILK (short for Seeding method based on simILar bucKets) to determine well-selected seeds for better clustering results. He further discussed how k-FreqItems allows data scientists to perform clustering on complex data such as graphs and mixed attributes, and how this algorithm could be used as a pre-processing tool in data analytics operations such as active learning, classification, and data indexing.

Determining the Right Cluster_02

The talk concluded with a Q&A section moderated by Senior Data Scientist (Enterprise Digital Services) Chua Kah Sheng.

Engineer (C3 Development) Daryl Chua appreciated such opportunities to interact with academics such as Professor Tung, which allow him to expand his toolbox. He shared: “I work on video analytics which involves supervised machine learning for object detection, so this lecture exposed me to a different paradigm. I was introduced to new perspectives and more options to solve the engineering problems I might encounter in my work. I look forward to attending more of such talks in the future!”

Beyond the lecture halls, the learning journey was also a mutually beneficial process.

“I was delighted by the depth of understanding and applications to real-life situations, based on the questions asked. I look forward to working closely with the relevant project teams in DSTA,” said Professor Tung. As DSTA’s Distinguished Professor (Data Science), Professor Tung also provides technical leadership and guidance to refine DSTA’s existing methodology and approach to data science projects. Find out more here!

The DSTA Academy Technology Talk is a special lecture series where subject matter experts are invited to share their insights and experience in emerging technologies with DSTA staff.

Determining the Right Cluster_03

Head Data Scientist (Enterprise Digital Services) Koh Lay Tin presented a token of appreciation to Professor Tung.