Finding the Needle in a Haystack

Sieving out valuable information – which are used to support organisations’ decision-making processes – from datasets will get increasingly harder due to the sheer amount of data inundating organisations today, said Professor Anthony Tung, Deputy Head of the Department of Computer Science, National University of Singapore.

He explained: “Now that we are in the era of Big Data, the variety of data available is a challenge for data analytics. Data models that were used previously, such as classification and regression models, lose too much semantics when trying to represent scenarios that are overly complex.”

Professor Tung was speaking to over 100 participants at a DSTA Academy Technology Talk on 19 October 2021, which is a lecture series open to DSTA staff and the Defence Technology Community that invites subject matter experts to share their experience and provide insights into emerging technologies.

211022_Finding the Needle in a Haystack_01

Participants were introduced to the concept of Similarity Functions, which links data and form models, as well as group similar objects to derive patterns. In addition, it is also used to summarise the discrepancy between observed values and expected values in a specific data model.

211022_Finding the Needle in a Haystack_02

To explain the different types of Similarity Functions, Professor Tung created a framework that categorises them based on complexity. Using Euclidean Distance – the length of a line segment between two points – to determine similar objects, it relies on physical features such as height, weight, and colour to determine the similarity between objects and requires little processing of data.

Illustrating his point with tourist attractions in Singapore, Professor Tung used the function to determine which attraction (P1 to P6) would be the most similar to attraction Q based on features d1 to d10. The values in each box represent how near, or similar, an attraction’s feature is to Q’s. From the table below, it can be concluded that attraction P1 is the most similar to attraction Q as it has the shortest Euclidean Distance.

211022_Finding the Needle in a Haystack_03

Engineer (Advanced Systems) Merrick Ho was glad to have attended the talk and broadened his knowledge of data analytics, which has proven useful in many projects such as those involving data science and artificial intelligence. He said: “I’ve always been a fan of data analytics, and I felt that Professor Tung’s insights and eloquence helped shed light on a complicated topic. The talk helped to aggregate some of the concepts I had learnt previously, and I am already thinking of ways to incorporate them when handling big data in my role in the advanced electro-optics cluster.”

Professor Tung is also DSTA’s first Distinguished Professor for Data Science. In this role, he provides technical leadership and guidance to help refine DSTA’s existing methodologies and approaches to data science, as DSTA continues to build up its capability in this domain.

Visuals in article courtesy of speaker.