The t-SNE Data Visualization Technique from Scratch Using C# -- Visual Studio Magazine (2024)

The Data Science Lab

The t-SNE Data Visualization Technique from Scratch Using C#

  • By James McCaffrey
  • 03/15/2024

The t-SNE ("t-distributed Stochastic Neighbor Embedding") technique is a method for visualizing high-dimensional data. The basic t-SNE technique is very specific: It converts data with three or more columns into a condensed version with just two columns. The condensed two-dimensional data can then be visualized as an XY graph.

In most situations, the easiest way to apply t-SNE is to use an existing library such the Python language scikit-learn sklearn.manifold.TSNE module. But to really understand how t-SNE works, and its strengths and weaknesses, you can implement t-SNE from scratch. This article presents an implementation of t-SNE using raw C#.

The best way to understand where this article is headed is to take a look at the screenshot in Figure 1 and the graph in Figure 2. The demo program begins by loading a tiny 15-item subset of the UCI Digits dataset into memory. Each data item has 64 pixel values so there's no direct way to graph the data. The first three data items represent '0' digits, the second five items represent '1' digits, and the third five items represent '2' digits.

The t-SNE algorithm is applied to the source data, resulting in 15 items with just two columns. The reduced data is plotted as an XY graph where the first column is used as x values and the second column is used as the y values. The graph shows three distinct groups of data items as expected.

This article assumes you have intermediate or better programming skill but doesn't assume you know anything about t-SNE data visualization. The demo is implemented using C#, but you should be able to refactor the demo code to another C-family language if you wish.


The source code for the demo program is too long to be presented in its entirety in this article. The complete code is available in the accompanying file download, and is also available online. All normal error checking has been removed to keep the main ideas as clear as possible.

The UCI Digits Dataset
The UCI Digits dataset contains crude 8-by-8 images of handwritten digits '0' through '9'. Each image is stored as a comma-delimited row of 64 pixels with integer grayscale values between 0 (white) and 16 (black), followed by the class label (digit represented) from 0 to 9. The UCI Digits dataset is sometimes called the optdigits ("optical digits") dataset.

The full dataset has 3,823 training images and 1,797 test images. The images in Figure 3 show 10 representative data items. The 15-item dataset used in the demo is the first five '0' images from the test dataset, followed by the first '1' images, followed by the first five '2' images.

The 15-item dataset is presented in Listing 1. To be displayed more easily on a web page, the data has a backslash and newline inserted every eight pixel values. To run the demo, you must delete the backslash-newline characters so that the 64 pixel values and the class label are on one line per image.

Listing 1: The Demo Dataset

# optdigits_test_012_15.txt# UCI digits, just 5 each of '0', '1', '2'# 64 pixel values in [0,16] then digit [0,1,2]## remove the \-newline characters so that# each line has 64 pixel values followed by# the class label (0, 1, 2)#0,0,5,13,9,1,0,0,\0,0,13,15,10,15,5,0,\0,3,15,2,0,11,8,0,\0,4,12,0,0,8,8,0,\0,5,8,0,0,9,8,0,\0,4,11,0,1,12,7,0,\0,2,14,5,10,12,0,0,\0,0,6,13,10,0,0,0, 0#0,0,1,9,15,11,0,0,\0,0,11,16,8,14,6,0,\0,2,16,10,0,9,9,0,\0,1,16,4,0,8,8,0,\0,4,16,4,0,8,8,0,\0,1,16,5,1,11,3,0,\0,0,12,12,10,10,0,0,\0,0,1,10,13,3,0,0, 0#0,0,3,13,11,7,0,0,\0,0,11,16,16,16,2,0,\0,4,16,9,1,14,2,0,\0,4,16,0,0,16,2,0,\0,0,16,1,0,12,8,0,\0,0,15,9,0,13,6,0,\0,0,9,14,9,14,1,0,\0,0,2,12,13,4,0,0, 0#0,0,10,14,11,3,0,0,\0,4,16,13,6,14,1,0,\0,4,16,2,0,11,7,0,\0,8,16,0,0,10,5,0,\0,8,16,0,0,14,4,0,\0,8,16,0,1,16,1,0,\0,4,16,1,11,15,0,0,\0,0,11,16,12,3,0,0, 0#0,0,6,14,10,2,0,0,\0,0,15,15,13,15,3,0,\0,2,16,10,0,13,9,0,\0,1,16,5,0,12,5,0,\0,0,16,3,0,13,6,0,\0,1,15,5,6,13,1,0,\0,0,16,11,14,10,0,0,\0,0,7,16,11,1,0,0, 0##0,0,0,12,13,5,0,0,\0,0,0,11,16,9,0,0,\0,0,3,15,16,6,0,0,\0,7,15,16,16,2,0,0,\0,0,1,16,16,3,0,0,\0,0,1,16,16,6,0,0,\0,0,1,16,16,6,0,0,\0,0,0,11,16,10,0,0, 1#0,0,0,0,14,13,1,0,\0,0,0,5,16,16,2,0,\0,0,0,14,16,12,0,0,\0,1,10,16,16,12,0,0,\0,3,12,14,16,9,0,0,\0,0,0,5,16,15,0,0,\0,0,0,4,16,14,0,0,\0,0,0,1,13,16,1,0, 1#0,0,0,2,16,16,2,0,\0,0,0,4,16,16,2,0,\0,1,4,12,16,12,0,0,\0,7,16,16,16,12,0,0,\0,0,3,10,16,14,0,0,\0,0,0,8,16,12,0,0,\0,0,0,6,16,16,2,0,\0,0,0,2,12,15,4,0, 1#0,0,0,0,12,5,0,0,\0,0,0,2,16,12,0,0,\0,0,1,12,16,11,0,0,\0,2,12,16,16,10,0,0,\0,6,11,5,15,6,0,0,\0,0,0,1,16,9,0,0,\0,0,0,2,16,11,0,0,\0,0,0,3,16,8,0,0, 1#0,0,0,1,11,9,0,0,\0,0,0,7,16,13,0,0,\0,0,4,14,16,9,0,0,\0,10,16,11,16,8,0,0,\0,0,0,3,16,6,0,0,\0,0,0,3,16,8,0,0,\0,0,0,5,16,10,0,0,\0,0,0,2,14,6,0,0, 1##0,0,0,4,15,12,0,0,\0,0,3,16,15,14,0,0,\0,0,8,13,8,16,0,0,\0,0,1,6,15,11,0,0,\0,1,8,13,15,1,0,0,\0,9,16,16,5,0,0,0,\0,3,13,16,16,11,5,0,\0,0,0,3,11,16,9,0, 2#0,0,5,12,1,0,0,0,\0,0,15,14,7,0,0,0,\0,0,13,1,12,0,0,0,\0,2,10,0,14,0,0,0,\0,0,2,0,16,1,0,0,\0,0,0,6,15,0,0,0,\0,0,9,16,15,9,8,2,\0,0,3,11,8,13,12,4, 2#0,0,8,16,5,0,0,0,\0,1,13,11,16,0,0,0,\0,0,10,0,13,3,0,0,\0,0,3,1,16,1,0,0,\0,0,0,9,12,0,0,0,\0,0,3,15,5,0,0,0,\0,0,14,15,8,8,3,0,\0,0,7,12,12,12,13,1, 2#0,0,0,5,14,12,2,0,\0,0,7,15,8,14,4,0,\0,0,6,2,3,13,1,0,\0,0,0,1,13,4,0,0,\0,0,1,11,9,0,0,0,\0,8,16,13,0,0,0,0,\0,5,14,16,11,2,0,0,\0,0,0,6,12,13,3,0, 2#0,0,0,3,15,10,1,0,\0,0,0,11,10,16,4,0,\0,0,0,12,1,15,6,0,\0,0,0,3,4,15,4,0,\0,0,0,6,15,6,0,0,\0,4,15,16,9,0,0,0,\0,0,13,16,15,9,3,0,\0,0,0,4,9,14,7,0, 2

I saved the 15-item source data as optdigits_test_012_15.txt in a Data directory located in the project root directory. The wacky data file name is intended to indicate that the file contains 15 items from the UCI Digits test data, where the items are class '0', '1' or '2'.

Unlike many machine learning techniques, t-SNE visualization does not require you to normalize the source data. Source data can be normalized, typically using z-score normalization, or min-max normalization, or divide-by-constant normalization. But t-SNE can work directly with raw data because there are no Euclidean distance, or similar, operations where data columns with large magnitude values will overwhelm columns with small magnitudes.

After applying t-SNE to the 15-item UCI Digits subset, the condensed two-dimensional values can be seen on the left side of the graph in Figure 3. The condensed two-dimensional data looks like:

[ 0] -337.1624 -72.7990[ 1] -270.6862 -41.9135. . .[14] 15.6229 80.1027

The UCI Digits dataset has class labels, but t-SNE visualization can be applied to data with or without labels. When examining data that has explicit class labels, t-SNE usually does not include the labels as part of the source. As with the demo, class labels are typically used just to verify t-SNE visualization results.

The Demo Program
To create the t-SNE demo program I used Visual Studio 2022 (Community Free Edition). I created a new C# console application and checked the "Place solution and project in the same directory" option. I specified .NET version 8.0. I named the project TSNE. I checked the "Do not use top-level statements" option to avoid the rather annoying program entry point shortcut syntax.

The demo has no significant .NET dependencies and any relatively recent version of Visual Studio with .NET (Core) or the older .NET Framework will work fine. You can also use the Visual Studio Code program if you like.

After the template code loaded into the editor, I right-clicked on file Program.cs in the Solution Explorer window and renamed the file to the slightly more descriptive TSNEProgram.cs. I allowed Visual Studio to automatically rename class Program.

The C# demo program is a refactoring of the original t-SNE program which is written in Python. The source research paper is "Visualizing Data using t-SNE" (2008), by L. van der Maaten and G. Hinton. The original Python implementation was written by van der Maaten and can be found at https://lvdmaaten.github.io/tsne/. The Python implementation is labeled as free to use, modify or redistribute for non-commercial purposes.

The overall program structure is presented in Listing 2. The TSNEProgram class houses the Main() method, which has all the control logic. The TSNE class is a simple wrapper over the Reduce() function that does all the work.

Listing 2: The t-SNE Demo Program Structure

using System;using System.IO;namespace TSNE{ internal class TSNEProgram { static void Main(string[] args) { // load data from file into memory // apply the t-SNE Reduce() function // show reduced two-dmensional order } } // Program public class TSNE { // wrapper class for static Reduce() and helpers public static double[][] Reduce(double[][] X, int maxIter, int perplexity) { . . } // ----- two primary helper functions private static double[][] ComputeP(double[][] X, int perplexity) { . . } private static double[] ComputePH(double[] di, double beta, out double h) { . . } // ------------------------------------------------------ // // 32 secondary helper functions, and one helper class // // MatSquaredDistances, MatCreate, MatOnes, // MatLoad, MatSave, VecLoad, // MatShow, MatShow, MatShow, (3 overloads) // VecShow, VecShow, MatCopy, MatAdd, MatTranspose, // MatProduct, MatDiff, MatExtractRow, // MatExtractColumn, VecMinusMat, VecMult, VecTile, // MatMultiply, MatMultLogDivide, MatRowSums, // MatColumnSums, MatColumnMeans, MatReplaceRow, // MatVecAdd, MatAddScalar, MatInvertElements, // MatSum, MatZeroOutDiag // // Gaussian class: NextGaussian() // // ------------------------------------------------------ } // class TSNE} // ns

The Reduce() function calls two primary helper functions, ComputeP() and ComputePH(). I renamed them from the x2p() and Hbeta() Python implementation names. The bulk of the demo program consists of 31 secondary helper functions and a helper class. Most of these secondary helpers are simple and implement functionality that is built in to Python. For example, in Python, if A and B are two NumPy matrices, you can write C = A + B. But when using raw C# you must implement a program-defined function such as:

private static double[][] MatAdd(double[][] A, double[][] B){ int rows = A.Length; int cols = A[0].Length; double[][] result = MatCreate(rows, cols); // helper for (int i = 0; i < rows; ++i) for (int j = 0; j < cols; ++j) result[i][j] = A[i][j] + B[i][j]; return result;}

The Calling Code
The Main() method begins by loading the 15-item UCI Digits subset into memory:

string fn = "..\\..\\..\\Data\\optdigits_test_012_15.txt";int[] cols = new int[64];for (int i = 0; i < 64; ++i) cols[i] = i;double[][] X = TSNE.MatLoad(fn, cols, ',', "#");Console.WriteLine("\nX (first 12 columns): ");TSNE.MatShow(X, 12, 1, 6, true);

The arguments passed to the MatLoad() function mean load 64 comma-delimited columns [0] through [63] inclusive, where lines beginning with "#" are comments to be ignored. The arguments to MatShow() mean show the first 12 columns of data, using 1 decimal, with a field width of 6, and show row indices.

The demo loads the class labels in column [64] and displays them like so:

Console.WriteLine("y labels: ");int[] y = TSNE.VecLoad(fn, 64, "#"); // col [64]TSNE.VecShow(y, 3);

As mentioned previously, t-SNE does not need labels, and the demo program uses them only to verify that the reduced data accurately reflects the source data.

The source data is reduced and displayed using these statements:

Console.WriteLine("maxIter = 600, perplexity = 5 ");double[][] reduced = TSNE.Reduce(X, maxIter: 600, perplexity: 5);Console.WriteLine("Result: ");TSNE.MatShow(reduced, 4, 12, true);

The t-SNE reduction process is iterative. The maximum number of iterations must be determined by trial and error. The demo monitors error/cost every 100 iterations. Preliminary experiments showed that error stopped significantly decreasing at roughly 600 iterations.

The primary hyperparameter for t-SNE reduction is called perplexity. The perplexity value defines how many nearest data items are used during reduction. Perplexity can be thought of as a parameter to tune the balance between preserving the global and the local structure of the data. The t-SNE process is highly sensitive to the perplexity value, and a good value must be determined by trial and error. Because the demo dataset is so small, a small perplexity value of 5 was used. Typical perplexity values are between 5 and 100.


  • « previous
  • 1
  • 2
  • next »
  • Get Code Download
The t-SNE Data Visualization Technique from Scratch Using C# -- Visual Studio Magazine (2024)

FAQs

Is t-SNE only for visualization? ›

t-SNE (t-distributed Stochastic Neighbor Embedding) is an unsupervised non-linear dimensionality reduction technique for data exploration and visualizing high-dimensional data. Non-linear dimensionality reduction means that the algorithm allows us to separate data that cannot be separated by a straight line.

How is t-SNE used for visualization? ›

t-SNE is better than existing techniques at creating a single map that reveals structure at many different scales. This is particularly important for high-dimensional data that lie on several different, but related, low-dimensional manifolds, such as images ofobjects from multiple classes seen from multiple viewpoints.

What is better than t-SNE? ›

UMAP vs t-SNE, revisited

The biggest difference between the the output of UMAP when compared with t-SNE is this balance between local and global structure - UMAP is often better at preserving global structure in the final projection.

When not to use t-SNE? ›

One of the main disadvantages of t-sne is that it is computationally intensive and therefore relatively slow. This means that it is not appropriate for very large datasets with many observations. Part of the reason for this is that pairwise distances need to be calculated between all of the points in the dataset.

When should I use t-SNE? ›

The best way to used the algorithm is to use it for exploratory data analysis. It will give you a very good sense of patterns hidden inside the data. It can also be used as an input parameter for other classification & clustering algorithms.

What is the difference between PCA and t-SNE? ›

- PCA simplifies and finds global patterns in the data. - t-SNE emphasizes visualization and reveals local patterns and clusters. - UMAP aims to preserve the local structure and handle complex relationships.

What are the advantages of t-SNE over PCA? ›

t-SNE is great for finding clusters and patterns in complex data, especially when they are not separable by a linear boundary. t-SNE can also reveal hidden features and structures that PCA might miss.

References

Top Articles
Latest Posts
Article information

Author: Arielle Torp

Last Updated:

Views: 5636

Rating: 4 / 5 (61 voted)

Reviews: 84% of readers found this page helpful

Author information

Name: Arielle Torp

Birthday: 1997-09-20

Address: 87313 Erdman Vista, North Dustinborough, WA 37563

Phone: +97216742823598

Job: Central Technology Officer

Hobby: Taekwondo, Macrame, Foreign language learning, Kite flying, Cooking, Skiing, Computer programming

Introduction: My name is Arielle Torp, I am a comfortable, kind, zealous, lovely, jolly, colorful, adventurous person who loves writing and wants to share my knowledge and understanding with you.