Hey everyone! I've been busy as I move towards graduating this December. But I'm gearing up for some of the final features I want in Melon v1.0. The last main features I'm tracking are:

Swap out Old UI system for MelonUI library

Waiting on two last things for MelonUI

Genre Canonicalization

Waiting on the new Discovery system

The new Discovery system

Hoping to work on this as my senior project in my last semester. I've pulled together lots of stuff for it but haven't started work on it (cause I'm not allowed to if I want it to be that project)

Visualizations

What I'm here to talk about today!

A big idealistic view for Melon's development is that it should be easy for developers interfacing with the server to be able to implement features and reach parity with other clients. Melon will try and handle as much of the implementation and processing that it can to make clients easier to build and make them work on more devices.

Melon already has a visualization feature built in, under api/download/track-waveform you can get a waveform of the whole track. The planned use in clients is to show the waveform as the playback progress bar. It's fast to compute and small to send to clients so this was an easy one to implement. It's not the only visualization feature I want though.

I'm a big fan of frequency visualizers, and have been using many like the Fountain of Colors with Rainmeter over the years. I have recently gone back to that one in 2025 on Windows 11 and its... a highly degraded experience (if working at all, I had menus that would set their values to -1 if you interacted with them at all, breaking the visualizer). This prompted me to look around for other systems like it and not find any that were as customizable. So I went and started work on my own! Using C# + WPF for the app and UI, NAudio for WASAPI access, and FFTSharp for it's way better implementation of the Fast Fourier Transform than mine. The app flow is simple in theory but a bit messy. It simply takes in float data from the WASAPI and builds frames to put in a queue that the UI can pull from and display. I'll go over how that works below, with some summaries and then more detailed comments.

FFT Visualization Pipeline Code

Getting Audio

We need to get a few things before we can build frames for our visualizer. First, get audio bytes from our source (in this case WASAPI Capture), then convert those into float samples.

This is our starting function. It sets up our variables and starts recording on the selected device, if any. It supports routing based on Output device, Input device, or PID for a selected process capture.


WasapiLoopbackCapture _capture;  
WaveFormat _CaptureFormat;       
WasapiCapture _inputCapture; 
// Checking if this is an input device or an output device
if (!isInput)
{
    var devEnum = new MMDeviceEnumerator(); // Audio Device enumerator
    
    // If no selected audio device setup with the default
    if (_audioDevice == null)
    {
        _capture = new WasapiLoopbackCapture();
        _CaptureFormat = _capture.WaveFormat;
    }
    else // Otherwise we need to setup a new capture using the selected device
    {
        MMDevice output = string.IsNullOrWhiteSpace(_audioDevice.FriendlyName)
            ? devEnum.GetDefaultAudioEndpoint(DataFlow.Render, Role.Console)
            : devEnum.EnumerateAudioEndPoints(DataFlow.Render, DeviceState.Active)
                      .First(d => d.FriendlyName == _audioDevice.FriendlyName);
        
        _capture = new WasapiLoopbackCapture(output);
        _CaptureFormat = _capture.WaveFormat;
    }
}
else // Same thing as above, except for Microphone inputs (DataFlow.Capture vs DataFlow.Render)
{
    var devEnum = new MMDeviceEnumerator();
    MMDevice mic = string.IsNullOrWhiteSpace(_audioDevice.FriendlyName)
        ? devEnum.GetDefaultAudioEndpoint(DataFlow.Capture, Role.Console)
        : devEnum.EnumerateAudioEndPoints(DataFlow.Capture, DeviceState.Active)
                  .First(d => d.FriendlyName == _audioDevice.FriendlyName);
    _inputCapture = new WasapiCapture(mic);
    _CaptureFormat = _inputCapture.WaveFormat;
}

// Selected app variable used to allow users to select per-process audio 
// This functionality as of this publication is NOT in NAudio, see "WASAPI Woes" section below.
if (SelectedApp != "")
{
    int pid = int.Parse(SelectedApp.Split(" - ")[0]);
    var cap = await WasapiCapture.CreateForProcessCaptureAsync(pid, true);
    _CaptureFormat = cap.WaveFormat;
    
    cap.DataAvailable += CaptureOnDataAvailable;
    cap.StartRecording();
    
    while (!token.IsCancellationRequested) { }
    cap.StopRecording();
}
else if (!isInput) // Output device 
{
    _capture.DataAvailable += CaptureOnDataAvailable; // Hook Capture function to event that's called when data is ready
    _capture.StartRecording(); // Start Recording
    while (!token.IsCancellationRequested) { }
    _capture.StopRecording();
}
else // Input device
{
    _inputCapture.DataAvailable += CaptureOnDataAvailable; // Hook Capture function to event that's called when data is ready
    _inputCapture.StartRecording(); // Start Recording
    while (!token.IsCancellationRequested) { }
    _inputCapture.StopRecording();
}

Next we have to do something with that data. Here I would just convert from the bytes given by the WASAPI to the float samples I need, however when getting audio per-process it returns in PCM-16 rather than Float (more on that later). So I have two paths to take for transforming the bytes into samples. We also want to use a ring buffer / circular buffer to store both the bytes and the samples given to use. At first I used just a buffer to store the incoming bytes, which would never fill the buffer to the needed size in one go. This would mean we have to wait multiple calls before it's ready to build a frame with, and makes the visualization look laggy. First we store the bytes in the ring buffer so we can continually decode samples, where the first byte is the oldest in the buffer and gets pushed out when a new one is added. Once the buffer fills to the needed limit for the first time the visuals can continually decode samples for running FFT. The CircularBuffer implementation comes from this lovely repo, though I have also extended it to support thread safety.


private static void CaptureOnDataAvailable(object? sender, WaveInEventArgs e)
{
    if (MainWindow.FVZMode) return; // Disables live playback when playing fvz files (more on that later)

    try
    {
        var fmt = _CaptureFormat;

        // Calculate buffer size
        int samplesNeeded = InstanceOptions._fftSize / fmt.Channels;
        int bytesPerSample = fmt.BitsPerSample / 8;
        int bufferSizeInBytes = samplesNeeded * fmt.Channels * bytesPerSample;

        if (tempBuffer.Capacity != bufferSizeInBytes)
        {
            tempBuffer = new CircularBuffer(bufferSizeInBytes);
        }

        // Add new data to buffer
        for (int i = 0; i < e.BytesRecorded; i++)
        {
            tempBuffer.PushBack(e.Buffer[i]);
        }

        // Check if we have enough data
        if (tempBuffer.Count() < bufferSizeInBytes)
        {
            return;
        }

        // Extract bytes from the ring buffer
        byte[] buffer = new byte[bufferSizeInBytes];
        int idx = 0;
        foreach (var b in tempBuffer)
        {
            buffer[idx++] = b;
            if (idx >= bufferSizeInBytes) break;
        }

        // Process bytes into samples
        if (fmt.Encoding == WaveFormatEncoding.IeeeFloat ||
            (fmt.Encoding == WaveFormatEncoding.Extensible &&
             ((WaveFormatExtensible)fmt).SubFormat == AudioMediaSubtypes.MEDIASUBTYPE_IEEE_FLOAT))
        {
            ProcessFloatData(buffer, bufferSizeInBytes, fmt, samplesNeeded);
        }
        else if (fmt.Encoding == WaveFormatEncoding.Pcm && fmt.BitsPerSample == 16)
        {
            ProcessPcm16Data(buffer, bufferSizeInBytes, fmt, samplesNeeded);
        }
    }
    catch (Exception ex)
    {
    
    }
}

Depending on which path is taken one of the following functions is run. Either way on the other side we'll have samples ready for FFT processing. For float samples we get 4 bytes per sample, but when in stereo 2 channel mode it alternates between left and right samples, which we just want to average here for mono visualizations (although 2 channel visualizations could look cool as well, and I may look into supporting this later). Then we push these samples into another CircularBuffer, so that the FFT thread can continuously build frames asap.


private static void ProcessFloatData(byte[] buffer, int bytesRecorded, WaveFormat fmt, int needed)
{
// Since data is already floats we can cast it from the bytes to a float as given
    ReadOnlySpan span = MemoryMarshal.Cast(buffer.AsSpan(0, bytesRecorded));
    int channels = fmt.Channels;

    if (channels == 2)
    {
	    // Stereo
        for (int n = 0; n < needed; n++)
        {
            int i = n * 2;
            _samples.PushBack((span[i] + span[i + 1]) * 0.5f);
        }
    }
    else
    {
        // Mono
        for (int n = 0; n < needed - 1; n++)
        {
            _samples[n] = span[n * channels];
            _samples.PushBack(span[n * channels]);
        }
    }
}

For PCM-16 it's very similar, but it uses shorts instead of floats:


private static void ProcessPcm16Data(byte[] buffer, int bytesRecorded, WaveFormat fmt, int needed)
{
    // Same byte cast but to shorts instead
    ReadOnlySpan span = MemoryMarshal.Cast(buffer.AsSpan(0, bytesRecorded));
    int channels = fmt.Channels;

    const float scale = 1.0f / 32768f;

    if (channels == 2)
    {
        // Stereo
        for (int n = 0; n < needed; n++)
        {
            int i = n * 2;
            _samples.PushBack(((span[i] + span[i + 1]) * 0.5f) * scale);
        }
    }
    else
    {
        // Mono
        for (int n = 0; n < needed; n++)
        {
            _samples.PushBack(span[n * channels] * scale);
        }
    }
}

Getting Visualizer Frames

Now we have usable samples to pass through the Fast Fourier Transform.


private static void ComputeFFT(int needed)
{
	// Get the sample rate
    int sampleRate = _CaptureFormat.SampleRate;
    while (!_fftCTS.IsCancellationRequested)
    {
        fpsMeter.StartFpsCounter();
        
	    // Make sure the sample buffer is read (and we aren't in FVZ Mode)
        if (!_samples.IsFull || _samples.IsEmpty || MainWindow.FVZMode)
        {
            fpsMeter.StopFpsCounter();
            continue;
        }
        try
        {
            // Apply window function using FFTSharp
            var slice = _samples.ToArray();
            _window.ApplyInPlace(slice);

            // FFT processing using FFTSharp
            _spectrum = FFT.Forward(slice);
            _magnitudes = FFT.Magnitude(_spectrum);

            // Build frame with current scale mode
            BuildFrame(_magnitudes, sampleRate);
        }
        catch (Exception)
        {

        }
        fpsMeter.StopFpsCounter();
    }
}

BuildFrame routes to different representations of the frequency spectrum, selected in the UI. I want to cover two of the spectrums, Log10 and "Normalized". There is also support for the Mel spectrogram binning.

First, Log10. This function takes our magnitudes from the last step and averages them into bars that we will display on the visualizer. If it were just linear, we would find how many bins fit in how many bars equally splitting the bins into each bar, but human hearing doesn't work that way. Our pitch perception is said to be logarithmic, where each octave doubles in frequency. So we can double each section of bins, like so: 20-40, 40-80, 80-160, 160-320, etc. This means high frequency bins get shoved into less bars.


private static void BuildFrameLogNormalized(double[] mags, int sampleRate)
{
	// Preparing Variables
    double fMin = InstanceOptions._fMin;
    double fMax = InstanceOptions._fMax == -1 ? sampleRate / 2.0 : InstanceOptions._fMax;
    int rows = InstanceOptions._bars;
    var power = new double[rows]; // Stores Accumulated Power sums for each bar
    var binCnt = new double[rows]; // Stores count of FFT bins -> Visualizer bins
	
    // Logarithmic edges 
    // Human hearing is said to be logarithmic, so we compensate by making the spectrum of frequencies take up more space as you get higher up. 
    // This means at the low end a bar might be 20-40, but higher up will be 1600-3200. By contrast Linear would take the same ammount of fft bins into each visualizer bar, 20-40 at the bottom and 1600-1620 at the top.
    double logMin = Math.Log10(fMin);
    double logMax = Math.Log10(fMax);
    double logStep = (logMax - logMin) / rows;

    // Pre-compute bar edges 
    // Here we are calculating the boundaries between bars, what frequencies do they start/stop at
    // Edges contains the left edge of boundary (the start freq) at i, and the right edge at i+1
    double[] edges = new double[rows + 1];
    for (int r = 0; r <= rows; r++)
    {
        edges[r] = Math.Pow(10, logMin + r * logStep);
	}

    // Loop over each FFT bin and for each:
    //  - Find which left and right edge it is between
    //  - Distribute it's power if it sits between/on an edge
    for (int bin = 1; bin < mags.Length; bin++)
    {
        double f = bin * sampleRate / (double)InstanceOptions._fftSize;
        if (f < edges[0] || f >= edges[^1]) continue;

        // Locate left edge index k so that edges[k] <= f < edges[k+1]
        int k = Array.BinarySearch(edges, f);
        if (k < 0) k = ~k - 1; // BinarySearch peculiarity

        double l = edges[k];
        double rEdge = edges[k + 1];
        double t = (f - l) / (rEdge - l); // 0->1 position between the two edges

        // distribute energy into the two adjacent bars
        double e = mags[bin] * mags[bin]; // energy to distribute
        
        power[k] += e * (1 - t); // left bar distribution
        binCnt[k] += (1 - t);

        if (k + 1 < rows) // right bar distribution (still inside range)
        {
            power[k + 1] += e * t;
            binCnt[k + 1] += t;
        }
    }


    // Clamp in range + Normalize
    // - Normalize the power by the ammount of FFT bins that contributed to a specific bar.
    // - Convert Amplitude to RMS to show the "average energy" of the bins that were added into a bar
    // - Clamp in the range of _dbRange where 0 is _dbFloor and 1 is _dbFloor + _dbRange
    var frame = new double[rows];
    for (int r = 0; r < rows; r++)
    {
        if (binCnt[r] == 0) { frame[r] = 0; continue; }

        // power -> amplitude -> dB
        double rms = Math.Sqrt(power[r] / binCnt[r]) * Math.Sqrt(binCnt[r]); 
        double db = 20 * Math.Log10(rms + 1e-20);

        double topDb = InstanceOptions._dbFloor + InstanceOptions._dbRange;
        double dbNorm = Math.Clamp((db - InstanceOptions._dbFloor) / InstanceOptions._dbRange, 0, 1);
        frame[r] = dbNorm;
    }

	// At this point we have values usable in our visualizer and format, 0.0->1.0 where 0 is no sound in(/around) that frequency and 1 is very loud sound in that frequency (clamped max, but can technically go "out of bounds")
	
	// Last part here is smoothing, we take each bar and "blur" it to the bars on either side
    var smoothed = new double[rows];
    for (int r = 0; r < rows; r++)
    {
        double sum = 0;
        int cnt = 0;
        // For -Smoothness to +Smoothness (2 smooth would be -2 -> 2)
        for (int s = -InstanceOptions._smooth; s <= InstanceOptions._smooth; s++)
        {
	        // If row +/- smooth is not negative and not more than the rows we have
	        if (r + s >= 0 && r + s < rows) 
	        { 
		        // Take that row at add it to our sum (and increase our total)
		        sum += frame[r + s]; 
				cnt++; 
	        }
        }
		
		// Avg the sum
        smoothed[r] = (sum / cnt);
    }

	// Now this frame is done and ready to be shown by the visualizer, so it's chucked into a ConcurrentQueue
	// I also check to make sure the UI isn't lagging too far behind by not letting the buffer get overfilled
    _frameQueue.Enqueue(smoothed);
    while (_frameQueue.Count > 13) _frameQueue.TryDequeue(out _);
}

Now we've got frames to show! I wont go over how I show them here, it's super simple anyway, just telling a rectangle to get taller based on 0->1 floats (and then coloring/position/etc).

Making it look better

The visualization works and looks decent, but in my opinion has some issues. For the algorithm above, I added smoothing to try and help Fix the jagged bars especially as your bar count grows in the low end. The Smoothness value determines how much peaks should "slide down" to adjacent bars to smooth out jagged waveforms. This helped but there were three other issues I wanted to fix:

The low end takes up too much space leading to wide square waveforms as bars increase
Lots of noise bleed leads to uniformity without raising the db Floor
The high end is quiet and not very responsive (especially compared to the lows)

For the first part, we can swap out the Log10 binning for any function that takes in a frequency and returns the bin it should hit. For my algorithm I define a low, mid, and high range. Then I can scale them separately, compressing the lows more than the mids, and compressing the highs more than the mids and the lows. However, splitting the compression like this leads to "boundary walls", where as the frequency increases/decreases through the boundary it sticks around the boundary points. This is because the "slope" of each section mismatches even though the bar values on the edges of each side may be close if not the same. To fix this we can compute the Derivatives for each section border and then use the Cubic Hermite Spline algorithm to find a "connection" between point A and point B. Point A and B will just be the boundary line +/- some transitional value. I wont cover the Cubic Hermite implementation because my brain doesn't like it and someone else could do it better. I highly recommend Freya Holmér's video on Splines which helped me wrap my head around many of the concepts.


double[] edges = new double[rows + 1];
for (int r = 0; r <= rows; r++)
{
	// Old Easing
	// edges[r] = Math.Pow(10, logMin + r * logStep);
	
	// New Easing
    double t = r / (double)rows; // How far in the frequencies we are
    double easedT = TriEase(t); // Find where we should be instead (mapped bin)
    double logF = logMin + easedT * (logMax - logMin); // Map the T to the frequency range using the log fMin and log fMax
    edges[r] = Math.Pow(10, logF); // Convert back from log
}


// Our Easing function. You could swap this func out to adjust how frequencies are mapped to bins
static double TriEase(double t, double lowMid = 0.3, double highMid = 0.7)
{
	// We split by percent, so 40% of the frequencies from 20-20000 are considered the low end
    lowMid = 0.40; // 40% Lows
    highMid = 0.95; // 85% (55% mid section, 5% highs)
    var transitionWidth = 0.02; // Smoothness value between low / mid / high

    if (t <= 0.0) return 0.0;
    if (t >= 1.0) return 1.0;

	// Compute derivatives at boundary points for smooth matching
    if (t < lowMid - transitionWidth)
    {
        // Low section
        double x = t / lowMid;
        return 0.5 * Math.Pow(x, 0.5);
    }
    else if (t < lowMid + transitionWidth)
    {
        // Smooth transition from low to mid
        double t1 = lowMid - transitionWidth;
        double t2 = lowMid + transitionWidth;

        // Values at transition points
        double v1 = 0.5 * Math.Pow(t1 / lowMid, 0.5);
        double v2 = 0.5 + ((t2 - lowMid) / (highMid - lowMid)) * 0.4;

        // Derivatives at transition points  
        double d1 = 0.5 * 0.5 * Math.Pow(t1 / lowMid, -0.5) / lowMid;
        double d2 = 0.4 / (highMid - lowMid);

        return CubicHermite(t, t1, v1, d1, t2, v2, d2);
    }
    else if (t < highMid - transitionWidth)
    {
        // Mid section
        double x = (t - lowMid) / (highMid - lowMid);
        return 0.5 + x * 0.4;
    }
    else if (t < highMid + transitionWidth)
    {
        // Smooth transition from mid to high
        double t1 = highMid - transitionWidth;
        double t2 = highMid + transitionWidth;

        // Values at transition points
        double v1 = 0.5 + ((t1 - lowMid) / (highMid - lowMid)) * 0.4;
        double v2 = 0.9 + 0.1 * Math.Pow((t2 - highMid) / (1 - highMid), 0.9);

        // Derivatives at transition points
        double d1 = 0.4 / (highMid - lowMid);
        double d2 = 0.1 * 0.9 * Math.Pow((t2 - highMid) / (1 - highMid), -0.1) / (1 - highMid);

        return CubicHermite(t, t1, v1, d1, t2, v2, d2);
    }
    else
    {
        // High section  
        double x = (t - highMid) / (1 - highMid);
        return 0.9 + 0.1 * Math.Pow(x, 0.9);
    }
}

This also solves part 3. Because we compress the highs, their combined energy makes them more visible. The last improvement happens within the Normalization + Clamping step. Using a Logistic Sigmoid Function we can gate out extra noise that washes out the spectrum's details. If we just gate the noise, say at 0.4, then that audio gets chopped and can visibly be seen as a wall in the waveform (0-1 becomes more like 0.1-1). So we use the Sigmoid as a Soft Gate to gradually ramp sounds we want muted down fast. (This applies more specifically for live audio visualizations, as pre generated ones don't get noise bleed as much. I'm not sure why this is though.)


// The Normalizing + clamping section
var frame = new double[rows];
for (int r = 0; r < rows; r++)
{
    if (binCnt[r] == 0) { frame[r] = 0; continue; }

    // power -> amplitude -> dB
    double rms = Math.Sqrt(power[r] / binCnt[r]) * Math.Sqrt(binCnt[r]);
    double db = 20 * Math.Log10(rms + 1e-20);

    // Normalize the DB within the range
    double topDb = InstanceOptions._dbFloor + InstanceOptions._dbRange;
    double dbNorm = Math.Clamp((db - InstanceOptions._dbFloor) / InstanceOptions._dbRange, 0, 1);

	// New code!
	// Use Soft Gate to gate out noise and an exponential smoothstep to smooth out the missing cliff
    double t = r / (double)(rows - 1);
    frame[r] = Math.Clamp(ApplySoftKnee(dbNorm, t), 0, 1);
}


static double ApplySoftGate(double x, double t)
{
    double center = 0.4; // How high is the gate
    double steep = 15.0; // How steep can we get (lower leads to more uniform waveforms)
    return 1.0 / (1.0 + Math.Exp(-steep * (x - center)));
}

Looking much better! BTW, both gifs are showing 0:13->0:28 of Caleb Belkin - For Her.

Visualizations using FVZ

Remember when I said "...it should be easy for developers interfacing with the server to be able to implement features... It's not the only visualization feature I want though." Well after working on making a visualizer UI for so long, I thought about how visualizers could work in Melon. The Fast Fourier Transform is a faster to compute version of the Discrete Fourier Transform, a very complex formula that can split the waves in a combined wave back into their parts. Sound is just a wave that is made of the combined frequencies in the signal, so we can split songs into their frequency spectrums using this formula with a small window of the audio that is playing right now. The original Discreet Fourier Transform is O(N^2), but the FFT brings this down to O(N log N). This is an improvement, but this is still complex. The N here is going to be mostly influenced by Resolution x Number of Frames. On my PC, running an RTX 4080 + Ryzen 7900X3D + 32GB of Memory, 60 seconds of audio at 8192x60fps, it takes 2.6667 seconds to compute. Per FFT that's only 0.7408ms per, however, running the FFT constantly in real time takes a toll. It's a costly operation that requires lots of multiply and add instructions on the CPU. Running it in real time while the system also works on other tasks can be overwhelming especially on mobile / embedded systems or systems without Hardware Acceleration. It can be a real battery drain.

So I decided to try and create a file format that could store pre-generated Frequency VisualiZations. This format needed a few things for me:

Flexibility

Users may want visualizations to feel differently or may have audio that needs special treatment, for example quieter audio will need a lower db Floor.

Compression

Storing / Sending data for pre-generated FFT frames takes a lot of space. First tests with uncompressed data saved Fox Stevenson - What Are You (Wow).flac at 250x8192@120 (Bars x Res @ FPS) to 24,348kb which is roughly half the file's original size (a lot of extra data)

Fast

I have a LOT of music that will need pre generating and likely so will many using Melon. So, it needs to be as fast as I can get to encode, and even faster to decode on clients.

So say hello to the .fvz file! The encoder library allows for customizing tons of options about the visualization's bars, but ultimately just stores a header with some metadata and an array of arrays of floats, from 0->1. Like I mentioned earlier it's very large uncompressed, but luckily there are a few tricks I was able to use to compress it farther.

You can enable compression steps in custom patterns. First is standard Zstd 3 Compression, which nets you large reduction off the bat. Next is Quantization. Our floats start as 32bit, but we can quantize them down to 16 or 8 bit without loosing too much information (4 bit starts to look bad). Lastly I've added Delta Encoding from frame-to-frame. This means instead of storing the actual value each bar is in each frame, we store the difference from last frame to this frame. This only really helps when the change from frame to frame is small enough to keep variance across the file low enough for Zstd to optimize it. Below is a size comparison chat using Fox Stevenson - What Are You (Wow).flac at 250x8192@120 (Bars x Res @ FPS)

Compression Type	File Size
Original Flac File	52,249 kb
Uncompressed	24,348 kb
Zstd 3	22,047 kb
+ Quant -> 16	13,491 kb
+ Quant -> 8	6,160 kb
+ Delta Encode	3,394 kb

I think it's about as small as I can get it for now, and I feel it's small enough for my use case. It's not too much extra data and it's faster than the song is to transmit. The file in this example is 4:19 long and took roughly 4.7 seconds to complete generating. I've seen a range of 2-8 seconds on the various other songs and test files I've used. It seems to stay roughly 1 second per minute of audio. I would love to find a way to compress below 1mb though!

FFTVIS Encoder/Decoder

The frame generator is mostly the same from the one showed in the live audio visualizer, just changed to support taking in data from a file rather than live asap. So I won't cover it here! The Encoder and Decoder use a bitmask + bool from the header to determine the compression used. An .FVZ file header looks like this:


public struct Metadata // 36 bytes long w/ padding
{
    [MarshalAs(UnmanagedType.ByValArray, SizeConst = 6)]
    public byte[] magic; // 'FFTVIS'
    public int version; // 2 currently
    public uint fftResolution; // FFT generation Resolution
    public ushort numBands; // Number of bands generated for display
    public ushort frameRate; // FPS of generation
    public uint totalFrames; // Total # of frames contained
    public float maxAmplitude; // Maximum amplitude of the samples read
    public ushort compressionType; // 0001 - ZSTD, 0010 - Quant, 0100 - Deltas
    public bool quantizeLevel; // false - 16, true - 8
}

The Zstd step is made easy thanks to ZstdNet:


// Convert our frames into a flat array 
double[] all = new double[totalSamples];
for (int i = 0; i < GeneratedFrames.Length; i++)
    Array.Copy(GeneratedFrames[i], 0, all, i * frameLen, frameLen);

// Copy the flat array over to it's bytes
bytes = new byte[all.Length * sizeof(double)];
Buffer.BlockCopy(all, 0, bytes, 0, bytes.Length);

// Compress
using (var compressor = new Compressor())
{
    byte[] compressed = compressor.Wrap(bytes);
    fs.Write(BitConverter.GetBytes(compressed.Length));
    fs.Write(compressed, 0, compressed.Length);
}

For Quantization it's also quite easy, but I have 2 different versions, signed and unsigned. Signed is used when delta encoding, as it's -1 -> 1 instead of 0-1. They all share the same structure:


private static ushort[] Quantize16(double[] input)
{
    ushort[] output = new ushort[input.Length];
    for (int i = 0; i < input.Length; i++)
        output[i] = (ushort)Math.Clamp((int)Math.Round(input[i] * 65535), 0, 65535);
    return output;
}

However, the actual math changes:


// Unsigned 16
(ushort)Math.Clamp((int)Math.Round(input[i] * 65535), 0, 65535);
// Signed 16
(short)Math.Clamp((int)Math.Round((value * 2.0 - 1.0) * 32767.0), -32767, 32767);
// Unsigned 8
(byte)Math.Clamp((int)Math.Round(input[i] * 255), 0, 255);
// Signed 8
(sbyte)Math.Clamp((int)Math.Round((value * 2.0 - 1.0) * 127.0), -127, 127);

Lastly I added Delta Encoding to try and get better compression but it rarely wins out and is usually the same size as without it. Where it wins out is when there is little variance in the audio being played or the visualization DB floor/range let in little audio.


double[] prevFrame = new double[frameLen]; // Last frame to get difference frome
double[] allDeltas = new double[totalSamples]; // output deltas

for (int i = 0; i < GeneratedFrames.Length; i++)
{
    var frame = GeneratedFrames[i];

    // Compute deltas between values 
    // (quantizing deltas leads to drift so quantize first)
    for (int j = 0; j < frameLen; j++)
    {
        allDeltas[i * frameLen + j] = frame[j] - prevFrame[j];
    }

    // Update previous frame
    prevFrame = frame;
}

bytes = new byte[allDeltas.Length * sizeof(double)];
Buffer.BlockCopy(allDeltas, 0, bytes, 0, bytes.Length);

For the decoder it's all the same but in reverse! Zstd has to decompress first, then Quants are divided, and Deltas are added up back to their original values.

FVZ Settings

When generating .fvz files with the FFTVIS Library, you get lots of control over how visualizations should be created. The following settings are available:

Bar Count

How many bars to bin frequencies into

DB Floor

The floor to ignore sound below, a negative number. Lower lets in more sound, -70 -> -90 is typically recommended for music files

DB Range

The range of DB amplitudes being displayed. Lower values exaggerate values while higher values smooth the waveform out. 70-120 is typically normal.

Frequency Min

The lowest frequency to display, typically 20hz.

Frequency Max

The maximum frequency to display, typically 20000hz.

Smoothness

How much to smooth out peaks, by averaging +/- Smoothness bars.

Bin Map

How to map frequencies to bins
FFTVIS's C# Library supports Log10, Mel, and the custom Normalized preset. .fvz files are agnostic to this though, all they need to know is what the data looks like after. So you could add custom bin maps if mine aren't what you're looking for.

FFT Resolution

The window of samples to run FFT analysis on, MUST BE A MULTIPLE OF 2
Typically 2048, 4096, 8192, 16384

The frame rate to render at. Typically no need to exceed 240 and not worth going below 60.

Compression Type

A bitmask (in the C# lib, an Enum flag object) that describes which steps to use for encoding and decoding the file w/ compression.
0001 - ZSTD, 0010 - Quant, 0100 - Deltas

Quantization Level

16 bit or 8 bit, only used if the bitmask is set for Quantization

FVZ Examples

I've made a quick JS implementation of the decoder, so you can enjoy live examples of real .fvz files being played back! No audio (gotta play nice with Mr. Copyright), but can you guess the songs being played? (You can check by clicking the blurred text below!)

00:00 / 00:00

Scatman John - Scatman's World

Frame: 0 / 0

00:00 / 00:00

Caleb Belkin - for her.

Frame: 0 / 0

00:00 / 00:00

Fox Stevenson - What Are You (Wow)

Frame: 0 / 0

WASAPI Woes

Per Process

While I was working on the audio visualizer, just for fun, I decided I wanted to see if I could split up process audio and show different colors/visualizer settings for individual processes. Starting in Windows 10 Build 20348 (aka Windows Server 2022) and up you can use AUDIO_CLIENT_ACTIVATIONPARAMS to specify a filter with a process ID. I struggled a lot trying to get this to work through PInvoking windows APIs but at the time of looking in to this NAudio hadn't implemented support yet. After some digging I ran into JustArion's PR with a working implementation but still ran into trouble. After many frustrating hours of debugging I discovered that instead of providing 32-bit floats like the WASAPI normally does, when doing per-process the audio is provided as PCM-16. (I never realized that would be the case D:) After decoding it as PCM-16 it looks (mostly) correct! If you're looking to do anything similar in C#, you can follow the reference cpp implementation and write a wrapping around the WASAPIs, or you could also check out this library from qH0sT or use JustArion's NAudio fork on the process-audio-capture branch. NAudio hasn't merged in support just yet (as these are all relatively new APIs and implementations for them). This also means that if you want to build FreqFreak, you'll need to reference JustArion's version of NAudio.

Timing

Another fun issue I ran into was timing. Calling from the WASAPI will just give you audio as needed, so the UI system is designed to just ask for the current frame as fast as it can render them. However when building the FVZ player into FreqFreak, I ran into issues trying to play back any file rendered at above 60fps. Firstly, you can't easily delay a thread on windows past 60 fps, roughly 16hz, because the system clock runs at 15. If you try to delay any less time than 15ms, you will still wait 15ms. This is because the system clock literally cannot see the time in between it's refresh rate, so it will not know you've passed 5ms until it refreshes itself up to 10ms later.

The second problem is the WASAPI. There is an odd bug that has persisted across all the forms of the WASAPI I've used (WASADK, CPP APIs, NAudio's Wrapper). If you attempt to seek the song using the seek function, there is a large chance that instead of seeking to, for example, 1 minute in, it will seek to 1, but SAY that it's at 1:06? This results in two things: one the current timestamp is invalid, and two the song ends roughly 6 seconds early, because it thinks it's farther ahead than it is. I have no idea what audio files make it do this, it does it to FLAC, 44100, 48000, 88000. It does it to mp3, it does it to m4a. Truly I have no clue what the the deal is. But I do have a (very stupid) workaround. Why seek your file using the API when you can just. Cut your file at the bytes you want to skip, then tell windows to shut up and play this "new file" from the beginning. If you never seek it never looses position.

I hadn't encountered issues with this solution yet, till now. This means that I have to store time seprately, because I not only cannot trust the WASAPI's position, I cannot just ask NAudio WaveOut how far into the file it is (because it's not the full file). NAudio's AudioFileReader knows (because I use the OffsetSampleProvider to skip bytes for seeking) and returns it as a timestamp, but if you bombard it asking for the current time as fast as possible, it grinds the app down to 5fps or less. I also, as mentioned above, cannot just ask the thread to wait say 4.16ms for 240fps because the system clock cannot see that time. So my implementation just unrestricted asks for the next frame as soon as it's done showing the last frame, and we calculate where we are by asking when did we start (or when we lasted seeked and where to) and when we are now. The difference of the start to now equals the elapsed time and position in the song, which I can use to determine the current frame to pull. This was implemented as a custom PlaybackTimer class. However, all of this is dumb and in JS I'm able to just ask the audio player "hey when are we" and it just works, so thanks msft lol.

Lastly, I ran into a super small issue when building the Encoder for FVZ that breaks higher frame rates as well. This one took me awhile to figure out, partially because it was happening at the same time as the timing issue above. Lets start by looking at this original function for generating frames from a file. The goal here is to take a known FPS + Total time and decide on how much we should hop each frame in order to hit the FPS we want (and then extract those bytes for building into frames).


public bool GenerateFrames(IProgress? progress = null) 
{ 
	if (_audio.Length == 0) return false; 

	// Hop forward by samplerate / FPS
	double hop = Math.Ceiling((double)_format.SampleRate / FPS); 

	// How many frames will we need
	int frameCount = (int)Math.Ceiling(Math.Max(0, (_audio.Length - FFTResolution) / hop + 1)); 
	
	GeneratedFrames = new double[frameCount][]; // Storage

	// Parallel loop for SPEED
	Parallel.For(0, frameCount, new ParallelOptions { MaxDegreeOfParallelism = Environment.ProcessorCount }, i => 
	{
		var fb = CreateBuilder(); // FrameBuilder class
		// The window of data to process
		float[] window = _floatPool.Rent(FFTResolution); 
		try { 
			// What byte should we start at?
			int src = (int)Math.Floor(i * hop); 
			if (_audio.Length < src + FFTResolution) // If we don't have samples left for the last frame of audio, get the rest and work with it
			{ 
				// Copy last section 
				int dif = src + FFTResolution - _audio.Length; Array.Copy(_audio, src, window, 0, dif); 
			} 
			else 
			{ 
				// Copy normally 
				Array.Copy(_audio, src, window, 0, FFTResolution); 
			} 
			GeneratedFrames[i] = fb.ProcessData(window, _format) ?? new double[BarCount]; 
		} 
		finally 
		{ 
			// Free pool we rented so they don't charge us late fees
			_floatPool.Return(window); 
		} 
		progress?.Report(frameCount); 
	}); 
		
	_md.totalFrames = (uint)frameCount; 
	return true; 
}

So if you're smarter than me you might have been able to catch the issue. See at higher frame rates we may have a hop size that is fractional. If it is, it needs to become rounded to an int eventually, because we don't have fractional frame indexes. However, in this attempt I did something incorrect. Here I take the ceiling and of the hop making it an int early. Floor is never really used here (but it also would have been wrong). By doing this, I quantize the position (i) before multiplying it by the hop. This leads to reduced precision, but more importantly, reduced precision that compounds each frame. Being a fraction of a second behind means being seconds behind minutes into a song.


// Dont round your hop
// Ceiling leads to visuals being early
// Floor leads to them being late
double hop = Math.Ceiling((double)_format.SampleRate / FPS); 
int src = (int)Math.Floor(i * hop); // Don't use Floor, use Round

So instead those two lines are swapped out with these 3. Our hop size is now fractional. But also very importantly we use Round. Just using floor or ceiling on your src will also lead to a compounding effect. But by using round you flipflop between going slightly ahead and slightly behind in the fractions of a second. This never becomes noticeable, and the range remains low. For example: 60 seconds at 500 FPS (88.2 hop size) leads to bounding at -0.4 -> 0.4.


// Fractional hop
double hop = _format.SampleRate / (double)FPS; 

// Rounded src
double fracSrc = i * hop; 
int src = (int)Math.Round(fracSrc);

To "Show My Work" on the bounding, you can use this to watch the error value.


double fs = 44100.0; // Sample rate
double fps2 = 500.0; // Visualiser frame rate
double hop = fs / fps2; 

int frames = 60 * (int)fps2;
double errMin = 0.0, errMax = 0.0;

for (int i = 0; i < frames; i++)
{
    double exact = i * hop; // Ideal start (fractional)
    double rounded = Math.Round(exact); // Chosen integer sample
    double err = rounded - exact; // Instantaneous error

    if (err < errMin) errMin = err;
    if (err > errMax) errMax = err;
}

Console.WriteLine($"Error range after {frames} frames: {errMin:0.###} -> {errMax:0.###}");

Notes on Visualizations

Some interesting extra notes on things I've noticed playing around with visualization settings:

When using FVZ to generate, 8192 is typically perfect for 44100hz and 48000hz sample rate files. When playing files in 88khz like The Weeknd - Hurry Up Tomorrow, 8192 looks more like 4096, and so 16384 becomes a more valid FFT Resolution.
When playing audio live, depending on application audio volume, -80 -> -120 can be good DB Floors, but when generating from a file direct, -60 -> -90 tends to work better.
DB Ranges of 40-60 are very peaky, 70-90 being balanced, and 100-130 being more uniform/flat
When using Normalized 20-20000hz is best, but in Log10 it can be helpful to cut out some lows to 40-60 to remove the odd stretching.
Live visuals are always more jittery than Generated.
Per process audio visuals are far more jittery. My assumption is this is an effect of being PCM 16 instead of 32, but I don't know why or how to help it not be this way. Best solution can be to lower attack/decay speeds to get more smooth playback.
There is a limit to how much you can "stretch" (add more bars) before it starts to stretch bars like the lows into jagged/blocky representations. This suggests to me there is an upper "worth it" limit for bar count. It differs per FFT resolution, but in Live it is unreasonable to go higher than 16k without having "lag" from having to wait for more samples to fill the buffer. In both modes more than 16k res is almost never worth it(unless your sample rate is much higher, vinyl res at 192000 might need 32k res), as it takes in too much song at a time and thus averages too much area to look accurate.

Wrap up and Downloads

SO! Lots of work went in to this, and after multiple days of polishing up and writing this blog and researching, I'm ready to release it! There will be project pages with some details + downloads for both FreqFreak, the Windows audio visualizer and FVZ player, as well as FFTVIS, the file format .fvz used to store pregenerated visuals + it's C# library and JS Decoder implementation. You can find links to them below or in the project section of this site :D

All the code for these projects is Open Source, feel free to download, use, modify, and build other systems with the tools found here! If you have suggestions, or find any bugs, please submit an issue or a PR. I hope you enjoy them, and if you're working on similar projects to these, I hope some of the things I've learned and detailed here can be of assistance. Having the desync from both dual timers and the float rounding was a nightmare lol, hard to place when it's two parts of the pipeline contributing two levels of drift.

Audio Visualizations in C#

FFT Visualization Pipeline Code

Getting Audio

Getting Visualizer Frames

Making it look better

Visualizations using FVZ

FFTVIS Encoder/Decoder

FVZ Settings

FVZ Examples

WASAPI Woes

Per Process

Timing

Notes on Visualizations

Wrap up and Downloads

Comments