Monday, March 02, 2020

Generating Spectrograms with Neural Networks

In a previous experiments, I used spectrograms instead of raw audio as inputs to neural networks, while training them to recognize pitches, intervals, and chords.

I found that feeding the networks raw audio data got nowhere. Training was extremely slow, and losses seemed to be bounded at unacceptably high values. After switching to spectrograms, the networks started learning almost immediately -- it was quite remarkable!

This post is about generating spectrograms with neural networks.

These spectrograms were generated by a Neural Network

On Spectrograms


Spectrograms are 2-dimensional visual representations of slices of audio (or really, any signal.) On the x-axis of a spectrogram is time, and on the y-axis is frequency.

A Violin playing A4 (440hz)

Because the data is correlated well along both dimensions, spectrograms lend themselves well to both human analysis and convolutional neural networks.

So, I wondered, why can't the networks learn the spectrograms themselves? Under the covers, spectrograms are built with STFTs, which are entirely linear operations on data -- you slide a window over the data at some stride length, then perform a discrete Fourier transform to get the frequency components of the window.

Since the transformation is entirely linear, all you need is one network layer, no activations, no biases. This should theoretically collapse down to a simple regression problem. Right? Let's find out.

Generating Training Data


We start by synthesizing some training data. To keep things simple, let's assume that we want to generate spectrograms of 25ms of audio sampled at 8khz, which is 2000 samples. Round up (in binary) to 2048 to make things GPU friendly.

The underlying STFT will use a hanning window of size 256, FFT size of 256, and a stride length of 250, producing 33x129 images. That's 33 frequency-domain slices (along the time axis) capped at 128hz.

Note that the spectrograms return complex values. We want to make sure the networks can learn to completely reconstruct both the magnitude and phase portions of the signal. Also note that we're going to teach our network how to compute hanning windows.

Here's the code to generate the training data -- we calculate batch_size (15,000) examples, each with 2048 samples and assign them to xs. We then calculate their spectrograms and assign them to ys (the targets.)

Note that we separate the real and imaginary components of the spectrogram and simply stack one atop the other. We also don't scale or normalize the data in any way. Let the network figure all that out! :-)

Building the Model


Now the fun part. We build a single-layer network with 2048 inputs for the audio slice, and row * col outputs for the image (times two to hold the real and imaginary components of the outputs.) Since the outputs are strictly a linear function of the inputs, we don't need a bias term or activation functions.


Again, this is really just linear regression. With, oh, about 17 million variables!


This model trains very fast. In 4 epochs (about 80 seconds), the loss drops to 3.0e-08, which is sufficient for our experiments, and in 10 epochs (about 7 minutes), we can drop it all the way to 2.0e-15.


The Real Test


Our model is ready. Let's see how well this does on unseen data. We generate a slice of audio playing four tones, and compare scipy's spectrogram function with our neural network.


Left: SciPy, Right: Neural Network

Wow, that's actually pretty good, however when we look at a log-scaled version, you can see noise in the network-generated one.


Log-scaled spectrogram: Left: SciPy, Right: Neural Network

Maybe we can train it for a bit longer and try again.

Left: SciPy, Right: Neural Network

Oh yeah, that's way better!

Peeking into the Model


Okay, so we know that this works pretty well. It's worth taking a little time to dig in and see what exactly it learned. The best way to do this is by slicing through the layers and examining the weight matrices.


Lucky for us there's just one layer with (2048, 8514) weights. The second dimension (8154) is just the flattened spectrogram for each sample in the first. In the code above, we reshaped and transformed the data to make it easy to visualize.

Here it is below -- the weight maps for the first, 11th, 21st, and 31st slices (out of 33) of the output.


The vertical bands represent activated neurons. You can see how the bands move from left to right as they work on a 256-sample slice of the audio. But more interesting is the spiral pattern of the windows. What's going on there? Let's slice through one of the bands and plot just the inner dimension.


This is actually pretty cool -- each of the graphs below is a Hanning-Windowed sine wave of an integer frequency along each of the vertical bands. These sinusoids are correlated with the audio, one-by-one, to tease out the active frequencies in the slice of audio.

1, 5, and 10hz Sine Waves (Windowed)

To put it simply, those pretty spirally vertical bands are... Fourier Transforms!

Learning the Discrete Fourier Transform


Exploring that network was fun, however we must go deeper. Let's build a quick network to perform a 128-point DFT, without any windowing, and see if there's more we can learn.

This is a much simpler network, with only about 65k weights. It trains very fast, and works like a charm!


Digging into the weights, you can clearly see the complex sinusoids used to calculate the Fourier transform.


Real (blue) and Imaginary (green) Components

If you look at the weight matrix as a whole, you see the same pattern we saw in the vertical bands of the spectrogram NN weights.


There's a lot more we can explore in these networks, but I should probably end here... this post is getting way too long.

Final Thoughts


It's impressive how well this works, and how quickly this works. The neural networks we trained above are relatively crude, and there are techniques we can explore to optimize them.

For example, with the spectrogram networks -- instead of having it learn each FFT band independently for each window, we could use a different network architecture (like recurrent networks), or implement some kind of weight sharing strategy across multiple layers.

Either way, let me be clear: using Neural Networks to perform FFTs or generate spectrograms is completely impractical, and you shouldn't do it. Really, don't do it! It is, however, a great way to explore the guts of machine learning models as they learn to perform complicated tasks.



Friday, February 28, 2020

Time Frequency Duality

An particularly interesting characteristic of Fourier transforms is time-frequency duality. This duality exposes a beautiful deep symmetry between the time and frequency domains of a signal.

For example, a sinusoid in the time domain is an impulse in the frequency domain, and vice versa.

Here's what a 1-second 20hz sine wave looks like. If you play this on your audio device, you'll hear a 20hz tone.



20hz Sine Wave

When you take the Fourier transform of the wave, and plot the frequency domain representation of the signal, you get an impulse in the bin representing the 20hz. (Ignore the tiny neighbours for now.)


Frequency Domain of 20hz Sine Wave

If you play this transformed representation out to your audio device, you'll hear a click, generated from the single impulse pushing the speaker's diaphragm. This is effectively an impulse signal.

Okay, let's create an impulse signal by hand -- a string of zeros, with a 1 somewhere in the middle. Play this on your speaker, and, again, you'll hear a click. This signal is no different from the the previous transformed signal, except for maybe the position of the impulse.

So, check this out. If you take the the FFT of the impulse and plot the frequency domain representation, you get... a sinusoid!



This works both ways. You can the the inverse FFT of a sine wave in the frequency domain, to produce an impulse in the time domain.


Inverse Fourier Transform of a Sine Wave

This is a wonderfully striking phenomenon, which I think reveals a lot about our perception of nature.

For example, here's another property of time-frequency duality -- convolutions in the time domain are multiplications in the frequency domain, and vice versa. Because multiplications require far fewer operations than convolutions, it's much simpler to operate on frequency domain representations of signals.

Your inner ear consists of lots of tiny hairs that vary in thickness and resonate at different frequencies sending frequency domain representations of sound to your brain -- i.e., your ear evolved a little DSP chip in it to make it easier on your brain.

Saturday, February 22, 2020

Pitch Detection with Convolutional Networks

While working on Pitchy Ninja and Vexflow, I explored a variety of different techniques for pitch detection that would also work well in a browser. Although, I settled on a relatively well-known algorithm, the exploration took me down an interesting path -- I wondered if you could build neural networks to classify pitches, intervals, and chords in recorded audio.

Turns out the answer is yes. To all of them.

This post details some of the techniques I used to build a pitch-detection neural network. Although I focus on single-note pitch estimation, these methods seem to work well for multi-note chords too.

On Pitch Estimation


Pitch detection (also called fundamental frequency estimation) is not an exact science. What your brain perceives as pitch is a function of lots of different variables, from the physical materials that generate the sounds to your body's physiological structure.

One would presume that you can simply transform a signal to its frequency domain representation, and look at the peak frequencies. This would work for a sine wave, but as soon as you introduce any kind of timbre (e.g., when you sing, or play a note on a guitar), the spectrum is flooded with overtones and harmonic partials.

Here's a 33ms spectrogram of the note A4 (440hz) played on a piano. You can see a peak at 440hz, and another around 1760hz.



Here's the same A4 (440hz), but on a violin.


And here's a trumpet.


Notice how the thicker instruments have rich harmonic spectrums? These harmonics are what make them beautiful, and also what make pitch detection hard.

Estimation Techniques


A lot of the well understood pitch estimation algorithms resort to transformations and heuristics which amplify the fundamental and cancel out the overtones. Some, more advanced techniques work on (kind of) fingerprinting timbres, and then attempting to correlate them with a signal.

For single tones, these techniques work well, but they do break down in their own unique ways. After all, they're heuristics that try to estimate human perception.

Convolutional Networks


Deep convolutional networks have been winning image labeling challenges for nearly a decade, starting with AlexNet in 2012. The key insight in these architectures is that detecting objects require some level of locality in pattern recognition, i.e., learned features should be agnostic to translations, rotations, intensities, etc. Convolutional networks learn multiple layers of filters, each capturing some perceptual element.



For example, the bottom layer of an image recognition network might detect edges and curves, the next might detect simple shapes, and the next would detect objects, etc. Here's an example of extracted features from various layers (via DeepFeat.)


Convolutional Networks for Audio


For audio feature extraction, time domain representations don't seem to be very useful to convnets. However, in the frequency domain, convnets learn features extremely well. Once networks start looking at spectrograms, all kinds of patterns start to emerge.

In the next few sections, we'll build a and train a simple convolutional network to detect fundamental frequencies across six octaves.

Getting Training Data


To do this well, we need data. Labeled. Lots of it! There are a few paths we can take:

Option 1: Go find a whole bunch of single-tone music online, slice it up into little bits, transcribe and label.

Option 2: Take out my trusty guitar, record, slice, and label. Then my keyboard, and my trumpet, and my clarinet. And maybe sing too. Ugh!

Option 3: Build synthetic samples with... code!

Since, you know, the ultimate programmer virtue is laziness, let's go with Option 3.

Tools of the Trade


The goal is to build a model that performs well, and generalizes well, so we'll need to make sure that we account for enough of the variability in real audio as we can -- which means using a variety of instruments, velocities, effects, envelopes, and noise profiles.

With a good MIDI library, a patch bank, and some savvy, we can get this done. Here's what we need:
  • MIDIUtil - Python library to generate MIDI files.
  • FluidSynth - Renders MIDI files to raw audio.
  • GeneralUser GS - A bank of GM instrument patches for FluidSynth.
  • sox - To post-process the audio (resample, normalize, etc.)
  • scipy.io - For generating spectrograms
  • Tensorflow - For building and training the models.

All of these are open-source and freely available. Download and install them before proceeding.


Synthesizing the Data


We start with picking a bunch of instruments encompassing a variety of different timbres and tonalities.



Pick the notes and octaves you want to be able to classify. I used all 12 tones between octaves 2 and 8 (and added some random detunings.) Here's a handy class to deal with note to MIDI value conversions.


The next section is where the meat of the synthesis happens. It does the following:
  • Renders the MIDI files to raw audio (wav) using FluidSynth and a free GM sound font.
  • Resamples to single-channel, unsigned 16-bit, at 44.1khz, normalized.
  • Slices the sample up into its envelope components (attack, sustain, decay.)
  • Detunes some of the samples to cover more of the harmonic surface.

Finally, we use the Sample class to generate thousands of different 33ms long MIDI files, each playing a single note.  The labels are part of the filename, and include the note, octave, frequency, and envelope component.



Building the Network


Now that we have the training data, let's design the network.

I experimented with a variety of different architectures before I got here, starting with simple dense (non-convolution) networks with time-domain inputs, then moving on to one-dimensional LSTMs, then two-dimensional convolutional networks (convnets) with frequency-domain inputs.

As you can guess, the 2D networks with frequency-domain inputs worked significantly better. As soon as I got decent baseline performance with them, I focused on incrementally improving accuracy by reducing validation loss.

Model Inputs


The inputs to the network will be spectrograms, which are 2D images representing a slice of audio. The X-axis is usually time, and the Y-axis is frequency. They're great for visualizing audio spectrums, but also for more advanced audio analysis.

A Church Organ playing A4 (440hz)


Spectrograms are typically generated with Short Time Fourier Transforms (STFTs). In short, the algorithm slides a window over the audio, running FFTs over the windowed data. Depending on the parameters of the STFT (and the associated FFTs), the precision of the detected frequencies can be tweaked to match the use case.

For this experiment, we're working with 44.1khz 16-bit samples, 33ms long -- which is about 14,500 data points per sample. We first downsample the audio to 16khz, yielding 5280 data points per sample.

The spectrogram will be generated via STFT, using a window size of 256, an overlap of 200, and a 1024 point FFT zero-padded on both sides. This yields one 513x90 pixel image per sample.

The 1024-point FFT also caps the resolution to about 19hz, which isn't perfect, but fine for distinguishing pitches.

The Network Model


Our network consists of 4 convolutional layers, with 64, 128, 128, and 256 filters respectively, which are then immediately downsampled with max-pooling layers. The input layer reshapes the input tensors by adding a channels dimension for Conv2D. We close out the model with two densely connected layers, and a final output node for the floating-point frequency.

To prevent overfitting, we regularize by aggressively adding dropout layers, including one right at the input which also doubles as an ad-hoc noise generator.



Although we use mean-squared-error as our loss function, it's the mean-absolute-error that we need to watch, since it's easier to reason about. Let's take a look at the model summary.



Wow, 12 million parameters! Feels like a lot for an experiment, but it turns out we can build a model in less than 10 minutes on a modern GPU. Let's start training.


After 100 epochs, we can achieve a validation MSE of 0.002, and a validation MAE of 0.03.


You may be wondering why the validation MAE is so much better than the training MAE. This is because of the aggressive dropout regularization. Dropout layers are only activated during training, not prediction.

These results are quite promising for an experiment! For classification problems, we could use confusion matrices to see where the models mispredict. For regression problems (like this one), we can explore the losses a bit more by plotting a graph of errors by pitch.




Prediction Errors by Pitch

Already, we can see that the prediction errors are on the highest octaves. This is very likely due to our downsampling to 16khz, causing aliasing in the harmonics and confusing the model.

After discarding the last octave, we can take the mean of the prediction error, and what do we see?
np.mean(np.nan_to_num(errors_by_key[0:80]))
19.244542657486097

Pretty much exactly the resolution of the FFT we used. It's very hard to do better given the inputs.

The Real Test


So, how does this perform in the wild? To answer this question, I recorded a few samples of myself playing single notes on the guitar, and pulled some youtube videos of various instruments and sliced them up for analysis. I also crossed my fingers and sacrificed a dozen goats.

As hoped, the predictions were right within the tolerances of the model. Try it yourself and let me know how it works out.

Improvements and Variations


There's a few things we can do to improve what we have -- larger FFT and window sizes, higher sample rates, better data, etc. We can also turn this into a classification problem by using softmax at the bottom layer and training directly on musical pitches instead of frequencies.

This experiment was part of a whole suite of models I built for music recognition. In a future post I'll describe a more complex set of models I built to recognize roots, intervals, and 2-4 note chords.

Until then, hope you enjoyed this post. If you did, drop me a note at @11111110b.

All the source code for these experiments will be available on my Github page as soon as it's in slightly better shape.


Wednesday, February 19, 2020

No Servers, Just Buckets: Hosting Static Websites on the Cloud


For over two decades, I've hosted websites on managed servers. Starting with web hosting providers, going to dedicated machines, then dedicated VMs, then cloud VMs. Maintaining these servers tend to come at a high cognitive cost -- machine and network setup, OS patches, web server configuration, replication and high-availability, TLS and cert management, security... the list goes on.

Last year, I moved [almost] [all] [my] [websites] to cloud buckets, and it has been amazing! Life just got simpler. With just a few commands I got:

  • A HTTP(s) web-server hosting my content.
  • Managed TLS certificates.
  • Compression, Caching, and Content Delivery.
  • Replication and High availability.
  • IPv6!
  • Fewer headaches, and more spending money. :-)

If you don't need tight control over how your data is served, I would strongly recommend that you host your sites on Cloud Buckets. (Yes, of course, servers are still involved, you just don't need to worry about them.)

In this post, I'll show you how I got the float64 website up and serving in almost no time.

What are Cloud Buckets?


Buckets are a storage abstraction for blobs of data offered by cloud providers. E.g., Google Cloud Storage or Amazon S3. Put simply, they're a place in the cloud where you can store directories of files (typically called objects.)

Data in buckets are managed by cloud providers -- they take care of all the heavy lifting around storing the data, replicating, backing up, and serving. You can access this data with command line tools, via language APIs, or from the browser. You can also manage permissions, ownership, replication, retention, encryption, and audit controls.

Hosting Websites on Cloud Buckets


Many cloud providers now allow you to serve files (sometimes called bucket objects) over the web, and let you distribute content over their respective CDNs. For this post, we'll upload a website to a Google Cloud Storage bucket and serve it over the web.

Make sure you have your Google Cloud account setup, command-line tools installed, and are logged in on your terminal.

gcloud auth login
gcloud config set project <your-project-id>


Create your storage bucket with gsutil mb. Bucket names must be globally unique, so you'll have to pick something no one else has used. Here I'm using float64 as my bucket name.

gsutil mb gs://float64

Copy your website content over to the bucket. We specify '-a public-read' to make the objects world-readable.

gsutil cp -a public-read index.html style.css index.AF4C.js gs://float64

That's it. Your content is now available at https://storage.googleapis.com/<BUCKET>/index.html. Like mine is here: https://storage.googleapis.com/float64/index.html.

Using your own Domain


To serve data over your own domain using HTTPS, you need to create a Cloud Load Balancer (or use an existing one.) Go to the Load Balancer Console, click "Create Load Balancer", and select the HTTP/HTTPS option.


The balancer configuration has three main parts: backend, routing rules, and frontend.

For the backend, select "backend buckets", and pick the bucket that you just created. Check the 'Enable CDN' box if you want your content cached and delivered over Google's worldwide Content Delivery Network.



For the routing rules, simply use your domain name (float64.dev) in the host field, your bucket (float64) in the backends field, and /* in Paths to say that all paths get routed to your bucket.

Finally, for the frontend, add a new IP address, and point your domain's A record at it. If you're with the times, you can also add an IPv6 address, and point your domain's AAAA record at it.



If you're serving over HTTPS, you can create a new managed certificate. These certs are issued by Let's Encrypt and managed by Google (i.e., Goole takes care of attaching, verifying, and renewing them.) The certificates take about 30 minutes to propagate.

Save and apply your changes, and your custom HTTPS website is up! A few more odds and ends before we call it a day.

Setup Index and Error Pages


You probably don't want your users typing in the name of the index HTML file (https://float64.dev/index.html) every time they visit your site. You also probably want invalid URLs showing a pretty error page.

You can use gsutil web to configure the index and 404 pages for the bucket.

gsutil web set gs://my-super-bucket -m index.html -e 404.html

Caching, Compression, and Content Delivery


To take advantage of Google's CDN (or even simply to improve bandwidth usage and latency), you should set the Cache-Control headers on your files. I like to keep the expiries for the index page short, and everything else long (of course, also adding content hashes to frequently modified files.)

We also want to make sure that text files are served with gzip compression enabled. The -z flag compresses the file, and sets the content-encoding to gzip while serving over HTTP(s).

gsutil -h "Cache-control:public,max-age=86400" -m \
  cp -a public-read -z js,map,css,svg \
    $DIST/*.js $DIST/*.map $DIST/*.css \
    $DIST/*.jpg $DIST/*.svg $DIST/*.png $DIST/*.ico \
    gs://float64


gsutil -h "Cache-control:public,max-age=300" -m \

  cp -a public-read -z html \
  $DIST/index.html gs://float64

If you've made it this far, you now have a (nearly) production-ready website up and running. Congratulations!

So, how much does it cost?


I have about 8 different websites running on different domains, all using managed certificates and the CDN, and I pay about $20 a month.

I use a single load balancer ($18/mo) and one IP address ($2/mo) for all of them. I get about 10 - 20k requests a day across all my sites, and bandwidth costs are in the pennies.

Not cheap, but not expensive either given the cognitive savings. And there are cheaper options (as you'll see in the next section).

Alternatives


There are many ways to serve web content out of storage buckets, and this is just one. Depending on your traffic, the number of sites you're running, and what kinds of tradeoffs you're willing to make, you can optimize costs further.

Firebase Hosting sticks all of this into one pretty package, with a lower upfront cost (however, the bandwidth costs are higher as your traffic increases.)

Cloudflare has a free plan and lets you stick an SSL server and CDN in front of your Cloud Storage bucket. However if you want dedicated certificates, they charge you $5 each. Also, the minimum TTL on the free plan is 2 hours, which is not great if you're building static Javascript applications.

And there's CloudFront, Fastly, Netlify, all which provide various levels of managed infrastructure, still all better than running your own servers.

Caveats


Obviously, there's no free lunch, and good engineering requires making tradeoffs, and here are a few things to consider before you decide to migrate from servers to buckets:

  • Vendor lock-in. Are you okay with using proprietary technologies for your stack. If not, you're better off running your own servers.
  • Control and Flexibility. Do you want advanced routing, URL rewriting, or other custom behavior? If so you're better off running your own servers.
  • Cost transparency. Although both Google and Amazon do great jobs with billing and detailed price breakdowns, they are super complicated and can change on a whim.
For a lot of what I do, these downsides are well worth it. The vendor lock-in troubles me the most, however it's not hard to migrate this stuff to other providers if I need to.

If you liked this, check out some of my other stuff on this blog.



Tuesday, July 12, 2016

New in VexFlow: ES6, Visual Regression Tests, and more!

Lots of developments since the last time I posted about VexFlow.


VexFlow is ES6


Thanks to the heroics of SilverWolf90 and AaronMars, and the help from many others, VexFlow's entire src/ tree has been migrated to ES6. This is a huge benefit to the project and to the health of the codebase. Some of the wins are:

  • Real modules, which allows us to extract explicit dependency information and generate graphs like this.
  • Const-correctness and predictable variable scoping with const and let.
  • Classes, lambda functions, and lots of other structural enhancements that vastly improve the clarity and conciseness of the codebase.


Part of the migration effort also involved making everything lint-clean, improving the overall style and consistency of the codebase -- see SilverWolf90's brief document on how here.


Visual Regression Tests


VexFlow now has a visual regression test system, and all image-generating QUnit tests are automatically included.

The goal of this system is to detect differences in the rendered output without having to rely on human eyeballs, especially given the huge number of tests that exist today. It does this by calculating a perceptual hash (PHASH) of each test image and comparing it with the hash of a good known blessed image. The larger the arithmetic distance between the hashes, the more different are the two images.

The system also generates a diff image, which is an overlay of the two images, with the differences highlighted, to ease debugging. Here's an example of a failing test:



These tests are run automatically for all PRs, commits, and releases. Props to Taehoon Moon for migrating the regression tests from NodeJS to SlimerJS, giving us headless support and Travis CI integration. To find out more, read the Wiki page on Visual Regression Tests.


Native SVG


Thanks to the awesome contribution of Gregory Ristow, VexFlow now has a native SVG rendering backend, and the RaphaelJS backend has been deprecated. This not only reduces the overall size and bloat, but also hugely improves rendering performance.

The new backend is called Rendering.Backends.SVG with the code at Vex.Flow.SVGContext. Here is a quick example of how to use the new backend: https://jsfiddle.net/nL0cn3vL/2/.

Improved Microtonal Support


VexFlow now has better support for Arabic, Turkish, and other microtonal music via accidentals and key signatures. Thanks to infojunkie for a lot of the heavy lifting here, and to all the contributors in the GitHub issue.



Microtonal support is by no means complete, but this is a noteworthy step forward in the space.

Other Stuff


Lots of other stuff worth mentioning:

  • Support for user interactivity in SVG notation. You can attach event-handlers to elements (or groups of elements) and dynamically modify various properties of the score.
  • Improved bounding-box support.
  • Alignment of clef, timesignature, and other stave modifiers during mid-measure changes.
  • Lots of improvements to the build system and Travis CI integration.
  • Lots of bug fixes related to beaming, tuplets, annotations, etc.

Many thanks to all the contributors involved!

Friday, May 02, 2014

New in VexFlow (May 2014)

Lots of commits into the repository lately. Thanks to Cyril Silverman for may of these. Here are some of the highlights:

Chord Symbols

This includes subscript/superscript support in TextNote and support for common symbols (dim, half-dim, maj7, etc.)

Stave Line Arrows

This is typically used in instructional material.

Slurs

Finally, we have slurs. This uses a new VexFlow class called Curve. Slurs are highly configurable.

Improved auto-positioning of Annotations and Articulations

Annotations and Articulations now self-position based on note, stem, and beam configuration.

Grace Notes

VexFlow now has full support for Grace Notes. Grace Note groups can contain complex rhythmic elements, and are formatted using the same code as regular notes.

Auto-Beam Imnprovements

Lots more beaming options, including beaming over rests, stemlet rendering, and time-signature aware beaming.

Tab-Stem Features

You can (optionally) render Tab Stems through stave lines.

That's all, Folks!

Monday, January 02, 2012

More K-Means Clustering Experiments on Images

I spent a little more time experimenting with k-means clustering on images and realized that I could use these clusters to recolor the image in interesting ways.

I wrote the function save_recolor to replace pixels from the given clusters (replacements) with new ones of equal intensity, as specified by the rgb_factors vector. For example, the following code will convert pixels of the first two clusters to greyscale.

> save_recolor("baby.jpeg", "baby_new.jpg", replacements=c(1,2),
               rgb_factors=c(1/3, 1/3, 1/3))

It's greyscale because the rgb_factors distributes the pixel intensity evenly among the channels. A factor of c(20/100, 60/100, 20/100) would make pixels from the cluster 60% more green.

Let's get to some examples. Here's an unprocessed image, alongside its color clusters. I picked k=10. You can set k by specifying the palette_size parameter to save_recolor.

Here's what happens when I remove the red (the first cluster).

> save_recolor("baby.jpeg", "baby_new.jpg", replacements=1)

In the next image, I keep the red, and remove everything else.

> save_recolor("baby.jpeg", "baby_new.jpg", replacements=2:10)

Below, I replace the red cluster pixels, with green ones of corresponding intensity.

> save_recolor("baby.jpeg", "baby_new.jpg", replacements=1,
               rgb_factors=c(10/100, 80/100, 10/100))

And this is a fun one: Get rid of everything, keep just the grass.

> save_recolor("baby.jpeg", "baby_new.jpg", replacements=c(1,3:10))

I tried this on various images, using different cluster sizes, replacements, and RGB factors, with lots of interesting results. Anyhow, you should experiment with this yourselves and let me know what you find.

I should point out that nothing here is novel or new -- it's all well known in image processing circles. It's still pretty impressive what you can do when you apply simple machine learning algorithms to other areas.

Okay, as in all my posts, the code is available in my GitHub repository:

https://github.com/0xfe/experiments/blob/master/r/recolor.rscript

Happy new year!