This workshop is a bit of a provocation, intended to lead into deeper engagement with State Library's original oral history collection with the potential for a community developed AI Queensland Voice(s). It cover the existing state of the art, and how even extremely large models =====Queensland Prosidy - Exploring the sounds of Queensland voices through machine learning===== In this workshop we'll look at how the unique accents, rhythms and tones of Queensland voices from State Library's collection fit into the burgeoning world of Text To Speech (TTS). We'll search the collection for some original audio sources, explore the resources available at The Edge for audio restoration and finally learn how to use free and open source machine learning software to create a realistic voice. We'll cover briefly the legal and ethical background to 'voice fakes' - and discover how hard it really is to make a machine speak [[https://en.wikipedia.org/wiki/Strine|Strine]]. ====Voices in the Collection==== How to find them? Digitised oral histories are a good bet, meaning a transcript should be available. * "oral history digital" * refine by "original materials" * sort by date - "oldest to newest" Lets go with[[http://onesearch.slq.qld.gov.au/primo-explore/fulldisplay?docid=slq_alma21148530540002061&context=L&vid=SLQ&lang=en_US&search_scope=SLQ_PCI_EBSCO&adaptor=Local%20Search%20Engine&tab=all&query=any,contains,oral%20history%20digital&sortby=lso01|164 oral history interviews regarding the history of the Cape York Peninsula by interviewer Duncan Jackson]] ====Finding Digital Audio and transcripts==== Lets have a listen to Duncan Jackson's interview with Kathleen Jackson. Click on [[164 oral history interviews regarding the history of the Cape York Peninsula by interviewer Duncan Jackson|online access]] to access the digitool viewer, then open: * [[http://hdl.handle.net/10462/mp3/741|mp3]] * [[http://hdl.handle.net/10462/pdf/5853|PDF transcript]] ====Copyright and ethical considerations==== We can always find the conditions of access and use on onesearce and the digitool viewer. In this case we have unrestricted access, and the material is in copyright. You are free to use for personal research and study. For other uses contact copyright@slq.qld.gov.au ======Speech Synthesis====== Like many of the 20th century's technological inovations, the frst modern speech synthesiser can be traced back to the invention of the [[https://en.wikipedia.org/wiki/Vocoder|vocoder]] at [[https://en.wikipedia.org/wiki/Bell_Labs|Bell Labs]]. Derived from this, the [[https://en.wikipedia.org/wiki/Voder|Voder]] was demonstrated at the 1939 World Fair. {{ :workshops:public:machine_learning:uncanny_valley:voder_demonstrated_on_1939_new_york_world_fair_-_the_voder_fascinates_the_crowds_-_bell_telephone_quarterly_january_1940_.jpg?direct&600 |}} ((By Internet Archive Book Images - https://www.flickr.com/photos/internetarchivebookimages/14776509983/Source book page: https://archive.org/stream/belltelephonemag19amerrich/belltelephonemag19amerrich#page/n78/mode/1upReference[Fig.4] The Voder Fascinates the Crowds from: Williams, Thomas W. (January 1940) I. At the New York World's Fair. "Our Exhibits at Two Fairs". Bell Telephone Quarterly XIX (1): 65."​The Voder Fascinates the Crowds - The manipulative skill of the operator s fingers makes the Voders voice almost loo good to be true ", No restrictions, https://commons.wikimedia.org/w/index.php?curid=43343073)) ==== ==== {{ :workshops:public:machine_learning:uncanny_valley:homer_dudley_october_1940_._the_carrier_nature_of_speech_._bell_system_technical_journal_xix_4_495-515._--_fig.8_schematic_circuit_of_the_voder.jpg?direct&600 |}} ====Historical Audio Examples ==== Here is a playlist of various historical TTS methods. https://soundcloud.com/user-552764043 ======Modern State of the Art TTS======= Now - it time to have some fun with TTS - check out the man holding the frog below... https://vo.codes/#speak ==== ==== And have a listen to some interesting examples from pop/meme culture. https://fifteen.ai/examples ==== ==== https://www.youtube.com/watch?v=drirw-XvzzQ ==== ==== =====Wavenet===== Modern deep learning based synthesis started with the release of [[https://deepmind.com/blog/article/wavenet-generative-model-raw-audio|Wavenet]] in 2016 by Google's [[https://deepmind.com|Deepmind]]. WaveNet changes this paradigm by directly modelling the raw waveform of the audio signal, one sample at a time. As well as yielding more natural-sounding speech, using raw waveforms means that WaveNet can model any kind of audio, including music.((https://deepmind.com/blog/article/wavenet-generative-model-raw-audio)) =====Tacotron and Tacotron2===== Wavenet was followed by Tacoctron (also from Google) in 2017. https://google.github.io/tacotron/publications/tacotron/index.html Then Tacotron2 https://ai.googleblog.com/2017/12/tacotron-2-generating-human-like-speech.html =====The next wave - Diffusion===== In April 2022 open.ai dropped [[https://openai.com/dall-e-2/|DALL-E 2]], which uses diffusion models. "Diffusion Models are generative models, meaning that they are used to generate data similar to the data on which they are trained. Fundamentally, Diffusion Models work by destroying training data through the successive addition of Gaussian noise, and then learning to recover the data by reversing this noising process. After training, we can use the Diffusion Model to generate data by simply passing randomly sampled noise through the learned denoising process." These models can be applied to TTS, with [[https://github.com/neonbjb/tortoise-tts|tortoise-tts]] producting some[[https://nonint.com/static/tortoise_v2_examples.html|excellent examples]] of generated speech. =====Getting Started===== ======Google Colab====== Google's Colaboratory((https://colab.research.google.com/notebooks/intro.ipynb)), or "Colab" for short, allows you to write and execute Python in your browser, with * Zero configuration required * Free access to GPUs * Easy sharing ====Python==== Python is an open source programming language that was made to be easy-to-read and powerful((https://simple.wikipedia.org/wiki/Python_(programming_language))). ​Python ​is: * a high-level language, ​(Meaning programmer can focus on what to do instead of how to do it.) * an interpreted language (Interpreted languages do not need to be compiled to run.) * is often described as a "​batteries included"​ language due to its comprehensive standard library. ==== ==== A program called an interpreter runs Python code on almost any kind of computer. In our case python will be interpreted by google colab, which is based on Jupyter notebooks. ====Jupyter Notebooks ==== Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text((https://jupyter.org/)). Usually Jupyter notebooks require set-up for a specific purpose, but Colab takes care of all this for us. ======Getting Started with Colab====== The only requirment for using Colab is (unsurprisingly) a Google account. Once you have a google account, lets jump into our first ML example - [[https://github.com/deezer/spleeter|Spleeter]] - that we mentioned earlier. Go to the Colab here: https://colab.research.google.com/github/deezer/spleeter/blob/master/spleeter.ipynb ====Making a Colab Copy ==== The first step is make a copy of the notebook to our Google drive - this means we can save any changes we like. {{:workshops:public:machine_learning:uncanny_valley:01_colab_spleeter.jpg?direct&400|}} ==== ==== This will trigger a google sign-in {{:workshops:public:machine_learning:uncanny_valley:02_colab_spleeter.jpg?direct&400|}} ==== ==== and the your copy will open in a new tab. {{:workshops:public:machine_learning:uncanny_valley:03_colab_spleeter.jpg?direct&400|}} ====Select a Runtime==== Next we change our runtime (the kind or processor we use) {{:workshops:public:machine_learning:uncanny_valley:04_colab_spleeter.jpg?direct&400|}} ==== ==== to a GPU to take advantage of Googles free GPU offer. {{:workshops:public:machine_learning:uncanny_valley:04.5_colab_spleeter.jpg?direct&400|}} ==== ==== Now lets connect to our hosted runtime {{:workshops:public:machine_learning:uncanny_valley:05_colab_spleeter.jpg?direct&400|}} ==== ==== and check the specs... {{:workshops:public:machine_learning:uncanny_valley:06_colab_spleeter.jpg?direct&400|}} =====Step Through the Notebook===== Now its time to actually use the notebook! Before we start, lets go over how the notebooks work: * The notebook is divided into sections, with each section made up of cells. * These cells have code pre-entered into them, * A play button on the runs (executes) the code in the cell. * The output of the cell is printed (or displayed) directly below each cell. * The output could be text, pictures, audio or video. ==== ==== Cells usually contain python code, but can also be coded in bash - the UNIX command line shell. Cells containing bash commands start with an exclamation mark ''!'' ===== ===== Our first section is called "Install Spleeter" and contains the bash command ''apt install ffmeg'' . This installs ffmeg in our runtime, which is used to process audio. Press the go button.. {{:workshops:public:machine_learning:uncanny_valley:07_colab_spleeter.jpg?direct&400|}} ==== ==== ffmpeg will be downloaded and installed to our runtime. {{:workshops:public:machine_learning:uncanny_valley:08_colab_spleeter.jpg?direct&600|}} ==== ==== Next we will run a python command ''pip''to use the [[https://pypi.org/project/pip/|python package manager ]] to install the spleeter python package. {{:workshops:public:machine_learning:uncanny_valley:09_colab_spleeter.jpg?direct&1200|}} ==== ==== This will take a while - and at the end we will get a message saying we need to restart our runtime due to some compatibilty issues ((this is not unusual when using a hosted runtime)) {{:workshops:public:machine_learning:uncanny_valley:10_colab_spleeter.jpg?direct&1200|}} ==== ==== Go ahead and restart {{:workshops:public:machine_learning:uncanny_valley:11_colab_spleeter.jpg?direct&600|}} ==== ==== Next is another bash command wget we use to (web)get our example audio file. {{:workshops:public:machine_learning:uncanny_valley:12_colab_spleeter.jpg?direct&800|}} ==== ==== And the next cell uses the python ''Audio'' command to give us a nice little audio player so we can hear our example. {{:workshops:public:machine_learning:uncanny_valley:13_colab_spleeter.jpg?direct&600|}} ==== ==== Now its finally time to use the spleeter tool with the ''separate'' command ((confusingly we need to call it from bash (with the exclamation)) as ''!spleeter separate'' , and lets pass the ''-h'' flag ((a fancy way of saying option)) to show us the built in help for the command. {{:workshops:public:machine_learning:uncanny_valley:14_colab_spleeter.jpg?direct&800|}} ==== ==== Now that we know what we are doing - we run the tool for real, and will use the ''-i'' flag to define the input as our downloaded example, and the ''-o'' flag to define our output destination as the directory (folder) ''output''. By default spleeter will download and use the[[ https://github.com/deezer/spleeter/wiki/2.-Getting-started#using-2stems-model|2stems model]]. {{:workshops:public:machine_learning:uncanny_valley:15_colab_spleeter.jpg?direct&1200|}} ==== ==== Another bash command ''ls'' (list) shows us the contents of our output directory {{:workshops:public:machine_learning:uncanny_valley:16_colab_spleeter.jpg?direct&800|}} ==== ==== And finally onother couple of ''audio'' commands to hear our result! {{:workshops:public:machine_learning:uncanny_valley:17_colab_spleeter.jpg?direct&800|}} ====Things to try==== Check out the [[https://github.com/deezer/spleeter/wiki/2.-Getting-started#separate-sources|usage instructions]] for the separate tool on the Github site and try your own 4stem and 5tem separations. Use your own audio files to test the separation. ======Speech to Text with Mozilla Deepspeech ====== Our next challenge will be to adapt the latest version of Mozilla's Deepspeech for use in Google Colab. We will be using the documentation here: https://deepspeech.readthedocs.io/en/v0.8.0/USING.html#getting-the-pre-trained-model To adapt this colab notebook to run the latest version of Mozilla Deepspeech: https://colab.research.google.com/github/tugstugi/dl-colab-notebooks/blob/master/notebooks/MozillaDeepSpeech.ipynb#scrollTo=4OAYywPHApuz =====Text to Speech with Mozilla TTS====== Our final example is TTS with Mozilla TTS: https://colab.research.google.com/drive/1u_16ZzHjKYFn1HNVuA4Qf_i2MMFB9olY?usp=sharing#scrollTo=6LWsNd3_M3MP You can dive straight into this and use it to generate speech. This example usesTacotron2 and MultiBand-Melgan models and LJSpeech dataset. ====Run All Cells==== {{:workshops:public:machine_learning:uncanny_valley:01_melgan.png?400|}} ====Generate Speech==== {{:workshops:public:machine_learning:uncanny_valley:02_melgan.png?400|}} =====Going Further===== ML is such a big and fast moving area of research there are countless other ways to explore and learn, here are a few two-minute videos to pique your interest: * [[https://www.youtube.com/watch?v=EjVzjxihGvU|Video restoration]] * [[https://www.youtube.com/watch?v=Lu56xVlZ40M|OpenAI Plays Hide and Seek]] ==== ==== Make sure you check out the resources in Lynda, which you will have free access to as a State Library of Queensland member ====== Links ====== https://machinelearningforkids.co.uk/#!/links#top https://experiments.withgoogle.com/collection/ai https://openai.com/blog/