I'll read my blog posts for you (voice clones using AI)
I can read this blog post to you thanks to AI.
I recently relaunched my personal website with the goal of writing down personal achievements, experiments or other interesting stuff I have done or encountered. I've tried this a couple of times before but never really comitted to it. This time I'm aiming to really take it more seriously. The posts on the site before this one were old ones to make the site feel less empty and gives me a canvas to work from for this post. I've experimented with loads of stuff over the years, much of which I had really written down somewhere.
Anyway, this experimental project that I wanted to really try was adding natural text-to-speech narration to my posts — in my own voice.
Partly just because it sounded fun and partly to prove the viability of it. I’ve been experimenting and adapting to the use of AI tools lately (like leveraging LLMs for day-to-day work and play), and the pace of new tools coming out is kind of wild. I've seen ElevenLabs used to generate text to speed voice clones of your own voice. The idea of having a spoken version of my blog — read in a voice that sounds eerily like mine — was an interesting idea to me and the potential real world use-cases for it get me quite excited.
Those real world examples I can actually think of:
- Accessibility: Audio can help readers with visual impairments or reading difficulties engage with content more easily. To be clear though, this wasn't my focus here so the implementation you can see on selected posts is not accessibility focused. Furthermore, this is now a business as well as social consideration with the introduction of the The European Accessibility Act 2025.
- Brand cohesion in digital media: In a commercial setting for media (e.g. videos, tutorials, podcasts), having the ability to use a single representative voice (or selection of voices) while freeing those people up to do more valuable work (context depending of course).
Making the Voice Clone
The process of cloning my voice was surprisingly straightforward. I just signed up for an ElevenLabs account, and within minutes I was recording audio samples through their web interface. They prompt you to read a handful of their predefined text snippets — I think I spent about ten minutes going through them. A couple of hours later your voice is ready and you can start generating text to speech.
To be clear, they do recommend providing a lot more data for better results, especially if you’re aiming for something high-quality or production-ready. But since this was just a fun experiment, I wasn’t too worried about perfection. The resulting voice isn't flawless, but it's close enough to be recognizable — and kind of uncanny in a fun way. Because of the nominal input I gave I have got a slightly harsh tone but the voice is unmistakebly mine. The fact that it has managed to produce the results it has from the audio I gave is truly impressive.
How it works (in broad strokes)
Here’s how I did it:
1. Take the raw blog post text
This website is a simple Django app, and blog posts are stored as plain Markdown files on disk. Each post has a small bit of metadata defined in a Python config dictionary, which makes things easy to iterate over in code. All posts are committed to version control, so changes are tracked and the whole system stays pretty portable.
from project.pages.config import PAGES
2. Send it to OpenAI for TTS-friendly transformation
In the management command, I call OpenAI to make the text more suitable for audio. It’s especially important for handling things like code blocks, which don't translate well to speech.
⚠️ Important: It's important to make sure you encourage the LLM to return only the text you intend to turn into speech and it'll ruin your output.
ONLY RESPOND WITH THE TEXT TO BE SPOKEN. DO NOT INCLUDE ANY OTHER TEXT AS THE RESPONSE WILL BE PASSED DIRECTLY INTO A TEXT TO SPEECH SYSTEM
This prevents the model from wrapping its response in markdown, explanations, or apologies — which would end up being read aloud if included. Without this instruction, you’ll likely get unexpected meta-commentary in your TTS audio.
def transform_text_for_tts(self, text):
client = OpenAI(api_key=settings.OPENAI_API_KEY)
response = client.responses.create(
model=settings.GPT_MODEL_NAME,
instructions=(
"Transform the following article content to be more suitable for text-to-speech narration. "
"When encountering code blocks, provide a concise, high-level explanation of what the code does "
"instead of reading it verbatim. Preserve the key points and ideas while making the text more naturally spoken with minimal changes. "
"ONLY RESPOND WITH THE TEXT TO BE SPOKEN. DO NOT INCLUDE ANY OTHER TEXT AS THE RESPONSE WILL BE PASSED DIRECTLY INTO A TEXT TO SPEECH SYSTEM"
),
input=text
)
return response.output_text
3. Send the revised text to ElevenLabs
The revised text is then sent to ElevenLabs using their Python client. It returns a stream of audio data in MP3 format.
def generate_audio(self, tts_text):
client = ElevenLabs(api_key=settings.ELEVENLABS_API_KEY)
audio = client.text_to_speech.convert(
text=tts_text,
voice_id=settings.ELEVENLABS_VOICE_ID,
model_id=settings.ELEVENLABS_MODEL_ID,
output_format=settings.ELEVENLABS_OUTPUT_FORMAT,
)
return audio
ElevenLab's API returns the MP3 file content as the response data, so audio
is the MP3 file effectively.
4. Upload the MP3 to a CDN
Anywhere you see files being saved, I'm using DigitalOcean Spaces — it's essentially an S3-compatible object store. I'm using the django-storages
integration. It's a great library and I've been using it for years.
def upload_tts_audio(self, slug, audio_file):
file_path = os.path.join(settings.TRANSCRIPT_DIR, slug, settings.SPOKEN_TRACK_FILENAME)
default_storage.save(file_path, ContentFile(b"".join(audio_file)))
s3_client = get_s3_client()
s3_client.put_object_acl(
ACL='public-read',
Bucket=settings.AWS_STORAGE_BUCKET_NAME,
Key=file_path
)
return file_path
def upload_tts_text(self, slug, tts_text):
file_path = os.path.join(settings.TRANSCRIPT_DIR, slug, settings.TRANSCRIPT_TEXT_FILENAME)
default_storage.save(file_path, ContentFile(tts_text))
s3_client = boto3.client('s3')
return file_path
5. Embed it in the blog post
With the MP3 hosted and available, I just drop a standard <audio>
element into the post template:
<audio controls>
<source src="/tts/my-post-title.mp3" type="audio/mpeg">
Your browser does not support the audio element.
</audio>
That’s it. It's simple but proves that this can work. Where the effort is needed to produce production-ready results is in the sample size and quality given to ElevenLabs to produce the voice clone and in the prompt sent to OpenAI to convert blog posts to be more appropriate to be spoken aloud without changing meaning or losing important context.
Relevant settings
If you want to piece together something similar, here are the relevant settings I’m using in settings.py
:
# OpenAI settings
OPENAI_API_KEY = env('OPENAI_API_KEY', default=None) # Your OpenAI API key
GPT_TEMPERATURE = env('GPT_TEMPERATURE', default=0.7) # Temperature for model creativity
GPT_MODEL_NAME = env('GPT_MODEL_NAME', default='gpt-4.1-nano') # Model to use for transformation
GPT_MAX_TOKENS = env('GPT_MAX_TOKENS', default=1500) # Max token count to stay within limits
# ElevenLabs settings
ELEVENLABS_API_KEY = env('ELEVENLABS_API_KEY', default=None)
ELEVENLABS_VOICE_ID = env('ELEVENLABS_VOICE_ID', default=None)
ELEVENLABS_MODEL_ID = env('ELEVENLABS_MODEL_ID', default='eleven_multilingual_v2')
ELEVENLABS_OUTPUT_FORMAT = env('ELEVENLABS_OUTPUT_FORMAT', default='mp3_44100_128')
# File and transcript settings
TRANSCRIPT_DIR = "pages" # Base directory for storing transcripts/audio
SPOKEN_TRACK_FILENAME = "tts.mp3"
TRANSCRIPT_TEXT_FILENAME = "tts.txt"
# S3 and django-storages configuration
S3_BUCKET_NAME = env('S3_BUCKET_NAME', default=None)
S3_ENDPOINT_URL = env('S3_ENDPOINT_URL', default=None)
CDN_URL = env('CDN_URL', default=None) # Used to serve public files
S3_ACCESS_KEY_ID = env('S3_ACCESS_KEY_ID', default=None)
S3_SECRET_ACCESS_KEY = env('S3_SECRET_ACCESS_KEY', default=None)
ENABLE_S3_FOR_STORAGE = env('ENABLE_S3_FOR_STORAGE', default=None)
if ENABLE_S3_FOR_STORAGE:
STORAGES['default'] = {
"BACKEND": "storages.backends.s3.S3Storage",
"default_acl": "private",
}
STORAGES['staticfiles'] = {
"BACKEND": "storages.backends.s3.S3Storage",
"OPTIONS": {
"default_acl": "public-read",
}
}
AWS_ACCESS_KEY_ID = S3_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY = S3_SECRET_ACCESS_KEY
AWS_STORAGE_BUCKET_NAME = S3_BUCKET_NAME
AWS_S3_ENDPOINT_URL = S3_ENDPOINT_URL
For completeness, here’s the full management command that ties everything together:
# generate_tts_for_pages.py
from django.core.management.base import BaseCommand
from project.pages.config import PAGES
from django.core.files.storage import default_storage
from django.core.files.base import ContentFile
import os
import boto3
from django.conf import settings
from openai import OpenAI
from elevenlabs.client import ElevenLabs
from project.common.botoclient import get_s3_client
class Command(BaseCommand):
help = "Transcribe a post from the system."
def handle(self, *args, **options):
self.process_pages()
def process_pages(self):
for slug, page in PAGES.items():
if not page.get('has_spoken_version', False):
continue
if not self.transcript_exists(slug):
content = self.fetch_page_content(page['filename'])
tts_ready_text = self.transform_text_for_tts(content)
audio_file = self.generate_audio(tts_ready_text)
self.upload_tts_text(slug, tts_ready_text)
self.upload_tts_audio(slug, audio_file)
def transcript_exists(self, slug):
file_path = os.path.join(settings.TRANSCRIPT_DIR, slug, settings.SPOKEN_TRACK_FILENAME)
return default_storage.exists(file_path)
def fetch_page_content(self, filename):
filepath = os.path.join(settings.PAGES_BASE_DIR, filename)
with open(filepath, "r", encoding="utf-8") as file:
return file.read()
def transform_text_for_tts(self, text):
client = OpenAI(api_key=settings.OPENAI_API_KEY)
response = client.responses.create(
model=settings.GPT_MODEL_NAME,
instructions=(
"Transform the following article content to be more suitable for text-to-speech narration. "
"When encountering code blocks, provide a concise, high-level explanation of what the code does "
"instead of reading it verbatim. Preserve the key points and ideas while making the text more naturally spoken with minimal changes. "
"ONLY RESPOND WITH THE TEXT TO BE SPOKEN. DO NOT INCLUDE ANY OTHER TEXT AS THE RESPONSE WILL BE PASSED DIRECTLY INTO A TEXT TO SPEECH SYSTEM"
),
input=text
)
return response.output_text
def generate_audio(self, tts_text):
client = ElevenLabs(api_key=settings.ELEVENLABS_API_KEY)
audio = client.text_to_speech.convert(
text=tts_text,
voice_id=settings.ELEVENLABS_VOICE_ID,
model_id=settings.ELEVENLABS_MODEL_ID,
output_format=settings.ELEVENLABS_OUTPUT_FORMAT,
)
return audio
def upload_tts_audio(self, slug, audio_file):
file_path = os.path.join(settings.TRANSCRIPT_DIR, slug, settings.SPOKEN_TRACK_FILENAME)
default_storage.save(file_path, ContentFile(b"".join(audio_file)))
s3_client = get_s3_client()
s3_client.put_object_acl(
ACL='public-read',
Bucket=settings.AWS_STORAGE_BUCKET_NAME,
Key=file_path
)
return file_path
def upload_tts_text(self, slug, tts_text):
file_path = os.path.join(settings.TRANSCRIPT_DIR, slug, settings.TRANSCRIPT_TEXT_FILENAME)
default_storage.save(file_path, ContentFile(tts_text))
s3_client = boto3.client('s3')
return file_path