AI Tools

My Laptop, My LLMs: Running AI Models Locally in 2024

Ever wondered about running AI on your personal machine? Let me show you how to set up three powerful open-source LLMs right on your laptop, complete with practical tips and honest opinions.

Elena Márquez
By Elena Márquez · Editor-in-ChiefReviewed by Mira Chen · Published
8 min read2,061 views

Over 60% of open-source LLMs released in the past year are actually fine-tuned versions of existing models, not entirely new creations. Why does this matter to you? It means you can often get seriously good performance from models that fit neatly on your laptop — no need for custom hardware or expensive cloud services. I’m going to walk you through getting three top-notch open-source large language models (LLMs) running on your computer, ready for private, uncensored use. It's truly liberating.

What You'll Achieve and What You Need

By the end of this guide, you’ll have Llama 3 8B, Phi-3 Mini, and Mistral 7B – three powerful open-source LLMs – living on your personal machine. You'll be able to chat with them, sum up long articles, brainstorm ideas, and even dabble in basic code generation. All this, without an internet connection or those pesky API fees. For me, it feels like having a private AI sandbox.

Before we jump in, let's talk hardware. You absolutely need a machine with a decent GPU for anything beyond a snail's pace. I’d recommend at least 16GB of RAM, and critically, an NVIDIA GPU with 8GB of VRAM or more (12GB+ is ideal for a smoother experience). If you're on AMD, local LLMs are getting better, but you'll still hit more snags. Intel Arc GPUs are also slowly making headway. Just make sure your graphics drivers are updated; this is a common overlooked step. Finally, budget about 50-70GB of free disk space for the models and the software we'll use.

For software, we'll primarily use `ollama`. It's a fantastic, user-friendly tool that streamlines getting many different LLMs up and running. You'll need to download and install `ollama` from their official website (ollama.ai) beforehand. It works on macOS, Linux, and Windows. A basic understanding of your operating system's command line helps, but don't worry, I’ll give you the exact commands.

Ollama download page
Ollama download page

Step-by-Step Setup: Getting Your LLMs Running Locally

This process is surprisingly straightforward, a big thanks to `ollama`. Follow these steps, and you’ll be chatting with AI in minutes. I promise, it's easier than it sounds.

Step 1: Install Ollama

If you haven't yet, download and install Ollama for your operating system. The installation is usually a standard double-click-and-follow-the-prompts process. Once it’s installed, open your terminal (or command prompt on Windows) and type `ollama run llama3`. This command is pretty magical: it automatically grabs the Llama 3 8B model (the default size if you don't specify) and kicks off a chat session. That initial download might take 5-15 minutes, depending on your internet speed, since the model file is a few gigabytes.

Step 2: Download Additional Models

After you've had a quick chat with Llama 3, you can download more models. Just type `/bye` or press `Ctrl+D` to exit the Llama 3 session.

To get the compact but surprisingly capable Phi-3 Mini, type: `ollama run phi3`

For the versatile Mistral 7B: `ollama run mistral`

Ollama handles all the complex stuff like quantization and compatibility in the background. Each `ollama run <model_name>` command will start the download for that model if it's not already on your system. Want to see what’s out there? Check ollama.ai/library, or just type `ollama list` in your terminal to see your downloaded models.

Step 3: Interacting with Your Models

Once a model is downloaded, you can run it anytime by simply typing `ollama run <model_name>`. The session starts, and you can just type your prompts. Try out different prompts across the models. You might notice Llama 3 feels more verbose or creative, while Phi-3 Mini could be surprisingly efficient for simpler tasks. I personally find Mistral 7B an excellent general-purpose choice.

Here are some commands you can use within the chat session: - `/bye` or `Ctrl+D`: Exit the current chat session. - `/set parameter top_k 40`: Change a model parameter (e.g., control randomness with `temperature`). You can peek at the Ollama documentation for more on these. - `/?`: Show help.

Remember, your interaction is entirely local. No data ever leaves your machine, which is a huge plus for privacy.

Terminal chat with an LLM
Terminal chat with an LLM

What I'd Skip and Common Mistakes

After spending my fair share of time wrestling with these setups, I've learned a few things to watch out for or just outright avoid, especially if you're new to this.

1. Trying to run 70B+ models on consumer hardware: Unless you've got a workstation boasting 48GB+ of VRAM, stick to 7B-13B models. Anything bigger is generally too sluggish to be useful on a desktop GPU. I learned this the hard way with a 34B model that took minutes per token. Never again. 2. Neglecting GPU driver updates: This comes up more often than you'd think. Outdated drivers can cause crashes, poor performance, or prevent Ollama from even seeing your GPU. Update those drivers before you start messing with anything. 3. Expecting GPT-4 level performance from 8B models: Look, these models are incredible for their size, but they aren't magic. They shine at summarization, creative writing prompts, and basic coding, but demanding reasoning or complex problem-solving might still be beyond their current capabilities. Keep your expectations grounded, and you'll be happier. 4. Downloading every single model variant: Ollama’s library is vast and has many versions of each model (e.g., `llama3:70b`, `llama3:8b-instruct-q4_K_M`). For general use, stick to the default or `instruct` versions. The `q` variants refer to quantization levels (like `q4_K_M`), which are often good compromises between file size and quality. Don't overthink these unless you have a very specific need. 5. Forgetting about `--verbose`: If you're trying to debug an issue, appending `--verbose` after `ollama run <model_name>` often spits out helpful error messages or performance metrics. It’s a little detail, but it can save you hours of head-scratching.

Your Cost Reality Check

Running these LLMs locally has one major upfront benefit: zero dollars for the software and models themselves. For many solopreneurs and hobbyists, this completely changes the landscape.

However, your initial hardware investment is where the money goes. If you already own a decent machine (say, a gaming PC you bought for $1500-$2500 three years ago), your marginal cost for AI is effectively nothing. If you need to buy a new machine just for this, expect to spend anywhere from $1000 for a capable mid-range desktop (Ryzen 5/Intel i5, 16GB RAM, NVIDIA RTX 3060/4060) to $2500+ for a powerful workstation (Ryzen 7/Intel i7/i9, 32GB RAM, NVIDIA RTX 4080/4090). A used NVIDIA RTX 3090, for example, can often be found for $600-$800, offering 24GB of VRAM and excellent performance for local LLMs — often a better value than a new mid-range card, if you can find one. It's often the best bang for your buck.

Energy costs are pretty minimal for occasional use, maybe an extra $5-$10 on your monthly electricity bill if you’re running models constantly. Compare that to cloud API costs for equivalent usage, which could easily hit $50-$200+ per month for any serious work. Dedicated local hardware often pays for itself surprisingly quickly. For example, generating 1 million tokens with GPT-3.5 Turbo costs around $1.50, and 1 million with GPT-4 Turbo can be $10-$30. If you're hitting those numbers daily, local quickly becomes the cheaper option.

Alternatives Worth Considering

While Ollama is a fantastic starting point for me, other tools offer slightly different features or capabilities. It's good to know about these as you get more comfortable.

- LM Studio: This is a more GUI-focused application for Windows and macOS. It makes downloading and running models incredibly intuitive. Plus, it lets you easily set up a local API endpoint that mimics OpenAI's, which means it plays well with existing tools. - Text Generation WebUI (oobabooga/text-generation-webui): This is the heavier hitter, with more features for power users. It gives you extensive control over model parameters, integrates with various backends (like `transformers` and `llama.cpp`), and has a wide range of extensions for things like agents, image generation, and more complex workflows. - KoboldCpp: This tool specializes in `ggml` and `GGUF` format models. These are highly optimized for CPU inference but also support GPU acceleration. It’s an excellent choice for slightly older hardware or specific use cases where you need maximum efficiency.

What's Next?

Now that you have these powerful LLMs running on your machine, the real fun begins. Here are a few ideas for where to go next, based on what I’ve enjoyed exploring:

1. Experiment with fine-tuning: Look into tools like `axolotl` or even Ollama's own `ollama push` and `ollama create` commands. These allow for simple fine-tuning if you have a dataset. This is how you really start tailoring models to specific tasks or your unique writing style. 2. Build local applications: Use the local API exposed by tools like Ollama or LM Studio to create your own apps. How about a local summarization tool, a creative writing assistant integrated with your word processor, or even a basic AI agent that interacts with your local files? Python's `ollama` client library makes this surprisingly straightforward. 3. Explore different model families: The open-source landscape is always moving. Keep an eye on new releases from research labs like Meta (Llama), Microsoft (Phi), and Mistral AI, but don't forget the independent researchers and communities that often release fantastic fine-tuned models. There's always something new to discover. 4. Join the community: Platforms like Hugging Face and various Discord servers are buzzing with local LLM enthusiasts. You'll find new ideas, support, and discussions about the latest advancements there. It’s where I often pick up new tricks.

Frequently Asked Questions

Q: Can I run these models on a Mac with an M-series chip? A: Absolutely! Ollama has fantastic support for Apple Silicon. M1, M2, and M3 chips, especially with 16GB or more unified memory, are incredibly efficient for running these LLMs. It’s a great experience.

Q: Do I need an internet connection to use the models after they're downloaded? A: No, once the models are downloaded and stored on your machine, you can run them completely offline. This is one of the major advantages for privacy and reliability when using local LLMs.

Q: What if my GPU VRAM is less than 8GB? A: You might still be able to run smaller models, particularly `phi3` or highly quantized versions of `mistral`. Performance will be slower, and you might see `out of memory` errors with larger models. Consider using CPU-only inference as a last resort, but prepare for very slow response times. It's often not worth the wait.

Q: Are these open-source models as good as proprietary ones like GPT-4? A: For many common tasks, especially creative writing, summarization, and coding assistance, models like Llama 3 8B and Mistral 7B are remarkably competitive. They may not perfectly match GPT-4's reasoning on complex, nuanced problems, but for local, private use, they offer incredible value and performance. I find them perfectly adequate for most of my needs.

Related articles

The AIWiki Sunday brief

One short email each Sunday — the AI tools, income ideas, and productivity reads our editors actually used that week.

No spam, unsubscribe in one click.