Frequently Asked Questions About Google DeepMind Generative Media App-Building Framework
21 answers covering everything from basics to advanced usage.
// Basics
Can I use DeepMind's generative media models without writing any code?
Yes. AI Studio's Playground lets you test all models — Gemini, Nano Banana 2, VO3, and LIA 3 — through a visual interface without writing code. You can upload files, toggle tools, adjust parameters, and evaluate outputs interactively. When you're ready to move to code, the 'Get Code' button exports your exact configuration. AI Studio Build goes further, generating full-stack apps from natural-language specs including UI, database, and authentication.
What is the difference between Gemini Flash Light, Flash, and Pro?
These are Gemini's tiered models ordered by cost and capability. Flash Light (~$0.25/M tokens) is the cheapest, suitable for prototyping and simpler tasks. Flash offers a middle ground with better reasoning. Pro is the most capable and expensive — roughly an order of magnitude more than Flash Light — suited for complex reasoning, long-document analysis, and production quality. Default to Flash Light during development and only upgrade when quality benchmarks justify the cost increase.
What is Nano Banana 2 and how does it relate to Imagen?
Nano Banana 2 is Google DeepMind's current image generation and editing model, previously known as Imagen. It supports multiple aspect ratios, image-reference-based generation for character consistency, and Google Search grounding that lets it generate images informed by live web results. You access it through AI Studio or the Developer API, and it works best when Gemini generates its prompts rather than you writing them manually.
// How To
How do I generate video from an image using VO3?
Pass your starting image as the first frame to VO3 or VO3.1 Light, along with a motion-specific description prompt generated by Gemini. Critically, do not reuse your image generation prompt — write a separate prompt describing what should happen after the starting frame (camera movements, character actions, environmental changes). Use VO3.1 Light at $0.05/image-equivalent for prototyping and upgrade to full VO3 only when quality demands it.
How do I use Gemini's code execution tool in my app?
Code execution is a one-liner tool toggle in AI Studio that provisions a sandboxed Python environment with pre-installed data science libraries. Enable it in the Playground or via the API, and Gemini can write, run, and iterate on Python code — like drawing bounding boxes on images or generating segmentation masks — within a single inference call. This is exported as part of the 'Get Code' configuration and works identically in production.
How do I generate music for my app using LIA 3?
Use Gemini to generate an audio prompt describing the desired mood, instrumentation, tempo, and feel. Pass this prompt to LIA 3 via the API to generate 30-second clips or full 3-minute songs with lyrics. For real-time applications, LIA Real Time generates music indefinitely and responds to prompt changes on the fly, functioning like an AI DJ. Pair LIA 3 with Gemini's chat mode to maintain thematic consistency across multiple music pieces.
How do I handle file uploads when building with Gemini?
Use the File Upload API, which handles file storage without requiring you to configure cloud buckets. Upload assets (documents, images, audio, video) once using the client-side API. Each file receives a URI you reference in subsequent Gemini prompts. Combine this with chat mode to upload a large source asset once and issue multiple instructions against it across turns, dramatically reducing token costs and latency for multi-step generation pipelines.
Can I process a YouTube video directly with Gemini?
Yes. In AI Studio, use the 'Add Link' feature to input a YouTube URL with optional start and end timestamps. Gemini can ingest the video content — visual frames and audio — natively as part of a multimodal prompt. This means you can analyze, summarize, extract clips, or generate derivative content from YouTube videos without downloading or pre-processing them. The same configuration exports via 'Get Code' for production use.
What is Google Search grounding and when should I enable it?
Google Search grounding is a toggle in AI Studio that allows Gemini and Nano Banana 2 to retrieve live web information during generation. Enable it when your app needs factual accuracy beyond the model's training cutoff — like identifying current book editions, real product details, or recent events. For image generation, it lets Nano Banana 2 generate images informed by live web search results. It's a one-liner toggle in the Playground and exports cleanly via Get Code.
// Troubleshooting
Why are my AI-generated characters looking different in each image?
You're likely relying on long-context memory or text descriptions alone for character consistency. The fix is explicit reference image injection: generate one dedicated reference image per character first, then for every subsequent scene, pass only the specific reference images for characters in that scene. Don't pass your entire character library — pass only what's relevant. For complex characters, generate multiple reference angles (front, back, side) to give the model more visual grounding.
Why is the TTS model ignoring my text input?
The TTS model requires a read/tell instruction prefix to process text. If you send raw text without prefixing it with something like 'Read this:' or 'Tell this story:', the model will ignore the content. For multi-character narration, rewrite dialogue as a play-style transcript with character labels and inline style descriptions (e.g., 'fast-paced, British accent, excited'). Reuse the same style tag for recurring characters to maintain voice consistency.
My AI Studio Build app has bugs — how do I debug generated code effectively?
Always instruct the model to add logging from the start — error messages alone are insufficient. Request separate files for each feature to isolate regressions and make code review tractable. When the model fixes errors, review file diffs carefully to detect if it's modifying unrelated logic. Watch which files change with each iteration. If a fix introduces new bugs elsewhere, revert and re-prompt with more specific instructions targeting only the affected module.
// Comparisons
How does building with DeepMind models compare to using LangChain or other agent frameworks?
The DeepMind framework explicitly warns against over-investing in agent frameworks — Paige Bailey's Sprint Warning notes that models tend to absorb these capabilities natively. Instead of building multi-step chains with external orchestration, the framework leverages Gemini's native multimodal capabilities, built-in tools (code execution, search grounding, function calling), and chat mode for context persistence. This reduces infrastructure complexity. Only add external orchestration when the model demonstrably cannot handle the task natively.
How does Gemma 4 compare to running Llama or Mistral locally?
Gemma 4 is Google DeepMind's open-weight model family under Apache 2.0 license, specifically designed for agentic use with built-in thinking and multimodal understanding. Its Effective 2B/4B models use a per-layer embedded architecture optimized for mobile/edge devices — actual 2-4B parameters performing like 5-8B models. Unlike Llama or Mistral, Gemma 4 is directly compatible with the same tooling and prompt patterns used across the DeepMind cloud suite, providing a consistent developer experience from cloud to edge.
How does AI Studio Build compare to v0.dev or Lovable for app scaffolding?
AI Studio Build is analogous to v0.dev or Lovable but integrates directly with Google DeepMind's model suite, Firebase/Firestore for databases, Google OAuth for authentication, and GitHub for version control. The key advantage is that your AI Studio playground prototypes — including model configurations, tool settings, and validated prompts — feed directly into the Build environment. You can paste existing notebooks or specs as context, and the generated app inherits access to the full DeepMind API ecosystem natively.
// Advanced
What does the service_tier parameter do and when should I use flex vs priority?
The service_tier parameter signals scheduling priority when calling Gemini models. Set 'flex' for batch or offline jobs — it's cheaper but may accept minutes of latency, functioning like a batch API. Set 'priority' for live user-facing requests — it costs roughly 2x but provides higher reliability and lower latency. During development, default to flex to minimize spend; switch to priority only for production endpoints serving real users.
What is Genie 3 and how does it create interactive environments?
Genie 3 is a world model from Google DeepMind that generates interactive, playable environments frame-by-frame from text descriptions and character prompts. It composes Nano Banana 2 (for visuals), VO (for temporal coherence), and Gemini (for understanding) into a system that outputs raw pixel frames — no game engine or 3D assets required. Users can navigate and interact with the generated world in real time, making it suitable for game prototyping, training simulations, and interactive storytelling.
How do I use structured outputs when chaining DeepMind model calls?
When Gemini's output feeds into downstream model calls — generating image prompts, character lists, or video motion descriptions — use structured outputs (JSON schema enforcement) to guarantee parseable responses. In AI Studio, toggle structured outputs on and define your schema. This prevents free-text formatting variations from breaking your pipeline. For example, request a JSON array of scene objects each containing character_ids, image_prompt, and motion_prompt fields, then programmatically route each to the correct generation model.
How expensive is video generation with VO3 and how do I control costs?
Video generation with VO can cost approximately $20 per run depending on duration and quality tier. Use VO3.1 Light ($0.05/image-equivalent) for all prototyping and only upgrade to full VO3 for production-quality outputs. Gate all expensive model calls behind explicit confirmation flags in your notebooks and apps — never auto-run video generation in loops. Use service_tier='flex' for batch video processing and set budget alerts. Validate image prompts thoroughly before committing to video generation.
Should I build a vector database for RAG with DeepMind models?
Probably not yet. The Sprint Warning principle applies directly here: Gemini's context windows have expanded dramatically, absorbing much of the use case that vector databases served. Before investing in RAG infrastructure, test whether Gemini can handle your document set within its native context window using chat mode and the File Upload API. Only build vector database infrastructure if your data volume genuinely exceeds the model's context capacity and you've confirmed the gap won't close in the next release cycle.
How do I deploy a Gemma 4 model on a mobile device?
Use Gemma 4's Effective 2B or 4B models, which use a per-layer embedded architecture allowing embeddings to be stored on flash and paged in as needed. Deploy via AI Edge Gallery for Android devices, or use Ollama or LM Studio for other edge platforms. These models run on phones, Raspberry Pis, and Jetson Nanos. The Apache 2.0 license allows full customization. They support the same agentic patterns — thinking, multimodal understanding, tool use — as cloud Gemini, scaled for device constraints.