Files
openclaw-backups/archive/inactive-skills/desktop-control/AI_AGENT_GUIDE.md

11 KiB

AI Desktop Agent - Cognitive Automation Guide

🤖 What Is This?

The AI Desktop Agent is an intelligent layer on top of the basic desktop control that understands what you want and figures out how to do it autonomously.

Unlike basic automation that requires exact instructions, the AI Agent:

  • Understands natural language ("Draw a cat in Paint")
  • Plans the steps automatically
  • Executes autonomously
  • Adapts based on what it sees

🎯 What Can It Do?

Autonomous Drawing

from skills.desktop_control.ai_agent import AIDesktopAgent

agent = AIDesktopAgent()

# Just describe what you want!
agent.execute_task("Draw a circle in Paint")
agent.execute_task("Draw a star in MS Paint")
agent.execute_task("Draw a house with a sun")

What it does:

  1. Opens MS Paint
  2. Selects pencil tool
  3. Figures out how to draw the requested shape
  4. Draws it autonomously
  5. Takes a screenshot of the result

Autonomous Text Entry

# It figures out where to type
agent.execute_task("Type 'Hello World' in Notepad")
agent.execute_task("Write an email saying thank you")

What it does:

  1. Opens Notepad (or finds active text editor)
  2. Types the text naturally
  3. Formats if needed

Autonomous Application Control

# It knows how to open apps
agent.execute_task("Open Calculator")
agent.execute_task("Launch Microsoft Paint")
agent.execute_task("Open File Explorer")

Autonomous Game Playing (Advanced)

# It will try to play the game!
agent.execute_task("Play Solitaire for me")
agent.execute_task("Play Minesweeper")

What it does:

  1. Analyzes the game screen
  2. Detects game state (cards, mines, etc.)
  3. Decides best move
  4. Executes the move
  5. Repeats until win/lose

🏗️ How It Works

Architecture

User Request ("Draw a cat")
    ↓
Natural Language Understanding
    ↓
Task Planning (Step-by-step plan)
    ↓
Step Execution Loop:
    - Observe Screen (Computer Vision)
    - Decide Action (AI Reasoning)
    - Execute Action (Desktop Control)
    - Verify Result
    ↓
Task Complete!

Key Components

  1. Task Planner - Breaks down high-level tasks into steps
  2. Vision System - Understands what's on screen (screenshots, OCR, object detection)
  3. Reasoning Engine - Decides what to do next
  4. Action Executor - Performsthe actual mouse/keyboard actions
  5. Feedback Loop - Verifies actions succeeded

📋 Supported Tasks (Current)

Tier 1: Fully Automated

Task Pattern Example Status
Draw shapes in Paint "Draw a circle" Working
Basic text entry "Type Hello" Working
Launch applications "Open Paint" Working

Tier 2: Partially Automated 🔨

Task Pattern Example Status
Form filling "Fill out this form" 🔨 In Progress
File operations "Copy these files" 🔨 In Progress
Web navigation "Find on Google" 🔨 Planned

Tier 3: Experimental 🧪

Task Pattern Example Status
Game playing "Play Solitaire" 🧪 Experimental
Image editing "Resize this photo" 🧪 Planned
Code editing "Fix this bug" 🧪 Research

🎨 Example: Drawing in Paint

Simple Request

agent = AIDesktopAgent()
result = agent.execute_task("Draw a circle in Paint")

# Check result
print(f"Status: {result['status']}")
print(f"Steps taken: {len(result['steps'])}")

What Happens Behind the Scenes

1. Planning Phase:

Plan generated:
  Step 1: Launch MS Paint
  Step 2: Wait 2s for Paint to load
  Step 3: Activate Paint window
  Step 4: Select pencil tool (press 'P')
  Step 5: Draw circle at canvas center
  Step 6: Screenshot the result

2. Execution Phase:

[✓] Launched Paint via Win+R → mspaint
[✓] Waited 2.0s
[✓] Activated window "Paint"
[✓] Pressed 'P' to select pencil
[✓] Drew circle with 72 points
[✓] Screenshot saved: drawing_result.png

3. Result:

{
    "task": "Draw a circle in Paint",
    "status": "completed",
    "success": True,
    "steps": [... 6 steps ...],
    "screenshots": [... 6 screenshots ...],
}

🎮 Example: Game Playing

agent = AIDesktopAgent()

# Play a simple game
result = agent.execute_task("Play Solitaire for me")

Game Playing Loop

1. Analyze screen → Detect cards, positions
2. Identify valid moves → Find legal plays
3. Evaluate moves → Which is best?
4. Execute move → Click and drag card
5. Repeat until game ends

Game-Specific Intelligence

The agent can learn patterns for:

  • Solitaire: Card stacking rules, suit matching
  • Minesweeper: Probability calculations, safe clicks
  • 2048: Tile merging strategy
  • Chess (if integrated with engine): Move evaluation

🧠 Enhancing the AI

Adding Application Knowledge

# In ai_agent.py, add to app_knowledge:

self.app_knowledge = {
    "photoshop": {
        "name": "Adobe Photoshop",
        "launch_command": "photoshop",
        "common_actions": {
            "new_layer": {"hotkey": ["ctrl", "shift", "n"]},
            "brush_tool": {"hotkey": ["b"]},
            "eraser": {"hotkey": ["e"]},
        }
    }
}

Adding Custom Task Patterns

# Add a custom planning method
def _plan_photo_edit(self, task: str) -> List[Dict]:
    """Plan for photo editing tasks."""
    return [
        {"type": "launch_app", "app": "photoshop"},
        {"type": "wait", "duration": 3.0},
        {"type": "open_file", "path": extracted_path},
        {"type": "apply_filter", "filter": extracted_filter},
        {"type": "save_file"},
    ]

🔥 Advanced: Vision + Reasoning

Screen Analysis

The agent can analyze screenshots to:

  • Detect UI elements (buttons, text fields, menus)
  • Read text (OCR for labels, instructions)
  • Identify objects (icons, images, game pieces)
  • Understand layout (where things are)
# Analyze what's on screen
analysis = agent._analyze_screen()

print(analysis)
# Output:
# {
#     "active_window": "Untitled - Paint",
#     "mouse_position": (640, 480),
#     "detected_elements": [...],
#     "text_found": [...],
# }

Integration with OpenClaw LLM

# Future: Use OpenClaw's LLM for reasoning
agent = AIDesktopAgent(llm_client=openclaw_llm)

# The agent can now:
# - Reason about complex tasks
# - Understand context better
# - Plan more sophisticated workflows
# - Learn from feedback

🛠️ Extending for Your Needs

Add Support for New Apps

  1. Identify the app
  2. Document common actions
  3. Add to knowledge base
  4. Create planning method

Example: Adding Excel support

# Step 1: Add to app_knowledge
"excel": {
    "name": "Microsoft Excel",
    "launch_command": "excel",
    "common_actions": {
        "new_sheet": {"hotkey": ["shift", "f11"]},
        "sum_formula": {"action": "type", "text": "=SUM()"},
    }
}

# Step 2: Create planner
def _plan_excel_task(self, task: str) -> List[Dict]:
    return [
        {"type": "launch_app", "app": "excel"},
        {"type": "wait", "duration": 2.0},
        # ... specific Excel steps
    ]

# Step 3: Hook into main planner
if "excel" in task_lower or "spreadsheet" in task_lower:
    return self._plan_excel_task(task)

🎯 Real-World Use Cases

1. Automated Form Filling

agent.execute_task("Fill out the job application with my resume data")

2. Batch Image Processing

agent.execute_task("Resize all images in this folder to 800x600")

3. Social Media Posting

agent.execute_task("Post this image to Instagram with caption 'Beautiful sunset'")

4. Data Entry

agent.execute_task("Copy data from this PDF to Excel spreadsheet")

5. Testing

agent.execute_task("Test the login form with invalid credentials")

⚙️ Configuration

Enable/Disable Failsafe

# Safe mode (default)
agent = AIDesktopAgent(failsafe=True)

# Fast mode (no failsafe)
agent = AIDesktopAgent(failsafe=False)

Set Max Steps

# Prevent infinite loops
result = agent.execute_task("Play game", max_steps=100)

Access Action History

# Review what the agent did
print(agent.action_history)

🐛 Debugging

View Step-by-Step Execution

result = agent.execute_task("Draw a star in Paint")

for i, step in enumerate(result['steps'], 1):
    print(f"Step {i}: {step['step']['description']}")
    print(f"  Success: {step['success']}")
    if 'error' in step:
        print(f"  Error: {step['error']}")

View Screenshots

# Each step captures before/after screenshots
for screenshot_pair in result['screenshots']:
    before = screenshot_pair['before']
    after = screenshot_pair['after']
    
    # Display or save for analysis
    before.save(f"step_{screenshot_pair['step']}_before.png")
    after.save(f"step_{screenshot_pair['step']}_after.png")

🚀 Future Enhancements

Planned features:

  • Computer Vision: OCR, object detection, UI element recognition
  • LLM Integration: Natural language understanding with OpenClaw LLM
  • Learning: Remember successful patterns, improve over time
  • Multi-App Workflows: "Get data from Chrome and put in Excel"
  • Voice Control: "Alexa, draw a cat in Paint"
  • Autonomous Debugging: Fix errors automatically
  • Game AI: Reinforcement learning for game playing
  • Web Automation: Full browser control with understanding

📚 Full API

Main Methods

# Execute a task
result = agent.execute_task(task: str, max_steps: int = 50)

# Analyze screen
analysis = agent._analyze_screen()

# Manual mode: Execute individual steps
step = {"type": "launch_app", "app": "paint"}
result = agent._execute_step(step)

Result Structure

{
    "task": str,                    # Original task
    "status": str,                  # "completed", "failed", "error"
    "success": bool,                # Overall success
    "steps": List[Dict],            # All steps executed
    "screenshots": List[Dict],      # Before/after screenshots
    "failed_at_step": int,          # If failed, which step
    "error": str,                   # Error message if failed
}

🦞 Built for OpenClaw - The future of desktop automation!