Files

449 lines
11 KiB
Markdown

# AI Desktop Agent - Cognitive Automation Guide
## 🤖 What Is This?
The **AI Desktop Agent** is an intelligent layer on top of the basic desktop control that **understands** what you want and figures out how to do it autonomously.
Unlike basic automation that requires exact instructions, the AI Agent:
- **Understands natural language** ("Draw a cat in Paint")
- **Plans the steps** automatically
- **Executes autonomously**
- **Adapts** based on what it sees
---
## 🎯 What Can It Do?
### ✅ Autonomous Drawing
```python
from skills.desktop_control.ai_agent import AIDesktopAgent
agent = AIDesktopAgent()
# Just describe what you want!
agent.execute_task("Draw a circle in Paint")
agent.execute_task("Draw a star in MS Paint")
agent.execute_task("Draw a house with a sun")
```
**What it does:**
1. Opens MS Paint
2. Selects pencil tool
3. Figures out how to draw the requested shape
4. Draws it autonomously
5. Takes a screenshot of the result
### ✅ Autonomous Text Entry
```python
# It figures out where to type
agent.execute_task("Type 'Hello World' in Notepad")
agent.execute_task("Write an email saying thank you")
```
**What it does:**
1. Opens Notepad (or finds active text editor)
2. Types the text naturally
3. Formats if needed
### ✅ Autonomous Application Control
```python
# It knows how to open apps
agent.execute_task("Open Calculator")
agent.execute_task("Launch Microsoft Paint")
agent.execute_task("Open File Explorer")
```
### ✅ Autonomous Game Playing (Advanced)
```python
# It will try to play the game!
agent.execute_task("Play Solitaire for me")
agent.execute_task("Play Minesweeper")
```
**What it does:**
1. Analyzes the game screen
2. Detects game state (cards, mines, etc.)
3. Decides best move
4. Executes the move
5. Repeats until win/lose
---
## 🏗️ How It Works
### Architecture
```
User Request ("Draw a cat")
Natural Language Understanding
Task Planning (Step-by-step plan)
Step Execution Loop:
- Observe Screen (Computer Vision)
- Decide Action (AI Reasoning)
- Execute Action (Desktop Control)
- Verify Result
Task Complete!
```
### Key Components
1. **Task Planner** - Breaks down high-level tasks into steps
2. **Vision System** - Understands what's on screen (screenshots, OCR, object detection)
3. **Reasoning Engine** - Decides what to do next
4. **Action Executor** - Performsthe actual mouse/keyboard actions
5. **Feedback Loop** - Verifies actions succeeded
---
## 📋 Supported Tasks (Current)
### Tier 1: Fully Automated ✅
| Task Pattern | Example | Status |
|-------------|---------|---------|
| Draw shapes in Paint | "Draw a circle" | ✅ Working |
| Basic text entry | "Type Hello" | ✅ Working |
| Launch applications | "Open Paint" | ✅ Working |
### Tier 2: Partially Automated 🔨
| Task Pattern | Example | Status |
|-------------|---------|---------|
| Form filling | "Fill out this form" | 🔨 In Progress |
| File operations | "Copy these files" | 🔨 In Progress |
| Web navigation | "Find on Google" | 🔨 Planned |
### Tier 3: Experimental 🧪
| Task Pattern | Example | Status |
|-------------|---------|---------|
| Game playing | "Play Solitaire" | 🧪 Experimental |
| Image editing | "Resize this photo" | 🧪 Planned |
| Code editing | "Fix this bug" | 🧪 Research |
---
## 🎨 Example: Drawing in Paint
### Simple Request
```python
agent = AIDesktopAgent()
result = agent.execute_task("Draw a circle in Paint")
# Check result
print(f"Status: {result['status']}")
print(f"Steps taken: {len(result['steps'])}")
```
### What Happens Behind the Scenes
**1. Planning Phase:**
```
Plan generated:
Step 1: Launch MS Paint
Step 2: Wait 2s for Paint to load
Step 3: Activate Paint window
Step 4: Select pencil tool (press 'P')
Step 5: Draw circle at canvas center
Step 6: Screenshot the result
```
**2. Execution Phase:**
```
[✓] Launched Paint via Win+R → mspaint
[✓] Waited 2.0s
[✓] Activated window "Paint"
[✓] Pressed 'P' to select pencil
[✓] Drew circle with 72 points
[✓] Screenshot saved: drawing_result.png
```
**3. Result:**
```python
{
"task": "Draw a circle in Paint",
"status": "completed",
"success": True,
"steps": [... 6 steps ...],
"screenshots": [... 6 screenshots ...],
}
```
---
## 🎮 Example: Game Playing
```python
agent = AIDesktopAgent()
# Play a simple game
result = agent.execute_task("Play Solitaire for me")
```
### Game Playing Loop
```
1. Analyze screen → Detect cards, positions
2. Identify valid moves → Find legal plays
3. Evaluate moves → Which is best?
4. Execute move → Click and drag card
5. Repeat until game ends
```
### Game-Specific Intelligence
The agent can learn patterns for:
- **Solitaire**: Card stacking rules, suit matching
- **Minesweeper**: Probability calculations, safe clicks
- **2048**: Tile merging strategy
- **Chess** (if integrated with engine): Move evaluation
---
## 🧠 Enhancing the AI
### Adding Application Knowledge
```python
# In ai_agent.py, add to app_knowledge:
self.app_knowledge = {
"photoshop": {
"name": "Adobe Photoshop",
"launch_command": "photoshop",
"common_actions": {
"new_layer": {"hotkey": ["ctrl", "shift", "n"]},
"brush_tool": {"hotkey": ["b"]},
"eraser": {"hotkey": ["e"]},
}
}
}
```
### Adding Custom Task Patterns
```python
# Add a custom planning method
def _plan_photo_edit(self, task: str) -> List[Dict]:
"""Plan for photo editing tasks."""
return [
{"type": "launch_app", "app": "photoshop"},
{"type": "wait", "duration": 3.0},
{"type": "open_file", "path": extracted_path},
{"type": "apply_filter", "filter": extracted_filter},
{"type": "save_file"},
]
```
---
## 🔥 Advanced: Vision + Reasoning
### Screen Analysis
The agent can analyze screenshots to:
- **Detect UI elements** (buttons, text fields, menus)
- **Read text** (OCR for labels, instructions)
- **Identify objects** (icons, images, game pieces)
- **Understand layout** (where things are)
```python
# Analyze what's on screen
analysis = agent._analyze_screen()
print(analysis)
# Output:
# {
# "active_window": "Untitled - Paint",
# "mouse_position": (640, 480),
# "detected_elements": [...],
# "text_found": [...],
# }
```
### Integration with OpenClaw LLM
```python
# Future: Use OpenClaw's LLM for reasoning
agent = AIDesktopAgent(llm_client=openclaw_llm)
# The agent can now:
# - Reason about complex tasks
# - Understand context better
# - Plan more sophisticated workflows
# - Learn from feedback
```
---
## 🛠️ Extending for Your Needs
### Add Support for New Apps
1. **Identify the app**
2. **Document common actions**
3. **Add to knowledge base**
4. **Create planning method**
Example: Adding Excel support
```python
# Step 1: Add to app_knowledge
"excel": {
"name": "Microsoft Excel",
"launch_command": "excel",
"common_actions": {
"new_sheet": {"hotkey": ["shift", "f11"]},
"sum_formula": {"action": "type", "text": "=SUM()"},
}
}
# Step 2: Create planner
def _plan_excel_task(self, task: str) -> List[Dict]:
return [
{"type": "launch_app", "app": "excel"},
{"type": "wait", "duration": 2.0},
# ... specific Excel steps
]
# Step 3: Hook into main planner
if "excel" in task_lower or "spreadsheet" in task_lower:
return self._plan_excel_task(task)
```
---
## 🎯 Real-World Use Cases
### 1. Automated Form Filling
```python
agent.execute_task("Fill out the job application with my resume data")
```
### 2. Batch Image Processing
```python
agent.execute_task("Resize all images in this folder to 800x600")
```
### 3. Social Media Posting
```python
agent.execute_task("Post this image to Instagram with caption 'Beautiful sunset'")
```
### 4. Data Entry
```python
agent.execute_task("Copy data from this PDF to Excel spreadsheet")
```
### 5. Testing
```python
agent.execute_task("Test the login form with invalid credentials")
```
---
## ⚙️ Configuration
### Enable/Disable Failsafe
```python
# Safe mode (default)
agent = AIDesktopAgent(failsafe=True)
# Fast mode (no failsafe)
agent = AIDesktopAgent(failsafe=False)
```
### Set Max Steps
```python
# Prevent infinite loops
result = agent.execute_task("Play game", max_steps=100)
```
### Access Action History
```python
# Review what the agent did
print(agent.action_history)
```
---
## 🐛 Debugging
### View Step-by-Step Execution
```python
result = agent.execute_task("Draw a star in Paint")
for i, step in enumerate(result['steps'], 1):
print(f"Step {i}: {step['step']['description']}")
print(f" Success: {step['success']}")
if 'error' in step:
print(f" Error: {step['error']}")
```
### View Screenshots
```python
# Each step captures before/after screenshots
for screenshot_pair in result['screenshots']:
before = screenshot_pair['before']
after = screenshot_pair['after']
# Display or save for analysis
before.save(f"step_{screenshot_pair['step']}_before.png")
after.save(f"step_{screenshot_pair['step']}_after.png")
```
---
## 🚀 Future Enhancements
Planned features:
- [ ] **Computer Vision**: OCR, object detection, UI element recognition
- [ ] **LLM Integration**: Natural language understanding with OpenClaw LLM
- [ ] **Learning**: Remember successful patterns, improve over time
- [ ] **Multi-App Workflows**: "Get data from Chrome and put in Excel"
- [ ] **Voice Control**: "Alexa, draw a cat in Paint"
- [ ] **Autonomous Debugging**: Fix errors automatically
- [ ] **Game AI**: Reinforcement learning for game playing
- [ ] **Web Automation**: Full browser control with understanding
---
## 📚 Full API
### Main Methods
```python
# Execute a task
result = agent.execute_task(task: str, max_steps: int = 50)
# Analyze screen
analysis = agent._analyze_screen()
# Manual mode: Execute individual steps
step = {"type": "launch_app", "app": "paint"}
result = agent._execute_step(step)
```
### Result Structure
```python
{
"task": str, # Original task
"status": str, # "completed", "failed", "error"
"success": bool, # Overall success
"steps": List[Dict], # All steps executed
"screenshots": List[Dict], # Before/after screenshots
"failed_at_step": int, # If failed, which step
"error": str, # Error message if failed
}
```
---
**🦞 Built for OpenClaw - The future of desktop automation!**