Building a Personal Intelligence System: Design Decisions and Lessons Learned

· Ricky Lee · 14 min read
Building a Personal Intelligence System: Design Decisions and Lessons Learned

In Part 1, I built a pipeline that extracted, normalized, and imported 12,742 life events into Apple Calendar. That covered the what — now for the how and why. This post digs into the engineering decisions that shaped the system, a complementary Apple Notes strategy that gives Siri narrative context, and the practical lessons from building something this personal at scale.

Teaching Apple Intelligence Who You Are

Getting events into Apple Calendar is only half the story. For Apple’s personal intelligence to connect the dots, it needs narrative context — not just data points, but an understanding of who you are. This is where Apple Notes becomes unexpectedly powerful.

The Second-Person Context Strategy

Apple Intelligence draws personal context from Apple Notes, and I discovered that notes written in the second person — as if someone is describing you to an assistant — work best. I maintain 14 structured context documents in Apple Notes:

NoteWhat It Tells Siri
About MeName, location, career, partner, family, key facts
Career SummaryRole, company, tenure, progression, work patterns
Relationship MapWho matters most, how you know them, key details
Atlanta Dining FavoritesPreferred restaurants, cuisine preferences, go-to spots
Dietary GuideDietary restrictions and safe food choices
Travel ProfileCountries visited, airline status, travel style
Health ProfileMedications, supplements, key conditions
Active SubscriptionsServices, costs, what you use regularly
Smart Home & HomelabHome setup, devices, technical infrastructure
Favorite EntertainmentShows, movies, music, podcasts, gaming
Food & BeverageCooking preferences, grocery habits, beverage favorites
Personal InterestsHobbies, collections, values, personality traits
Emergency ContactsKey people and numbers
Legal SummaryRelevant legal documents and milestones

The key insight is voice. Compare:

First person (less effective)

"I like sushi and my favorite restaurant is Nakato."

Second person (more effective)

"Your favorite sushi restaurant is Nakato Japanese Restaurant on Cheshire Bridge Road. You've been going there since 2016 and typically order the Deluxe Sashimi. Your partner prefers the teriyaki chicken."

The second-person framing mirrors how Apple Intelligence internally represents personal context — as facts about you rather than facts from you. In my experience, this subtle shift improves how well Siri connects queries to your personal data.

Writing Effective Context Notes

A few principles that emerged from iteration:

  1. Be specific, not general. “You enjoy dining out” is useless. “Your go-to weeknight spot is Tin Drum Asia Cafe on Lindbergh Drive, where you always order the Pad Thai” gives Siri something to work with.

  2. Include relationships. “Your best friend is [Name]. You’ve known each other since 2005 and met at college.” This lets Siri understand queries like “When did I last see [Name]?” by cross-referencing with calendar events.

  3. State preferences as facts. “You are a Delta Medallion member” and “You drive a Mazda Miata” are the kinds of concrete facts that make Siri’s answers feel genuinely personal.

  4. Cover the domains your timeline covers. Each context note should complement a calendar category. The dining note enriches orange Dining events; the travel profile enriches teal Travel events.

  5. Update regularly. These notes are living documents. When your circumstances change — new job, new medication, new favorite restaurant — update the relevant note.

Structured Event Metadata

Beyond Apple Notes, the events themselves carry metadata that enhances AI reasoning:

X-APPLE-STRUCTURED-LOCATION embeds coordinates and venue names that integrate with Maps. The pipeline maintains a dictionary of 40+ canonical venues with full street addresses:

CANONICAL_VENUES = {
    "Nakato Japanese Restaurant": "1776 Cheshire Bridge Rd NE, Atlanta, GA 30324",
    "The Melting Pot": "754 Peachtree St NE, Atlanta, GA 30308",
    "Tin Drum Asia Cafe": "88 Lindbergh Dr NE, Atlanta, GA 30305",
    "Ponce City Market": "675 Ponce De Leon Ave NE, Atlanta, GA 30308",
    # ... 36 more venues
}

When Calendar shows one of these events, you can tap the location and get directions in Maps. When Apple Intelligence processes location queries, it can reason about geography.

CATEGORIES maps events to semantic groupings — Dining, Travel, Entertainment — enabling queries scoped by type.

CREATED timestamps give every event a temporal anchor that AI can use for recency reasoning and chronological ordering.

EventKit Over ICS

I initially generated ICS files and imported them through Apple Calendar’s file import. This worked but had problems: Calendar would sometimes create duplicates on re-import, and there was no reliable way to delete-and-recreate calendars programmatically.

Switching to direct EventKit import via Python’s pyobjc-framework-EventKit solved everything:

from EventKit import (
    EKEventStore, EKEntityTypeEvent, EKEvent,
    EKSpanThisEvent, EKCalendar, EKSourceTypeCalDAV,
)
from Foundation import NSDate
from AppKit import NSColor

# Create a calendar targeting iCloud specifically
cal = EKCalendar.calendarForEntityType_eventStore_(EKEntityTypeEvent, store)
cal.setTitle_("Dining")
cal.setSource_(icloud_source)  # explicitly target iCloud, not "On My Mac"
cal.setColor_(NSColor.colorWithRed_green_blue_alpha_(0.96, 0.45, 0.20, 1.0))

# Write an event
event = EKEvent.eventWithEventStore_(store)
event.setTitle_("Dining: Nakato Japanese Restaurant")
event.setCalendar_(cal)
event.setStartDate_(start_nsdate)
event.setEndDate_(end_nsdate)
event.setAllDay_(True)

store.saveEvent_span_commit_error_(event, EKSpanThisEvent, False, None)

The script creates (or recreates) all 10 calendars programmatically, sets their colors, and writes events directly to the iCloud calendar store. It’s idempotent — every run produces the same result.

Design Decisions

Every pipeline involves tradeoffs. Here are the ones that shaped this system — and the reasoning behind each choice.

CSV as the Master Store (Not SQLite)

The master timeline lives in a single CSV file: master-timeline-review.csv, 33,594 rows, 15 columns. This was a deliberate choice.

CSV wins

Human-readable, diffable in git, trivially inspectable. Opens in any spreadsheet app. Every pipeline stage reads and writes the same format — no ORM, no driver, no abstraction layer.

CSV costs

No schema enforcement, no indexes, no foreign keys. Querying 33K rows means full scans. Dedup loads entire date-buckets into memory. A typo in review_status is silent.

For a personal project that runs in batch mode once a day, these costs are manageable. For anything larger — multiple users, real-time updates, hundreds of thousands of events — I’d reach for SQLite on day one.

Batch Pipeline, Not Stream Processing

The pipeline rebuilds everything from scratch on each run. Every extraction script re-parses every source file. The merge script re-combines all 11 CSVs. Normalization re-processes all 33K events. This is intentionally wasteful.

Why? Idempotency. Every run produces identical output from identical input. If I fix a normalization mapping, the next run applies it retroactively to every historical event. If I add a new data source, it merges cleanly with everything else. There’s no state to corrupt, no partial updates to debug, no “which events were processed in which run?” provenance questions.

<90sFull Pipeline Run
33KEvents Reprocessed
11Source CSVs Merged
100%Idempotent

At this scale, the simplicity of batch processing outweighs the efficiency of incremental updates. The breakpoint — where I’d switch to an incremental model — would be somewhere around 100K events or sub-minute update requirements.

EventKit Over CalDAV API

Apple exposes calendar data through two interfaces: the CalDAV network protocol and the local EventKit framework.

EventKit (chosen)

Local framework, direct iCloud store access. Create/delete/recreate calendars atomically. Set colors via NSColor. Explicitly target iCloud source. No OAuth, no tokens, no auth complexity. Works with Advanced Data Protection.

CalDAV

Network protocol, platform-independent. But: no calendar deletion, no color standard, defaults to "On My Mac" silently breaking sync. Requires OAuth tokens and refresh logic. Broken by Advanced Data Protection.

The tradeoff is platform lock-in — the import script only runs on macOS. For a system built entirely around Apple Calendar and iCloud, this is an acceptable constraint — and as it turns out, it’s not just acceptable, it’s required.

Advanced Data Protection Changes Everything

If you’ve enabled Advanced Data Protection (ADP) on your iCloud account — and you should — your calendar data is end-to-end encrypted. Apple can’t read it. And neither can anything that accesses it over the network.

This has a concrete implication for this project: CalDAV is effectively dead for ADP-enabled accounts. The CalDAV protocol authenticates against iCloud’s servers, but with ADP, the server doesn’t have the decryption keys for your calendar data. Third-party calendar apps that sync via CalDAV either won’t work or require you to carve out a specific exception for Calendar in your ADP settings — which partially defeats the purpose.

EventKit sidesteps this entirely. It accesses the local calendar store on your Mac, where the data has already been decrypted by the system’s Secure Enclave. The import script never touches a network API, never sends credentials to a server, never negotiates with iCloud’s sync infrastructure. It writes directly to the local EventKit database, and iCloud syncs the encrypted result to your other devices through the standard ADP pipeline.

This means the pipeline’s entire data flow — from your local data exports, through Python processing, into EventKit, and out to iCloud — stays on-device until the final encrypted sync. Your 12,000+ life events, including financial transactions, health visits, and travel patterns, never transit the network in plaintext. For a system that’s essentially a structured autobiography, that’s not a nice-to-have — it’s a hard requirement.

The irony is that I chose EventKit over CalDAV for practical reasons (calendar deletion, color control, no OAuth complexity). ADP turned that convenience choice into a security architecture decision. If I’d built the pipeline around CalDAV, enabling ADP would have broken it entirely.

Three-Tier Provenance Model

Every event carries three provenance fields. This is deliberately over-engineered for a personal project — and deliberately so.

Tier 1
origin
Which pipeline stage created it
cc-statements, email-mining, google-takeout
Tier 2
audit_source
Which data source it came from
credit-statements, debit-statements, email-summaries.db
Tier 3
source_record_id
Pointer to the specific source record
email-1234, cc-statements:2018-05-05:a1b2c3d4

When the dedup algorithm makes a questionable decision, provenance answers “where did this event come from and why does it look like that?” When a normalization rule produces unexpected results, source_record_id lets me trace back to the exact credit card transaction or email summary. Real IDs like email-1234 map directly to SELECT * FROM email_summaries WHERE id = 1234. Synthetic IDs like cc-statements:2018-05-05:a1b2c3d4 encode the source, date, and content hash.

100% provenance coverage across 33K+ events means no event is an orphan. This is the same principle behind data lineage systems at companies that operate data pipelines at scale — and it’s just as valuable at the scale of one person’s life.

Human-in-the-Loop Classification

The pipeline’s classification system is essentially a human-in-the-loop labeling workflow. The extraction scripts apply initial labels (category, signal level) using rules. The normalization layer refines those labels using a growing dictionary of mappings. But the final authority is human review — the review_status field that marks events as approved, rejected, or pending.

This mirrors active learning patterns in ML systems: the automated classifier handles the easy cases (credit card payments → skip, Delta Airlines → travel), surfaces the ambiguous cases for human judgment (is this pharmacy visit Health or Shopping?), and incorporates each human decision back into the classification rules for future runs. The AskUserQuestion tool formalized this into a structured labeling interface — essentially a lightweight annotation tool.

The high rejection rate is a precision/recall tradeoff. The extraction scripts optimize for recall — capture everything, let the human filter. The normalization and dedup layers optimize for precision — clean titles, deduplicated records, correct categories. The human review layer is the final precision gate.

Backpressure and Rate Limiting

The iCloud 503 errors taught me something that distributed systems engineers learn early: every external dependency has a rate limit, and if your system doesn’t respect it, the dependency will enforce it for you — usually at the worst possible time.

EventKit writes to the local calendar store, which is fast. But iCloud sync — the process that propagates those changes to Apple’s servers and your other devices — has its own throughput limits. Push 12,000 events in 90 seconds, and the sync daemon queues them. Follow up immediately with 46 surgical updates, and the queue overflows. iCloud responds with 503s and calendars go blank on other devices — a known behavior when the sync daemon’s internal queue saturates.

The correct engineering response is backpressure — don’t write faster than the downstream system can absorb. Options include batching EventKit commits (commit every 500 events instead of every event), inserting small delays between batches to let sync catch up, or separating full rebuilds and surgical updates with a cooldown period. My current approach is the simplest form of backpressure: don’t stack operations. Full rebuilds and surgical updates are mutually exclusive — pick one per session. It’s not elegant, but it’s reliable.

Making This Your Own

You don’t need my exact setup to build a personal timeline. The approach works with any AI-powered CLI tool that can read files and write code. Here’s the general playbook:

Step 1: Export Your Data

Start with whatever you have. The highest-value sources for most people:

  • Google Takeout (takeout.google.com): Calendar history, YouTube, Maps timeline, Gmail. This is the single richest source for most people.
  • Credit card statements: Download PDFs or CSVs from your card issuer’s website.
  • Amazon order history: Request your data export from Amazon’s privacy settings.

You don’t need all sources on day one. Start with one or two and iterate.

Step 2: Build Extraction Scripts

This is where the AI CLI tool earns its keep. Point it at your data exports and ask it to build extraction scripts. I used Claude Code — Anthropic’s command-line AI agent — which can read your data files directly, write Python extraction scripts, execute them, review the output, and iterate until the results are clean. The conversational loop is critical: “This merchant name is wrong” → Claude Code fixes the mapping → re-run → check again.

The key is defining a common output schema. Every extraction script should produce a CSV with the same columns: date, time, event_title, location, description, source_category. This standardization is what makes the merge step possible.

Step 3: Merge and Deduplicate

Once you have per-source CSVs, build a merge script that combines them and handles deduplication. Start simple — exact title match on the same date — and add fuzzy matching as you encounter duplicates that slip through.

Step 4: Normalize and Enrich

This is the iterative part. Review your merged timeline and fix what’s ugly: inconsistent merchant names, wrong categories, missing locations. Build up a normalization dictionary as you go. Claude Code is particularly good at this — ask it to scan your CSV for patterns and suggest normalizations. “Find all events that look like restaurant names but are categorized as Shopping” is the kind of prompt that surfaces hundreds of corrections in one pass.

Step 5: Import to Apple Calendar

For macOS users, the EventKit approach via pyobjc is the most reliable. If you’d rather not deal with Python/Objective-C bridging, ICS file import works too — just be aware of the re-import duplication issue.

For the ICS route, the key properties to set correctly:

DTSTART;VALUE=DATE:20240315        (all-day event)
CATEGORIES:Dining                   (calendar grouping)
TRANSP:TRANSPARENT                  (don't block your schedule)
X-APPLE-STRUCTURED-LOCATION;...    (Maps integration)

The Tools

I built this with Claude Code, but the approach is tool-agnostic. What matters is having an AI agent that can read files, write code, and iterate:

  • Claude Code: What I used. Anthropic’s CLI agent that can read your data files, write extraction scripts, run them, and iterate on the output. The ability to work in a persistent project directory and maintain context across sessions was essential for a pipeline this complex.
  • VS Code / Copilot / Cursor: IDE-based AI tools that can accomplish the same workflow with a more visual interface.
  • Any CLI-capable LLM tool: The pattern is general. If your tool can read a CSV and write a Python script, it can build this pipeline.

Pitfalls and Lessons Learned

Building this pipeline taught me several things the hard way.

AppleScript defaults to "On My Mac"
AppleScript-created calendars live locally, not in iCloud. No sync, no Apple Intelligence indexing, silent failure. Fix: Use EventKit with explicit iCloud source targeting.
Midnight UTC is the wrong day
An all-day event on March 15 at midnight UTC displays as March 14 in US timezones. Fix: Use VALUE=DATE (no time) for ICS, or noon local time for EventKit.
EventKit's inclusive end date
ICS says DTEND is exclusive (March 16 for a March 15 event). EventKit says inclusive (March 15). Mixing them up creates two-day or zero-duration events.
pyobjc needs system Python
pyobjc-framework-EventKit must be installed system-wide, not in a virtualenv. First run triggers a macOS permission dialog — check Privacy & Security → Calendars if it fails.

Apple Intelligence Is Still Learning

As of early 2026, Apple Intelligence’s calendar awareness is still evolving. A few observations for anyone building toward this:

  • Calendar search already works well. Even without AI, having 12,000+ structured events means macOS and iOS calendar search becomes genuinely useful. Searching “Starbucks” instantly shows every visit.
  • Cross-app context is the promise. The real payoff comes when Apple Intelligence can connect calendar events with Notes, Messages, and Photos — “Show me photos from my Nashville trip” should eventually cross-reference Travel calendar events with photo geotags.
  • Structured data wins. Clean titles, consistent categories, and real locations will matter more as AI features mature. The investment in normalization pays dividends later — garbage-in-garbage-out applies to AI reasoning just as much as it does to database queries.
  • TRANSPARENT events are essential. Mark all historical events as TRANSP:TRANSPARENT so they don’t pollute your actual availability or trigger scheduling conflicts. These are history, not commitments.

Curation Is the Work

The most time-consuming part isn’t building the pipeline — it’s reviewing the output. Is this Amazon order worth a timeline entry, or is it commodity? Should this pharmacy run be categorized as Health or Shopping? Does this transaction at “BLVD *RUDY’S PONCE CITY” deserve a clean venue name in the dictionary (it did — it’s a barbershop, not a restaurant, and the pipeline originally miscategorized it as dining)?

That curation can’t be fully automated — it requires judgment about what matters to you. But it gets faster over time as the normalization rules accumulate.

The One Thing I’d Change

If I were starting this project over, I’d use SQLite from day one.

The master CSV was the right choice for the first 5,000 events. It stopped being the right choice somewhere around 15,000. At 33,594 rows, I’m maintaining a flat file that wants to be a database.

The pain points are specific:

Deduplication
Full scan every run — loads date-buckets into memory, compares every pair.
Self-join with indexed date column. Persist dedup decisions instead of recomputing.
Provenance queries
"Show me rejected credit card events" requires grep and awk on a 33K-row CSV.
WHERE origin = 'cc-statements' AND review_status = 'rejected' — milliseconds.
Schema enforcement
Nothing prevents writing dinning instead of dining. Blank review_status is silent.
CHECK constraints and enum tables catch errors at write time.

The migration cost is why I haven’t done it yet. Every pipeline script reads and writes CSV. The review_status and review_notes columns are hand-edited in a spreadsheet for bulk curation. Moving to SQLite means rewriting the I/O layer of every script and building a review interface to replace the spreadsheet workflow. It’s a weekend project that I keep pushing to next weekend.

The lesson is general: start with the simplest data format that works, but have a migration plan for when it stops working. CSV carried this project further than it should have, and the refactor cost grows with every script that assumes flat-file I/O.

What’s Next

The calendar timeline and Apple Notes knowledge base are two halves of the same system. Events give Siri temporal context — where you’ve been, what you’ve done. Notes give Siri personal context — who you are, what you prefer, how your life is structured. Together, they turn Apple Intelligence from a generic assistant into something that actually knows you.

The data was always there, scattered across bank statements and email archives and service exports. It just needed someone — or something — to pull it together. And with Apple Intelligence gaining more personal context awareness with each update, the investment made today becomes more valuable tomorrow.


Ricky Lee is a Staff Engineer at Fueled (formerly 10up), where he’s spent 11+ years building enterprise content management platforms. This project — a personal data pipeline that turned scattered life data into a structured, AI-ready timeline — grew out of curiosity about what happens when you point modern AI tooling at your own digital footprint. He writes at rickylee.com and is @rickalee on GitHub.