Tadpole – A modular and extensible DSL built for web scraping

20 points - today at 4:29 PM

Source

Comments

bobajeff today at 5:57 PM
I had to look up what KDL is and what `Functional Source License, Version 1.1, ALv2 Future License` is.

So KDL is like another JSON or Yaml. FSL-1.1-ALv2 is an, almost but not really, open source license that after a 2 years becomes available under a real open source license. It's to prevent free loading from companies or something. Sounds fine to me actually.

zachperkitny today at 4:29 PM
Hello!

I wanted to share my recent project: Tadpole. It is a custom DSL built on top of KDL specifically for web scraping and browser automation. I wanted there to be a standardized way of writing scrapers and reusing existing scraper logic. This was my solution.

Why?

    Abstraction: Simulating realistic human behavior (bezier curves, easing) through high-level composed actions.
    Zero Config: Import and share scraper modules directly via Git, bypass NPM/Registry overhead.
    Reusability: Actions and evaluators can be composed through slots to create more complex workflows.

Example

This is a fully running example, @tadpole/cli is published on npm:

tadpole run redfin.kdl --input '{"text": "Seattle, WA"}' --auto --output output.json

  import "modules/redfin/mod.kdl" repo="github.com/tadpolehq/community"

  main {
    new_page {
      redfin.search text="=text"
      wait_until
      redfin.extract_from_card extract_to="addresses" {
        address {
          redfin.extract_address_from_card
        }
      }
    }
  }

Roadmap? Planned for 0.2.0

    Control Flow: Add maybe (effectively try/catch) and loop (while {}, do {})
    DOMPick: Used to select elements by index
    DOMFilter: Used to filter elements using evaluators
    More Evaluators: Type casting, regex, exists
    Root Slots: Support for top level dynamic placeholders
    Error Reporting: More robust error reporting
    Logging: More consistent logging from actions and add log action to global registry
0.3.0

    Piping: Allowing different files to chain input/output.
    Outputs: Complex output sinks to databases, s3, kafka, etc.
    DAGs: Use directed acylic graphs to create complex crawling scenarios and parallel compute.
Github Repository: https://github.com/tadpolehq/tadpole

I've also created a community repository for sharing scraper logic: https://github.com/tadpolehq/community

Feedback would be greatly appreciated!