Codebeez

Blipz on the radar 2022: summaries

Thoughtworks keynote

Data Mesh concepts in practice

Tech radar started as an internal tool and has now expanded to the outside world.

Alexandre Goedert and Roman, both from Thoughtworks, discussed data mesh—a topic that has gained significant attention following Zhamak’s book publication this year.

Current Understanding of Data Mesh:

Data mesh is defined similarly across sources but with different emphases:

  • Platform architecture
  • Decentralized data architecture
  • Strategic approach to modern data management

While implementation efforts are underway with various clients, integration costs remain high due to significant startup expenses. Vendors typically focus on technology and architecture rather than addressing social aspects of adoption.

Core Concept:

Data mesh represents “a sociotechnical approach in managing and accessing analytical data at scale.” Analytical data encompasses temporal, historic, and aggregated views of business facts over time.

Zhamak’s book, released March 2022, represents a natural evolution of earlier approaches. Thoughtworks’ journey shows data becoming increasingly diverse since 2007, with growing volume and freshness requirements. Current architecture still separates operational and analytical data planes with ETL pipelines and data governance layers.

Principal Challenges:

Data landscapes and user requirements continuously evolve:

  • Data has diversified significantly since 2005
  • Demand for faster data access has increased
  • Reduced dependency on tribal knowledge is needed
  • Analysts require greater involvement with data
  • Low-code solutions are necessary to lower technical barriers

Current Pitfalls:

  • Fail to scale sources
  • Fail to scale consumers
  • Fail to bootstrap data products
  • Fail to materialize data-driven value

Data Mesh Solutions:

  • Distributed domain-driven architecture
  • Data as a product
  • Self-serve data infrastructure
  • Federated computational governance

Team Topologies:

Enabling teams facilitate adoption of self-service platforms for onboarding new teams.

Domain Data Products Should Be:

  • Discoverable
  • Addressable
  • Self-describing
  • Trustworthy
  • Secure
  • Interoperable

Each data product exists within a domain with input and output ports that function as modules at the domain level, allowing easy addition of new products using existing patterns.

Data Lineage and Discovery:

The approach emphasizes easy discoverability of data products. New products can be spun up within known domains, reusing established data patterns that separate creation planes from infrastructure management and exploration planes.

Data Ethics Canvas:

The framework balances data hoarding against data fearing—collecting unused data wastes resources. By pushing data products down to infrastructure levels, only required data gets collected by domains.

Implementation Readiness:

Companies should already be familiar with microservices and domain-driven architecture. However, “data-driven culture is very important across 3 different pillars”:

  • Data availability
  • Data accessibility
  • Data literacy-developer portal

Better than the new oil: Sustainable IT on the radar!

Tom Kennes

Tech and data represent valuable products but consume significant resources. This presentation raised awareness about institutes and tools advancing IT sustainability.

Cloud uses green energy but potentially limits household access to renewable sources.

Key Resources:

  • SDIA - European Alliance for Sustainable IT

Efficiency Principles:

Koomey’s Law: Hardware efficiency doubles every 18 months.

Jevon’s Paradox: Efficiency gains in fuel costs get offset by increased consumption.

Carbon Accounting:

EU regulations now require companies to report energy usage beginning in 2024. Trustworthy measurements require lower-level consumption data at CPU or package levels using RAPL (Running Average Power Limit), currently available only on certain Intel processors.

Sustainability Tools:

Developer Actions:

  • Relocate cloud resources to lower-consumption areas
  • Use minimal necessary resources
  • Use Leafcloud or Blockheating
  • Reduce webpage bloat
  • Select lower-consumption languages—“Rust and C use 75x less resources than Python for similar code”
  • Optimize code performance
  • Reduce deployment frequency

Polars

Ritchie Vink

Polars represents a pandas replacement library. Vink, the author with ML and software development background, developed it through incubation by Xomnia.

Motivation:

Current DataFrame implementations don’t apply 60 years of RDBMS design principles:

  • Almost all implementations use eager evaluation without query optimization
  • Massive wasteful materializations occur
  • Users bear responsibility for fast, memory-efficient compute
  • Parallelism is absent
  • Pandas remains largely single-threaded

Pandas inherited NumPy quirks regarding strings and missing data. Dask attempts parallelization by adding CPU power rather than addressing root problems.

Polars Architecture:

Polars functions as a frontend over Apache Arrow memory abstractions with a vectorized parallel query engine.

Foundations:

Arrow:

  • Columnar in-memory standard
  • Future of data communication
  • Eliminates serialization/deserialization costs
  • Enables free pointer sharing within processes
  • Arrow2 provides native Rust implementation

Rust:

  • Reference-counted garbage collection
  • COW (Copy-on-Write) with atomic reference counting
  • No mutable aliases (compile-time checked)
  • Lock-free mutation with safe reference sharing
  • Very fast performance

Expression API:

The design reduces API surface while supporting powerful expressions optimizable into efficient execution plans—somewhat similar to PySpark but with more flexible expressions.

Performance:

Polars achieves fastest performance on open-source benchmarks. Other libraries suffer from GIL and string conversion overhead.

Future Development:

Out-of-memory streaming support is in development, with demonstrations showing full CPU exploitation while maintaining stable, low memory usage.


CDK: Are we on the road to infrastructure nirvana?

Nico Krijnen

Cloud Development Kits:

CDKs enable infrastructure definition using programming languages instead of configuration files or templates. “People would rather read code than YAML files.”

Code is read far more frequently than written (10:1 ratio), so developers should optimize for readability. CDKs allow condensed infrastructure code that’s easier to read and change.

Reusable Infrastructure Blocks:

Is infrastructure merely hardware? What can infrastructure learn from software engineering?

Software changes; code that’s easy to read is easier to modify. Version control and automated tests create safety nets for change.

CUPID replaces SOLID:

  • Composable → Optimized for readers
  • Unix philosophy → Single-purpose code that works together
  • Predictable → Behaves as expected
  • Idiomatic → Feels natural
  • Domain based → Code resembles domain language and structure

Software value lies in solving business user problems. Infrastructure doesn’t immediately add business value, so spending less time on it allows more focus on business-value creation.


Train humans instead?

Vincent van Warmerdam

Warmerdam discussed Explosion, Berlin, spaCy, and Prodigy. Two demonstrations explored rethinking ML system construction.

Part One: Credit Card Fraud Detection

Using Keras credit card fraud detection data in JupyterLab with pandas visualization via HiPlot—a giant grid search interface. By highlighting data rows across distributions and assigning colors, a rudimentary classifier emerged. Business rules were filtered visually, combined with AND operators, creating a benchmark model via scikit-learn compatibility.

This approach revealed model scores differently than typical ML approaches. Rather than creating “ML soup,” understanding data better produces generalizable rules.

Framework:

  • Data → Rules → Labels
  • Labels → ML → Rules

Visualizing differences between models and rule-based systems reveals what models actually add.

Part Two: Human-in-the-Loop Learning

Instead of blackboxing ML models, teach them interactively. Humans can steer models away from ethical or unfair behavior during training.

Metrics alone seem inadequate—actual predictions matter more. How might the system fail? Data quality is critical.

Demonstration Approach:

Improving non-optimal embeddings from pretrained image models: reducing dimensionality via PCA or UMAP to create clusters enabling class selection, accelerating labeling.

Using Prodigy server: Start with a pretrained model, annotate data visually, train a dense layer on pretrained representations focused on specific tasks (e.g., cat classification), then visualize this trained layer for domain relevance.

Key Principles:

  1. Annotate your own data
  2. Build on relatively simple tricks
  3. Construct systems facilitating this process
  4. Use tricks for data understanding and testing
  5. Consider interactivity as a design pattern

OSINT tips and tricks

Alwin Peppels, Cyberseals

Open-source intelligence exploration covering IoT, RF, AV, and locks.

Techniques:

  • Enriching data
  • Search by exclusion
  • Monitoring differences

Tools and Methods:

Maltego combines entity data enrichment. Google Query operators precisely target specific content. File extension queries within targeted domains narrow results. Google Lens loosely matches image objects and geolocates them.

Government Records:

KVK: Business owner information; owners often use personal details for initial registration lacking business addresses.

Kadaster: Combined with KVK data, provides birthdays and property values.

RDW: Current and past vehicle ownership information with estimated values. Images enable license plate guessing; validated against RDW databases yield further image searches.

Uncommon Data Sources:

Data Breaches:

COMB released 3.2 billion emails and passwords. Data breaches containing up to 100GB of data with passwords, emails, gender, and language information are discoverable.

Data Leakage:

Shodan performs IPv4 port scans; unprotected devices dump screenshots (cameras, IoT devices). Private data in daily life—like mail tracking—leaks sender-to-receiver information.

Mobile number redaction varies across sites (PayPal, LastPass). Combined like sudoku puzzles, limited unknown digits become guessable. Forgotten password interfaces sometimes reveal email address existence/absence, aiding enumeration. Facebook and LinkedIn friend suggestions reveal contact list phone numbers.


Keynote 2: Perimeter security is dead

Lechner

Technology changes over time; sometimes principles change too.

Perimeter Concept:

A perimeter is a closed wall around resources (like a castle) with a gate for legal access.

Evolution:

Mainframes: Physical perimeter was the room housing the mainframe. Terminals existed, making physical security important.

Client-Server Networks: TCP/IP networks later connected to the internet, requiring firewalls to filter incoming and outgoing traffic—a digital perimeter security approach.

Shared Intranet Applications: Applications moved from local desktops to external data centers. VPN combined with intelligent firewalls secured connectivity.

Current State: Work-from-home models funnel all traffic through VPNs using platform-as-a-service, creating massive volume and diversity. Past attacks showed that filtering such packet volumes is difficult. Drawing perimeters across complex cloud services, office computers, remote workers, and software update services is impractical.

Alternatives:

Zero Trust Design:

  • Never trust; always verify
  • Implement least privilege
  • Assume breach

Application Security:

  • Encrypt network traffic
  • Maintain backups
Blog