DataEngineering

DataEngineering

Copy job vs Copy activity Microsoft Fabric

Hello again!

Moving data remains one of the foundational activities in data engineering. Whether you’re consolidating information from multiple systems into a central lakehouse, performing regular incremental updates, or building complex ETL/ELT workflows, reliable and efficient data movement is essential for keeping analytics platforms current and accurate.

In Microsoft Fabric’s Data Factory, two powerful options stand out for handling these needs: Copy Job and Copy Data Activity. Both are designed to move data between a wide variety of sources and destinations, but they serve slightly different purposes depending on the complexity of the task.

What is a Copy Job?

Copy Job provides a streamlined, no-code (or low-code) experience for moving data without the need to build a full pipeline. It is ideal for straightforward data ingestion scenarios where you want quick setup and built-in intelligence for common patterns.

With a Copy Job, you select your source (databases, files, cloud storage, etc.), choose the destination (such as a Fabric Lakehouse or Warehouse), and configure how the data should be written. The interface guides you through connections, table selection, column mapping, and write behaviors.

Key capabilities include:

  • Full copy or incremental copy modes. In incremental mode, the job automatically tracks changes using watermark columns (like timestamps or ROWVERSION) or Change Data Capture (CDC) for supported databases.
  • Automatic table creation in the destination if the table doesn’t exist.
  • Options to truncate the destination table before loading.
  • Write methods such as Append, Upsert (merge based on keys), Overwrite, or SCD Type 2 (in preview for CDC scenarios).
  • Built-in audit columns that record extraction time, job ID, and source information for better traceability.
  • Scheduling support, including multiple schedules or event-based triggers.
  • Performance features like auto-partitioning (in preview) for large datasets.
  • Fault tolerance through resume-from-last-successful-run behavior.

Copy Jobs excel in scenarios like daily or hourly batch loads, database-to-lakehouse synchronization, or multi-source data consolidation where minimal orchestration is required.

What is Copy Data Activity?

When your requirements go beyond simple movement and involve orchestration, custom logic, or integration with other steps, the Copy Data Activity becomes the better choice. This activity is added directly to a Fabric data pipeline canvas, allowing it to sit alongside other activities such as transformations, validations, or conditional branching.

You can launch it via the Copy Assistant for a guided setup or add it manually for full control. Configuration happens across tabs for source, destination, mapping, and settings.

Key capabilities include:

  • Support for custom SQL queries or stored procedures at the source.
  • Advanced performance tuning: intelligent throughput optimization, degree of copy parallelism, staging for large transfers, compression, and data consistency verification.
  • Fault tolerance options to skip incompatible rows.
  • Parameterization for reusable activities across environments.
  • Seamless integration into broader pipelines that may include Data Flows, notebooks, or other activities.

Copy Data Activity is particularly useful for complex migrations, scenarios requiring custom transformations immediately after copying, or when you need to coordinate multiple data movements with dependencies and error handling.

Comparison Table: Copy Job vs. Copy Data Activity

Capability Copy Job Copy Data Activity (in Pipeline)
Setup Complexity Simple, no pipeline required Requires building and managing a pipeline
Flexibility Easy-to-use with advanced options Fully customizable and advanced
Native Incremental Copy Yes (watermark-based or CDC) No (requires custom logic or queries)
CDC Replication Yes No
User-defined Query Yes Yes
Table & Column Management Yes (auto-create, truncate, mapping) Yes (mapping, create new tables)
Write Behaviors Append, Upsert, Overwrite, SCD Type 2 Append, Upsert, Overwrite
Orchestration & Chaining Limited (can be called from pipeline) Excellent (integrates with other activities)
Scheduling Yes (built-in, multiple schedules) Yes (via pipeline triggers)
Performance Tuning Automatic optimization + auto-partitioning Detailed control (parallelism, throughput, staging)
Audit & Observability Built-in audit columns Advanced logging and pipeline monitoring
Best For Routine batch/incremental loads Complex workflows with transformations & logic

This comparison is based on Fabric’s official decision guide for data movement.

Choosing Between Copy Job and Copy Data Activity

  • Choose Copy Job when you need fast, reliable data movement with native incremental support, table management, and scheduling — but without the overhead of pipeline development. It offers a good balance of simplicity and advanced features like CDC and upsert.
  • Choose Copy Data Activity when you require full customization, complex orchestration, or integration with other pipeline activities. It provides maximum flexibility at the cost of a bit more setup time.

Many teams use both approaches together: Copy Jobs for the majority of routine table syncs, and Copy Activities inside pipelines for the more intricate or transformation-heavy flows.

Both options are serverless, scale automatically, support on-premises sources via gateways, and integrate natively with the rest of Fabric (Lakehouse, Warehouse, Power BI, etc.). They also include robust monitoring so you can track run history, throughput, errors, and performance metrics.

In practice, starting with a Copy Job often gets you productive quickly. Once your needs evolve toward more sophisticated workflows, transitioning selected jobs into pipeline-based Copy Activities is straightforward.

Data movement doesn’t have to be complicated or fragile. With Copy Job and Copy Data Activity, Fabric makes it accessible, scalable, and observable — freeing data engineers to focus on higher-value work like modeling, analytics, and delivering business insights.

If you’re exploring Fabric Data Factory, I recommend trying a simple Copy Job first to experience the ease, then experimenting with the Copy Activity inside a pipeline to see the added power of orchestration.

What data movement challenges are you facing in your environment? Feel free to share in the comments.

Next in the data engineering series, we’ll explore how to combine these movement options with transformations and monitoring for robust end-to-end pipelines.

Thanks for reading! Stay tuned for more practical insights on Microsoft Fabric. Subscribe to the newsletter and keep exploring the world of data. 🚀

DataEngineering

Workspace Identity in Microsoft Fabric

If you’re starting with Microsoft Fabric or Power BI, you’ll often hear the term Workspace Identity. It may sound complex, but it’s actually a simple and powerful concept that improves security, automation, and governance in your data platform.

What Is Workspace Identity?

Workspace Identity is a system-assigned identity created for a workspace in Microsoft Fabric and Microsoft PowerBI.

Think of it as a service account automatically managed by Microsoft that allows the workspace to securely access other resources without using personal user credentials.

Simple Definition

Workspace Identity = A secure, automatic identity that a workspace uses to access data and services.

Why Do We Need Workspace Identity?

Before Workspace Identity, many solutions relied on:

Personal accounts Shared service accounts Stored credentials in scripts

These approaches can cause security risks and maintenance issues.

Problems Without Workspace Identity

Password expiration breaks pipelines Security risks from shared credentials Difficult auditing and governance Manual credential management

Benefits With Workspace Identity

✔ No stored passwords

✔ Centralized security management

✔ Supports automation & pipelines

✔ Improves compliance and governance

How Workspace Identity Works

A Workspace Identity is created and managed in Microsoft Entra ID (formerly Azure AD).

It authenticates the workspace when accessing services like storage, databases, or APIs.

Architecture Overview

1️⃣ Without Workspace Identity (Old Approach)

Explanation:

User credentials are stored in pipelines or notebooks Fabric workspace uses those credentials Access is granted to data sources

❌ Risk: Credentials can expire or be exposed.

2️⃣ With Workspace Identity (Recommended Approach)

Explanation:

Workspace has a system-assigned identity Identity is registered in Microsoft Entra ID Data sources grant access to the workspace identity Secure authentication happens automatically

✔ No passwords stored

✔ Secure & scalable

Key Components

🔹 Workspace

A container for reports, datasets, notebooks, and pipelines in Fabric/Power BI.

🔹 Workspace Identity

A system-managed identity linked to the workspace.

🔹 Microsoft Entra ID

Identity provider that authenticates the workspace.

🔹 Data Sources

Examples include:

Azure Data Lake SQL Databases REST APIs Key Vault

Real-World Example

Imagine you have a Fabric workspace that runs a pipeline to load data from Azure Data Lake.

Without Workspace Identity

Pipeline stores a service account password Password expires → pipeline fails

With Workspace Identity

Workspace authenticates using its identity No password to manage Pipeline runs reliably

When Should Beginners Use Workspace Identity?

Use Workspace Identity when:

✔ Accessing Azure resources securely

✔ Automating pipelines and notebooks

✔ Avoiding credential storage

✔ Implementing governance best practices

How to Enable Workspace Identity (High-Level Steps)

Open your workspace in Microsoft Fabric / Power BI Go to Workspace Settings Enable Workspace Identity Assign permissions in Azure resources (IAM)

Security Best Practices

Grant least privilege access Monitor access using audit logs Avoid using personal accounts in production Use Workspace Identity for automation

Common Beginner Mistakes

❌ Using personal accounts in pipelines

❌ Hardcoding credentials in notebooks

❌ Granting excessive permissions

❌ Not documenting identity usage

Summary

Workspace Identity is a foundational security feature in Microsoft Fabric and Power BI that allows workspaces to authenticate securely without storing credentials.

Key Takeaways

It is a system-managed identity Improves security and governance Essential for automation and enterprise solutions Recommended for all production workloads

Thanks for reading! Stay tuned for more practical insights on Microsoft Fabric. Subscribe to the newsletter and keep exploring the world of data. 🚀

DataEngineering

Notebooks in Microsoft Fabric

If you’re new to Microsoft Fabric and feeling a bit overwhelmed by all the tools at your disposal, you’re in the right place.

If you’re not quite sure what Microsoft Fabric is yet, I highly recommend checking out my introductory series on Microsoft Fabric before diving in.

Notebooks in Fabric are like your personal playground for coding, data wrangling, and even building machine learning models. They’re built on Apache Spark, making them perfect for data engineers and scientists alike. In this guide, we’ll walk through the basics of using notebooks, from creation to advanced features, sprinkled with handy tips and tricks to make your life easier. We’ll draw from Microsoft’s official documentation to keep things accurate and up-to-date.

Whether you’re ingesting data, transforming it, or experimenting with ML, notebooks offer an interactive, web-based environment that’s collaborative and powerful. Let’s dive in!

What Are Notebooks in Microsoft Fabric?

At their core, notebooks are interactive documents where you can mix executable code, visualizations, and explanatory text. Think of them as a blend of a code editor, a report builder, and a collaboration tool—all powered by Apache Spark for handling big data.

  • For Data Engineers: Use them to ingest, prepare, and transform data seamlessly.
  • For Data Scientists: Experiment with machine learning models, track progress, and deploy solutions.
  • Key Perks: Real-time visualizations, Markdown for documentation, and tight integration with Fabric’s ecosystem like lakehouses and pipelines.

Tip: If you’re coming from Jupyter Notebooks, you’ll feel right at home—Fabric supports importing .ipynb files directly!

Getting Started: Creating Your First Notebook

Starting is simple—no need for complex setups.

  1. Head to the Data Engineering homepage in Fabric.
  2. Click New in your workspace or use the Create Hub.
  3. Select Notebook, give it a name, and boom—you’re in!

You can also import existing notebooks:

  • From your local machine: Use the workspace toolbar to upload .ipynb, .py, .scala, or .sql files. Fabric converts them automatically.

Trick: Always start with a blank notebook for practice. Name it something descriptive like “MyFirstDataTransform” to keep your workspace organized.

Editing and Saving: The Basics

Once created, your notebook opens in Develop mode (if you have edit permissions). Here’s the lowdown:

  • Autosave is On by Default: Edits save automatically after you start working. No more losing progress!
  • Switch to Manual Save: Go to Edit > Save options > Manual if you prefer control. Then use Ctrl+S or the Save button.
  • Save a Copy: Clone your notebook to experiment without messing up the original—great for testing variations.

Tip: In a team setting, toggle to Run Only or View mode to avoid accidental changes when reviewing someone else’s work.

Trick: Use Save a Copy to create branches for different experiments, like one for data cleaning and another for visualization tweaks.

Working with Cells: Code and Markdown Magic

Notebooks are made of cells—building blocks for your content.

  • Code Cells: Write and run code in languages like Python, Scala, or SQL. Right-click files in the lakehouse explorer to auto-generate code snippets (e.g., loading a CSV with Spark or Pandas).
  • Markdown Cells: Add text, headings, lists, or even images for explanations. Perfect for documenting your thought process.

To run a cell: Hit the play button or use shortcuts (more on those later).

Tip: Start every notebook with a Markdown cell outlining your goals—it keeps you focused and helps collaborators understand your flow.

Trick: Use magic commands (like %%sql for SQL queries or %%pyspark for PySpark code) to switch contexts quickly without restarting sessions. This is a game-changer for mixing languages in one notebook!

Integrating with Lakehouses and Managing Files

Fabric shines in data integration—notebooks connect seamlessly to lakehouses for file and table access.

  • Add a Lakehouse: From the Lakehouse explorer, attach an existing one or create new. Pin it as default for easy paths (e.g., read files like they’re local).
  • Browse and Operate: In the Lake view, explore tables and files. Right-click to copy paths or generate load code.
  • Resource Folders:
    • Built-in Resources: Per-notebook storage for small files (up to 500 MB total). Upload, download, or access via relative paths.
    • Environment Resources: Shared across notebooks in the same environment—ideal for common scripts.

Need to edit a file? Use the built-in File Editor for CSV, TXT, PY, etc. (up to 1 MB). Save with Ctrl+S.

Tip: After pinning or renaming a lakehouse, restart your Spark session to avoid path errors.

Trick: Drag and drop files into the resources folder for quick uploads. Use notebookutils.nbResPath() in code to grab absolute paths dynamically—saves time debugging!

Running Code: Sessions, Security, and Best Practices

Running code is interactive and secure:

  • Interactive Runs: Manual execution under your user context.
  • In Pipelines or Schedules: Runs under the pipeline editor’s or schedule creator’s identity—double-check permissions!

First-time users get a warning: Review code before running to avoid surprises.

Tip: For big jobs, monitor Spark sessions in the UI to spot bottlenecks early.

Trick: Use workspace stages (dev/test/prod) to test notebooks safely without risking production data. Always review version history before executing shared code.

Keyboard Shortcuts to Boost Productivity

Who doesn’t love shortcuts? Here are essentials:

  • Ctrl+S: Save (in manual mode).
  • In the file editor: Standard code navigation and editing keys work, with syntax highlighting.

Tip: Learn cell-specific shortcuts like Shift+Enter to run and move to the next cell—speeds up iterative testing.

Trick: Customize your workflow by combining shortcuts with magic commands for ultra-efficient debugging.

Collaboration: Team Up in Real Time

Notebooks aren’t solo affairs:

  • Co-editing: Multiple users edit simultaneously—see cursors, selections, and live changes.
  • Sharing: Grant Edit, Run, or Share permissions via the toolbar.
  • Comments: Add threaded discussions on cells. Tag @users for notifications (emails sent if needed).

Tip: Use comments for feedback loops in team projects—it beats endless email threads.

Trick: For pair programming, share in Develop mode and use real-time visibility to debug together remotely.

Version History: Track Changes Like a Pro

In preview, but super useful:

  • Checkpoints: Auto every 5 minutes, or manual for milestones.
  • Diff View: Compare versions to see changes in code, output, and metadata.
  • Restore or Copy: Roll back or branch from old versions.

Integrates with Git, VS Code, and pipelines for multi-source tracking.

Tip: Create manual checkpoints before major experiments—easy rollback if things go sideways.

Trick: Label versions descriptively (e.g., “Added ML Model v1”) to make history navigation a breeze.

Troubleshooting Common Hiccups

  • Session Issues: Restart after lakehouse changes.
  • File Limits: Stick to 100 MB per file in resources; use lakehouses for bigger stuff.
  • Permissions: Ensure collaborators have access to tagged resources.
  • No Autosave in Editor: Always Ctrl+S when editing files.

Best Practice: Verify the “last modified by” user in pipelines to maintain security.

Top Tips and Tricks for New Learners

Here’s a roundup to accelerate your learning:

  • Start Small: Begin with simple data loads from a lakehouse to build confidence.
  • Visualize Early: Use libraries like Matplotlib in code cells for quick charts—Fabric handles rich outputs beautifully.
  • Experiment with Modes: Switch between Develop and Run Only to test execution without edits.
  • Leverage Integrations: Mount lakehouses as defaults to simplify paths; it’s a huge time-saver.
  • Security First: Always scan shared notebooks via version history.
  • Resource Optimization: Use shared environment folders for reusable code modules across projects.
  • Pro Debugging: Tag comments during co-edits for targeted fixes.
  • Bonus: If stuck, check Fabric’s troubleshooting sections or community forums for real-world advice.

Wrapping Up

Congratulations—you’re now equipped to tackle notebooks in Microsoft Fabric like a seasoned pro! They’re not just tools; they’re your gateway to efficient, collaborative data work. Practice with a sample dataset, experiment freely, and soon you’ll be building pipelines and ML models effortlessly.

Stay tuned for more advanced guides and real-world scenarios on Microsoft Fabric. Subscribe to the newsletter and keep exploring the world of data. 🚀

Scroll to Top
×
Your Cart
Cart is empty.
Fill your cart with amazing items
Shop Now
$0.00
Shipping & taxes may be re-calculated at checkout
$0.00