Best Practices of Different Data Ingestion Options in Snowflake

3 min readMar 1, 2023

Data ingestion, the process of obtaining and importing data for immediate storage or use in a database usually comes in two flavours — data ingested in batches & data streaming. Batch data ingestion batches imports data in discrete chunks at more pre-determined time slots. Real-time data streaming naturally follows an unpredictable ingestion schedule.

Snowflake’s Data Cloud solves many of the data ingestion problems that companies face in with below Data Ingestion options -

*Snowpipe Streaming is currently preview feature open to all accounts.

Let’s look at the benefits and best practices of each -

Benefits of COPY

~ Large scale bulk data ingestion
~ Scalable compute warehouse
~ Resilient error handling and retry
~ Deduplication support
~ Full support for semi-structured data on load
~ Pre-defined data format options

COPY Best Practices

Do
=> Leverage object path partitioning for effective listing and loading
=> Use the FILES or PATTERN option to specify the exact files when loading ad-hoc
=> Utilize schema detection for simple table creation and schema evolution for continuous data loading
=> Take advantage of file format and copy options such as ON_ERROR

Don’t
<> Upsized warehouses won’t load single files faster
<> Don’t over-complicate your large data loads; let Snowflake handle the job management
<> Don’t split or merge files unless your data source supports it natively; not worth building a separate tool or process

Benefits of Snowpipe

~ Continuously generate data is available for analysis in <1 minute
~ You only pay for the compute time yo use to load data
~ Avoid repeated manual COPY commands
~ Zero management. No indexing, tuning, partitioning or vacuuming on load
~ Error notifications for failure alerting
~ Serverless: No server to manage or concurrency to worry about.

Snowpipe Best Practices

Do
=> Utilize native cloud provider event filtering
=> Ensure optimal sizing (100–250 MB)
=> Utilize the same notification integration for multiple Azure/GCP pipes
=> Utilize error notifications for failure alerting
=> Check pipe status for debugging and health

Don’t
<> Ingest tiny files (1KB); use Snowpipe streaming where possible to bypass files
<> Build your own frankenstein auto-ingest solution or custom replication
<> Drop stages, tables, integrations on running pipes
<> Ignore COPY best practices

Benefits of Snowpipe Streaming

~ Data queryable in < 5 seconds after Snowflake ack.
~ Low cost for both fluid and trickle cases.
~ Direct Data Streaming: Ingest to tables over HTTPs, No files or pipe to manage.
~ Exactly-Once and Per-Channel Ordering: Get data in once and process in order.
~ High Throughput: GB/sec ingestion rates supported.
~ Low overhead: Minimal configuration and setup needed.

Snowpipe Streaming Best Practices

Do
=> Consider your business requirements to determine best ingestion cost/latency
=> Leverage Snowpipe Streaming for Snowflake’s Kafka Connector with a simple upgrade

Don’t
<> Build a streaming application just to read a large file or batch load existing data
<> Replace all batch ingestion; argument it instead with streaming

References:

https://docs.snowflake.com/en/user-guide/data-load-snowpipe-streaming-overview

https://docs.snowflake.com/en/user-guide/data-load-considerations

Follow and Clap if you like the content and feel free to ask if you have any questions in the comments. I will be more than happy to assist and guide you.

Best Practices of Different Data Ingestion Options in Snowflake

Benefits of COPY

COPY Best Practices

Benefits of Snowpipe

Snowpipe Best Practices

Benefits of Snowpipe Streaming

Snowpipe Streaming Best Practices

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Snowflake Wiki

Responses (1)