Best Practices of Different Data Ingestion Options in Snowflake

Snowflake Wiki
3 min readMar 1, 2023

Data ingestion, the process of obtaining and importing data for immediate storage or use in a database usually comes in two flavours — data ingested in batches & data streaming. Batch data ingestion batches imports data in discrete chunks at more pre-determined time slots. Real-time data streaming naturally follows an unpredictable ingestion schedule.

Snowflake’s Data Cloud solves many of the data ingestion problems that companies face in with below Data Ingestion options -

*Snowpipe Streaming is currently preview feature open to all accounts.

Let’s look at the benefits and best practices of each -

Benefits of COPY

~ Large scale bulk data ingestion
~ Scalable compute warehouse
~ Resilient error handling and retry
~ Deduplication support
~ Full support for semi-structured data on load
~ Pre-defined data format options

COPY Best Practices

Do
=> Leverage object path partitioning for effective listing and loading
=> Use the FILES or PATTERN option to specify the exact files when loading ad-hoc
=> Utilize schema detection for simple table creation and schema evolution for continuous data loading
=> Take advantage of file format and copy options such as ON_ERROR

Don’t
<> Upsized warehouses won’t load single files faster
<> Don’t over-complicate your large data loads; let Snowflake handle the job management
<> Don’t split or merge files unless your data source supports it natively; not worth building a separate tool or process

Benefits of Snowpipe

~ Continuously generate data is available for analysis in <1 minute
~ You only pay for the compute time yo use to load data
~ Avoid repeated manual COPY commands
~ Zero management. No indexing, tuning, partitioning or vacuuming on load
~ Error notifications for failure alerting
~ Serverless: No server to manage or concurrency to worry about.

Snowpipe Best Practices

Do
=> Utilize native cloud provider event filtering
=> Ensure optimal sizing (100–250 MB)
=> Utilize the same notification integration for multiple Azure/GCP pipes
=> Utilize error notifications for failure alerting
=> Check pipe status for debugging and health

Don’t
<> Ingest tiny files (1KB); use Snowpipe streaming where possible to bypass files
<> Build your own frankenstein auto-ingest solution or custom replication
<> Drop stages, tables, integrations on running pipes
<> Ignore COPY best practices

Benefits of Snowpipe Streaming

~ Data queryable in < 5 seconds after Snowflake ack.
~ Low cost for both fluid and trickle cases.
~ Direct Data Streaming: Ingest to tables over HTTPs, No files or pipe to manage.
~ Exactly-Once and Per-Channel Ordering: Get data in once and process in order.
~ High Throughput: GB/sec ingestion rates supported.
~ Low overhead: Minimal configuration and setup needed.

Snowpipe Streaming Best Practices

Do
=> Consider your business requirements to determine best ingestion cost/latency
=> Leverage Snowpipe Streaming for Snowflake’s Kafka Connector with a simple upgrade

Don’t
<> Build a streaming application just to read a large file or batch load existing data
<> Replace all batch ingestion; argument it instead with streaming

References:

https://docs.snowflake.com/en/user-guide/data-load-snowpipe-streaming-overview

https://docs.snowflake.com/en/user-guide/data-load-considerations

Follow and Clap if you like the content and feel free to ask if you have any questions in the comments. I will be more than happy to assist and guide you.

--

--

Snowflake Wiki

Snowflake Basics | Features | New releases | Tricks & Tips | SnowPro Certifications | Solutions | Knowledge Sharing | ~~~ By satitiru (Snowflake DataSuperHero)