In the context of data engineering we have our share of foot guns. One of those foot guns that I run into the most are stored procedures and views. In a team data engineering environment their nature is not conducive to effective work.
The most common use case for a stored procedure is to provide a standard SQL interface to common tasks performed against your data. For example, instead of directly running
INSERT against a table you could call a
insert_tableA stored procedure that could contain any custom checks, actions, or other tasks that should be performed when inserting into that particular table. This is mainly useful when multiple actors are repeatedley performing the same or similar actions.
Often, but not always, stored procedures are reserved for tasks that involve adding or changing data. On the flip-side many database systems provide views to enable a similar concept to stored procedures but for querying data. Views are typically, limited to a single
SELECT statement. This affords an interface for defining standardized “reports” that you can be sure are pulled the same way for everyone who runs them.
For complex reports, a stored procedure could be in order so that temporary tables and other facilities can be used when querying a complex dataset.
All of that is in fact useful. However, the important part is knowing in which situations to use these tools and when not to. If you are a DBA for a small company with non-technical users who need to run a few reports and maybe the part-time IT guy needs to insert data from Active Directory when a new employee joins, views and stored procedures can be great solutions to hide complexity from these kinds of incidental, business, or junior database users. However, in a data engineering context these tools can easily become a liability, especially if proper guardrails are not in place.
What else can we get out of stored procedures?
In addition to the basic use cases for stored procedures outlined above, stored procedures can provide other conveniences. For example, you can deploy code without a file system since the code is stored within the database itself. This makes deploying easy while sacrificing the control and access provided by a file system. Depending on your situation this can be a huge positive.
You can also log directly to a SQL table quite easily from a stored procedure. This is great if you’re working at your SQL command line however if your logs needs to be captured for reference or later auditing not having them on the file-system adds extra steps and could increase storage usage down the line.
I am certain that stored procedure aficionados could easily provide more pros for and reasons to use them. In my mind, the main issue to consider is where and how your work needs to be done. If you can always be directly inside your RDBMS doing work, then having these things at your fingertips in the console is nice. This, however, is usually only the case if you are a solo DBA or developer working on a single project. If you work on a team and code needs to be version controlled, shared, reviewed, modified and deployed by more than one developer to more than one place these SQL tools can quickly breakdown.
The tools we have for version control, code review, deployments, etc. all expect code to live in a file system. You cannot put a stored procedure into git from within MySQL (for example). You must make a file and copy the stored procedure’s code into it. This code in the file is no longer your stored procedure. Your stored procedure is still in the RDBMS. It is controlled completely by the RDBMS and your facsimile on the file system is no longer tied to the code in the RDBMS.
What I am getting at is that stored procedures can be modified at any time by anyone who has the access. So you might deploy a spiffy update to stored procedure only for it to be clobbered by another developer’s fat finger hours later. Or a disgruntled employee might change a stored procedure and go undetected for some time. And while there might be a record of when a stored procedure was last changed and by who, the code changes themselves are not logged anywhere.
Since there is not a single path to deployment chaos can ensue. There are certainly permissions and access rights that can be setup to help prevent some of this but this adds another layer of complexity to deployment and DB server management.
If instead, the code was stored on the file system and sent to the RDBMS in commands the entire history and lifetime of the code can be tracked and reviewed and reverted when needed. This is critical not only for production systems where knowing what code was executing when is paramount to security, quality, and reliability but it is also important for complying with various compliance regimes many organizations must follow.
A properly setup deployment and permissions system could surely alleviate many of these issues but that is considerable infrastructure for a benefit I have yet to see in the data engineering context. Most RDBMSs cannot exist on their own in a data pipeline. You are going to need to execute custom code (python, perl, PHP, ruby, Go, etc.) to connect to APIs, move data between systems, and more.
So if you must setup version control, deployment, and all of that for your PHP code (for example) why not use that same infrastructure for your SQL code rather than adding another layer for managing the stored procedures as well? At this point and in this context, none of the benefits mentioned above apply any longer so there is no reason to do this.
SQL Scripts are Still Great
None of the above is to meant to discount the SQL script (i.e a set of consecutive SQL statements that perform a task like generating a report or a reporting table). Stored procedures, more often than not, consist of a series of SQL statements that employee all kinds of logic, but this is not a unique nor defining characteristic.
Your SQL code should be made more visible and usable by lifting it out of the RDBMS layer and on to the file system with the rest of the code. However, when performing ETLs it ideal (IMHO) to keep as much business logic as possible in your programming language (python, perl, php etc.) so that can you can leverage unit testing, code discoverablility (think IDEs), and code reusability.
In the context of reporting, SQL scripts shine especially bright. Usually there is not much use for a traditional programming language when using an RDBMS to perform a complex data analyses. This is an excellent use case for a SQL script, but not a stored procedure!
If you are on a data engineering team building data pipelines, reports, or deploying models stored procedures do not offer many benefits and the practical trade offs required to employ them generally will not be worth any benefits they provide. If you have a different conclusion I would love to learn about your point of view. Send me your blog posts!