Hi folks,
I have passed my Data Architecture and Management Designer exam couple of months back and was keen to share my learnings but thanks to my procrastination it was delaying. Finally, I got a chance to post something in a structured way which might help you to pass this credentials.
Data Architect is the domain exam which falls under the Application Architect tree which leads to Technical Architect (its a long way but getting there) I won't go into the details of the credentials overview since all the instructions are mentioned here rather I highlight the key areas which you need to focus on, in order to fully understand the large data volume considerations and become an expert in Data management. Roll up your sleeves and lets dig in:
Multi-tenancy & Metadata Overview:
Multitenancy is a means of providing a single application to multiple organizations such as different companies or departments within a company, from a single hardware-software stack.
To ensure that tenant-specific customizations do not breach the security of other tenants or affect their performance, Salesforce uses a runtime engine that generates application components from those customizations. By maintaining boundaries between the architecture of the underlying application and that of each tenant, Salesforce protects the integrity of each tenant’s data and operations.
When any updates are being made related to organization, the platform tracks metadata. Salesforce stores the application data for all virtual tables in a few large database tables, which are partitioned by tenant (organization) and serve as heap storage. The platform’s engine then materializes virtual table data at runtime by considering the corresponding metadata. Details here
Search Architecture:
Search is the capability to query records based on free-form text.For data to be searched, it must first be indexed. The indexes are created using the search indexing servers, which also generate and asynchronously process queue entries of newly created or modified data. After a searchable object’s record is created or updated, it could take about 15 minutes or more for the updated text to become searchable.
Salesforce performs indexed searches by first searching the indexes for appropriate records, then narrowing down the results based on access permissions, search limits, and other filters. This process creates a result set, which typically contains the most relevant results. After the result set reaches a predetermined size, the remaining records are discarded. The result set is then used to query the records from the database to retrieve the fields that a user sees. Details here
Force.com query optimizer:
The Force.com query optimizer helps the database system’s optimizer produce effective execution plans for Salesforce queries. Force.com query optimizer works on the queries that are automatically generated to handle reports, list views, and both SOQL queries. Details here
Skinny Tables:
Salesforce creates skinny tables to contain frequently used fields and to avoid joins, and it keeps the skinny tables in sync with their source tables when the source tables are modified. To enable skinny tables, contact Salesforce Customer Support. For each object table, Salesforce maintains other, separate tables at the database level for standard and custom fields. This separation ordinarily necessitates a join when a query contains both kinds of fields. A skinny table contains both kinds of fields and does not include soft-deleted records.
This table shows an Account view, a corresponding database table, and a skinny table that would speed up Account queries. Details here
Indexes:
Salesforce supports custom indexes to speed up queries, and you can create custom indexes by contacting Salesforce Customer Support. Nulls in the criteria prevented the use of indexes. Details here
Index Tables
The Salesforce multitenant architecture makes the underlying data table for custom fields unsuitable for indexing. To overcome this limitation, the platform creates an index table that contains a copy of the data, along with information about the data types. By default, the index tables do not include records that are null (records with empty values)
Standard and Custom Indexed Fields:
The Force.com query optimizer maintains a table containing statistics about the distribution of data in each index. It uses this table to perform pre-queries to determine whether using the index can speed up the query.
Standard Indexed Fields:
Used if the filter matches less than 30% of the total records, up to one million records.
For example, a standard index is used if:
• A query is executed against a table with two million records, and the filter matches 600,000 or fewer records.
• A query is executed against a table with five million records, and the filter matches one million or fewer records.
Custom Indexed Fields
Used if the filter matches less than 10% of the total records, up to 333,333 records.
For example, a custom index is used if:
• A query is executed against a table with 500,000 records, and the filter matches 50,000 or fewer records.
• A query is executed against a table with five million records, and the filter matches 333,333 or fewer records.
Two-column Custom Indexes
Two-column indexes are subject to the same restrictions as single-column indexes, with one exception. Two-column indexes can have nulls in the second column by default, whereas single-column indexes cannot, unless Salesforce Customer Support has explicitly enabled the option to include nulls.
PK Chunking:
Use the PK Chunking request header to enable automatic primary key (PK) chunking for a bulk query job. PK chunking splits bulk queries on very large tables into chunks based on the record IDs, or primary keys, of the queried records. Each chunk is processed as a separate batch that counts toward your daily batch limit, and you must download each batch’s results separately. Details here
Big Objects for data archiving:
BigObjects is a new capability to let you store and manage data at scale on the Salesforce platform. This feature helps you engage directly with customers by preserving all your historical customer event data. If you have a large amount of data stored in standard or custom objects in Salesforce, use BigObjects to store historical data.
Best Practices for Data archiving
Big Objects
Data Skew:
“Data skew” refers to the condition of having the ratio of records to a parent record or to an owner out of balance. As a basic rule of thumb, you want to keep any one owner or parent record from “owning” more than 10,000 records. If you find that is happening a lot, revisit the idea of archiving your data, or create a plan to spread the wealth by assigning the records to more owners, or to more parent records.
Data Governance
Data Governance – refers to the overall management of the availability, usability, integrity, and security of the data employed in an enterprise. A sound data governance program includes a governing body or council, a defined set of procedures, and a plan to execute those procedures
Salesforce governance simplified
Best Practice for Data Governance
Data Stewardship
Management and oversight of an organization’s data assets to help provide business users with high-quality data that is easily accessible in a consistent manner
Data Stewardship/ Data governance
Techniques for Optimizing Performance
1) Mashups - External Website and Callouts
2) Using SOQL and SOSL
SOQL - You know in which objects or fields the data resides. (if multiple object are related to eachother)
SOSL - You don’t know in which object or field the data resides, and you want to find it in the most efficient way possible. (if multiple objects aren't related to eachother) It is generally faster than SOQL if the search expression uses a CONTAINS term.
3) Deleting Data:
While the data is soft deleted, it still affects database performance because the data is still resident, and deleted records have to be excluded from any queries.
We recommend that you use the Bulk API’s hard delete function to delete large data volumes.
4) Defer Sharing Rules:
It allows users to defer the processing of sharing rules until after new users, rules, and other
content have been loaded.
Best Practice for achieving good performance in deployments
1) Loading Data from the API
2) Extracting Data from API
Event monitoring is one of many tools that Salesforce provides to help keep your data secure. It lets you see the granular details of user activity in your organization. We refer to these user activities as events. You can view information about individual events or track trends in events to swiftly identify abnormal behavior and safeguard your company’s data. Details here
Setup Audit Trail
Setup Audit Trail tracks the configuration and metadata changes that you and other admins have made to your org. Details here
Merge accounts
Merging multiple accounts into one helps you keep your data clean so you can focus on closing deals. Details here
I would be highly suggest to go through the guide for best practices of deployments with large data volume.
Best of luck for your exam. Hope this information helps. Feel free to comment below if there's anything which need you understand more.
Cheers ears
I have passed my Data Architecture and Management Designer exam couple of months back and was keen to share my learnings but thanks to my procrastination it was delaying. Finally, I got a chance to post something in a structured way which might help you to pass this credentials.
Data Architect is the domain exam which falls under the Application Architect tree which leads to Technical Architect (its a long way but getting there) I won't go into the details of the credentials overview since all the instructions are mentioned here rather I highlight the key areas which you need to focus on, in order to fully understand the large data volume considerations and become an expert in Data management. Roll up your sleeves and lets dig in:
Multi-tenancy & Metadata Overview:
Multitenancy is a means of providing a single application to multiple organizations such as different companies or departments within a company, from a single hardware-software stack.
To ensure that tenant-specific customizations do not breach the security of other tenants or affect their performance, Salesforce uses a runtime engine that generates application components from those customizations. By maintaining boundaries between the architecture of the underlying application and that of each tenant, Salesforce protects the integrity of each tenant’s data and operations.
When any updates are being made related to organization, the platform tracks metadata. Salesforce stores the application data for all virtual tables in a few large database tables, which are partitioned by tenant (organization) and serve as heap storage. The platform’s engine then materializes virtual table data at runtime by considering the corresponding metadata. Details here
Search Architecture:
Search is the capability to query records based on free-form text.For data to be searched, it must first be indexed. The indexes are created using the search indexing servers, which also generate and asynchronously process queue entries of newly created or modified data. After a searchable object’s record is created or updated, it could take about 15 minutes or more for the updated text to become searchable.
Salesforce performs indexed searches by first searching the indexes for appropriate records, then narrowing down the results based on access permissions, search limits, and other filters. This process creates a result set, which typically contains the most relevant results. After the result set reaches a predetermined size, the remaining records are discarded. The result set is then used to query the records from the database to retrieve the fields that a user sees. Details here
Force.com query optimizer:
The Force.com query optimizer helps the database system’s optimizer produce effective execution plans for Salesforce queries. Force.com query optimizer works on the queries that are automatically generated to handle reports, list views, and both SOQL queries. Details here
Skinny Tables:
Salesforce creates skinny tables to contain frequently used fields and to avoid joins, and it keeps the skinny tables in sync with their source tables when the source tables are modified. To enable skinny tables, contact Salesforce Customer Support. For each object table, Salesforce maintains other, separate tables at the database level for standard and custom fields. This separation ordinarily necessitates a join when a query contains both kinds of fields. A skinny table contains both kinds of fields and does not include soft-deleted records.
This table shows an Account view, a corresponding database table, and a skinny table that would speed up Account queries. Details here
Indexes:
Salesforce supports custom indexes to speed up queries, and you can create custom indexes by contacting Salesforce Customer Support. Nulls in the criteria prevented the use of indexes. Details here
Index Tables
The Salesforce multitenant architecture makes the underlying data table for custom fields unsuitable for indexing. To overcome this limitation, the platform creates an index table that contains a copy of the data, along with information about the data types. By default, the index tables do not include records that are null (records with empty values)
Standard and Custom Indexed Fields:
The Force.com query optimizer maintains a table containing statistics about the distribution of data in each index. It uses this table to perform pre-queries to determine whether using the index can speed up the query.
Standard Indexed Fields:
Used if the filter matches less than 30% of the total records, up to one million records.
For example, a standard index is used if:
• A query is executed against a table with two million records, and the filter matches 600,000 or fewer records.
• A query is executed against a table with five million records, and the filter matches one million or fewer records.
Custom Indexed Fields
Used if the filter matches less than 10% of the total records, up to 333,333 records.
For example, a custom index is used if:
• A query is executed against a table with 500,000 records, and the filter matches 50,000 or fewer records.
• A query is executed against a table with five million records, and the filter matches 333,333 or fewer records.
Two-column Custom Indexes
Two-column indexes are subject to the same restrictions as single-column indexes, with one exception. Two-column indexes can have nulls in the second column by default, whereas single-column indexes cannot, unless Salesforce Customer Support has explicitly enabled the option to include nulls.
PK Chunking:
Use the PK Chunking request header to enable automatic primary key (PK) chunking for a bulk query job. PK chunking splits bulk queries on very large tables into chunks based on the record IDs, or primary keys, of the queried records. Each chunk is processed as a separate batch that counts toward your daily batch limit, and you must download each batch’s results separately. Details here
Big Objects for data archiving:
BigObjects is a new capability to let you store and manage data at scale on the Salesforce platform. This feature helps you engage directly with customers by preserving all your historical customer event data. If you have a large amount of data stored in standard or custom objects in Salesforce, use BigObjects to store historical data.
Best Practices for Data archiving
Big Objects
Data Skew:
“Data skew” refers to the condition of having the ratio of records to a parent record or to an owner out of balance. As a basic rule of thumb, you want to keep any one owner or parent record from “owning” more than 10,000 records. If you find that is happening a lot, revisit the idea of archiving your data, or create a plan to spread the wealth by assigning the records to more owners, or to more parent records.
Data Governance – refers to the overall management of the availability, usability, integrity, and security of the data employed in an enterprise. A sound data governance program includes a governing body or council, a defined set of procedures, and a plan to execute those procedures
Salesforce governance simplified
Best Practice for Data Governance
Data Stewardship
Management and oversight of an organization’s data assets to help provide business users with high-quality data that is easily accessible in a consistent manner
Data Stewardship/ Data governance
Techniques for Optimizing Performance
SOQL - You know in which objects or fields the data resides. (if multiple object are related to eachother)
SOSL - You don’t know in which object or field the data resides, and you want to find it in the most efficient way possible. (if multiple objects aren't related to eachother) It is generally faster than SOQL if the search expression uses a CONTAINS term.
3) Deleting Data:
While the data is soft deleted, it still affects database performance because the data is still resident, and deleted records have to be excluded from any queries.
We recommend that you use the Bulk API’s hard delete function to delete large data volumes.
4) Defer Sharing Rules:
It allows users to defer the processing of sharing rules until after new users, rules, and other
content have been loaded.
Best Practice for achieving good performance in deployments
1) Loading Data from the API
- Use the fastest operation possible—insert() is fastest, update() is next, and upsert() is next after that. If possible, also break upsert() into two operations: create() and update().
- When updating, send only fields that have changed (delta-only loads)
- For custom integrations, authenticate once per load, not on each record.
- Use Public Read/Write security during initial load to avoid sharing calculation overhead
- When changing child records, group them by parent—group records by the field ParentId in the same batch to minimize locking conflict
2) Extracting Data from API
- Use the getUpdated() and getDeleted() SOAP API to sync an external system with Salesforce at intervals greater than 5 minutes
- Keep searches specific and avoid using wildcards, if possible. For example, search with Michael instead of Mi*
- Use single-object searches for greater speed and accuracy.
- When deleting records that have many children, delete the children first.
- When deleting large volumes of data, a process that involves deleting one million or more records, use the hard delete option of the Bulk API
Event monitoring is one of many tools that Salesforce provides to help keep your data secure. It lets you see the granular details of user activity in your organization. We refer to these user activities as events. You can view information about individual events or track trends in events to swiftly identify abnormal behavior and safeguard your company’s data. Details here
Field Audit Trail
Field Audit Trail lets you define a policy to retain archived field history data up to ten years, independent of field history tracking Details here
Setup Audit Trail
Setup Audit Trail tracks the configuration and metadata changes that you and other admins have made to your org. Details here
Merge accounts
Merging multiple accounts into one helps you keep your data clean so you can focus on closing deals. Details here
I would be highly suggest to go through the guide for best practices of deployments with large data volume.
Best of luck for your exam. Hope this information helps. Feel free to comment below if there's anything which need you understand more.
Cheers ears