Finding Total Size of Hive Database’s data

Gomz
3 min readNov 13, 2019

--

By default, we can check the size of a table or database by using the hdfs command as below.

hdfs dfs -df -s -h <HDFS path of Database/table>

But when there are many databases or tables (especially external tables) with data present in multiple different directories in HDFS, the below might help in determining the size.

The below steps when when performed in the Hive Metastore DB would help you in getting the total size occupied by all the tables in Hive.

Note: The results you get for this query would be 100% correct only if all the tables are having their stats updated. [This can be checked in the table — TABLE_PARAMS in Metastore DB that I have also mentioned below (How it works?.b)]

Steps:

1. Login into Hive Metastore DB and use the database that is used by hive. hive1 by default.

2. Once done, you can execute the below query to get the total size of all the tables in Hive in bytes. The query takes the sum of total size of all the Hive tables based on the statistics of the tables.

MariaDB [hive1]> SELECT SUM(PARAM_VALUE) FROM TABLE_PARAMS WHERE PARAM_KEY="totalSize";
+------------------+
| SUM(PARAM_VALUE) |
+------------------+
| 30376289388684 |
+------------------+
1 row in set (0.00 sec)

3. Remember, the result derived above is for only one replication. 30376289388684 x 3 is the actual size in HDFS including the replication.

How it works?

a. Selecting a random table in Hive with id 5783 and name — test12345 from the TBLS table in Hive Metastore DB.

MariaDB [hive1]> SELECT * FROM TBLS WHERE TBL_ID=5783;
+--------+-------------+-------+------------------+-------+-----------+-------+-----------+---------------+--------------------+--------------------+----------------+
| TBL_ID | CREATE_TIME | DB_ID | LAST_ACCESS_TIME | OWNER | RETENTION | SD_ID | TBL_NAME | TBL_TYPE | VIEW_EXPANDED_TEXT | VIEW_ORIGINAL_TEXT | LINK_TARGET_ID |
+--------+-------------+-------+------------------+-------+-----------+-------+-----------+---------------+--------------------+--------------------+----------------+
| 5783 | 1555060992 | 1 | 0 | hive | 0 | 17249 | test12345 | MANAGED_TABLE | NULL | NULL | NULL |
+--------+-------------+-------+------------------+-------+-----------+-------+-----------+---------------+--------------------+--------------------+----------------+
1 row in set (0.00 sec)

b. Checking the different parameters of the table in Hive Metastore table — TABLE_PARAMS for the same Hive table with id — 5783. The totalSize record indicates the total size occupied by this table in HDFS for one of its replica. The next point © which is the hdfs du -s can be compared to check this.

The param COLUMN_STATS_ACCURATE with the value true says the table's statistics property is set to true. You can check for tables with this value as false to see if there are any tables in Hive those might have missing statistics.

MariaDB [hive1]> SELECT * FROM TABLE_PARAMS
-> WHERE TBL_ID=5783;
+--------+-----------------------+-------------+
| TBL_ID | PARAM_KEY | PARAM_VALUE |
+--------+-----------------------+-------------+
| 5783 | COLUMN_STATS_ACCURATE | true |
| 5783 | numFiles | 1 |
| 5783 | numRows | 1 |
| 5783 | rawDataSize | 2 |
| 5783 | totalSize | 324 |
| 5783 | transient_lastDdlTime | 1555061027 |
+--------+-----------------------+-------------+
6 rows in set (0.00 sec)

c. hdfs du -s output of the same table from HDFS. 324 and 972 are the sizes of one and three replicas of the table data in HDFS.

324  972  /user/hive/warehouse/test12345

Hope this helps!!

--

--