hdfs 常用命令

hdfs dfs命令和hadoop fs命令是一样的，如果看底层源码的话，两个命令会调用同一个 jar 包。因此，后面统一用hadoop fs。

本地>hdfs

-put，从本地文件系统中拷贝文件到 HDFS 路径去，等同于 -copyFromLocal：

1	hadoop fs -put ./zaiyiqi.txt /user/shuguo/test/

-moveFromLocal，从本地剪切粘贴到 HDFS：

1	hadoop fs -moveFromLocal ./kongming.txt /sanguo/shuguo

hdfs>本地

-get，等同于 copyToLocal，就是从 HDFS 下载文件到本地：

1	hadoop fs -get /sanguo/shuguo/kongming.txt ./

-getmerge，合并下载多个文件，比如 HDFS 的目录 /user/atguigu/test 下有多个文件：

1	hadoop fs -getmerge /user/shuguo/test/* ./zaiyiqi.txt

hdfs>hdfs

-cp、-mv 与 linux 用法一样，用于复制和移动。

-chgrp、-chmod、-chown 与 linux 用法一样，修改文件归属。

-help，输出这个命令参数：

1	hadoop fs -help rm

-ls，显示目录信息：

1	hadoop fs -ls /

-du，统计文件夹的大小信息：

1	hadoop fs -du -h /user/sanguo/test

-mkdir，在 HDFS 上创建目录：

1	hadoop fs -mkdir -p /sanguo/shuguo

-cat，显示文件内容：

1	hadoop fs -cat /sanguo/shuguo/kongming.txt

-rm -r，删除文件或文件夹：

1	hadoop fs -rm -r /user/sanguo/test

-setrep，设置 HDFS 中文件的副本数量：

1	hadoop fs -setrep 10 /sanguo/shuguo/kongming.txt

这里设置的副本数只是记录在 NameNode 的元数据中，是否真的会有这么多副本，还得看 DataNode 的数量。因为目前只有 3 台设备，最多也就 3 个副本，只有节点数的增加到 10 台时，副本数才能达到 10。

hdfs 客户端(java)

基础配置

首先将 windows 版本的 hadoop 放到电脑目录下，并将 hadoop 的 bin 目录添加到系统环境变量。windows 版本的 hadoop 安装具体可以看看这篇文章。(https://blog.csdn.net/weixin_41122339/article/details/81141913)切记安装完后重启 ide 才可以使用。win10 家庭版可能用不了。

安装好后新建一个 maven 项目，导入依赖坐标：

<dependencies>
  <dependency>
    <groupId>junit</groupId>
    <artifactId>junit</artifactId>
    <version>RELEASE</version>
  </dependency>
  <dependency>
    <groupId>org.apache.logging.log4j</groupId>
    <artifactId>log4j-core</artifactId>
    <version>2.8.2</version>
  </dependency>
  <dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-common</artifactId>
    <version>2.7.2</version>
  </dependency>
  <dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-client</artifactId>
    <version>2.7.2</version>
  </dependency>
  <dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-hdfs</artifactId>
    <version>2.7.2</version>
  </dependency>
</dependencies>

然后在 maven 项目的 resources 目录下新建 log4j.properties 文件，填入日志相关配置：

log4j.rootLogger=INFO, stdout
log4j.appender.stdout=org.apache.log4j.ConsoleAppender
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=%d %p [%c] - %m%n
log4j.appender.logfile=org.apache.log4j.FileAppender
log4j.appender.logfile.File=target/spring.log
log4j.appender.logfile.layout=org.apache.log4j.PatternLayout
log4j.appender.logfile.layout.ConversionPattern=%d %p [%c] - %m%n

API

创建 HdfsClient 类：

public class HdfsClient{
@Test
public void testMkdirs() throws IOException, InterruptedException, URISyntaxException{

		// 1 获取文件系统
		Configuration conf = new Configuration();

		// 2 获取hdfs客户端对象
		FileSystem fs = FileSystem.get(new URI("hdfs://192.168.189.138:9000"), conf, "root");

		// 3 创建目录
		fs.mkdirs(new Path("/daxian"));

		// 4 关闭资源
		fs.close();
	}
}

填入正确的 URI 和用户名，执行上面的测试类。如果提示 winutils.exe 找不到，那么就是 hadoop 安装出了问题。如果成功在 hdfs 根目录下创建了 daxian 文件夹，那么就说明成功连上了 hadoop 集群。

上面示例的获取文件系统、获取 hdfs 客户端对象和关闭资源为固定格式，区别只在第三步，根据第三步不同的指令执行对 jdfs 执行不同的操作。

文件上传：

1	fs.copyFromLocalFile(new Path("d:/banzhang.txt"), new Path("/xiaohua.txt"));

文件下载：

1	fs.copyToLocalFile(false, new Path("/xiaohua.txt"), new Path("d:/xiaohua.txt"), true);

文件删除：

1	fs.copyToLocalFile(false, new Path("/xiaohua.txt"), new Path("d:/xiaohua.txt"), true);

文件更名：

1	fs.rename(new Path("/xiaohua.txt"), new Path("/yanjing.txt"));

判断文件还是文件夹：

FileStatus[] listStatus = fs.listStatus(new Path("/"));

for (FileStatus fileStatus : listStatus) {

  if (fileStatus.isFile()) {
    // 文件
    System.out.println("文件:"+fileStatus.getPath().getName());
  }else{
    // 文件夹
    System.out.println("文件夹:"+fileStatus.getPath().getName());
  }
}

文件详情查看：

RemoteIterator<LocatedFileStatus> listFiles = fs.listFiles(new Path("/"), true);

while(listFiles.hasNext()){
  LocatedFileStatus fileStatus = listFiles.next();

  // 查看文件名称、权限、长度、块信息
  System.out.println(fileStatus.getPath().getName());// 文件名称
  System.out.println(fileStatus.getPermission());// 文件权限
  System.out.println(fileStatus.getLen());// 文件长度

  BlockLocation[] blockLocations = fileStatus.getBlockLocations();

  for (BlockLocation blockLocation : blockLocations) {

    String[] hosts = blockLocation.getHosts();

    for (String host : hosts) {
      System.out.println(host);
    }
  }

  System.out.println("------分割线--------");
}

hdfs 配置信息优先级

1	Configuration conf = new Configuration();

上述代码的执行就是实例化一个 hdfs 默认配置 conf，然后 conf 作为参数传入 hdfs 文件操作实例，从而将配置引入与 hdfs 系统交互的文件中。

除了这种默认配置的实现，还可以在两个地方实现额外配置。

一个是在 maven 项目的 resources 目录下新建 hdfs-site.xml文件，在其中写入配置，配置格式与上一篇文章的 hadoop 的hdfs-site.xml文件相同，在上传文件时会自动引入。

另一种是在代码中设置，在 conf 实例创建后，以如下方式进行配置：

1	conf.set("dfs.replication", "2");

上述代码可以将hdfs-site.xml下的 dfs.replication 设置为 2。在后面将 conf 作为参数传入 hdfs 文件操作实例，从而将配置引入与 hdfs 系统交互的文件中。

三种配置有优先级：客户端代码中设置的值 >用户自定义配置文件 >默认配置

用Java操作hdfs

2020-08-17
大数据分布式

用Java操作hdfs

hdfs 常用命令

本地>hdfs

hdfs>本地

hdfs>hdfs

hdfs 客户端(java)

基础配置

API

hdfs 配置信息优先级

用Java操作hdfs

hdfs 常用命令

本地>hdfs

hdfs>本地

hdfs>hdfs

hdfs 客户端(java)

基础配置

API

hdfs 配置信息优先级

我们一起来让这个世界有趣一点